Elasticsearch 配置IK分词器

前因：

众所周知，Es内置的分词器对于中文并不是那么的友好，它会将中文分割成单个字，而不是一块词组，并不能达到分词检索效果。

如使用默认分词器分词：

$ curl -XPOST http://127.0.0.1:9200/_analyze?pretty -H 'Content-Type:application/json;chartset=UTF-8' -d '{"analyzer":"sdandard","text":"非凡社区"}'

此时返回的分词结果是：

{
     "tokens": [
          {
               "token": "非",
               "start_offset": 0,
               "end_offset": 1,
               "type": "<IDEOGRAPHIC>",
               "position": 0
          },
          {
               "token": "凡",
               "start_offset": 1,
               "end_offset": 2,
               "type": "<IDEOGRAPHIC>",
               "position": 1
          },
          {
               "token": "社",
               "start_offset": 2,
               "end_offset": 3,
               "type": "<IDEOGRAPHIC>",
               "position": 2
          },
          {
               "token": "区",
               "start_offset": 3,
               "end_offset": 4,
               "type": "<IDEOGRAPHIC>",
               "position": 3
          }
     ]
}

可以看到，“非凡社区”被分词切割成为了单独的字，而不是理想中的“非凡”、“社区”这样有意义的词组。

这也正是之所以要引入IK分词器的原因了。

IK分词器地址：

IK分词器github地址：在这里。

环境：

1、maven。

2、jdk8.

3、elasticsearch 6.5.3。

4、Ik分词器6.5.3。（需注意分词器版本要于es版本对应，下载地址：https://github.com/medcl/elasticsearch-analysis-ik/releases）

IK分词器的构建：

es路径：/opt/es.

1、解压ik release包至es目录下的plugins目录。

$ mkdir -p /opt/es/plugins/ik && unzip elasticsearch-analysis-ik-6.5.3.zip /opt/es653/plugins/ik

2、重启es。

3、验证分词器是否启用：

$ curl -XPOST http://127.0.0.1:9200/_analyze?pretty -H 'Content-Type:application/json;chartset=UTF-8' -d '{"analyzer":"sdandard","text":"非凡社区"}'

得到结果：

{
     "tokens": [
          {
               "token": "非凡",
               "start_offset": 0,
               "end_offset": 2,
               "type": "CN_WORD",
               "position": 0
          },
          {
               "token": "社区",
               "start_offset": 2,
               "end_offset": 4,
               "type": "CN_WORD",
               "position": 1
          }
     ]
}

IK的两个分词器：

IK分词器带有两个分词器，ik_max_word和ik_smart。

1、ik_max_word：会将文本做最细维度的拆分，尽可能多的拆分出词语。

2、ik_smark：会做最粗粒维度的拆分，已被分出的词语将不会再次被其他词语占用。

如使用两个分词器对“分词器效果”做分词：

ik_max_word:

{
     "tokens": [
          {
               "token": "分词器",
               "start_offset": 0,
               "end_offset": 3,
               "type": "CN_WORD",
               "position": 0
          },
          {
               "token": "分词",
               "start_offset": 0,
               "end_offset": 2,
               "type": "CN_WORD",
               "position": 1
          },
          {
               "token": "器",
               "start_offset": 2,
               "end_offset": 3,
               "type": "CN_CHAR",
               "position": 2
          },
          {
               "token": "效果",
               "start_offset": 3,
               "end_offset": 5,
               "type": "CN_WORD",
               "position": 3
          }
     ]
}

ik_smart:

{
     "tokens": [
          {
               "token": "分词器",
               "start_offset": 0,
               "end_offset": 3,
               "type": "CN_WORD",
               "position": 0
          },
          {
               "token": "效果",
               "start_offset": 3,
               "end_offset": 5,
               "type": "CN_WORD",
               "position": 1
          }
     ]
}

创建索引时指定使用IK分词器：

PUT user_v1
{
  "settings":{
    "number_of_shards": "6",
    "number_of_replicas": "1",  
     //指定分词器  
    "analysis":{   
      "analyzer":{
        "ik":{
          "tokenizer":"ik_max_word"
        }
      }
    }
  },
  "mappings":{
    "novel":{
      "properties":{
        "author":{
          "type":"text"
        },
        "wordCount":{
          "type":"integer"
        },
        "publishDate":{
          "type":"date",
          "format":"yyyy-MM-dd HH:mm:ss || yyyy-MM-dd"
        },
        "briefIntroduction":{
          "type":"text"
        },
        "bookName":{
          "type":"text"
        }
      }
    }
  }
}

其后针对于user_v1索引的新增/更新/检索，都会使用到ik_max_word分词器。

热词更新配置：

此略，后续待更新。