elasticsearch-jieba-plugin
20221201更新
- 新增分支:
7.17.x
分支,支持es7.17.0
,JDK版本:11.0.7
, gradle版本:7.6
8.4.1
分支,支持es8.4.1
,JDK版本:18.0.2.1
, gradle版本:7.6
- 当适配不同的ES版本,以及JDK版本,需要参考ES和JDK版本的对应关系
- 适配不同ES版本,修改以下文件,需要修改的地方已经注明
- build.gradle
- src/main/resources/plugin-descriptor.properties
- 需要切换不同的gradle版本,7.6是要切换的目标版本
gradle wrapper --gradle-version 7.6
jieba analysis plugin for elasticsearch: 7.7.0, 7.4.2, 7.3.0, 7.0.0, 6.4.0, 6.0.0 , 5.4.0, 5.3.0, 5.2.2, 5.2.1, 5.2.0, 5.1.2, 5.1.1
特点
- 支持动态添加字典,不重启ES。
简单的修改,即可适配不同版本的ES
支持动态添加字典,ES不需要重启
有关jieba_index和jieba_search的应用
新分词支持
如果是ES6.4.0的版本,请使用6.4.0分支最新的代码,或者master分支最新代码,也可以下载6.4.1的release,强烈推荐升级!
ES分词PositionIncrement解析
6.4.1的release,解决了PositionIncrement问题。详细说明见版本对应
分支 | tag | elasticsearch版本 | Release Link |
---|---|---|---|
7.7.0 | tag v7.7.1 | v7.7.0 | Download: v7.7.0 |
7.4.2 | tag v7.4.2 | v7.4.2 | Download: v7.4.2 |
7.3.0 | tag v7.3.0 | v7.3.0 | Download: v7.3.0 |
7.0.0 | tag v7.0.0 | v7.0.0 | Download: v7.0.0 |
6.4.0 | tag v6.4.1 | v6.4.0 | Download: v6.4.1 |
6.4.0 | tag v6.4.0 | v6.4.0 | Download: v6.4.0 |
6.0.0 | tag v6.0.0 | v6.0.0 | Download: v6.0.1 |
5.4.0 | tag v5.4.0 | v5.4.0 | Download: v5.4.0 |
5.3.0 | tag v5.3.0 | v5.3.0 | Download: v5.3.0 |
5.2.2 | tag v5.2.2 | v5.2.2 | Download: v5.2.2 |
5.2.1 | tag v5.2.1 | v5.2.1 | Download: v5.2.1 |
5.2 | tag v5.2.0 | v5.2.0 | Download: v5.2.0 |
5.1.2 | tag v5.1.2 | v5.1.2 | Download: v5.1.2 |
5.1.1 | tag v5.1.1 | v5.1.1 | Download: v5.1.1 |
more details
- choose right version source code.
- run
git clone https://github.com/sing1ee/elasticsearch-jieba-plugin.git --recursive
./gradlew clean pz
- copy the zip file to plugin directory
cp build/distributions/elasticsearch-jieba-plugin-5.1.2.zip ${path.home}/plugins
- unzip and rm zip file
unzip elasticsearch-jieba-plugin-5.1.2.zip
rm elasticsearch-jieba-plugin-5.1.2.zip
- start elasticsearch
./bin/elasticsearch
Custom User Dict
Just put you dict file with suffix .dict into ${path.home}/plugins/jieba/dic. Your dict file should like this:
小清新 3
百搭 3
显瘦 3
隨身碟 100
your_word word_freq
Using stopwords
- find stopwords.txt in ${path.home}/plugins/jieba/dic.
- create folder named stopwords under ${path.home}/config
mkdir -p {path.home}/config/stopwords
- copy stopwords.txt into the folder just created
cp ${path.home}/plugins/jieba/dic/stopwords.txt {path.home}/config/stopwords
- create index:
PUT http://localhost:9200/jieba_index
{
"settings": {
"analysis": {
"filter": {
"jieba_stop": {
"type": "stop",
"stopwords_path": "stopwords/stopwords.txt"
},
"jieba_synonym": {
"type": "synonym",
"synonyms_path": "synonyms/synonyms.txt"
}
},
"analyzer": {
"my_ana": {
"tokenizer": "jieba_index",
"filter": [
"lowercase",
"jieba_stop",
"jieba_synonym"
]
}
}
}
}
}
- test analyzer:
PUT http://localhost:9200/jieba_index/_analyze
{
"analyzer" : "my_ana",
"text" : "黄河之水天上来"
}
Response as follow:
{
"tokens": [
{
"token": "黄河",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0
},
{
"token": "黄河之水天上来",
"start_offset": 0,
"end_offset": 7,
"type": "word",
"position": 0
},
{
"token": "之水",
"start_offset": 2,
"end_offset": 4,
"type": "word",
"position": 1
},
{
"token": "天上",
"start_offset": 4,
"end_offset": 6,
"type": "word",
"position": 2
},
{
"token": "上来",
"start_offset": 5,
"end_offset": 7,
"type": "word",
"position": 2
}
]
}
NOTE
migrate from jieba-solr
Roadmap
I will add more analyzer support:
- stanford chinese analyzer
- fudan nlp analyzer
- ...
If you have some ideas, you should create an issue. Then, we will do it together.