elasticsearch-analysis-openkoreantext
νκ΅μ΄(νκΈ)λ₯Ό μ²λ¦¬νλ Elasticsearch analyzerμ λλ€. open-korean-text νκ΅μ΄ μ²λ¦¬μμ§μΌλ‘ μμ±λμμ΅λλ€.
Korean analysis plugin that integrates open-korean-text module into Elasticsearch.
Elasticsearch 4.x μ΄νμ λ²μ μ μ§μνμ§ μμ΅λλ€.
Install
$ cd ${ES_HOME}
$ bin/elasticsearch-plugin install {download URL}
μ€μΉ ν bin/elasticsearch
μ€ν μ, loaded plugin [elasticsearch-analysis-openkoreantext]
λΌλ λ‘κ·Έκ° μΆλ ₯λλμ§ νμΈν©λλ€.
download URL μ μλ Compatible Versionsλ₯Ό μ°Έκ³ νμ¬ Elasticsearch λ²μ Όμ λ§λ Plugin λ²μ Όμ λ€μ΄λ‘λ λ°μμΌν©λλ€.
Example
Input
curl -X POST 'http://localhost:9200/_analyze' -d '{
"analyzer": "openkoreantext-analyzer",
"text": "νκ΅μ΄λ₯Ό μ²λ¦¬νλ μμμ
λλΌγ
γ
"
}'
Output
{
"tokens": [
{
"token": "νκ΅μ΄",
"start_offset": 0,
"end_offset": 3,
"type": "Noun",
"position": 0
},
{
"token": "μ²λ¦¬",
"start_offset": 5,
"end_offset": 7,
"type": "Noun",
"position": 1
},
{
"token": "νλ€",
"start_offset": 7,
"end_offset": 9,
"type": "Verb",
"position": 2
},
{
"token": "μμ",
"start_offset": 10,
"end_offset": 12,
"type": "Noun",
"position": 3
},
{
"token": "μ΄λ€",
"start_offset": 12,
"end_offset": 15,
"type": "Adjective",
"position": 4
},
{
"token": "γ
γ
",
"start_offset": 15,
"end_offset": 17,
"type": "KoreanParticle",
"position": 5
}
]
}
Elasticsearchμ default analyzerλ₯Ό μ¬μ©νμ κ²½μ°
{
"tokens": [
{
"token": "νκ΅μ΄λ₯Ό",
"start_offset": 0,
"end_offset": 4,
"type": "<HANGUL>",
"position": 0
},
{
"token": "μ²λ¦¬νλ",
"start_offset": 5,
"end_offset": 9,
"type": "<HANGUL>",
"position": 1
},
{
"token": "μμμ
λλΌγ
γ
",
"start_offset": 10,
"end_offset": 17,
"type": "<HANGUL>",
"position": 2
}
]
}
μ€μ μ¬μ© λ°©λ²μ Elasicsearch analysisλ₯Ό μ°Έκ³ νμΈμ.
User Dictionary
κΈ°λ³Έμ¬μ μ΄μΈμ μ¬μ©μκ° μνλ λ¨μ΄λ₯Ό μΆκ°νμ¬ μ¬μ©ν μ μμ΅λλ€. μλ₯Όλ€μ΄ λ§μ½μμ΄
λ₯Ό λΆμνλ©΄ λ§μ½(Noun)
κ³Ό μμ΄(suffix)
λ‘ μΆμΆλμ§λ§, μ¬μ μ λ§μ½μμ΄
λ₯Ό μΆκ°νλ©΄ λ§μ½μμ΄(Noun)
λ‘ μΆμΆν μ μμ΅λλ€.
Analyzer Pluginμ μ€μΉνλ©΄ {ES_HOME}/plugins/elasticserach-analysis-openkoreantext
μμΉμ dic/
λλ ν 리λ₯Ό μ°Ύμ μ μμ΅λλ€. ν΄λΉ λλ ν 리 μμ μ¬μ ν
μ€νΈ νμΌμ μΆκ°νλ©΄ λ©λλ€.
μ¬μ ν μ€νΈ νμΌμ κ° λ¨μ΄λ€μ μ€λ°κΏνμ¬ λ£μΌλ©΄ λ©λλ€. (λ¨, λμμ°κΈ°λ λ¨μ΄λ‘ μΈμνμ§ μμ΅λλ€.)
# {ES_HOME}/plugins/elasticserach-analysis-openkoreantext/dic/sampledictionary
λ§μ½μμ΄
λμμ΄
μμμ΄ν λ¨Έλ
...
Components
μ΄ Analyzerλ λͺ κ°μ§ componentsλ‘ κ΅¬μ±λμ΄ μμ΅λλ€.
Charater Filter
- openkoreantext-normalizer
- ꡬμ΄μ²΄λ₯Ό νμ€ν ν©λλ€.
νμ©νμ©νμ©νμ© -> νμ©νμ©, νκ²λ€ -> νκ² λ€, μλγ γ γ -> μλΌγ γ
Tokenizer
- openkoreantext-tokenizer
- λ¬Έμ₯μ ν ν°ν ν©λλ€.
νκ΅μ΄λ₯Ό μ²λ¦¬νλ μμμ λλ€ γ γ -> [νκ΅μ΄, λ₯Ό, μ²λ¦¬, νλ, μμ, μ λλ€, γ γ ]
Token Filter
-
openkoreantext-stemmer Β * νμ©μ¬μ λμ¬λ₯Ό μ€ν λ°ν©λλ€.
μλ‘μ΄ μ€ν λ°μ μΆκ°νμλ€. -> [μλ‘λ€, μ€ν λ°, μ, μΆκ°νλ€, .]
-
openkoreantext-redundant-filter
- μ μμ¬, 곡백(λμμ°κΈ°), μ‘°μ¬, λ§μΉ¨ν λ±μ μ κ±°ν©λλ€.
κ·Έλ¦¬κ³ μ΄κ²μ μμ, λλ μλ‘μ¨, νκ΅μ΄λ₯Ό μ²λ¦¬νκΈ° -> [μμ, μ, νκ΅μ΄, μ²λ¦¬, νλ€]
-
openkoreantext-phrase-extractor
- μ΄κ΅¬λ₯Ό μΆμΆν©λλ€.
νκ΅μ΄λ₯Ό μ²λ¦¬νλ μμμ λλ€ γ γ -> [νκ΅μ΄, μ²λ¦¬, μμ, μ²λ¦¬νλ μμ]
Analyzer
[openkoreantext-normalizer
] -> [openkoreantext-tokenizer
] -> [openkoreantext-stemmer
, openkoreantext-redundant-filter
, classic
, length
, lowercase
]
- μ΄ analyzerμλ
openkoreantext-phrase-extractor
κ° κΈ°λ³Έ token filterλ‘ μ μ©λμ΄μμ§ μμ΅λλ€. - custom analyzer ꡬμ±μ μνμλ©΄ custom analyzerλ₯Ό μ°Έκ³ νμΈμ.
Compatible Versions
- 5.0.0 λ―Έλ§μ λ²μ Όμ μ§μνμ§ μμ΅λλ€. open-korean-textλ‘ μμ±λ λ€λ₯Έ νλ¬κ·ΈμΈμ μ°Έμ‘°νμκΈ° λ°λλλ€.
License
Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0