• Stars
    star
    126
  • Rank 284,543 (Top 6 %)
  • Language
    Java
  • License
    Apache License 2.0
  • Created over 7 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Korean analysis plugin that integrates open-korean-text module into elasticsearch.

elasticsearch-analysis-openkoreantext

Build Status

ν•œκ΅­μ–΄(ν•œκΈ€)λ₯Ό μ²˜λ¦¬ν•˜λŠ” Elasticsearch analyzerμž…λ‹ˆλ‹€. open-korean-text ν•œκ΅­μ–΄ μ²˜λ¦¬μ—”μ§„μœΌλ‘œ μž‘μ„±λ˜μ—ˆμŠ΅λ‹ˆλ‹€.

Korean analysis plugin that integrates open-korean-text module into Elasticsearch.

Elasticsearch 4.x μ΄ν•˜μ˜ 버전은 μ§€μ›ν•˜μ§€ μ•ŠμŠ΅λ‹ˆλ‹€.

Install

$ cd ${ES_HOME}
$ bin/elasticsearch-plugin install {download URL}

μ„€μΉ˜ ν›„ bin/elasticsearch μ‹€ν–‰ μ‹œ, loaded plugin [elasticsearch-analysis-openkoreantext] λΌλŠ” λ‘œκ·Έκ°€ 좜λ ₯λ˜λŠ”μ§€ ν™•μΈν•©λ‹ˆλ‹€.

download URL 은 μ•„λž˜ Compatible Versionsλ₯Ό μ°Έκ³ ν•˜μ—¬ Elasticsearch 버젼에 λ§žλŠ” Plugin 버젼을 λ‹€μš΄λ‘œλ“œ λ°›μ•„μ•Όν•©λ‹ˆλ‹€.

Example

Input

curl -X POST 'http://localhost:9200/_analyze' -d '{
  "analyzer": "openkoreantext-analyzer",
  "text": "ν•œκ΅­μ–΄λ₯Ό μ²˜λ¦¬ν•˜λŠ” μ˜ˆμ‹œμž…λ‹ˆλ‹Όγ…‹γ…‹"
}'

Output

{
  "tokens": [
    {
      "token": "ν•œκ΅­μ–΄",
      "start_offset": 0,
      "end_offset": 3,
      "type": "Noun",
      "position": 0
    },
    {
      "token": "처리",
      "start_offset": 5,
      "end_offset": 7,
      "type": "Noun",
      "position": 1
    },
    {
      "token": "ν•˜λ‹€",
      "start_offset": 7,
      "end_offset": 9,
      "type": "Verb",
      "position": 2
    },
    {
      "token": "μ˜ˆμ‹œ",
      "start_offset": 10,
      "end_offset": 12,
      "type": "Noun",
      "position": 3
    },
    {
      "token": "이닀",
      "start_offset": 12,
      "end_offset": 15,
      "type": "Adjective",
      "position": 4
    },
    {
      "token": "γ…‹γ…‹",
      "start_offset": 15,
      "end_offset": 17,
      "type": "KoreanParticle",
      "position": 5
    }
  ]
}

Elasticsearch의 default analyzerλ₯Ό μ‚¬μš©ν–ˆμ„ 경우

{
  "tokens": [
    {
      "token": "ν•œκ΅­μ–΄λ₯Ό",
      "start_offset": 0,
      "end_offset": 4,
      "type": "<HANGUL>",
      "position": 0
    },
    {
      "token": "μ²˜λ¦¬ν•˜λŠ”",
      "start_offset": 5,
      "end_offset": 9,
      "type": "<HANGUL>",
      "position": 1
    },
    {
      "token": "μ˜ˆμ‹œμž…λ‹ˆλ‹Όγ…‹γ…‹",
      "start_offset": 10,
      "end_offset": 17,
      "type": "<HANGUL>",
      "position": 2
    }
  ]
}

μ‹€μ œ μ‚¬μš© 방법은 Elasicsearch analysisλ₯Ό μ°Έκ³ ν•˜μ„Έμš”.

User Dictionary

기본사전 이외에 μ‚¬μš©μžκ°€ μ›ν•˜λŠ” 단어λ₯Ό μΆ”κ°€ν•˜μ—¬ μ‚¬μš©ν•  수 μžˆμŠ΅λ‹ˆλ‹€. 예λ₯Όλ“€μ–΄ 말썽쟁이λ₯Ό λΆ„μ„ν•˜λ©΄ 말썽(Noun)κ³Ό 쟁이(suffix)둜 μΆ”μΆœλ˜μ§€λ§Œ, 사전에 말썽쟁이λ₯Ό μΆ”κ°€ν•˜λ©΄ 말썽쟁이(Noun)둜 μΆ”μΆœν•  수 μžˆμŠ΅λ‹ˆλ‹€.

Analyzer Plugin을 μ„€μΉ˜ν•˜λ©΄ {ES_HOME}/plugins/elasticserach-analysis-openkoreantext μœ„μΉ˜μ— dic/ 디렉토리λ₯Ό 찾을 수 μžˆμŠ΅λ‹ˆλ‹€. ν•΄λ‹Ή 디렉토리 μ•ˆμ— 사전 ν…μŠ€νŠΈ νŒŒμΌμ„ μΆ”κ°€ν•˜λ©΄ λ©λ‹ˆλ‹€.

사전 ν…μŠ€νŠΈ νŒŒμΌμ€ 각 단어듀을 μ€„λ°”κΏˆν•˜μ—¬ λ„£μœΌλ©΄ λ©λ‹ˆλ‹€. (단, λ„μ›Œμ“°κΈ°λŠ” λ‹¨μ–΄λ‘œ μΈμ‹ν•˜μ§€ μ•ŠμŠ΅λ‹ˆλ‹€.)

# {ES_HOME}/plugins/elasticserach-analysis-openkoreantext/dic/sampledictionary
말썽쟁이
뚜쟁이
μš•μŸμ΄ν• λ¨Έλ‹ˆ
...

Components

이 AnalyzerλŠ” λͺ‡ 가지 components둜 κ΅¬μ„±λ˜μ–΄ μžˆμŠ΅λ‹ˆλ‹€.

Charater Filter

  • openkoreantext-normalizer
    • ꡬ어체λ₯Ό ν‘œμ€€ν™” ν•©λ‹ˆλ‹€.

    ν›Œμ©ν›Œμ©ν›Œμ©ν›Œμ© -> ν›Œμ©ν›Œμ©, ν•˜κ²Ÿλ‹€ -> ν•˜κ² λ‹€, μ•ˆλ”γ…‹γ…‹γ…‹ -> μ•ˆλΌγ…‹γ…‹

Tokenizer

  • openkoreantext-tokenizer
    • λ¬Έμž₯을 토큰화 ν•©λ‹ˆλ‹€.

    ν•œκ΅­μ–΄λ₯Ό μ²˜λ¦¬ν•˜λŠ” μ˜ˆμ‹œμž…λ‹ˆλ‹€ γ…‹γ…‹ -> [ν•œκ΅­μ–΄, λ₯Ό, 처리, ν•˜λŠ”, μ˜ˆμ‹œ, μž…λ‹ˆλ‹€, γ…‹γ…‹]

Token Filter

  • openkoreantext-stemmer Β * ν˜•μš©μ‚¬μ™€ 동사λ₯Ό μŠ€ν…Œλ°ν•©λ‹ˆλ‹€.

    μƒˆλ‘œμš΄ μŠ€ν…Œλ°μ„ μΆ”κ°€ν–ˆμ—ˆλ‹€. -> [μƒˆλ‘­λ‹€, μŠ€ν…Œλ°, 을, μΆ”κ°€ν•˜λ‹€, .]

  • openkoreantext-redundant-filter

    • 접속사, 곡백(λ„μ›Œμ“°κΈ°), 쑰사, λ§ˆμΉ¨ν‘œ 등을 μ œκ±°ν•©λ‹ˆλ‹€.

    그리고 이것은 μ˜ˆμ‹œ, λ˜λŠ” 예둜써, ν•œκ΅­μ–΄λ₯Ό μ²˜λ¦¬ν•˜κΈ° -> [μ˜ˆμ‹œ, 예, ν•œκ΅­μ–΄, 처리, ν•˜λ‹€]

  • openkoreantext-phrase-extractor

    • 어ꡬλ₯Ό μΆ”μΆœν•©λ‹ˆλ‹€.

    ν•œκ΅­μ–΄λ₯Ό μ²˜λ¦¬ν•˜λŠ” μ˜ˆμ‹œμž…λ‹ˆλ‹€ γ…‹γ…‹ -> [ν•œκ΅­μ–΄, 처리, μ˜ˆμ‹œ, μ²˜λ¦¬ν•˜λŠ” μ˜ˆμ‹œ]

Analyzer

[openkoreantext-normalizer] -> [openkoreantext-tokenizer] -> [openkoreantext-stemmer, openkoreantext-redundant-filter, classic, length, lowercase]

  • 이 analyzerμ—λŠ” openkoreantext-phrase-extractorκ°€ κΈ°λ³Έ token filter둜 μ μš©λ˜μ–΄μžˆμ§€ μ•ŠμŠ΅λ‹ˆλ‹€.
  • custom analyzer ꡬ성을 μ›ν•˜μ‹œλ©΄ custom analyzerλ₯Ό μ°Έκ³ ν•˜μ„Έμš”.

Compatible Versions

Elasticsearch open-korean-text Download URL
6.1.1 2.1.0 https://github.com/open-korean-text/elasticsearch-analysis-openkoreantext/releases/download/6.1.1/elasticsearch-analysis-openkoreantext-6.1.1.2-plugin.zip
6.1.0 2.1.0 https://github.com/open-korean-text/elasticsearch-analysis-openkoreantext/releases/download/6.1.1/elasticsearch-analysis-openkoreantext-6.1.0.2-plugin.zip
6.0.0 2.1.0 https://github.com/open-korean-text/elasticsearch-analysis-openkoreantext/releases/download/6.0.0.2/elasticsearch-analysis-openkoreantext-6.0.0.2-plugin.zip
5.6.5 2.1.0 https://github.com/open-korean-text/elasticsearch-analysis-openkoreantext/releases/download/6.1.1/elasticsearch-analysis-openkoreantext-5.6.5.2-plugin.zip
5.6.4 2.1.0 https://github.com/open-korean-text/elasticsearch-analysis-openkoreantext/releases/download/5.6.4.2/elasticsearch-analysis-openkoreantext-5.6.4.2-plugin.zip
5.6.3 2.1.0 https://github.com/open-korean-text/elasticsearch-analysis-openkoreantext/releases/download/5.6.4.2/elasticsearch-analysis-openkoreantext-5.6.3.2-plugin.zip
5.6.2 2.1.0 https://github.com/open-korean-text/elasticsearch-analysis-openkoreantext/releases/download/v5.6.x/elasticsearch-analysis-openkoreantext-5.6.2.2-plugin.zip
5.6.1 2.1.0 https://github.com/open-korean-text/elasticsearch-analysis-openkoreantext/releases/download/v5.6.x/elasticsearch-analysis-openkoreantext-5.6.1.2-plugin.zip
5.6.0 2.1.0 https://github.com/open-korean-text/elasticsearch-analysis-openkoreantext/releases/download/v5.6.x/elasticsearch-analysis-openkoreantext-5.6.0.2-plugin.zip
5.5.2 2.1.0 https://github.com/open-korean-text/elasticsearch-analysis-openkoreantext/releases/download/5.5.2.2/elasticsearch-analysis-openkoreantext-5.5.2.2-plugin.zip
5.5.1 2.1.0 https://github.com/open-korean-text/elasticsearch-analysis-openkoreantext/releases/download/5.5.1.2.1/elasticsearch-analysis-openkoreantext-5.5.1.2-plugin.zip
5.5.0 2.0.1 https://github.com/open-korean-text/elasticsearch-analysis-openkoreantext/releases/download/5.5.0.2/elasticsearch-analysis-openkoreantext-5.5.0.2-plugin.zip
5.4.3 2.0.1 https://github.com/open-korean-text/elasticsearch-analysis-openkoreantext/releases/download/5.4.2.2/elasticsearch-analysis-openkoreantext-5.4.3.2-plugin.zip
5.4.2 2.0.1 https://github.com/open-korean-text/elasticsearch-analysis-openkoreantext/releases/download/5.4.2.2/elasticsearch-analysis-openkoreantext-5.4.2.2-plugin.zip
5.4.1 2.0.1 https://github.com/open-korean-text/elasticsearch-analysis-openkoreantext/releases/download/5.4.1.2/elasticsearch-analysis-openkoreantext-5.4.1.2-plugin.zip
5.4.0 2.0.1 https://github.com/open-korean-text/elasticsearch-analysis-openkoreantext/releases/download/5.4.0.2/elasticsearch-analysis-openkoreantext-5.4.0.2-plugin.zip
5.3.2 2.0.1 https://github.com/open-korean-text/elasticsearch-analysis-openkoreantext/releases/download/5.4.0.2/elasticsearch-analysis-openkoreantext-5.3.2.2-plugin.zip
5.3.1 2.0.1 https://github.com/open-korean-text/elasticsearch-analysis-openkoreantext/releases/download/5.4.0.2/elasticsearch-analysis-openkoreantext-5.3.1.2-plugin.zip
5.3.0 2.0.1 https://github.com/open-korean-text/elasticsearch-analysis-openkoreantext/releases/download/5.4.0.2/elasticsearch-analysis-openkoreantext-5.3.0.2-plugin.zip
5.2.2 2.0.1 https://github.com/open-korean-text/elasticsearch-analysis-openkoreantext/releases/download/5.4.0.2/elasticsearch-analysis-openkoreantext-5.2.2.2-plugin.zip
5.2.1 2.0.1 https://github.com/open-korean-text/elasticsearch-analysis-openkoreantext/releases/download/5.4.0.2/elasticsearch-analysis-openkoreantext-5.2.1.2-plugin.zip
5.1.2 2.0.1 https://github.com/open-korean-text/elasticsearch-analysis-openkoreantext/releases/download/5.4.0.2/elasticsearch-analysis-openkoreantext-5.1.2.2-plugin.zip
5.1.1 2.0.1 https://github.com/open-korean-text/elasticsearch-analysis-openkoreantext/releases/download/5.4.0.2/elasticsearch-analysis-openkoreantext-5.1.1.2-plugin.zip
5.1.0 2.0.1 https://github.com/open-korean-text/elasticsearch-analysis-openkoreantext/releases/download/5.4.0.2/elasticsearch-analysis-openkoreantext-5.1.0.2-plugin.zip
5.0.2 2.0.1 https://github.com/open-korean-text/elasticsearch-analysis-openkoreantext/releases/download/5.4.0.2/elasticsearch-analysis-openkoreantext-5.0.2.2-plugin.zip
5.0.1 2.0.1 https://github.com/open-korean-text/elasticsearch-analysis-openkoreantext/releases/download/5.4.0.2/elasticsearch-analysis-openkoreantext-5.0.1.2-plugin.zip
5.0.0 2.0.1 https://github.com/open-korean-text/elasticsearch-analysis-openkoreantext/releases/download/5.4.0.2/elasticsearch-analysis-openkoreantext-5.0.0.2-plugin.zip
  • 5.0.0 미만의 버젼은 μ§€μ›ν•˜μ§€ μ•ŠμŠ΅λ‹ˆλ‹€. open-korean-text둜 μž‘μ„±λœ λ‹€λ₯Έ ν”ŒλŸ¬κ·ΈμΈμ€ μ°Έμ‘°ν•˜μ‹œκΈ° λ°”λžλ‹ˆλ‹€.

License

Licensed under the Apache License, Version 2.0: http://www.apache.org/licenses/LICENSE-2.0