• Stars
    star
    352
  • Rank 120,622 (Top 3 %)
  • Language
    Java
  • License
    Apache License 2.0
  • Created about 12 years ago
  • Updated 6 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

๐Ÿšฒ STConvert is analyzer that convert chinese characters between traditional and simplified.ไธญๆ–‡็ฎ€็น้ซ”ไบ’็›ธ่ฝฌๆข.

STConvert Analysis for Elasticsearch

STConvert is analyzer that convert Chinese characters between Traditional and Simplified. [ไธญๆ–‡็ฎ€็น้ซ”่ฝฌๆข][็ฎ€ไฝ“ๅˆฐ็นไฝ“][็นไฝ“ๅˆฐ็ฎ€ไฝ“][็ฎ€็นๆŸฅ่ฏขExpand]

You can download the pre-build package from release page

The plugin includes analyzer: stconvert, tokenizer: stconvert, token-filter: stconvert, and char-filter: stconvert

Supported config:

  • convert_type: default s2t ,optional option:

    1. s2t ,convert characters from Simple Chinese to Traditional Chinese
    2. t2s ,convert characters from Traditional Chinese to Simple Chinese
  • keep_both:default false ,

  • delimiter:default ,

Custom example:

PUT /stconvert/
{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "tsconvert" : {
                    "tokenizer" : "tsconvert"
                    }
            },
            "tokenizer" : {
                "tsconvert" : {
                    "type" : "stconvert",
                    "delimiter" : "#",
                    "keep_both" : false,
                    "convert_type" : "t2s"
                }
            },   
             "filter": {
               "tsconvert" : {
                     "type" : "stconvert",
                     "delimiter" : "#",
                     "keep_both" : false,
                     "convert_type" : "t2s"
                 }
             },
            "char_filter" : {
                "tsconvert" : {
                    "type" : "stconvert",
                    "convert_type" : "t2s"
                }
            }
        }
    }
}

Analyze tests

GET stconvert/_analyze
{
  "tokenizer" : "keyword",
  "filter" : ["lowercase"],
  "char_filter" : ["tsconvert"],
  "text" : "ๅ›ฝ้™…ๅœ‹้š›"
}

Output๏ผš
{
  "tokens": [
    {
      "token": "ๅ›ฝ้™…ๅ›ฝ้™…",
      "start_offset": 0,
      "end_offset": 4,
      "type": "word",
      "position": 0
    }
  ]
}

Normalizer usage

DELETE index
PUT index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "tsconvert": {
          "type": "stconvert",
          "convert_type": "t2s"
        }
      },
      "normalizer": {
        "my_normalizer": {
          "type": "custom",
          "char_filter": [
            "tsconvert"
          ],
          "filter": [
            "lowercase"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "foo": {
        "type": "keyword",
        "normalizer": "my_normalizer"
      }
    }
  }
}

PUT index/_doc/1
{
  "foo": "ๅœ‹้š›"
}

PUT index/_doc/2
{
  "foo": "ๅ›ฝ้™…"
}

GET index/_search
{
  "query": {
    "term": {
      "foo": "ๅ›ฝ้™…"
    }
  }
}

GET index/_search
{
  "query": {
    "term": {
      "foo": "ๅœ‹้š›"
    }
  }
}