• This repository has been archived on 18/Dec/2021
  • Stars
    star
    251
  • Rank 161,862 (Top 4 %)
  • Language
    Java
  • License
    Apache License 2.0
  • Created almost 12 years ago
  • Updated almost 7 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A plugin for language detection in Elasticsearch using Nakatani Shuyo's language detector

A langdetect plugin for Elasticsearch

elasticsearch langdetect coverage badge License Apache%202.0 blue xbib

Tower of Babel

This is an implementation of a plugin for Elasticsearch using the implementation of Nakatani Shuyo’s language detector.

It uses 3-gram character and a Bayesian filter with various normalizations and feature sampling. The precision is over 99% for 53 languages.

The plugin offers a mapping type to specify fields where you want to enable language detection. Detected languages are indexed into a subfield of the field named 'lang', as you can see in the example. The field can be queried for language codes.

You can use the multi_field mapping type to combine this plugin with the attachment mapper plugin, to enable language detection in base64-encoded binary data. Currently, UTF-8 texts are supported only.

The plugin offers also a REST endpoint, where a short text can be posted to in UTF-8, and the plugin responds with a list of recognized languages.

Here is a list of languages code recognized:

Table 1. Langauges

Code

Description

af

Afrikaans

ar

Arabic

bg

Bulgarian

bn

Bengali

cs

Czech

da

Danish

de

German

el

Greek

en

English

es

Spanish

et

Estonian

fa

Farsi

fi

Finnish

fr

French

gu

Gujarati

he

Hebrew

hi

Hindi

hr

Croatian

hu

Hungarian

id

Indonesian

it

Italian

ja

Japanese

kn

Kannada

ko

Korean

lt

Lithuanian

lv

Latvian

mk

Macedonian

ml

Malayalam

mr

Marathi

ne

Nepali

nl

Dutch

no

Norwegian

pa

Eastern Punjabi

pl

Polish

pt

Portuguese

ro

Romanian

ru

Russian

sk

Slovak

sl

Slovene

so

Somali

sq

Albanian

sv

Swedish

sw

Swahili

ta

Tamil

te

Telugu

th

Thai

tl

Tagalog

tr

Turkish

uk

Ukrainian

ur

Urdu

vi

Vietnamese

zh-cn

Chinese

zh-tw

Traditional Chinese characters (Taiwan, Hongkong, Macau)

Table 2. Compatibility matrix

Plugin version

Elasticsearch version

Release date

5.4.0.2

5.4.0

Jun 8, 2017

5.4.0.1

5.4.0

May 30, 2017

5.4.0.0

5.4.0

May 10, 2017

5.3.2.0

5.3.2

Apr 30, 2017

5.3.1.0

5.3.1

Apr 30, 2017

5.3.0.2

5.3.0

Apr 3, 2017

5.3.0.1

5.3.0

Apr 1, 2017

5.3.0.0

5.3.0

Mar 30, 2017

5.2.2.0

5.2.2

Mar 2, 2017

5.2.1.0

5.2.1

Mar 2, 2017

5.1.2.0

5.1.2

Jan 26, 2017

2.4.4.1

2.4.4

Jan 25, 2017

2.3.3.0

2.3.3

Jun 11, 2016

2.3.2.0

2.3.2

Jun 11, 2016

2.3.1.0

2.3.1

Apr 11, 2016

2.2.1.0

2.2.1

Apr 11, 2016

2.2.0.2

2.2.0

Mar 25, 2016

2.2.0.1

2.2.0

Mar 6, 2016

2.1.1.0

2.1.1

Dec 20, 2015

2.1.0.0

2.1.0

Dec 15, 2015

2.0.1.0

2.0.1

Dec 15, 2015

2.0.0.0

2.0.0

Nov 12, 2015

1.6.0.0

1.6.0

Jul 1, 2015

1.4.4.1

1.4.4

Apr 3, 2015

1.4.4.1

1.4.4

Mar 4, 2015

1.4.0.2

1.4.0

Nov 26, 2014

1.4.0.1

1.4.0

Nov 20, 2014

1.4.0.0

1.4.0

Nov 14, 2014

1.3.1.0

1.3.0

Jul 30, 2014

1.2.1.1

1.2.1

Jun 18, 2014

Installation

Elasticsearch 5.x

./bin/elasticsearch-plugin install http://xbib.org/repository/org/xbib/elasticsearch/plugin/elasticsearch-langdetect/5.4.0.2/elasticsearch-langdetect-5.4.0.2-plugin.zip

Elasticsearch 2.x

./bin/plugin install http://xbib.org/repository/org/xbib/elasticsearch/plugin/elasticsearch-langdetect/2.4.4.1/elasticsearch-langdetect-2.4.4.1-plugin.zip

Elasticsearch 1.x

./bin/plugin -install langdetect -url http://xbib.org/repository/org/xbib/elasticsearch/plugin/elasticsearch-langdetect/1.6.0.0/elasticsearch-langdetect-1.6.0.0-plugin.zip

Do not forget to restart the node after installing.

Examples

Note
The examples are written for Elasticsearch 5.x and need to be adapted to earlier versions of Elastiscearch.

A simple language detection example

In this example, we create a simple detector field, and write text to it for detection.

DELETE /test
PUT /test
{
   "mappings": {
      "docs": {
         "properties": {
            "text": {
               "type": "langdetect",
               "languages" : [ "en", "de", "fr" ]
            }
         }
      }
   }
}

PUT /test/docs/1
{
      "text" : "Oh, say can you see by the dawn`s early light, What so proudly we hailed at the twilight`s last gleaming?"
}

PUT /test/docs/2
{
      "text" : "Einigkeit und Recht und Freiheit für das deutsche Vaterland!"
}

PUT /test/docs/3
{
      "text" : "Allons enfants de la Patrie, Le jour de gloire est arrivé!"
}

POST /test/_search
{
       "query" : {
           "term" : {
                "text" : "en"
           }
       }
}

POST /test/_search
{
       "query" : {
           "term" : {
                "text" : "de"
           }
       }
}

POST /test/_search
{
       "query" : {
           "term" : {
                "text" : "fr"
           }
       }
}

Indexing language-detected text alongside with code

Just indexing the language code is not enough in most cases. The language-detected text should be passed to a specific analyzer to apply language-specific analysis. This plugin allows that by the language_to parameter.

DELETE /test
PUT /test
{
   "mappings": {
      "docs": {
         "properties": {
            "text": {
               "type": "langdetect",
               "languages": [
                  "de",
                  "en",
                  "fr",
                  "nl",
                  "it"
               ],
               "language_to": {
                  "de": "german_field",
                  "en": "english_field"
               }
            },
            "german_field": {
               "analyzer": "german",
               "type": "string"
            },
            "english_field": {
               "analyzer": "english",
               "type": "string"
            }
         }
      }
   }
}

PUT /test/docs/1
{
  "text" : "Oh, say can you see by the dawn`s early light, What so proudly we hailed at the twilight`s last gleaming?"
}

POST /test/_search
{
   "query" : {
       "match" : {
            "english_field" : "light"
       }
   }
}

Language code and multi_field

Using multifields, it is possible to store the text alongside with the detected language(s). Here, we use another (short nonsense) example text for demonstration, which has more than one detected language code.

DELETE /test
PUT /test
{
   "mappings": {
      "docs": {
         "properties": {
            "text": {
               "type": "text",
               "fields": {
                  "language": {
                     "type": "langdetect",
                     "languages": [
                        "de",
                        "en",
                        "fr",
                        "nl",
                        "it"
                     ],
                     "store": true
                  }
               }
            }
         }
      }
   }
}

PUT /test/docs/1
{
    "text" : "Oh, say can you see by the dawn`s early light, What so proudly we hailed at the twilight`s last gleaming?"
}

POST /test/_search
{
   "query" : {
       "match" : {
            "text" : "light"
       }
   }
}

POST /test/_search
{
   "query" : {
       "match" : {
            "text.language" : "en"
       }
   }
}

Language detection ina binary field with attachment mapper plugin

DELETE /test
PUT /test
{
   "mappings": {
      "docs": {
         "properties": {
            "text": {
    		  "type" : "attachment",
			  "fields" : {
				"content" : {
				  "type" : "text",
				  "fields" : {
					"language" : {
					  "type" : "langdetect",
					  "binary" : true
					}
				  }
				}
			  }
            }
         }
      }
   }
}

On a shell, enter commands

rm index.tmp
echo -n '{"content":"' >> index.tmp
echo "This is a very simple text in plain english" | base64  >> index.tmp
echo -n '"}' >> index.tmp
curl -XPOST --data-binary "@index.tmp" 'localhost:9200/test/docs/1'
rm index.tmp
POST /test/_refresh

POST /test/_search
{
   "query" : {
       "match" : {
            "content" : "very simple"
       }
   }
}

POST /test/_search
{
   "query" : {
       "match" : {
            "content.language" : "en"
       }
   }
}

Language detection REST API Example

curl -XPOST 'localhost:9200/_langdetect?pretty' -d 'This is a test'
{
  "languages" : [
    {
      "language" : "en",
      "probability" : 0.9999972283490304
    }
  ]
}
curl -XPOST 'localhost:9200/_langdetect?pretty' -d 'Das ist ein Test'
{
  "languages" : [
    {
      "language" : "de",
      "probability" : 0.9999985460514316
    }
  ]
}
curl -XPOST 'localhost:9200/_langdetect?pretty' -d 'Datt isse ne test'
{
  "languages" : [
    {
      "language" : "no",
      "probability" : 0.5714275763833249
    },
    {
      "language" : "nl",
      "probability" : 0.28571402563882925
    },
    {
      "language" : "de",
      "probability" : 0.14285660343967294
    }
  ]
}

Use _langdetect endpoint from Sense

GET _langdetect
{
   "text": "das ist ein test"
}

Change profile of language detection

There is a "short text" profile which is better to detect languages in a few words.

curl -XPOST 'localhost:9200/_langdetect?pretty&profile=short-text' -d 'Das ist ein Test'
{
  "profile" : "/langdetect/short-text/",
  "languages" : [ {
    "language" : "de",
    "probability" : 0.9999993070517024
  } ]
}

Settings

These settings can be used in elasticsearch.yml to modify language detection.

Use with caution. You don’t need to modify settings. This list is just for the sake of completeness. For successful modification of the model parameters, you should study the source code and be familiar with probabilistic matching using naive bayes with character n-gram. See also Ted Dunning, Statistical Identification of Language, 1994.

Name

Description

languages

a comma-separated list of language codes such as (de,en,fr…​) used to restrict (and speed up) the detection process

map.<code>

a substitution code for a language code

number_of_trials

number of trials, affects CPU usage (default: 7)

alpha

additional smoothing parameter, default: 0.5

alpha_width

the width of smoothing, default: 0.05

iteration_limit

safeguard to break loop, default: 10000

prob_threshold

default: 0.1

conv_threshold

detection is terminated when normalized probability exceeds this threshold, default: 0.99999

base_freq

default 10000

Issues

All feedback is welcome! If you find issues, please post them at Github

Credits

Thanks to Alexander Reelsen for his OpenNLP plugin, from where I have copied and adapted the mapping type code.

License

elasticsearch-langdetect - a language detection plugin for Elasticsearch

Derived work of language-detection by Nakatani Shuyo http://code.google.com/p/language-detection/

Copyright © 2012 Jörg Prante

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. you may obtain a copy of the License at

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

btn donateCC LG

More Repositories

1

elasticsearch-jdbc

JDBC importer for Elasticsearch
Java
2,842
star
2

elasticsearch-knapsack

Knapsack plugin is an import/export tool for Elasticsearch
Java
472
star
3

elasticsearch-index-termlist

Elasticsearch Index Termlist
Java
117
star
4

elasticsearch-transport-websocket

WebSockets for ElasticSearch
Java
113
star
5

elasticsearch-plugin-bundle

A bundle of useful Elasticsearch plugins
Java
110
star
6

elasticsearch-analysis-decompound

Decompounding Plugin for Elasticsearch
Java
87
star
7

elasticsearch-skywalker

Skywalker for Elasticsearch is like Luke for Lucene
Java
79
star
8

jdbc-driver-csv

JDBC driver for CSV
Java
70
star
9

elasticsearch-analysis-skos

SKOS analysis for Elasticsearch
Java
54
star
10

elasticsearch-xml

XML interface for Elasticsearch REST
Java
44
star
11

elasticsearch-csv

CSV format for Elasticsearch REST search responses
Java
42
star
12

elasticsearch-analysis-hunspell

Hunspell analysis for ElasticSearch
Java
38
star
13

elasticsearch-analysis-naturalsort

Natural sort plugin for Elasticsearch
Java
38
star
14

elasticsearch-analysis-reference

A reference mechanism for including content from other documents during the Elasticsearch analysis field mapping phase
Java
35
star
15

log4j2-elasticsearch

Log4j2 Elasticsearch appender
Java
27
star
16

elasticsearch-analysis-baseform

Baseform lemmatization for Elasticsearch
Java
26
star
17

elasticsearch-analysis-standardnumber

Analyze standard numbers like ARK, DOI, EAN, GTIN, IBAN, ISAN, ISBN, ISMN, ISNI, ISSN, ISTC, ISWC, ORCID, PPN, SICI, UPC, ZDB with Elasticsearch
Java
23
star
18

elasticsearch-helper

Helper classes for Elasticsearch client
Java
20
star
19

elasticsearch-functionscore-conditionalboost

Boost documents in Elasticsearch when they match dynamic conditions
Java
18
star
20

Elasticsearch-Dancer-App

a simple Elasticsearch/Dancer/Bootstrap application for demonstration
Perl
14
star
21

netty-http

HTTP 1.1 and 2.0 asynchronous client and server for Netty
Java
12
star
22

elasticsearch-payload

Term payloads for Elasticsearch
Java
11
star
23

elasticsearch-simple-action-plugin

A simple action plugin for Elasticsearch
Java
11
star
24

elasticsearch-plugin-deploy

Refreshable Elasticsearch plugins
Java
10
star
25

log4j2-elasticsearch-http

Log4j2 Elasticsearch appender using the Java JDK HTTP client
Java
9
star
26

elasticsearch-plugin-ratpack

Elasticsearch plugin for embedding Ratpack http://ratpack.io
Java
9
star
27

gradle-plugin-jflex

A JFlex plugin for Gradle
Groovy
8
star
28

elasticsearch-analysis-opennlp

Elasticsearch plugin for sentence detection, named entity recognition, part-of-speech tagging with OpenNLP
Java
7
star
29

elasticsearch-analysis-phonetic-eudex

Eudex phonetic analysis plugin for Elasticsearch
Java
6
star
30

elx

Elasticsearch extensions - rich API, clients, index lifecycle management, lightweight, small footprint, and much more - for Java 17+
Java
6
star
31

elasticsearch-aggregations

More aggregations for Elasticsearch
Java
6
star
32

elasticsearch-syslog

Receiving syslog messages with Elasticsearch
Java
5
star
33

jdbc-csv

JDBC driver for CSV files
Java
4
star
34

elasticsearch-client-http

Java HTTP client for Elasticsearch
Java
4
star
35

elasticsearch-analysis-hyphen

Hyphen analysis for Elasticsearch
Java
2
star
36

alpine-glibc-java

Alpine Linux + glibc + OpenJDK
Dockerfile
2
star
37

datastructures

More data structures for Java
Java
1
star
38

content

Content processing with JSON, RDF, XML, YAML for Java, with settings and config API
Java
1
star
39

gradle-plugin-jacc

Jacc plugin for Gradle
Groovy
1
star
40

barcode

Improved version of Okapi Barcode Library for Java 8
Java
1
star
41

elasticsearch-devkit

My dev kit for Elasticsearch (derived from mainline Elasticsearch build tools)
Java
1
star
42

elasticsearch-client

Modularized, OpenJDK 11 version of Elasticsearch client
Java
1
star
43

rpm

RPM Redhat Package Manager implemented in Java
Java
1
star