๐ Emoji, flags & emoticons support for Elasticsearch
Add support for emoji and flags in any Lucene compatible search engine!
If you wish to search ๐ฉ
to find donuts in your documents, you came to the
right place. We offer synonym files ready for usage in Elasticsearch and OpenSearch analyzer.
Requirements to index emoji in Elasticsearch
There is no requirements for Elasticsearch >= 6.7.
Using older version of Elasticsearch? Open me! ๐ฑ
Version | Requirements |
---|---|
Elasticsearch >= 6.4 and < 6.7 | You need to install the official ICU Plugin. See our blog post about this change. |
Elasticsearch < 6.4 | You need our custom ICU Tokenizer Plugin, see our blog post (2016). |
Run the following test to verify that you get 4 EMOJI tokens:
GET _analyze
{
"text": ["๐ฉ ๐ซ๐ท ๐ฉโ๐ ๐ฃ๐พโโ"]
}
The Synonyms, flags and emoticons
What you need to search with emoji is a way to expand them to words that can match searches and documents, in your language. That's the goal of the synonym dictionaries.
We build Solr / Lucene compatible synonyms files in all languages supported by Unicode CLDR so you can set them up in an analyzer. It looks like this:
๐ฉโ๐ => ๐ฉโ๐, firefighter, firetruck, woman
๐ฉโโ => ๐ฉโโ, pilot, plane, woman
๐ฅ => ๐ฅ, bacon, meat, food
๐ฅ => ๐ฅ, potato, vegetable, food
๐
=> ๐
, cold, face, open, smile, sweat
๐ => ๐, face, laugh, mouth, open, satisfied, smile
๐ => ๐, bus, tram, trolley
๐ซ๐ท => ๐ซ๐ท, france
๐ฌ๐ง => ๐ฌ๐ง, united kingdom
For emoticons, use this mapping with a char_filter to replace emoticons by emoji.
Installation
Download the emoji and emoticon file you want from this repository and store
them in PATH_TO_ES/config/analysis
(or anywhere Elasticsearch can read).
config
โโโ analysis
โย ย โโโ cldr-emoji-annotation-synonyms-en.txt
โย ย โโโ emoticons.txt
โโโ elasticsearch.yml
...
Use them like this (this is a complete english example with Elasticsearch >= 6.7):
PUT /tweets
{
"settings": {
"analysis": {
"filter": {
"english_emoji": {
"type": "synonym",
"synonyms_path": "analysis/cldr-emoji-annotation-synonyms-en.txt"
},
"emoji_variation_selector_filter": {
"type": "pattern_replace",
"pattern": "\\uFE0E|\\uFE0F",
"replace": ""
},
"english_stop": {
"type": "stop",
"stopwords": "_english_"
},
"english_keywords": {
"type": "keyword_marker",
"keywords": ["example"]
},
"english_stemmer": {
"type": "stemmer",
"language": "english"
},
"english_possessive_stemmer": {
"type": "stemmer",
"language": "possessive_english"
}
},
"analyzer": {
"english_with_emoji": {
"tokenizer": "standard",
"filter": [
"english_possessive_stemmer",
"lowercase",
"emoji_variation_selector_filter",
"english_emoji",
"english_stop",
"english_keywords",
"english_stemmer"
]
}
}
}
},
"mappings": {
"properties": {
"content": {
"type": "text",
"analyzer": "english_with_emoji"
}
}
}
}
You can now test the result with:
GET tweets/_analyze
{
"field": "content",
"text": "๐ฉ ๐ซ๐ท ๐ฉโ๐ ๐ฃ๐พโโ"
}
How to contribute
Build from CLDR SVN
You will need:
- php cli
- php zip and curl extensions
Edit the tag in tools/build-released.php
and run php tools/build-released.php
.
Update emoticons
Run php tools/build-emoticon.php
.
Licenses
Emoji data courtesy of CLDR. See unicode-license.txt for details. Some modifications are done on the data, see here. Emoticon data based on https://github.com/wooorm/emoticon/ (MIT).
This repository in distributed under MIT License. Feel free to use and contribute as you please!