• This repository has been archived on 24/Feb/2021
  • Stars
    star
    202
  • Rank 193,691 (Top 4 %)
  • Language
    Java
  • Created about 13 years ago
  • Updated about 9 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Plugin for elasticsearch which uses the lucene FSTSuggester

DO NOT USE THIS PLUGIN ANYMORE

This plugin has been superceded by the completion suggester in Elasticsearch and is not developed further. There is an excellent introductory blog post available as well.

This plugin is not developed further than for Elasticsearch 1.3, which you should not use anymore!

Suggester Plugin for Elasticsearch

Note: If you only need prefix suggestions, please use the new completion suggest feature available since elasticsearch 0.90.3, which features blazing fast real time suggestions, uses the AnalyzingSuggester under the hood and will also support fuzzy mode in 0.90.4.

This plugin uses the FSTSuggester, the AnalyzingSuggester or the FuzzySuggester from Lucene to create suggestions from a certain field for a specified term instead of returning the whole document data.

Feel free to comment, improve and help - I am thankful for any insights, no matter whether you want to help with elasticsearch, lucene or my other flaws I will have done for sure.

Oh and in case you have not read it above:

In case you want to contact me, drop me a mail at [email protected]

Breaking changes for elasticsearch 0.90

Because elasticsearch now comes with its own suggest API (not based on in-memory automatons per shard), big parts of this plugin needs to be changed.

REST endpoints have been moved

Both REST endpoints have been moved. The /_suggest endpoint now resides at __suggest. Refreshing has changed from _suggestRefresh to __suggestRefresh. I do not like this renaming either, but I have not yet got the ieda of a better name. I am totally open for better names. This is a WIP until elasticsearch 1.0 is released.

Package names have been moved

Everything is now in the de.spinscale package name space in order to avoid clashes. This means, if you are using the request builder classes, you will have to change your application.

Installation

If you do not want to work on the repository, just use the standard elasticsearch plugin command (inside your elasticsearch/bin directory)

bin/plugin -install de.spinscale/elasticsearch-plugin-suggest/0.90.5-0.9

Compatibility

Note: Please make sure the plugin version matches with your elasticsearch version. Follow this compatibility matrix

----------------------------------------
| suggest plugin   | Elasticsearch     |
----------------------------------------
| 1.3.2-2.0.1      | 1.3.2 -> master   |
----------------------------------------
| 1.0.1-2.0.0      | 1.0.1             |
----------------------------------------
| 0.90.12-1.1      | 0.90.12           |
----------------------------------------
| 0.90.7-1.0       | 0.90.7            |
----------------------------------------
| 0.90.5-0.9       | 0.90.5            |
----------------------------------------
| 0.90.3-0.8.*     | 0.90.3            |
----------------------------------------
| 0.90.1-0.7       | 0.90.1            |
----------------------------------------
| 0.90.0-0.6.*     | 0.90.0            |
----------------------------------------
| 0.20.5-0.5       | 0.20.5 -> 0.20.6  |
----------------------------------------
| 0.20.2-0.4       | 0.20.2 -> 0.20.4  |
----------------------------------------
| 0.19.12-0.2      | 0.19.12           |
----------------------------------------
| 0.19.11-0.1      | 0.19.11           |
----------------------------------------

Development

If you want to work on the repository

  • Clone this repo with git clone git://github.com/spinscale/elasticsearch-suggest-plugin.git
  • Checkout the tag (find out via git tag) you want to build with (possibly master is not for your elasticsearch version)
  • Run: mvn clean package -DskipTests=true - this does not run any unit tests, as they take some time. If you want to run them, better run mvn clean package
  • Install the plugin: /path/to/elasticsearch/bin/plugin -install elasticsearch-suggest -url file:///$PWD/target/releases/elasticsearch-suggest-$version.zip

Alternatively you can now use this plugin via maven and include it via the sonatype repo likes this in your pom.xml (or any other dependency manager)

<repositories>
  <repository>
    <id>Sonatype</id>
    <name>Sonatype</name>
    <url>http://oss.sonatype.org/content/repositories/releases/</url>
  </repository>
</repositories>

<dependencies>
  <dependency>
    <groupId>de.spinscale</groupId>
    <artifactId>elasticsearch-suggest-plugin</artifactId>
    <version>1.3.2-2.0.1</version>
  </dependency>
  ...
<dependencies>

The maven repo can be visited at https://oss.sonatype.org/content/repositories/releases/de/spinscale/elasticsearch-plugin-suggest/

Usage

FST based suggestions

Fire up curl like this, in case you have a products index and the right fields - if not, read below how to setup a clean elasticsearch in order to support suggestions.

# curl -X POST 'localhost:9200/products1/product/__suggest?pretty=1' -d '{ "field": "ProductName.suggest", "term": "tischwäsche", "size": "10"  }'
{
  "suggest" : [ "tischwäsche", "tischwäsche 100", 
    "tischwäsche aberdeen", "tischwäsche acryl", "tischwäsche ambiente", 
    "tischwäsche aquarius", "tischwäsche atlanta", "tischwäsche atlas", 
    "tischwäsche augsburg", "tischwäsche aus", "tischwäsche austria" ]
}

As you can see, this queries the products index for the field ProductName.suggest with the specified term and size.

You can also use HTTP GET for getting suggestions - even with the callback and the source parameters like in any normal elasticsearch search.

You might want to check out the included unit test as well. I use a shingle filter in my examples, take a look at the files in src/test/resources directory.

Full suggestions

With Lucene 4 (and the upgrade to elasticsearch 0.90.0) two new suggesters were added, one of them the AnalyzingSuggester and the FuzzySuggester based on the first one. Both have the great capability of returning the original form, but search on an analyzed one. Take this example (notice the search for a lowercase b, but getting back the original field name):

» curl -X POST localhost:9200/cars/car/__suggest -d '{ "field" : "name", "type": "full", "term" : "b", "analyzer" : "standard" }'

{"suggestions":["BMW 320","BMW 525d"],"_shards":{"total":5,"successful":5,"failed":0}}

Note: If you use type full or type fuzzy, the similarity parameter will not have any effect. In addition, these parameters are supported only for full and fuzzy:

  • analyzer:
  • index_analyzer:
  • search_analyzer:

This suggester can even ignore stopwords if configured appropriately - but only if you disable position increments for stopwords. Use this mapping and index settings when creating an index:

curl -X DELETE localhost:9200/cars
curl -X PUT localhost:9200/cars -d '{
  "mappings" : {
    "car" : {
      "properties" : {
        "name" : {
          "type" : "multi_field",
          "fields" : {
            "name":    { "type": "string", "index": "not_analyzed" }
          }
        }
      }
    }
  },
  "settings" : {
    "analysis" : {
      "analyzer" : {
        "suggest_analyzer_stopwords" : {
          "type" : "custom",
          "tokenizer" : "standard",
          "filter" : [ "standard", "lowercase", "stopword_no_position_increment" ]
        },
        "suggest_analyzer_synonyms" : {
          "type" : "custom",
          "tokenizer" : "standard",
          "filter" : [ "standard", "lowercase", "my_synonyms" ]
        }
      },
      "filter" : {
        "stopword_no_position_increment" : {
          "type" : "stop",
          "enable_position_increments" : false
        },
        "my_synonyms" : {
          "type" : "synonym",
          "synonyms" : [ "jetta, bora" ]
        }
      }
    }
  }
}'


curl -X POST localhost:9200/cars/car -d '{ "name" : "The BMW ever" }'
curl -X POST localhost:9200/cars/car -d '{ "name" : "BMW 320" }'
curl -X POST localhost:9200/cars/car -d '{ "name" : "BMW 525d" }'
curl -X POST localhost:9200/cars/car -d '{ "name" : "VW Jetta" }'
curl -X POST localhost:9200/cars/car -d '{ "name" : "VW Bora" }'

Now when querying with a stopwords analyzer, you can even get back The BMW ever

» curl -X POST localhost:9200/cars/car/__suggest -d '{ "field" : "name", "type": "full", "term" : "b", "analyzer" : "suggest_analyzer_stopwords" }'
{"suggestions":["BMW 320","BMW 525d","The BMW ever"],"_shards":{"total":5,"successful":5,"failed":0}}

Or you could use synonyms (FYI: jetta and bora were the same cars, but named different in USA and Europe, so a search should return both)

» curl -X POST localhost:9200/cars/car/__suggest -d '{ "field" : "name", "type": "full", "term" : "vw je", "analyzer" : "suggest_analyzer_synonyms" }'

{"suggestions":["VW Bora","VW Jetta"],"_shards":{"total":5,"successful":5,"failed":0}}

Full fuzzy suggestions

The FuzzySuggester uses LevenShtein distance to cater for typos.

» curl -X POST localhost:9200/cars/car/__suggest -d '{ "field" : "name", "type": "fuzzy", "term" : "bwm", "analyzer" : "standard" }'

{"suggestions":["BMW 320","BMW 525d"],"_shards":{"total":5,"successful":5,"failed":0}}

Statistics

The FuzzySuggester and the AnalyzingSuggester suggesters contain a method to find out their size, which is also exposed as an own endpoint, in case you want to monitor memory consumption of the in-memory structures.

» curl localhost:9200/__suggestStatistics
{"_shards":{"total":2,"successful":2,"failed":0},"fstStats":{"cars-0":[{"analyzingsuggester-name-queryAnalyzer:suggest_analyzer_synonyms-indexAnalyzer:suggest_analyzer_synonyms":147},{"analyzingsuggester-name-queryAnalyzer:suggest_analyzer_stopwords-indexAnalyzer:suggest_analyzer_stopwords":126}]}}

Configuration

Furthermore the suggest data is not updated, whenever you index a new product but every few minutes. The default is to update the index every 10 minutes, but you can change that in your elasticsearch.yml configuration:

suggest:
  refresh_interval: 600s

In this case the suggest indexes are refreshed every 10 minutes. This is also the default. You can use values like "10s", "10ms" or "10m" as with most other time based configuration settings in elasticsearch.

If you want to deactivate automatic refresh completely, put this in your elasticsearch configuration

suggest:
  refresh_disabled: true

If you want to refresh your FST suggesters manually instead of waiting for 10 minutes just issue a POST request to the /__suggestRefresh URL.

# curl -X POST 'localhost:9200/__suggestRefresh' 
# curl -X POST 'localhost:9200/products/product/__suggestRefresh' 
# curl -X POST 'localhost:9200/products/product/__suggestRefresh' -d '{ "field" : "ProductName.suggest" }'

Usage from Java

SuggestRequest request = new SuggestRequest(index);
request.term(term);
request.field(field);
request.size(size);
request.similarity(similarity);

SuggestResponse response = node.client().execute(SuggestAction.INSTANCE, request).actionGet();

Refresh works like this - you can add an index and a field in the suggest refresh request as well, if you want to trigger it externally:

SuggestRefreshRequest refreshRequest = new SuggestRefreshRequest();
SuggestRefreshResponse response = node.client().execute(SuggestRefreshAction.INSTANCE, refreshRequest).actionGet();

You can also use the included builders

List<String> suggestions = new SuggestRequestBuilder(client)
            .field(field)
            .term(term)
            .size(size)
            .similarity(similarity)
            .execute().actionGet().suggestions();
SuggestRefreshRequestBuilder builder = new SuggestRefreshRequestBuilder(client);
builder.execute().actionGet();

Thanks

  • Shay (@kimchy) for giving feedback
  • David (@dadoonet) for pushing me to get it into the maven repo
  • Adrien (@jpountz) for helping me to understand the the AnalyzingSuggester details and having the idea for only creating the FST on the primary shard

TODO

  • Find and verify the absence of the current resource leak (open deleted files after lots of merging) with the new architecture
  • Create the FST structure only on the primary shard and send it to the replica over the wire as byte array
  • Allow deletion of of fields in cache instead of refresh
  • Reenable the field refresh tests by checking statistics
  • Also expose the guava cache statistics in the endpoint
  • Create a testing rule that does start a node/cluster only once per test run, not per test. This costs so much time.

Changelog

  • 2014-09-03: Version bump to 1.3.2, also created a 1.0, 1.1 and 1.2 branch
  • 2014-03-01: Version bump to 1.0.1, created a 0.90 branch
  • 2014-03-01: Version bump to 0.90.12, switched to randomized elasticsearch testing resulting in testing code cleanups and waaaaaaaay faster tests
  • 2013-12-10: Version bump to 0.90.7
  • 2013-08-13: Version bump to 0.90.3. Due to changes in Lucene 4.4, please check the tests to see that stopwords need to be handled on the request side if you use the fuzzy or full mode.
  • 2013-05-31: Removing usage of jdk7 only methods, version bump to 0.90.1
  • 2013-05-25: Changing suggest statistics format, fixing cache loading bug for analyzing/fuzzysuggester
  • 2013-05-12: Fix for trying to access a closed index reader in AnalyzingSuggesster (i.e. after refresh)
  • 2013-05-12: Documentation update
  • 2013-05-01: Added support for the fuzzy suggester
  • 2013-04-28: Added support for the analyzing suggester and stopwords
  • 2013-03-20: Moved to own package namespaces, changed REST endpoints in order to be compatible with elasticsearch 0.90
  • 2013-01-18: Support for HTTP GET, together with JSONP and the source parameter.
  • 2012-10-21: The REST urls can now be used without specifiying a type (which is unused at the moment anyway). You can use now the $index/_suggest and $index/_suggestRefresh urls
  • 2012-10-21: Allowing to set suggest.refresh_disabled = true in order to deactivate automatic refreshing of the suggest index
  • 2012-10-06: Shutting down the shard suggest service clean in case the instance is stopped or a shard is moved
  • 2012-10-03: Starting cluster nodes in parallel in tests where several nodes are created (big speedup)
  • 2012-10-03: Added tests for refreshing suggest in memory structures for one index or one field in an index only
  • 2012-10-03: Replaced gradle with maven
  • 2012-10-03: Updated to elasticsearch 0.19.10
  • 2012-10-03: You can use the plugin now with a TransportClient for the first time. Yay!
  • 2012-10-03: Using the FSTCompletionLookup now instead of the deprecated FSTLookup
  • 2012-10-03: Pretty much a core rewrite today (having tests is great, even if they run 10 minutes). The suggest service is now implemented as service on shard level - no more central Suggester structures. The whole implementation is much cleaner and adheres way better to the whole elasticsearch architecture instead of being cowboy coded together - at least that is what I think.
  • 2012-09-30: Updated to elasticsearch 0.19.9. Making TransportClients work again not spitting an exception on startup, when the module is in classpath. Updated this docs.
  • 2012-06-25: Trying to fix another resource leak, which did not eat up diskspace but still did not close all files
  • 2012-06-11: Fixing bad resoure leak due to not closing index reader properly - this lead to lots of deleted files, which still had open handles, thus taking up space
  • 2012-05-13: Updated to work with elasticsearch 0.19.3
  • 2012-03-07: Updated to work with elasticsearch 0.19.0
  • 2012-02-10: Created SuggestRequestBuilder and SuggestRefreshRequestBuilder classes - results in easy to use request classes (check the examples and tests)
  • 2011-12-29: The refresh interval can now be chosen as time based value like any other elasticsearch configuration
  • 2011-12-29: Instead of having all nodes sleeping the same time and updating the suggester asynchronously, the master node now triggers the update for all slaves
  • 2011-12-20: Added transport action (and REST action) to trigger reloading of all FST suggesters
  • 2011-12-11: Fixed the biggest issue: Searchers are released now and do not leak
  • 2011-12-11: Indexing is now done periodically
  • 2011-12-11: Found a way to get the injector from the node, so I can build my tests without using HTTP requests

HOWTO - the long version

This HOWTO will help you to setup a clean elasticsearch installation with the correct index settings and mappings, so you can use the plugin as easy as possible. We will setup elasticsearch, index some products and query those for suggestions.

Get elasticsearch, install it, get this plugin, install it.

Add a suggest and a lowercase analyzer to your elasticsearch/config/elasticsearch.yml config file (or do it on index creation whatever you like)

index:
  analysis:
    analyzer:
      lowercase_analyzer:
        type: custom
        tokenizer: standard
        filter: [standard, lowercase] 
      suggest_analyzer:
        type: custom
        tokenizer: standard
        filter: [standard, lowercase, shingle]

Start elasticsearch and create a mapping. You can either create it via configuration in a file or during index creation. We will create an index with a mapping now

curl -X PUT localhost:9200/products -d '{
    "mappings" : {
        "product" : {
            "properties" : {
	        "ProductId":	{ "type": "string", "index": "not_analyzed" },
	        "ProductName" : {
	            "type" : "multi_field",
	            "fields" : {
	                "ProductName":  { "type": "string", "index": "not_analyzed" },
	                "lowercase":    { "type": "string", "analyzer": "lowercase_analyzer" },
	                "suggest" :     { "type": "string", "analyzer": "suggest_analyzer" }
	            }
	        }
            }
        }
    }
}'

Lets add some products

for i in 1 2 3 4 5 6 7 8 9 10 100 101 1000; do
    json=$(printf '{"ProductId": "%s", "ProductName": "%s" }', $i, "My Product $i")
    curl -X PUT localhost:9200/products/product/$i -d "$json"
done

Queries

Time to query and understand the different analyzers

Queries the not analyzed field, returns 10 matches (default), always the full product name:

curl -X POST localhost:9200/products/product/_suggest -d '{ "field": "ProductName", "term": "My" }'

Queries the not analyzed field, returns nothing (because lowercase):

curl -X POST localhost:9200/products/product/_suggest -d '{ "field": "ProductName", "term": "my" }'

Queries the lowercase field, returns only the occuring word (which is pretty bad for suggests):

curl -X POST localhost:9200/products/product/_suggest -d '{ "field": 
"ProductName.lowercase", "term": "m" }'

Queries the suggest field, returns two words (this is the default length of the shingle filter), in this case "my" and "my product"

curl -X POST localhost:9200/products/product/_suggest -d '{ "field": "ProductName.suggest", "term": "my" }'

Queries the suggest field, returns ten product names as we started with the second word + another one due to the shingle

curl -X POST localhost:9200/products/product/_suggest -d '{ "field": "ProductName.suggest", "term": "product" }'

Queries the suggest field, returns all products with "product 1" in the shingle

curl -X POST localhost:9200/products/product/_suggest -d '{ "field": "ProductName.suggest", "term": "product 1" }'

The same query as above, but limits the result set to two

curl -X POST localhost:9200/products/product/_suggest -d '{ "field": "ProductName.suggest", "term": "product 1", "size": 2 }'

And last but not least, typo finding, the query without similarity parameter set returns nothing:

curl -X POST localhost:9200/products/product/_suggest -d '{ "field": "ProductName.suggest", "term": "proudct", similarity: 0.7 }'

The similarity is a float between 0.0 and 1.0 - if it is not specified 1.0 is used, which means it must match exactly. I've found 0.7 ok for cases, when two letters were exchanged, but mileage may very as I tested merely on german product names.

With the tests I did, a shingle filter held the best results. Please check http://www.elasticsearch.org/guide/reference/index-modules/analysis/shingle-tokenfilter.html for more information about setup, like the default tokenization of two terms.

Now test with your data, come up and improve this configuration. I am happy to hear about your specific configuration for successful suggestion queries.

More Repositories

1

elasticsearch-ingest-opennlp

An Elasticsearch ingest processor to do named entity extraction using Apache OpenNLP
Java
269
star
2

elasticsearch-opennlp-plugin

Additional opennlp mapping type for elasticsearch in order to perform named entity recognition
Java
136
star
3

dropwizard-blog-sample

A sample dropwizard application using elasticsearch as storage backend
JavaScript
84
star
4

elasticsearch-ecommerce-search-app

A small Micronaut based demo environment to show Elasticsearch as a product search engine
Java
77
star
5

elasticsearch-ingest-langdetect

Ingest processor doing language detection for fields
Java
71
star
6

elasticsearch-graphite-plugin

[UNMAINTAINED] Elasticsearch plugin which pushes data to a graphite server
Java
66
star
7

elasticsearch-facetgrapher

Small hack to draw date histogram facets as graph using nvd3.js
JavaScript
54
star
8

play-cookbook

Source code for most of the recipes featured in the play framework cookbook
Java
53
star
9

elasticsearch-rest-client-samples

Elasticsearch REST client samples using Testcontainers
Java
50
star
10

cookiecutter-elasticsearch-ingest-processor

A cookiecutter template for an elasticsearch ingest processor plugin
Java
47
star
11

crystal-aws-lambda

Create and deploy AWS lambdas written in Crystal
Crystal
42
star
12

elasticsearch-river-streaming-json

A sample elasticsearch river able to stream json data in
Java
38
star
13

link-rating

A sample Spring Boot application to demo the Elastic Stack
Java
20
star
14

alfred-workflow-elastic-docs

An alfred workflow to easily search the elastic documentation
JavaScript
15
star
15

serverless-owntracks-kotlin

A set of serverless AWS lambdas written in Kotlin to store location tracking data from Owntracks
Kotlin
12
star
16

play-ofbiz

Play module to support the OFBiz entity engine, a persistence layer
Java
11
star
17

serverless-reverse-geocoder

Demo using Apache Lucene has a reverse geocoder, running as a CLI app via Graal, AWS Lambda or Google Cloud Run
Java
11
star
18

elasticsearch-facet-georegion

A sample elasticsearch facet implementation which allows to group by geographical region
Java
10
star
19

spark-groovy

Groovy syntactic sugar on top of the spark web framework
Groovy
9
star
20

play-solr

Solr plugin for the playframework
Java
8
star
21

spring-boot-app-search

Spring Boot App Search Demo
Java
7
star
22

maxcube-java

A CLI client for maxcube eq3 devices and a java dead simple libary
Java
6
star
23

seccomp-samples

A few seccomp samples, that can be run in a vagrant VM
Crystal
5
star
24

kibana2json

CLI tool to convert JSON copied from the kibana console back to valid JSON
Crystal
5
star
25

alfred-workflow-elastic-docs.cr

An alfred workflow to easily search the elastic documentation
Crystal
5
star
26

spring-boot-reactive-observability-demo

Spring Boot sample app to demo Elastic Observability
Java
4
star
27

javalin-custom-tailwindcss-example

Create a custom Tailwind CSS build with Gradle in your Java project
HTML
3
star
28

grok.cr

Crystal implementation of a grok
Crystal
3
star
29

play-bidding-sample

A sample bidding application using web sockets
Java
3
star
30

observability-java-samples

Elastic APM Java examples
Java
2
star
31

git-log-to-elasticsearch

A CLI tool to index git repository metadata into Elasticsearch
Crystal
2
star
32

social-network-search-sample

Svelte based SPA example of using Elasticsearch for a social network search
Svelte
1
star
33

elasticsearch-plugin-testcontainers-sample

Using Testcontainers To Test Elasticsearch Plugins
Java
1
star
34

quarkus-logging-ecs

Quarkus Logging extension to output log message in JSON ECS format
Java
1
star
35

javalin-cookie-session-store

A cookie based session store for Javalin
Java
1
star
36

katamari

Polyglot Asynchronous Middleware Framework
Java
1
star
37

jrebel-spark-plugin

JRebel plugin for the spark java webframework
Java
1
star
38

spinscale.github.com

Pages Repository
HTML
1
star
39

javalin-elasticsearch-sample-app

A sample javalin app using the built-in Java HTTP client to connect to Elasticsearch
Java
1
star
40

lead

lead - little elasticsearch alerting deployer
Crystal
1
star
41

talk-elasticsearch-security-manager-and-seccomp

Corresponding demos to my talk about Elasticsearch, its use of the Java Security Manager & seccomp
Java
1
star
42

elastic-stack-meetup-stream

Using the Elastic Stack to visualize the meetup.com reservation stream
Shell
1
star