• This repository has been archived on 24/Feb/2021
  • Stars
    star
    136
  • Rank 267,670 (Top 6 %)
  • Language
    Java
  • Created over 11 years ago
  • Updated over 8 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Additional opennlp mapping type for elasticsearch in order to perform named entity recognition

Elasticsearch OpenNLP Plugin

DO NOT USE THIS ANYMORE, IT IS DEPRECATED AND A BAD DESIGN IDEA

  • This PoC was against a really 0.90 version of Elasticsearch, porting it to newer version would need some significant amount of work
  • NLP enrichment in general is clearly a preprocessing step, that should not be done in Elasticsearch itself. First, the NLP model needs to be loaded in all nodes, requiring you a significant amount of heap to dedicate to NLP instead of Elasticsearch, destabilizing Elasticsearch
  • Upgrading your model requires a restart of all of the Elasticsearch nodes, resulting in unwanted downtime.
  • Workaround 1: You could have your own service in front of Elasticsearch that is doing NLP enrichments before sending the document to Elasticsearch. This one is decoupled and can be updated anytime, and even scaled up and down independently.
  • Workaround 2: Check out the work which is currently (early 2016) being done in the ingest branch in Elasticsearch - that is a mechanism allowing you to change a document before indexing, and this is, where it makes sense to port this NLP plugin to in the future. Also check out the github issues around this topic.

If you are searching for an update on this, you might want to check out the elasticsearch ingest opennlp processor for Elasticsearch 5.0 and above

This plugin uses the opennlp project to extract named entities from an indexed field. This means, when a certain field of a document is indexed, you can extract entities like persons, dates and locations from it automatically and store them in additional fields.

Add the configuration

opennlp.models.name.file: /path/to/elasticsearch-0.20.5/models/en-ner-person.bin
opennlp.models.date.file: /path/to/elasticsearch-0.20.5/models/en-ner-date.bin
opennlp.models.location.file: /path/to/elasticsearch-0.20.5/models/en-ner-location.bin

Add a mapping

curl -X PUT localhost:9200/articles
curl -v http://localhost:9200/articles/article/_mapping -d '
{ "article" : { "properties" : { "content" : { "type" : "opennlp" } } } }'

Index a document

curl -X PUT http://localhost:9200/articles/article/1 -d '
{ "title" : "Some title" ,
"content" : "Kobe Bryant is one of the best basketball players of all times. Not even Michael Jordan has ever scored 81 points in one game." }'

Query for a persons name

curl -X POST http://localhost:9200/articles/article/_search -d '
{ "query" : { "term" : { "content.name" : "kobe" } } }'

Query for another part of the article and you will not find it

curl -X POST http://localhost:9200/articles/article/_search -d '
{ "query" : { "term" : { "content.name" : "basketball" } } }'

Querying also works for locations or dates

curl -X PUT http://localhost:9200/articles/article/2 -d '{ "content" : "My next travel destination is going to be in Amsterdam. I will be going to Schiphol Airport next Sunday." }'

curl -X POST http://localhost:9200/articles/article/_search -d '
{ "query" : { "text" : { "content.location" : "schiphol airport" } } }'

curl -X POST http://localhost:9200/articles/article/_search -d '
{ "query" : { "term" : { "content.location" : "amsterdam" } } }'

curl -X POST http://localhost:9200/articles/article/_search -d '
{ "query" : { "term" : { "content.date" : "sunday" } } }'

Facetting is supported as well

curl -X PUT localhost:9200/articles
curl -v http://localhost:9200/articles/article/_mapping -d '{ "article" : { "properties" : { "content" : { "type" : "opennlp", "location_analyzer" : "keyword" } } } }'
curl -X PUT http://localhost:9200/articles/article/2 -d '{ "content" : "My next travel destination is going to be in Amsterdam. I will be going to Schiphol Airport next Sunday." }'
curl -X POST http://localhost:9200/articles/article/_search -d '{ "query" : { "match_all" : {} }, "facets" : { "location" : { "terms" : { "field" : "content.location" }} }, "size" : 0 }'

Downloading the models

In case you want to run the tests or use the plugin in elasticsearch, you need to download the models from sourceforge.

wget http://opennlp.sourceforge.net/models-1.5/en-ner-person.bin
wget http://opennlp.sourceforge.net/models-1.5/en-ner-location.bin
wget http://opennlp.sourceforge.net/models-1.5/en-ner-date.bin

Copy these models somewhere in our filesystem.

Installation

Package the plugin by calling mvn package -DskipTests=true and then install the plugin via

bin/plugin -install opennlp -url file:///path/to/elasticsearch-opennlp-plugin/target/releases/elasticsearch-plugin-opennlp-0.1-SNAPSHOT.zip

Running the tests

In case you want to run the tests, copy the above downloaded models to src/test/resources/models and run mvn clean package

Mapping configuration

If you want to enable any field for NLP parsing, you need to set it via mapping, similar to the elasticsearch attachments mapper plugin.

{
  article:{
    properties:{
      "content" : { "type" : "opennlp" }
    }
  }
}

Using different analyzers per field

You can also use different analyzers per field, if you want (it might not make sense to use the default analyzer for dates for example)

{
  article:{
    properties:{
      "content" : {
        "type" : "opennlp",
        "person_analyzer"   : "your_own_analyzer",
        "date_analyzer"     : "your_next_analyzer",
        "location_analyzer" : "your_other_analyzer"
      }
    }
  }
}

Problems & considerations

  • The whole NLP process is pretty RAM costly, consider this when starting elasticsearch
  • Bad architecture, the OpenNlpService uses final variables instead of being built correctly. I am not too proud of it, but it works and I hacked it up in a few days of prototyping.

Future directions

Supporting different taggers

My first implementation was using a POS tagger, but this only yielded some grammatical content, so a tagger should be more capable. But perhaps other people could make use of that in different use cases.

Supporting more languages

Currently only english is supported.

Suggestions

Another interesting thing would be to support this fields with suggestions, so you could have a "name" input field in your application, which would suggest only the names of a field. For example you have a set of news articles, where you are searching in names only (very good, if a the person you are searching for a person surnamed "Good").

If you have an article which contains "Kobe Bryant" and "Michal Jordan", you should want to suggest them properly. This currently does not work, because the extracted names are simply appended to one string in the field.

Credits

Some code has been copied from the Taming Text book and its sources. In case you want an engineering driven introduction into this topic, I highly recommend this book.

License

The plugin is licensed under Apache 2.0 License.

More Repositories

1

elasticsearch-ingest-opennlp

An Elasticsearch ingest processor to do named entity extraction using Apache OpenNLP
Java
269
star
2

elasticsearch-suggest-plugin

Plugin for elasticsearch which uses the lucene FSTSuggester
Java
202
star
3

dropwizard-blog-sample

A sample dropwizard application using elasticsearch as storage backend
JavaScript
84
star
4

elasticsearch-ecommerce-search-app

A small Micronaut based demo environment to show Elasticsearch as a product search engine
Java
77
star
5

elasticsearch-ingest-langdetect

Ingest processor doing language detection for fields
Java
71
star
6

elasticsearch-graphite-plugin

[UNMAINTAINED] Elasticsearch plugin which pushes data to a graphite server
Java
66
star
7

elasticsearch-facetgrapher

Small hack to draw date histogram facets as graph using nvd3.js
JavaScript
54
star
8

play-cookbook

Source code for most of the recipes featured in the play framework cookbook
Java
53
star
9

elasticsearch-rest-client-samples

Elasticsearch REST client samples using Testcontainers
Java
50
star
10

cookiecutter-elasticsearch-ingest-processor

A cookiecutter template for an elasticsearch ingest processor plugin
Java
47
star
11

crystal-aws-lambda

Create and deploy AWS lambdas written in Crystal
Crystal
42
star
12

elasticsearch-river-streaming-json

A sample elasticsearch river able to stream json data in
Java
38
star
13

link-rating

A sample Spring Boot application to demo the Elastic Stack
Java
20
star
14

alfred-workflow-elastic-docs

An alfred workflow to easily search the elastic documentation
JavaScript
15
star
15

serverless-owntracks-kotlin

A set of serverless AWS lambdas written in Kotlin to store location tracking data from Owntracks
Kotlin
12
star
16

play-ofbiz

Play module to support the OFBiz entity engine, a persistence layer
Java
11
star
17

serverless-reverse-geocoder

Demo using Apache Lucene has a reverse geocoder, running as a CLI app via Graal, AWS Lambda or Google Cloud Run
Java
11
star
18

elasticsearch-facet-georegion

A sample elasticsearch facet implementation which allows to group by geographical region
Java
10
star
19

spark-groovy

Groovy syntactic sugar on top of the spark web framework
Groovy
9
star
20

play-solr

Solr plugin for the playframework
Java
8
star
21

spring-boot-app-search

Spring Boot App Search Demo
Java
7
star
22

maxcube-java

A CLI client for maxcube eq3 devices and a java dead simple libary
Java
6
star
23

seccomp-samples

A few seccomp samples, that can be run in a vagrant VM
Crystal
5
star
24

kibana2json

CLI tool to convert JSON copied from the kibana console back to valid JSON
Crystal
5
star
25

alfred-workflow-elastic-docs.cr

An alfred workflow to easily search the elastic documentation
Crystal
5
star
26

spring-boot-reactive-observability-demo

Spring Boot sample app to demo Elastic Observability
Java
4
star
27

javalin-custom-tailwindcss-example

Create a custom Tailwind CSS build with Gradle in your Java project
HTML
3
star
28

grok.cr

Crystal implementation of a grok
Crystal
3
star
29

play-bidding-sample

A sample bidding application using web sockets
Java
3
star
30

observability-java-samples

Elastic APM Java examples
Java
2
star
31

git-log-to-elasticsearch

A CLI tool to index git repository metadata into Elasticsearch
Crystal
2
star
32

social-network-search-sample

Svelte based SPA example of using Elasticsearch for a social network search
Svelte
1
star
33

elasticsearch-plugin-testcontainers-sample

Using Testcontainers To Test Elasticsearch Plugins
Java
1
star
34

quarkus-logging-ecs

Quarkus Logging extension to output log message in JSON ECS format
Java
1
star
35

javalin-cookie-session-store

A cookie based session store for Javalin
Java
1
star
36

katamari

Polyglot Asynchronous Middleware Framework
Java
1
star
37

jrebel-spark-plugin

JRebel plugin for the spark java webframework
Java
1
star
38

spinscale.github.com

Pages Repository
HTML
1
star
39

javalin-elasticsearch-sample-app

A sample javalin app using the built-in Java HTTP client to connect to Elasticsearch
Java
1
star
40

lead

lead - little elasticsearch alerting deployer
Crystal
1
star
41

talk-elasticsearch-security-manager-and-seccomp

Corresponding demos to my talk about Elasticsearch, its use of the Java Security Manager & seccomp
Java
1
star
42

elastic-stack-meetup-stream

Using the Elastic Stack to visualize the meetup.com reservation stream
Shell
1
star