• Stars
    star
    894
  • Rank 51,071 (Top 2 %)
  • Language
    Python
  • License
    MIT License
  • Created about 5 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Elasticsearch with BERT for advanced document search.

Elasticsearch meets BERT

Below is a job search example:

An example of bertsearch

System architecture

System architecture

Requirements

  • Docker
  • Docker Compose >= 1.22.0

Getting Started

1. Download a pretrained BERT model

List of released pretrained BERT models (click to expand...)
BERT-Base, Uncased12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Large, Uncased24-layer, 1024-hidden, 16-heads, 340M parameters
BERT-Base, Cased12-layer, 768-hidden, 12-heads , 110M parameters
BERT-Large, Cased24-layer, 1024-hidden, 16-heads, 340M parameters
BERT-Base, Multilingual Cased (New)104 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Base, Multilingual Cased (Old)102 languages, 12-layer, 768-hidden, 12-heads, 110M parameters
BERT-Base, ChineseChinese Simplified and Traditional, 12-layer, 768-hidden, 12-heads, 110M parameters
$ wget https://storage.googleapis.com/bert_models/2018_10_18/cased_L-12_H-768_A-12.zip
$ unzip cased_L-12_H-768_A-12.zip

2. Set environment variables

You need to set a pretrained BERT model and Elasticsearch's index name as environment variables:

$ export PATH_MODEL=./cased_L-12_H-768_A-12
$ export INDEX_NAME=jobsearch

3. Run Docker containers

$ docker-compose up

CAUTION: If possible, assign high memory(more than 8GB) to Docker's memory configuration because BERT container needs high memory.

4. Create index

You can use the create index API to add a new index to an Elasticsearch cluster. When creating an index, you can specify the following:

  • Settings for the index
  • Mappings for fields in the index
  • Index aliases

For example, if you want to create jobsearch index with title, text and text_vector fields, you can create the index by the following command:

$ python example/create_index.py --index_file=example/index.json --index_name=jobsearch
# index.json
{
  "settings": {
    "number_of_shards": 2,
    "number_of_replicas": 1
  },
  "mappings": {
    "dynamic": "true",
    "_source": {
      "enabled": "true"
    },
    "properties": {
      "title": {
        "type": "text"
      },
      "text": {
        "type": "text"
      },
      "text_vector": {
        "type": "dense_vector",
        "dims": 768
      }
    }
  }
}

CAUTION: The dims value of text_vector must need to match the dims of a pretrained BERT model.

5. Create documents

Once you created an index, you’re ready to index some document. The point here is to convert your document into a vector using BERT. The resulting vector is stored in the text_vector field. Let`s convert your data into a JSON document:

$ python example/create_documents.py --data=example/example.csv --index_name=jobsearch
# example/example.csv
"Title","Description"
"Saleswoman","lorem ipsum"
"Software Developer","lorem ipsum"
"Chief Financial Officer","lorem ipsum"
"General Manager","lorem ipsum"
"Network Administrator","lorem ipsum"

After finishing the script, you can get a JSON document like follows:

# documents.jsonl
{"_op_type": "index", "_index": "jobsearch", "text": "lorem ipsum", "title": "Saleswoman", "text_vector": [...]}
{"_op_type": "index", "_index": "jobsearch", "text": "lorem ipsum", "title": "Software Developer", "text_vector": [...]}
{"_op_type": "index", "_index": "jobsearch", "text": "lorem ipsum", "title": "Chief Financial Officer", "text_vector": [...]}
...

6. Index documents

After converting your data into a JSON, you can adds a JSON document to the specified index and makes it searchable.

$ python example/index_documents.py

7. Open browser

Go to http://127.0.0.1:5000.

More Repositories

1

BossSensor

Hide screen when boss is approaching.
Python
6,197
star
2

awesome-embedding-models

A curated list of awesome embedding models tutorials, projects and communities.
Jupyter Notebook
1,740
star
3

anago

Bidirectional LSTM-CRF and ELMo for Named-Entity Recognition, Part-of-Speech Tagging and so on.
Python
1,481
star
4

HotPepperGourmetDialogue

Restaurant Search System through Dialogue in Japanese.
Python
271
star
5

HateSonar

Hate Speech Detection Library for Python.
Jupyter Notebook
187
star
6

asari

Japanese sentiment analyzer implemented in Python.
Python
143
star
7

natural-language-preprocessings

Some recipes of natural language pre-processing
Python
132
star
8

ja.text8

Japanese text8 corpus for word embedding.
Python
108
star
9

keras-crf-layer

Implementation of CRF layer in Keras.
Python
74
star
10

IOB2Corpus

Japanese IOB2 tagged corpus for Named Entity Recognition.
60
star
11

neraug

A text augmentation tool for named entity recognition.
Python
53
star
12

WikipediaQA

HTML
46
star
13

google-vision-sampler

Code examples for Google Vision API.
46
star
14

tensorflow-nlp-examples

TensorFlow Examples for Natural Language Processing
Python
32
star
15

awesome-text-classification

Text classification meets word embeddings.
Python
30
star
16

google-natural-language-sampler

Code examples for Google Natural Language API.
13
star
17

sentiment-analysis-toolbox

Sentiment analysis toolbox for all NLPer.
Jupyter Notebook
11
star
18

wiki-article-dataset

Wikipedia article dataset
Jupyter Notebook
11
star
19

kintone-handson

Python
10
star
20

japanese-news-crawler

A complete automated japanese news crawler built on the top of Scrapy framework
Python
8
star
21

protext

Python library for processing Japanese text.
Python
8
star
22

ChatDeTornado

CSS
5
star
23

TatsujinDaifugo

コンピュータ大貧民のクライアント作成を通じて、達人プログラマに近づくためのサービス
JavaScript
5
star
24

PyFaceRecognizer

Python
4
star
25

uecda-pyclient

Standard UECda Client written in Python.
Python
3
star
26

CourseraMachineLearning

Python
3
star
27

Internship

Python
2
star
28

spacy-hearst

Hearst patterns, for finding hyponyms, written in Python and spaCy.
Python
1
star
29

Hironsan

1
star