LASER embeddings
Out-of-the-box multilingual sentence embeddings.
laserembeddings is a pip-packaged, production-ready port of Facebook Research's LASER (Language-Agnostic SEntence Representations) to compute multilingual sentence embeddings.
- A compatibility issue with subword-nmt 0.3.8 was fixed (#39)
๐ - The behavior of
Laser.embed_sentences
was unclear/misleading when the number of language codes received in thelang
argument did not match the number of sentences to encode. It now raises an error in that case๐
Context
LASER is a collection of scripts and models created by Facebook Research to compute multilingual sentence embeddings for zero-shot cross-lingual transfer.
What does it mean? LASER is able to transform sentences into language-independent vectors. Similar sentences get mapped to close vectors (in terms of cosine distance), regardless of the input language.
That is great, especially if you don't have training sets for the language(s) you want to process: you can build a classifier on top of LASER embeddings, train it on whatever language(s) you have in your training data, and let it classify texts in any language.
The aim of the package is to make LASER as easy-to-use and easy-to-deploy as possible: zero-config, production-ready, etc., just a two-liner to install.
Getting started
Prerequisites
You'll need Python 3.6+ and PyTorch. Please refer to PyTorch installation instructions.
Installation
pip install laserembeddings
Chinese language
Chinese is not supported by default. If you need to embed Chinese sentences, please install laserembeddings with the "zh" extra. This extra includes jieba.
pip install laserembeddings[zh]
Japanese language
Japanese is not supported by default. If you need to embed Japanese sentences, please install laserembeddings with the "ja" extra. This extra includes mecab-python3 and the ipadic dictionary, which is used in the original LASER project.
If you have issues running laserembeddings on Japanese sentences, please refer to mecab-python3 documentation for troubleshooting.
pip install laserembeddings[ja]
Downloading the pre-trained models
python -m laserembeddings download-models
This will download the models to the default data
directory next to the source code of the package. Use python -m laserembeddings download-models path/to/model/directory
to download the models to a specific location.
Usage
from laserembeddings import Laser
laser = Laser()
# if all sentences are in the same language:
embeddings = laser.embed_sentences(
['let your neural network be polyglot',
'use multilingual embeddings!'],
lang='en') # lang is only used for tokenization
# embeddings is a N*1024 (N = number of sentences) NumPy array
If the sentences are not in the same language, you can pass a list of language codes:
embeddings = laser.embed_sentences(
['I love pasta.',
"J'adore les pรขtes.",
'Ich liebe Pasta.'],
lang=['en', 'fr', 'de'])
If you downloaded the models into a specific directory:
from laserembeddings import Laser
path_to_bpe_codes = ...
path_to_bpe_vocab = ...
path_to_encoder = ...
laser = Laser(path_to_bpe_codes, path_to_bpe_vocab, path_to_encoder)
# you can also supply file objects instead of file paths
If you want to pull the models from S3:
from io import BytesIO, StringIO
from laserembeddings import Laser
import boto3
s3 = boto3.resource('s3')
MODELS_BUCKET = ...
f_bpe_codes = StringIO(s3.Object(MODELS_BUCKET, 'path_to_bpe_codes.fcodes').get()['Body'].read().decode('utf-8'))
f_bpe_vocab = StringIO(s3.Object(MODELS_BUCKET, 'path_to_bpe_vocabulary.fvocab').get()['Body'].read().decode('utf-8'))
f_encoder = BytesIO(s3.Object(MODELS_BUCKET, 'path_to_encoder.pt').get()['Body'].read())
laser = Laser(f_bpe_codes, f_bpe_vocab, f_encoder)
What are the differences with the original implementation?
Some dependencies of the original project have been replaced with pure-python dependencies, to make this package easy to install and deploy.
Here's a summary of the differences:
Part of the pipeline | LASER dependency (original project) | laserembeddings dependency (this package) | Reason |
---|---|---|---|
Normalization / tokenization | Moses | Sacremoses 0.0.35, which seems to be the closest version to the Moses version used to train the model | Moses is implemented in Perl |
BPE encoding | fastBPE | subword-nmt | fastBPE cannot be installed via pip and requires compiling C++ code |
Japanese segmentation (optional) | MeCab / JapaneseTokenizer | mecab-python3 and ipadic dictionary | mecab-python3 comes with wheels for major platforms (no compilation needed) |
Will I get the exact same embeddings?
For most languages, in most of the cases, yes.
Some slight (and not so slight
An exhaustive comparison of the embeddings generated with LASER and laserembeddings is automatically generated and will be updated for each new release.
FAQ
How can I train the encoder?
You can't. LASER models are pre-trained and do not need to be fine-tuned. The embeddings are generic and perform well without fine-tuning. See facebookresearch/LASER#3 (comment).
Credits
Thanks a lot to the creators of LASER for open-sourcing the code of LASER and releasing the pre-trained models. All the kudos should go to them
A big thanks to the creators of Sacremoses and Subword Neural Machine Translation for their great packages.
Testing
The first thing you'll need is Poetry. Please refer to the installation guidelines.
Clone this repository and install the project:
poetry install -E zh -E ja
To run the tests:
poetry run pytest
Testing the similarity between the embeddings computed with LASER and laserembeddings
First, install the project with the extra dependencies (Chinese and Japanese support):
poetry install -E zh -E ja
Then, download the test data:
poetry run python -m laserembeddings download-test-data
Then, run the test with SIMILARITY_TEST
env. variable set to 1
.
SIMILARITY_TEST=1 poetry run pytest tests/test_laser.py
Now, have a coffee
The similarity report will be generated here: tests/report/comparison-with-LASER.md.