• Stars
    star
    177
  • Rank 215,985 (Top 5 %)
  • Language
    Python
  • License
    MIT License
  • Created almost 5 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Google USE (Universal Sentence Encoder) for spaCy

Tests Downloads Current Release Version pypi Version Coverage Status Code style: black

Spacy - Universal Sentence Encoder

Make use of Google's Universal Sentence Encoder directly within SpaCy. This library lets you embed Docs, Spans and Tokens from the Universal Sentence Encoder family available on TensorFlow Hub.

For using sentence-BERT in spaCy, see https://github.com/MartinoMensio/spacy-sentence-bert

Motivation

There are many different reasons to not always use BERT. For example to have embeddings that are tuned specifically for another task (e.g. sentence similarity). See this very useful blog article: https://blog.floydhub.com/when-the-best-nlp-model-is-not-the-best-choice/

The Universal Sentence Encoder is trained on different tasks which are more suited to identifying sentence similarity. Google AI blog paper

This library uses the user_hooks of spaCy to use an external model for the vectors, in this case a simple wrapper to the models available on TensorFlow Hub.

Install

You can install this library from:

  • github: pip install git+https://github.com/MartinoMensio/spacy-universal-sentence-encoder.git
  • pyPI: pip install spacy-universal-sentence-encoder

Compatibility:

  • python:
    • 3.6: compatible but not actively tested
    • 3.7/3.8/3.9/3.10: compatible and actively tested
    • 3.11 compatible but relies on rc version of tensorflow-text for multilingual models
  • tensorflow>=2.4.0,<3.0.0
  • spacy>=3.0.0,<4.0.0 (SpaCy v3 API changed a lot from v2)

To use the multilingual version of the models, you need to install the extra named multi with the command: pip install spacy-universal-sentence-encoder[multi]. This installs the dependency tensorflow-text that is required to run the multilingual models. Note that tensorflow-text is currently in RC version for python3.11.

In alternative, you can install the following standalone pre-packaged models with pip. Each model can be installed independently:

model name source pip package
en_use_md https://tfhub.dev/google/universal-sentence-encoder pip install https://github.com/MartinoMensio/spacy-universal-sentence-encoder/releases/download/v0.4.6/en_use_md-0.4.6.tar.gz#en_use_md-0.4.6
en_use_lg https://tfhub.dev/google/universal-sentence-encoder-large pip install https://github.com/MartinoMensio/spacy-universal-sentence-encoder/releases/download/v0.4.6/en_use_lg-0.4.6.tar.gz#en_use_lg-0.4.6
xx_use_md https://tfhub.dev/google/universal-sentence-encoder-multilingual pip install https://github.com/MartinoMensio/spacy-universal-sentence-encoder/releases/download/v0.4.6/xx_use_md-0.4.6.tar.gz#xx_use_md-0.4.6
xx_use_lg https://tfhub.dev/google/universal-sentence-encoder-multilingual-large pip install https://github.com/MartinoMensio/spacy-universal-sentence-encoder/releases/download/v0.4.6/xx_use_lg-0.4.6.tar.gz#xx_use_lg-0.4.6

In addition, also CMLM models are now available:

model name source pip package
en_use_cmlm_md https://tfhub.dev/google/universal-sentence-encoder-cmlm/en-base pip install https://github.com/MartinoMensio/spacy-universal-sentence-encoder/releases/download/v0.4.6/en_use_cmlm_md-0.4.6.tar.gz#en_use_cmlm_md-0.4.6
en_use_cmlm_lg https://tfhub.dev/google/universal-sentence-encoder-cmlm/en-large pip install https://github.com/MartinoMensio/spacy-universal-sentence-encoder/releases/download/v0.4.6/en_use_cmlm_lg-0.4.6.tar.gz#en_use_cmlm_lg-0.4.6
xx_use_cmlm https://tfhub.dev/google/universal-sentence-encoder-cmlm/multilingual-base pip install https://github.com/MartinoMensio/spacy-universal-sentence-encoder/releases/download/v0.4.6/xx_use_cmlm-0.4.6.tar.gz#xx_use_cmlm-0.4.6
xx_use_cmlm_br https://tfhub.dev/google/universal-sentence-encoder-cmlm/multilingual-base-br pip install https://github.com/MartinoMensio/spacy-universal-sentence-encoder/releases/download/v0.4.6/xx_use_cmlm_br-0.4.6.tar.gz#xx_use_cmlm_br-0.4.6

Usage

Loading the model

If you installed the model standalone packages (see table above) you can use the usual spacy API to load this model:

import spacy
nlp = spacy.load('en_use_md')

Otherwise you need to load the model in the following way:

import spacy_universal_sentence_encoder
nlp = spacy_universal_sentence_encoder.load_model('xx_use_lg')

The third option is to load the model on your existing spaCy pipeline:

import spacy
# this is your nlp object that can be any spaCy model
nlp = spacy.load('en_core_web_sm')

# add the pipeline stage (will be mapped to the most adequate model from the table above, en_use_md)
nlp.add_pipe('universal_sentence_encoder')

In all of the three options, the first time that you load a certain Universal Sentence Encoder model, it will be downloaded from TF Hub (see section below to use an already downloaded model, or to change the location of the model files).

The last option (using nlp.add_pipe) can be customised with the following configurations:

  • use_model_url: allows to use a specific TFHub URL
  • preprocessor_url: for TFHub models that need specific preprocessing with another TFHub model (e.g., see documentation of CMLM models)
  • model_name: to load a specific model instead of mapping the current (language, size) to one of the options in the table above
  • enable_cache: default True, enables an internal cache to avoid embedding the same text (doc/span/token) twice. It makes the computation faster (when enough duplicates are embedded) but has a memory footprint because all the embeddings extracted are kept in the cache
  • debug: default False shows debugging information.

To use the configurations, when adding the pipe stage pass a dict as additional argument, for example:

nlp.add_pipe('universal_sentence_encoder', config={'enable_cache': False})

Use the embeddings

After adding to the pipeline, you can use the embedding models by using the various properties and methods of Docs, Spans and Tokens:

# load as before
import spacy
nlp = spacy.load('en_core_web_lg')
nlp.add_pipe('universal_sentence_encoder')

# get two documents
doc_1 = nlp('Hi there, how are you?')
doc_2 = nlp('Hello there, how are you doing today?')
# Inspect the shape of the Doc, Span and Token vectors
print(doc_1.vector.shape) # the full document representation
print(doc_1[3], doc_1[3].vector.shape) # the word "how"
print(doc_1[3:6], doc_1[3:6].vector.shape) # the span "how are you"

# or use the similarity method that is based on the vectors, on Doc, Span or Token
print(doc_1.similarity(doc_2[0:7]))

Common issues

Here you can find the most common issues with possible solutions.

Using a pre-downloaded model

If you want to use a model that you have already downloaded from TensorFlow Hub, belonging to the Universal Sentence Encoder family, you can use it by doing the following:

  • locate the full path of the folder where you have downloaded and extracted the model. Let's suppose the location is $HOME/tfhub_models
  • rename the folder of the extracted model (the one directly containing the folders variables and the file saved_model.pb) to the sha1 hash of the TFHub model source. The mapping URL / sha1 values is the following:
    • en_use_md: 063d866c06683311b44b4992fd46003be952409c
    • en_use_lg: c9fe785512ca4a1b179831acb18a0c6bfba603dd
    • xx_use_md: 26c892ffbc8d7b032f5a95f316e2841ed4f1608c
    • xx_use_lg: 97e68b633b7cf018904eb965602b92c9f3ad14c9
  • set the environment variable TFHUB_CACHE_DIR to the location containing the renamed folder, for example $HOME/tfhub_models (set it before trying to download the model: export TFHUB_CACHE_DIR=$HOME/tfhub_models)
  • Now load your model and it should see that it was already downloaded

Serialisation

To serialise and deserialise nlp objects, SpaCy does not restore user_hooks after deserialisation, so a call to from_bytes will result in not using the TensorFlow vectors, so the similarities won't be good. For this reason the suggested solution is:

  • serialise with bytes = doc.to_bytes() normally
  • deserialise with spacy_universal_sentence_encoder.doc_from_bytes(nlp, bytes) which will also restore the user hooks

Multiprocessing

This library, relying on TensorFlow, is not fork-safe. This means that if you are using this library inside multiple processes (e.g. with a multiprocessing.pool.Pool), your processes will deadlock. The solutions are:

  • use a thread-based environment (e.g. multiprocessing.pool.ThreadPool)
  • only use this library inside the created processes (first create the processes and then import and use the library)

Using nlp.pipe` with multiple processes

Spacy does not restore user hooks (UserWarning: [W109]) therefore if you use nlp.pipe with multiple processes, you won't be able to use the .vector on doc, span and token. I am developing a workaround.

Utils

To build and upload

# change version
VERSION=0.4.6
# change version references everywhere
# update locally installed package
pip install -r requirements.txt
# build the standalone models (8)
./build_models.sh
# build the archive at dist/spacy_universal_sentence_encoder-${VERSION}.tar.gz
python setup.py sdist
# upload to pypi
twine upload dist/spacy_universal_sentence_encoder-${VERSION}.tar.gz
# upload language packages to github

More Repositories

1

spacy-dbpedia-spotlight

A spaCy wrapper for DBpedia Spotlight
Python
105
star
2

spacy-sentence-bert

Sentence transformers models for SpaCy
Python
105
star
3

it_vectors_wiki_spacy

Word embeddings for Italian language, spacy2 prebuilt model
Python
8
star
4

polito-aule-bot

Bot per ottenere informazioni sulle aule libere al Politecnico di Torino
Python
4
star
5

semantic_parsing_annotations

Let's see how different semantic parsers annotate some NLU datasets (intents+slots)
Jupyter Notebook
3
star
6

claimreview-data

2
star
7

doc2latex_process_chapters

A stupid tool to process the outputs of docx2latex.com to create clean latex content divided into chapters
Python
2
star
8

simple-text-editor

Electron and monaco-loader powered editor
JavaScript
2
star
9

dirtyjson

A wrapper for https://github.com/RyanMarcus/dirty-json
Python
2
star
10

bert-as-service-docker

Shell
2
star
11

MisinfoMe_frontend_old

frontend part of https://github.com/MartinoMensio/MisinfoMe
TypeScript
1
star
12

SDP-exams

System and Devices Programming previous exams themes @ Polytechnic University of Turin
C++
1
star
13

masters_thesis

My Master's Thesis: Deep Semantic Learning for Conversational Agents
TeX
1
star
14

botcycle-server

Server component of BotCycle
JavaScript
1
star
15

DP2-Labs

Distributed Programming 2 Laboratories @ Polytechnic University of Turin
Java
1
star
16

corenlp-sentiment-tree-parser

Parse the sentimentTree string coming out of CoreNLP
JavaScript
1
star
17

misinfome

Python
1
star
18

PrullenbakVaccinNotificatie

Python
1
star
19

botcycle-nlu

This project has been moved into botcycle
Python
1
star
20

language_intensity

A playground to measure the intensity of language
Python
1
star
21

PDS-Labs

Laboratori di Programmazione di Sistema @ Polytechnic University of Turin
C
1
star
22

docker-example

Dockerfile
1
star
23

phd_kmi_template

A template for PhD thesis/upgrade report
TeX
1
star
24

CAs-Labs

Computer Architectures labs @ Polytechnic University of Turin
Assembly
1
star
25

twitter-connector

Python
1
star
26

OMA-project

Optimization Methods and Algorithms project @ Polytechnic University of Turin
Java
1
star
27

claimreview-collector

dataset processing part of https://github.com/MartinoMensio/MisinfoMe
Jupyter Notebook
1
star