• Stars
    star
    243
  • Rank 165,509 (Top 4 %)
  • Language
    Python
  • License
    MIT License
  • Created almost 6 years ago
  • Updated almost 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

spacy-wordnet creates annotations that easily allow the use of wordnet and wordnet domains by using the nltk wordnet interface

spaCy WordNet

spaCy Wordnet is a simple custom component for using WordNet, MultiWordnet and WordNet domains with spaCy.

The component combines the NLTK wordnet interface with WordNet domains to allow users to:

  • Get all synsets for a processed token. For example, getting all the synsets (word senses) of the word bank.
  • Get and filter synsets by domain. For example, getting synonyms of the verb withdraw in the financial domain.

Getting started

The spaCy WordNet component can be easily integrated into spaCy pipelines. You just need the following:

Prerequisites

  • Python 3.X
  • spaCy

You also need to install the following NLTK wordnet data:

python -m nltk.downloader wordnet
python -m nltk.downloader omw

Install

pip install spacy-wordnet

Supported languages

Almost all Open Multi Wordnet languages are supported.

Usage

Once you choose the desired language (from the list of supported ones above), you will need to manually download a spaCy model for it. Check the list of available models for each language at SpaCy 2.x or SpaCy 3.x.

English example

Download example model:

python -m spacy download en_core_web_sm

Run:

import spacy

from spacy_wordnet.wordnet_annotator import WordnetAnnotator 

# Load an spacy model
nlp = spacy.load('en_core_web_sm')
# Spacy 3.x
nlp.add_pipe("spacy_wordnet", after='tagger')
# Spacy 2.x
# nlp.add_pipe(WordnetAnnotator(nlp, name="spacy_wordnet"), after='tagger')
token = nlp('prices')[0]

# wordnet object link spacy token with nltk wordnet interface by giving acces to
# synsets and lemmas 
token._.wordnet.synsets()
token._.wordnet.lemmas()

# And automatically tags with wordnet domains
token._.wordnet.wordnet_domains()

spaCy WordNet lets you find synonyms by domain of interest for example economy

economy_domains = ['finance', 'banking']
enriched_sentence = []
sentence = nlp('I want to withdraw 5,000 euros')

# For each token in the sentence
for token in sentence:
    # We get those synsets within the desired domains
    synsets = token._.wordnet.wordnet_synsets_for_domain(economy_domains)
    if not synsets:
        enriched_sentence.append(token.text)
    else:
        lemmas_for_synset = [lemma for s in synsets for lemma in s.lemma_names()]
        # If we found a synset in the economy domains
        # we get the variants and add them to the enriched sentence
        enriched_sentence.append('({})'.format('|'.join(set(lemmas_for_synset))))

# Let's see our enriched sentence
print(' '.join(enriched_sentence))
# >> I (need|want|require) to (draw|withdraw|draw_off|take_out) 5,000 euros
    

Portuguese example

Download example model:

python -m spacy download pt_core_news_sm

Run:

import spacy

from spacy_wordnet.wordnet_annotator import WordnetAnnotator 

# Load an spacy model
nlp = spacy.load('pt_core_news_sm')
# Spacy 3.x
nlp.add_pipe("spacy_wordnet", after='tagger', config={'lang': nlp.lang})
# Spacy 2.x
# nlp.add_pipe(WordnetAnnotator(nlp.lang), after='tagger')
text = "Eu quero retirar 5.000 euros"
economy_domains = ['finance', 'banking']
enriched_sentence = []
sentence = nlp(text)

# For each token in the sentence
for token in sentence:
    # We get those synsets within the desired domains
    synsets = token._.wordnet.wordnet_synsets_for_domain(economy_domains)
    if not synsets:
        enriched_sentence.append(token.text)
    else:
        lemmas_for_synset = [lemma for s in synsets for lemma in s.lemma_names('por')]
        # If we found a synset in the economy domains
        # we get the variants and add them to the enriched sentence
        enriched_sentence.append('({})'.format('|'.join(set(lemmas_for_synset))))

# Let's see our enriched sentence
print(' '.join(enriched_sentence))
# >> Eu (querer|desejar|esperar) retirar 5.000 euros

More Repositories

1

argilla

✨Argilla: the open-source data curation platform for LLMs
Python
2,260
star
2

biome-text

Custom Natural Language Processing with big and small models 🌲🌱
Python
67
star
3

get_started_with_deep_learning_for_text_with_allennlp

Getting started with AllenNLP and PyTorch by training a tweet classifier
Python
67
star
4

adept-augmentations

A Python library aimed at dissecting and augmenting NER training data.
Python
47
star
5

kcap17-tutorial

Material for tutorial "Hybrid techniques for knowledge-based NLP: Knowledge graphs meet machine learning and all their friends" at KCAP 2017, Austin (Texas)
Jupyter Notebook
16
star
6

awesome-llm-datasets

πŸ‘©πŸ€πŸ€– A curated list of datasets for large language models (LLMs), RLHF and related resources (continually updated)
12
star
7

argilla-streamlit

πŸ‘‘ Streamlit for extended UI functionalities for Argilla.
Python
5
star
8

pln_con_rubrix

Jupyter Notebook
4
star
9

rubrix-streamlit-example

Streamlit web app for monitoring and collecting data from third-party apps with Rubrix
Python
4
star
10

argilla-plugins

πŸ”Œ Open-source plugins for with practical features for Argilla using listeners.
Python
3
star
11

custom_models_allennlp

Python
3
star
12

profner

Repo for our ProfNER contribution
Jupyter Notebook
2
star
13

rdf-processing

Processing RDF triples to build graphs
Scala
1
star
14

kg-builder

Scala
1
star
15

coset

Code for Mediaflows Coset competition
Jupyter Notebook
1
star
16

selectra

Repo to pre-train a spanish language model and zero-shot classifier based on the ELECTRA model
Python
1
star
17

docker-remote-aws-with-gpu

Configures an docker-machine remote environment with amazonec2 driver
Shell
1
star
18

bds-nlp-2018

Materials for Big Data Spain 2018 training "Get started with NLP and AI in Python"
Jupyter Notebook
1
star
19

cantemist-ner

Recognai's contribution to the CANTEMIST NER track
Jupyter Notebook
1
star