• Stars
    star
    212
  • Rank 186,122 (Top 4 %)
  • Language
    Python
  • Created about 9 years ago
  • Updated over 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Making sense embedding out of word embeddings using graph-based word sense induction

SenseGram

This repository contains implementation of a method that takes as an input a word embeddings, such as word2vec and splits different senses of the input words. For instance, the vector for the word "table" will be split into "table (data)" and "table (furniture)" as shown below.

Our method performs word sense induction and disambiguation based on sense embeddings. Sense inventory is induced from exhisting word embeddings via clustering of ego-networks of related words. Detailed description of the method is available in the original paper:

The picture below illustrates the main idea of the underlying approach:

ego

If you use the method please cite the following paper:

@InProceedings{pelevina-EtAl:2016:RepL4NLP,
  author    = {Pelevina, Maria  and  Arefiev, Nikolay  and  Biemann, Chris  and  Panchenko, Alexander},
  title     = {Making Sense of Word Embeddings},
  booktitle = {Proceedings of the 1st Workshop on Representation Learning for NLP},
  month     = {August},
  year      = {2016},
  address   = {Berlin, Germany},
  publisher = {Association for Computational Linguistics},
  pages     = {174--183},
  url       = {http://anthology.aclweb.org/W16-1620}
}

Use cases

This software can be used to:

  • Generation of word sense embeddigns from a raw text corpus

  • Generation of word sense embeddings from a pretrained word embeddings (in the word2vec format)

  • Generation of graphs of semantically related words

  • Generation of graphs of semantically related word senses

  • Generation of a word sense inventory specific to the input text corpus

Installation

This project is implemented in Python 3. It makes use of the word2vec toolkit (via gensim), FAISS for computation of graphs of related words, and the Chinese Whispers graph clustering algorithm. We suggest using Ubuntu Linux 16.04 for computation of the models and using it on a computational server (ideally from 64Gb of RAM and 16 cores) as some stages are computational intensive. To install all dependencies on Ubuntu Linux 16.04 use the following commands:

git clone --recursive https://github.com/tudarmstadt-lt/sensegram.git
make install-ubuntu-16-04

Optional: Set the PYTHONPATH variable to the root directory of this repository (needed only for working with the "egvi" scripts), e.g. ``export PYTHONPATH="/home/user/sensegram:$PYTHONPATH"

Note that this command also will bring you an appropriate vesion of Python 3 via Anaconda. If you already have a properly configured recent version of Python 3 and/or running a system different from Ubuntu 16.04, use the command make install to install the dependencies. Note however, that in this case, you will also need to install manually binary dependencies required by FAISS yourself.

Training a new model from a text corpus

The way to train your own sense embeddings is with the train.py script. You will have to provide a raw text corpus as input. If you run train.py with no parameters, it will print usage information:

usage: train.py [-h] [-cbow CBOW] [-size SIZE] [-window WINDOW]
                [-threads THREADS] [-iter ITER] [-min_count MIN_COUNT] [-N N]
                [-n N] [-min_size MIN_SIZE] [-make-pcz]
                train_corpus

Performs training of a word sense embeddings model from a raw text corpus
using the SkipGram approach based on word2vec and graph clustering of ego
networks of semantically related terms.

positional arguments:
  train_corpus          Path to a training corpus in text form (can be .gz).

optional arguments:
  -h, --help            show this help message and exit
  -cbow CBOW            Use the continuous bag of words model (default is 1,
                        use 0 for the skip-gram model).
  -size SIZE            Set size of word vectors (default is 300).
  -window WINDOW        Set max skip length between words (default is 5).
  -threads THREADS      Use <int> threads (default 40).
  -iter ITER            Run <int> training iterations (default 5).
  -min_count MIN_COUNT  This will discard words that appear less than <int>
                        times (default is 10).
  -N N                  Number of nodes in each ego-network (default is 200).
  -n N                  Maximum number of edges a node can have in the network
                        (default is 200).
  -min_size MIN_SIZE    Minimum size of the cluster (default is 5).
  -make-pcz             Perform two extra steps to label the original sense
                        inventory with hypernymy labels and disambiguate the
                        list of related words.The obtained resource is called
                        proto-concepualization or PCZ.

The training produces following output files:

  • model/ + CORPUS_NAME + .word_vectors - word vectors in the word2vec text format
  • model/ + CORPUS_NAME + .sense_vectors - sense vectors in the word2vec text format
  • model/ + CORPUS_NAME + .sense_vectors.inventory.csv - sense probabilities in TSV format

In addition, it produces several intermediary files that can be investigated for error analysis or removed after training:

  • model/ + CORPUS_NAME + .graph - word similarity graph (distributional thesaurus) in TSV format
  • model/ + corpus_name + .clusters - sense clusters produced by chinese-whispers in TSV format
  • model/ + corpus_name + .minsize + MIN_SIZE - clusters that remained after filtering out of small clusters in TSV format

In train.sh we provide an example for usage of the train.py script. You can test it using the command make train. More useful commands can be found in the Makefile.

Using a pre-trained model

See the QuickStart tutorial on how to perform word sense disambiguation and inspection of a trained model.

You can downlooad pre-trained models for English, German, and Russian. Note that to run examples from the QuickStart you only need files with extensions .word_vectors, .sense_vectors, and .sense_vectors.inventory.csv. Other files are supplementary.

Transforming pre-trained word embeddings to sense embeddings

Instead of learning a model from a text corpus, you can provide a pre-trained word embedding model. To do so, you just neeed to:

  1. Save the word embeddings file (in word2vec text format) with the extension .word_vectors, e.g. wikipedia.word_vectors.

  2. Run the train.py script inducating the path to the word embeddings file, e.g.:

python train.py model/wikipedia

Note: do not indicate the .word_vectors extension when launching the train.py script.

More Repositories

1

kaldi-tuda-de

Scripts for training general-purpose large vocabulary German acoustic models for ASR with Kaldi.
Shell
171
star
2

BlurbGenreCollection-HMC

Hierarchical multi-label text classification of the BlurbGenreCollection using capsule networks.
Python
83
star
3

bert-sense

Source code accompanying the KONVENS 2019 paper "Does BERT Make Any Sense? Interpretable Word Sense Disambiguation with Contextualized Embeddings"
Python
61
star
4

newsleak

Information extraction and interactive visualization of textual datasets for investigative data-driven journalism and eDiscovery
Java
53
star
5

kaldi-model-server

Simple Kaldi model server for chain (nnet3) models in online recognition mode directly from a local microphone
JavaScript
37
star
6

path2vec

Learning to represent shortest paths and other graph-based measures of node similarities with graph embeddings
Jupyter Notebook
32
star
7

targer

A web application tagging and retrieval of arguments in text
JavaScript
29
star
8

taxi

TAXI: a Taxonomy Induction Method based on Lexico-Syntactic Patterns, Substrings and Focused Crawling
Jupyter Notebook
29
star
9

subtitle2go

Python
27
star
10

bbb-live-subtitles

BBB plugin for automatic subtitles in conference calls
Python
26
star
11

Taxonomy_Refinement_Embeddings

Taxonomy refinement method to improve domain-specific taxonomy systems.
Python
26
star
12

wsd

A system for unsupervised knowledge-free interpretable word sense disambiguation based on distributional semantics
JavaScript
19
star
13

MeetingBot

Minute Meeting Bot
JavaScript
18
star
14

josimtext

A system for word sense induction and disambiguation based on JoBimText approach
Scala
16
star
15

ethiopicmodels

Different semantic models for Amharic
Jupyter Notebook
16
star
16

dats

Discourse Analysis Tool Suite
TypeScript
15
star
17

kb2vec

Vectorizing knowledge bases for entity linking
Python
15
star
18

par4Acad

Paraphrasing for academic texts
Jupyter Notebook
14
star
19

storyfinder

Storyfinder - A Browser Plugin and Server Backend for Personalized Knowledge- and Information Management
JavaScript
13
star
20

hatespeech

Hate speech
Jupyter Notebook
11
star
21

neural-coref

State-of-the-Art Neural Coreference Resolution on German Data
Python
10
star
22

cam

The Comparative Argument Machine
Python
10
star
23

lt-expertfinder

An Evaluation Framework for Expert Finding Methods
Java
10
star
24

microNER

A micro-service for German Named Entity Recognition
Jupyter Notebook
9
star
25

triframes

Unsupervised Semantic Frame Induction using Triclustering
Python
9
star
26

AmharicHateSpeech

Amharic Hate Speech - Dataset and classification Models
Jupyter Notebook
8
star
27

context-eval

Tools for Evaluation of Unsupervised Word Sense Disambiguation Systems
Jupyter Notebook
8
star
28

chinese-whispers

Implementation of the Chinese Whispers graph clustering algorithm
Java
8
star
29

comparative

Comparative Sentence Classifier
Jupyter Notebook
8
star
30

GermEval2017-Baseline

Baseline classification system for GermEval 2017, Aspect Based Sentiment Analysis
Java
8
star
31

academic_countdown

HTML
8
star
32

amharicprocessor

Amharic Segmenter and tokenizer
Python
7
star
33

TextGraphs17-shared-task

Jupyter Notebook
6
star
34

ASAB

Amharic Sentiment Annotator Bot
Python
6
star
35

mangosteen

A system for inducing distributional sense-aware semantic classes labeled with hypernyms
Python
5
star
36

par4sem

Adaptive Paraphrasing for Semantic Writing Aid tools
Java
4
star
37

158

WSD for 158 languages
Python
4
star
38

speech-lex-edit

Speech lexicon editor
Python
4
star
39

lefex

A tool for extraction of lexical features from text based on UIMA and MapReduce
Java
4
star
40

anno-plot

TypeScript
3
star
41

hindi-wordnet-extension

Enriching Hindi WordNet using Knowledge Graph Completion Approaches
Python
3
star
42

thesis-template-uhh-lt-latex

TeX
3
star
43

pss-lrev

Multi-modal Page stream segmation with Convolutional Neural Networks
Jupyter Notebook
3
star
44

LexiExp

LexiExp -- Free open source sentiment lexicon expansion script
Java
3
star
45

newsleak-docker

Docker configuration to run new/s/leak
2
star
46

news-speaker-attribution-2023

Data for the shared task on German speaker attribution
Python
2
star
47

multi-summ-german

Multi document summarization for German language
2
star
48

NoDCore

Scala
2
star
49

lt-keyterms

Simple but effective key term extraction for 40+ languages
Java
2
star
50

NLP_ThumbnailAnnotator

M.Sc. Project in Language Technology at Uni Hamburg from Florian Schneider
Java
1
star
51

uhhlt-offenseval2020

Python
1
star
52

NoDWeb

JavaScript
1
star
53

english-asr-lm-tools

Python
1
star
54

kaldi-asr-english

1
star
55

sovereign-perspectiveArg24

Python
1
star
56

senseasim

R
1
star
57

dimensions-of-similarity

Python
1
star
58

LT-ABSA

Java
1
star
59

autolinks

Automatic Proactive Researching
JavaScript
1
star
60

HILANO-Flower

Python
1
star
61

par4sim

CSS
1
star
62

tell-me-again

A dataset of multiple summaries for many stories
Python
1
star
63

dblplink

Code for ISWC 2023 demo paper titles "DBLPLink: An Entity Linker for the DBLP Scholarly Knowledge Graph"
Python
1
star
64

vida

VIDA: The Visual Incel Data Archive. A Theory-oriented Annotated Dataset To Enhance Hate Detection Through Visual Culture
1
star
65

cam-2.0

Python
1
star