• Stars
    star
    1,232
  • Rank 36,639 (Top 0.8 %)
  • Language
    Python
  • License
    Apache License 2.0
  • Created over 9 years ago
  • Updated about 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Toy Python implementation of http://www-nlp.stanford.edu/projects/glove/

glove-python

Circle CI

A toy python implementation of GloVe.

Glove produces dense vector embeddings of words, where words that occur together are close in the resulting vector space.

While this produces embeddings which are similar to word2vec (which has a great python implementation in gensim), the method is different: GloVe produces embeddings by factorizing the logarithm of the corpus word co-occurrence matrix.

The code uses asynchronous stochastic gradient descent, and is implemented in Cython. Most likely, it contains a tremendous amount of bugs.

Installation

Install from pypi using pip: pip install glove_python.

Note for OSX users: due to its use of OpenMP, glove-python does not compile under Clang. To install it, you will need a reasonably recent version of gcc (from Homebrew for instance). This should be picked up by setup.py; if it is not, please open an issue.

Building with the default Python distribution included in OSX is also not supported; please try the version from Homebrew or Anaconda.

Usage

Producing the embeddings is a two-step process: creating a co-occurrence matrix from the corpus, and then using it to produce the embeddings. The Corpus class helps in constructing a corpus from an interable of tokens; the Glove class trains the embeddings (with a sklearn-esque API).

There is also support for rudimentary pagragraph vectors. A paragraph vector (in this case) is an embedding of a paragraph (a multi-word piece of text) in the word vector space in such a way that the paragraph representation is close to the words it contains, adjusted for the frequency of words in the corpus (in a manner similar to tf-idf weighting). These can be obtained after having trained word embeddings by calling the transform_paragraph method on the trained model.

Examples

example.py has some example code for running simple training scripts: ipython -i -- examples/example.py -c my_corpus.txt -t 10 should process your corpus, run 10 training epochs of GloVe, and drop you into an ipython shell where glove.most_similar('physics') should produce a list of similar words.

If you want to process a wikipedia corpus, you can pass file from here into the example.py script using the -w flag. Running make all-wiki should download a small wikipedia dump file, process it, and train the embeddings. Building the cooccurrence matrix will take some time; training the vectors can be speeded up by increasing the training parallelism to match the number of physical CPU cores available.

Running this on my machine yields roughly the following results:

In [1]: glove.most_similar('physics')
Out[1]:
[('biology', 0.89425889335342257),
 ('chemistry', 0.88913708236100086),
 ('quantum', 0.88859617025616333),
 ('mechanics', 0.88821824562025431)]

In [4]: glove.most_similar('north')
Out[4]:
[('west', 0.99047203572917908),
 ('south', 0.98655786905501008),
 ('east', 0.97914140138065575),
 ('coast', 0.97680427897282185)]

In [6]: glove.most_similar('queen')
Out[6]:
[('anne', 0.88284931171714842),
 ('mary', 0.87615260138308615),
 ('elizabeth', 0.87362497374226267),
 ('prince', 0.87011034923161801)]

In [19]: glove.most_similar('car')
Out[19]:
[('race', 0.89549347066796814),
 ('driver', 0.89350343749207217),
 ('cars', 0.83601334715106568),
 ('racing', 0.83157724991920212)]

Development

Pull requests are welcome.

When making changes to the .pyx extension files, you'll need to run python setup.py cythonize in order to produce the extension .c and .cpp files before running pip install -e ..

More Repositories

1

spotlight

Deep recommender models using PyTorch.
Python
2,857
star
2

rustlearn

Machine learning crate for Rust
Rust
598
star
3

triplet_recommendations_keras

An example of doing MovieLens recommendations using triplet loss in Keras
Jupyter Notebook
417
star
4

wyrm

Autodifferentiation package in Rust.
Rust
169
star
5

sbr-go

Recommender systems for Go
Go
169
star
6

sbr-rs

Deep recommender systems for Rust
Rust
115
star
7

netrex

Neural network recommendation models in PyTorch
Python
57
star
8

recommender_datasets

A common format and repository for various recommender datasets.
Python
38
star
9

mixture

TeX
26
star
10

python-rustlearn

Calling rustlearn from Python
Python
26
star
11

explicit-vs-implicit

An experiment on explicit vs implicit feedback recommenders
Jupyter Notebook
26
star
12

binge

Recommendation models that use binary rather than floating point operations at prediction time.
TeX
21
star
13

fizzbuzz

A FizzBuzz solver using a Rust autodifferentiation library.
Rust
20
star
14

dictionarylearning

Online learning of sparse dictionaries
Java
13
star
15

dynarray

Dynamic Numpy arrays
Python
13
star
16

hugo-blog

Rust
8
star
17

BayesianTestJS

Rudimentary Bayesian Beta-Bernoulli A/B testing inference and visualization code.
JavaScript
6
star
18

BetaJS

Beta distribution probability density function (PDF) for JavaScript
JavaScript
3
star
19

sbr-sys

Rust
2
star
20

spotlight-recipe

Conda recipes for Spotlight
Shell
2
star
21

blas_cython

Python
1
star
22

lightfm_datasets

Datasets for the LightFM package
Python
1
star
23

sketchy

Python
1
star
24

wheedle

Rust
1
star
25

fictional-octo-chainsaw

1
star
26

ndarray-npy

Simple serialization of ndarray's arrays in Rust into NumPy's .npy format.
Rust
1
star