• Stars
    star
    186
  • Rank 207,316 (Top 5 %)
  • Language
    Python
  • License
    MIT License
  • Created over 9 years ago
  • Updated over 6 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Gaussian word embeddings

word2gauss

Gaussian word embeddings

Python/Cython implementation of Luke Vilnis and Andrew McCallum Word Representations via Gaussian Embedding, ICLR 2015 that represents each word as a multivariate Gaussian. Scales to (relatively) large corpora using Cython extensions and threading with asynchronous stochastic gradient descent (Adagrad).

Getting started

Installing

  1. Install the dependencies: numpy, scipy, the packages in requirements.txt The Travis CI provisioning script installs these packages provision.sh and may be useful as a starting point.

  2. Build/install word2gauss: sudo make install

  3. Finally it's a good idea to run the test suite: make test

NOTE: The performance sensitive parts of the code have been carefully written in a way that allows gcc to auto-vectorize all the important loops. Accordingly we recommend using gcc to compile and setting these flags for building:

export CFLAGS="-ftree-vectorizer-verbose=2 -O3 -ffast-math"
sudo -E bash -c "make install"

If you are using a Mac, gcc compiled code runs approximately 2.5X faster than the default clang compiler. You can force the build to use gcc instead of clang with:

# change these to the location of gcc -- note that /usr/bin/gcc is really
# clang in a default XCode installation
export CC=/usr/local/bin/gcc
export CXX=/usr/local/bin/g++
export CFLAGS="-ftree-vectorizer-verbose=2 -O3 -ffast-math"
sudo -E bash -c "make install"

Code overview

GaussianEmbedding

The GaussianEmbedding class is the main workhorse for most tasks. It stores the model data, deals with serialization to/from files and learns the parameters. To allow embedding of non-word types like hierarchies and entailment relations, GaussianEmbedding has no knowledge of any vocabulary and operates only on unique IDs. Each ID is a uint32 from 0 .. N-1 with -1 signifying an OOV token.

Vocabulary

For learning word embeddings, the token - id mapping is off-loaded to a Vocabulary class. This class bundles together a string tokenizer, a token - id map, and a random token id generator (used for the negative sampling in training, see below). This allows us to translate streams of documents into training examples.

The class needs this interface:

    .word2id: given a token, return the id or raise KeyError if not in the vocab
    .id2word: given a token id, return the token or raise IndexError if invalid
    .tokenize: given a string, tokenize it using the tokenizer and then
    remove all OOV tokens
    .tokenize_ids: given a string, tokenize and return the token ids
    .random_ids: given an integer, return a numpy array of random token ids

There is a simple implementation of a vocabulary class (word2gauss.words.Vocabulary) that uses a simple uniform random from the token_id space for the negative samples.

Alternatively, you can use https://github.com/seomoz/vocab that uses a sample based on the token counts, or provide your own implementation.

Learning embeddings

To learn embeddings, you will need a suitable corpus and an implementation of the vocab class.

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

from gzip import GzipFile

from word2gauss import GaussianEmbedding, iter_pairs
from vocab import Vocabulary

# load the vocabulary
vocab = Vocabulary(...)

# create the embedding to train
# use 100 dimensional spherical Gaussian with KL-divergence as energy function
embed = GaussianEmbedding(len(vocab), 100,
    covariance_type='spherical', energy_type='KL')

# open the corpus and train with 8 threads
# the corpus is just an iterator of documents, here a new line separated
# gzip file for example
with GzipFile('location_of_corpus', 'r') as corpus:
    embed.train(iter_pairs(corpus, vocab), n_workers=8)

# save the model for later
embed.save('model_file_location', vocab=vocab.id2word, full=True)

Examining trained models

from word2gauss import GaussianEmbedding
from vocab import Vocabulary

# load in a previously trained model and the vocab
vocab = Vocabulary(...)
embed = GaussianEmbedding.load('model_file_location')

# find nearest neighbors to 'rock'
embed.nearest_neighbors('rock', vocab=vocab)

# find nearest neighbors to 'rock' sorted by covariance
embed.nearest_neighbors('rock', num=100, vocab=vocab, sort_order='sigma')

# solve king + woman - man = ??
embed.nearest_neighbors([['king', 'woman'], ['man']], num=10, vocab=vocab)

Background details

Instead of representing a word as a vector as in word2vec, word2gauss represents each word as a multivariate Gaussian. Assuming some dictionary of known tokens w[i], i = 0 .. N-1, each word is represented as a probability P[i], a K dimensional Gaussian parameterized by

   P[i] ~ N(x; mu[i], Sigma[i])

Here, mu[i] and Sigma[i] are the mean and co-variance matrix for word i. The mean is a vector of length K and in the most general case Sigma[i] is a (K, K) matrix. The paper makes one of two approximations to simplify Sigma[i]:

  • 'diagonal' in which case Sigma[i] is a vector length K
  • 'spherical' in which case Sigma[i] is a single float

To learn the probabilities, first define an energy function E(P[i], P[j]) that returns a similarity like measure of the two probabilities. Both the symmetric Expected Likelihood Inner Product and asymmetric KL-divergence are implemented.

Given a pair of "positive" and "negative" indices, define Delta E = E(P[posi], P[posj]) - E(P[negi], P[negj]). Intuitively the training process optimizes the parameters to make Delta E positive. Formally, use a max-margin loss:

    loss = max(0, Closs - Delta E)

and optimize the parameters to minimize the sum of the loss over the entire training set of positive/negative pairs.

To generate the training pairs, use co-occuring words as the positive examples and randomly sampled words as the negative examples. Since the energy function is potentially asymmetric, for each co-occuring word pair randomly sample both the left and right tokens for negative examples. In addition, we allow the option to generate several sets of training pairs from each word. In pseudo-code:

for sentence in corpus:
    for i in len(sentence):
        for k in 1..window_size:
            for nsample in 1..number_of_samples_per_word:
                positive pair = (left, right) = (sentence[i], sentence[i + k])
                negative pairs = [(left, random ID), (random ID, right)]
                update model weights

More Repositories

1

shovel

Rake, for Python
Python
664
star
2

simhash-py

Simhash and near-duplicate detection
Python
377
star
3

qless

Queue / Pipeline Management
Ruby
292
star
4

pyreBloom

Fast Redis Bloom Filters in Python
Python
286
star
5

interpol

A toolkit for working with API endpoint definition files, giving you a stub app, a schema validation middleware, and browsable documentation.
HTML
187
star
6

reppy

Modern robots.txt Parser for Python
Python
178
star
7

SEOmozAPISamples

Mozscape API sample code
Java
158
star
8

simhash-cpp

Simhashing in C++
C++
121
star
9

url-py

URL Transformation, Sanitization
Python
102
star
10

qless-core

Core Lua Scripts for qless
Python
83
star
11

simhash-db-py

Python API for Various DB-Backed Simhash Clusters
Python
63
star
12

qless-py

Python Bindings for qless
Python
48
star
13

qdr

Query-Document Relevance
Python
43
star
14

dragnet_data

Training/test data for Dragnet
Shell
41
star
15

publicsuffix-elixir

Elixir library providing public suffix logic based on publicsuffix.org data
Elixir
38
star
16

linkscape-gem

Provides an interface to SEOmoz's suite of APIs, including the free and site intelligence APIs.
Ruby
38
star
17

simhash-cluster

A cluster implementation of simhash near-duplicate detection
Python
33
star
18

Social-Authority-SDK

Ruby
33
star
19

s3po

Your Friendly Asynchronous S3 Upload Protocol Droid
Python
30
star
20

GWT-keyword-analysis

Analysis of Google Webmaster Tools search data
Python
25
star
21

g-crawl-py

Gevent Crawling in Python, with Utilities
Python
23
star
22

mozsci

Data science tools from Moz
Python
22
star
23

url-cpp

C++ bindings for url parsing and sanitization
C++
19
star
24

vocab

Vocabulary using n-grams
Python
16
star
25

uri_parser

A fast URI parser that wraps Google's chromium URL canonicalization library
C++
13
star
26

downpour

Fetch urls quickly and asynchronously with Twisted, honoring politeness.
Python
13
star
27

rep-cpp

Robot exclusion protocol in C++
C++
12
star
28

mltk

mltk - Moz Language Tool Kit
Python
12
star
29

plines

Easily create job pipelines out of declared job dependencies using Qless.
Ruby
10
star
30

awssh

AWSSH Config
Python
9
star
31

roger-mesos

A complete mesos cluster setup with automatic load balancing
Python
8
star
32

linkscape-py

Python Bindings for Linkscape's API
Python
5
star
33

qless-js

Node.js bindings for qless
JavaScript
5
star
34

roger-bamboo

Roger's internal load balancer and frontend proxy. Based on https://github.com/QubitProducts/bamboo
Go
5
star
35

gzippy

Gzip files in python
Python
4
star
36

asis

Lightweight As-Is Server
Python
4
star
37

awscpp

AWS C++ Bindings
C++
3
star
38

rack-authenticate

Rack middleware that handles basic auth and HMAC auth
Ruby
3
star
39

elasticsearch-utils

Some elasticsearch utilities I've put together / been using in investigating elasticsearch performance
Python
3
star
40

pyjudy

Python bindings to libJudy
Python
3
star
41

resque-unfairly

A Resque plugin for processing queues from random jobs based on queue weightings. Inspired by resque-fairly.
Ruby
3
star
42

roger-monitoring

Monitoring stack for RogerOS
Python
3
star
43

crawl-curio-cabinet

A Curio Cabinet of the Odd Behaviors We've Seen on the Internet
HTML
3
star
44

qless-docker

Create a qless docker image!
Ruby
2
star
45

irobot

robots.txt file inspection
Ruby
2
star
46

bloomfilter-py

Simple and fast Bloom filter
Python
2
star
47

docker-sortdb

Docker setup for SortDB
Shell
1
star
48

qless-java

qless java binding
Java
1
star
49

zendesk-search

Search for tags and such in zendesk
JavaScript
1
star
50

deb-swift

1
star
51

fiji

Cell schemas and schema versioning for HBase
HTML
1
star
52

p5-Webservice-Followerwonk-SocialAuthority

Perl Client for The Followerwonk Social Authority API
Perl
1
star
53

qless-util-py

Utilities for use with qless-py
Python
1
star
54

process_tree_dictionary

Implements a dictionary that is scoped to a process tree for Erlang and Elixir.
Elixir
1
star
55

moz_nav

DEPRECATED. Common navigation and layout across all SEOmoz applications
Ruby
1
star
56

logtools

Stuff for reading crawler log files. Probably not of much interest to those outside of SeoMOZ.
Python
1
star