• Stars
    star
    347
  • Rank 122,141 (Top 3 %)
  • Language
    C
  • License
    Apache License 2.0
  • Created about 10 years ago
  • Updated 4 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Node.js interface to the Google word2vec tool.

NPM version Build Status Coverage Status Dependencies

node-word2vec

Node.js interface to the Google word2vec tool

What is it?

This is a Node.js interface to the word2vec tool developed at Google Research for "efficient implementation of the continuous bag-of-words and skip-gram architectures for computing vector representations of words", which can be used in a variety of NLP tasks. For further information about the word2vec project, consult https://code.google.com/p/word2vec/.

Installation

Currently, node-word2vec is ONLY supported for Unix operating systems.

Install it via npm:

npm install word2vec

To use it inside Node.js, require the module as follows:

var w2v = require( 'word2vec' );

Usage

API

.word2phrase( input, output, params, callback )

For applications where it is important that certain pairs of words are treated as a single term (e.g. "Barack Obama" or "New York" should be treated as one word), the text corpora used for training should be pre-processed via the word2phrases function. Words which frequently occur next to each other will be concatenated via an underscore, e.g. the words "New" and "York" if following next to each other might be transformed to a single word "New_York".

Internally, this function calls the C command line application of the Google word2vec project. This allows it to make use of multi-threading and preserves the efficiency of the original C code. It processes the texts given by the input text document, writing the output to a file with the name given by output.

The params parameter expects a JS object optionally containing some of the following keys and associated values. If they are not supplied, the default values are used.

Key Description Default Value
minCount discard words appearing less than minCount times 5
threshold determines the number of phrases, higher value means less phrases 100
debug sets debug mode 2
silent sets whether any output should be printed to the console false

After successful execution, the supplied callback function is invoked. It receives the number of the exit code as its first parameter.

.word2vec( input, output, params, callback )

This function calls Google's word2vec command line application and finds vector representations for the words in the input training corpus, writing the results to the output file. The output can then be loaded into node via the loadModel function, which exposes several methods to interact with the learned vector representations of the words.

The params parameter expects a JS object optionally containing some of the following keys and associated values. For those missing, the default values are used:

Key Description Default Value
size sets the size of word vectors 100
window sets maximal skip length between words 5
sample sets threshold for occurrence of words. Those that appear with higher frequency in the training data will be randomly down-sampled; useful range is (0, 1e-5) 1e-3
hs 1 = use Hierarchical Softmax 0
negative number of negative examples; common values are 3 - 10 (0 = not used) 5
threads number of used threads 12
iter number of training iterations 5
minCount This will discard words that appear less than minCount times 5
alpha sets the starting learning rate 0.025 for skip-gram and 0.05 for CBOW
classes output word classes rather than word vectors 0 (vectors are written)
debug sets debug mode 2
binary save the resulting vectors in binary mode 0 (off)
saveVocab the vocabulary will be saved to saveVocab value
readVocab the vocabulary will be read from readVocab value , not constructed from the training data
cbow use the continuous bag of words model 1 (use 0 for skip-gram model)
silent sets whether any output should be printed to the console false

After successful execution, the supplied callback function is invoked. It receives the number of the exit code as its first parameter.

.loadModel( file, callback )

This is the main function of the package, which loads a saved model file containing vector representations of words into memory. Such a file can be created by using the word2vec function. After the file is successfully loaded, the supplied callback function is fired, which following conventions has two parameters: err and model. If everything runs smoothly and no error occured, the first argument should be null. The model parameter is a model object holding all data and exposing the properties and methods explained in the Model Object section.

Example:

w2v.loadModel( './vectors.txt', function( error, model ) {
    console.log( model );
});

Sample Output:

{
    getVectors: [Function],
    distance: [Function: distance],
    analogy: [Function: analogy],
    words: '98331',
    size: '200'
}

Model Object

Properties

.words

Number of unique words in the training corpus.

.size

Length of the learned word vectors.

Methods

.similarity( word1, word2 )

Calculates the word similarity between word1 and word2.

Example:

model.similarity( 'ham', 'cheese' );

Sample Output:

0.4907762118841032

.mostSimilar( phrase[, number] )

Calculates the cosine distance between the supplied phrase (a string which is internally converted to an Array of words, which result in a phrase vector) and the other word vectors of the vocabulary. Returned are the number words with the highest similarity to the supplied phrase. If number is not supplied, by default the 40 highest scoring words are returned. If none of the words in the phrase appears in the dictionary, the function returns null. In all other cases, unknown words will be dropped in the computation of the cosine distance.

Example:

model.mostSimilar( 'switzerland', 20 );

Sample Output:

[
    { word: 'chur', dist: 0.6070252929307018 },
    { word: 'ticino', dist: 0.6049085549621765 },
    { word: 'bern', dist: 0.6001648890419077 },
    { word: 'cantons', dist: 0.5822226582323267 },
    { word: 'z_rich', dist: 0.5671853621346818 },
    { word: 'iceland_norway', dist: 0.5651901750812693 },
    { word: 'aargau', dist: 0.5590524831511438 },
    { word: 'aarau', dist: 0.555220055372284 },
    { word: 'zurich', dist: 0.5401119092258485 },
    { word: 'berne', dist: 0.5391358099043649 },
    { word: 'zug', dist: 0.5375590160292268 },
    { word: 'swiss_confederation', dist: 0.5365824598661265 },
    { word: 'germany', dist: 0.5337325187293028 },
    { word: 'italy', dist: 0.5309218588704736 },
    { word: 'alsace_lorraine', dist: 0.5270204106304165 },
    { word: 'belgium_denmark', dist: 0.5247942780963807 },
    { word: 'sweden_finland', dist: 0.5241634037188426 },
    { word: 'canton', dist: 0.5212495170066538 },
    { word: 'anterselva', dist: 0.5186651140386938 },
    { word: 'belgium', dist: 0.5150383129735169 }
]

.analogy( word, pair[, number] )

For a pair of words in a relationship such as man and king, this function tries to find the term which stands in an analogous relationship to the supplied word. If number is not supplied, by default the 40 highest-scoring results are returned.

Example:

model.analogy( 'woman', [ 'man', 'king' ], 10 );

Sample Output:

[
    { word: 'queen', dist: 0.5607083309028658 },
    { word: 'queen_consort', dist: 0.510974781496456 },
    { word: 'crowned_king', dist: 0.5060923120115347 },
    { word: 'isabella', dist: 0.49319425034513376 },
    { word: 'matilda', dist: 0.4931204901924969 },
    { word: 'dagmar', dist: 0.4910608716969606 },
    { word: 'sibylla', dist: 0.4832698899279795 },
    { word: 'died_childless', dist: 0.47957251302898396 },
    { word: 'charles_viii', dist: 0.4775804990655765 },
    { word: 'melisende', dist: 0.47663194967001704 }
]

.getVector( word )

Returns the learned vector representations for the input word. If word does not exist in the vocabulary, the function returns null.

Example:

model.getVector( 'king' );

Sample Output:

{
    word: 'king',
    values: [
        0.006371254151248689,
        -0.04533821363410406,
        0.1589142808632736,
        ...
        0.042080221123209825,
        -0.038347102017109225
    ]
}

.getVectors( [words] )

Returns the learned vector representations for the supplied words. If words is undefined, i.e. the function is evoked without passing it any arguments, it returns the vectors for all learned words. The returned value is an array of objects which are instances of the class WordVec.

Example:

model.getVectors( [ 'king', 'queen', 'boy', 'girl' ] );

Sample Output:

[
    {
        word: 'king',
        values: [
            0.006371254151248689,
            -0.04533821363410406,
            0.1589142808632736,
            ...
            0.042080221123209825,
            -0.038347102017109225
        ]
    },
    {
        word: 'queen',
        values: [
            0.014399041122817985,
            -0.000026896638109750347,
            0.20398248693190596,
            ...
            -0.05329081648586445,
            -0.012556868376422963
        ]
    },
    {
        word: 'girl',
        values: [
            -0.1247347144692245,
            0.03834108759049417,
            -0.022911846734360187,
            ...
            -0.0798994867922872,
            -0.11387393949666696
        ]
    },
    {
        word: 'boy',
        values: [
            -0.05436531234037158,
            0.008874993957578164,
            -0.06711992414442335,
            ...
            0.05673998568026764,
            -0.04885347925837509
        ]
    }
]

.getNearestWord( vec )

Returns the word which has the closest vector representation to the input vec. The function expects a word vector, either an instance of constructor WordVector or an array of Number values of length size. It returns the word in the vocabulary for which the distance between its vector and the supplied input vector is lowest.

Example:

model.getNearestWord( model.getVector('empire') );

Sample Output:

{ word: 'empire', dist: 1.0000000000000002 }

.getNearestWords( vec[, number] )

Returns the words whose vector representations are closest to input vec. The first parameter of the function expects a word vector, either an instance of constructor WordVector or an array of Number values of length size. The second parameter, number, is optional and specifies the number of returned words. If not supplied, a default value of 10 is used.

Example:

model.getNearestWords( model.getVector( 'man' ), 3 )

Sample Output:

[
    { word: 'man', dist: 1.0000000000000002 },
    { word: 'woman', dist: 0.5731114915085445 },
    { word: 'boy', dist: 0.49110060323870924 }
]

WordVector

Properties

.word

The word in the vocabulary.

.values

The learned vector representation for the word, an array of length size.

Methods

.add( wordVector )

Adds the vector of the input wordVector to the vector .values.

.subtract( wordVector )

Subtracts the vector of the input wordVector to the vector .values.

Unit Tests

Run tests via the command npm test

Build from Source

Clone the git repository with the command

$ git clone https://github.com/Planeshifter/node-word2vec.git

Change into the project directory and compile the C source files via

$ cd node-word2vec
$ make --directory=src

License

Apache v2.

More Repositories

1

text-miner

text mining utilities for Node.js
JavaScript
142
star
2

node-wordnet-magic

tools for working with Princeton's lexical database WordNet
JavaScript
74
star
3

node-Rstats

[UNMAINTAINED] An interface for node.js to statistical programming language R based on the fabulous Rcpp package
C++
58
star
4

deidentify

De-identification of Protected Health Information according to HIPAA Privacy Rule
JavaScript
38
star
5

node-concept-net

node.js interface to the ConceptNet semantic network API [DEPRECATED; ConceptNet API has changed]
JavaScript
29
star
6

emscripten-examples

collection of emscripten use cases of how to port C/C++ functions to JS
JavaScript
29
star
7

kernel-smooth

nonparametric kernel smoothing
JavaScript
20
star
8

tweet-sentiment

SVM Classifier to Detect Sentiment of Tweets
JavaScript
16
star
9

order

Returns a permutation which rearranges an input array.
CoffeeScript
9
star
10

SVD

singular value decomposition via emscripten
JavaScript
9
star
11

node-sparqling-star

node.js client for creating SPARQL queries and communicating with services like DBpedia
JavaScript
8
star
12

node-MetaMap

Package providing access to the MetaMap Web API for UMLS
HTML
8
star
13

node-chvocab

Mapping texts to UMLS via the Consumer Health Vocabulary (CHV)
JavaScript
7
star
14

plusArrays.js

Extensions of the Array.prototype object
JavaScript
7
star
15

fisher-transform

inference for correlation rho via fisher transformation
JavaScript
6
star
16

node-wandbox-api

Node.js bindings to the Wandbox API.
JavaScript
6
star
17

R-Introduction

short tutorial for the statistical programming language R
HTML
6
star
18

wordnetify

Command line interface to turn text documents into WordNet synset trees
CoffeeScript
5
star
19

node-armadillo

node.js wrapper for Armadillo C++ linear algebra library
C++
5
star
20

hackernews-ai-digest

Daily Digest of HackerNews submissions related to AI.
MDX
5
star
21

qality.js

Multiple-Choice QA System for JavaScript
JavaScript
4
star
22

liquid-screen

Easy Animations.
JavaScript
4
star
23

node-spam-detector

small utility app to detect spam URLs
JavaScript
4
star
24

multtest

adjustments of p-values for multiple comparisons
JavaScript
4
star
25

guess-age

Guess the age of a person only from their first name.
JavaScript
4
star
26

anagrams

find anagrams and similar sounding words
JavaScript
4
star
27

ndarray-inv

calculating matrix inverses
JavaScript
4
star
28

feedback-buttons

jQuery plugin for adding feedback buttons to HTML elements.
JavaScript
3
star
29

react-latex

React LaTeX Component: http://cmu-isle.github.io/react-latex.
JavaScript
3
star
30

GameOfLife.js

(Yet) another implementation of Horton Conway's Game of Life
JavaScript
3
star
31

generator-node-sweetjs

Yeoman generator for node.js module powered by sweet.js macros
JavaScript
3
star
32

statbytes-js

StatBytes presentation on JavaScript
JavaScript
3
star
33

node-wordnet-visualizer

Visualization for WordNet objects as created by the wordnet-magic package
JavaScript
3
star
34

gsl-js

C
2
star
35

GSoC-Application

TeX
2
star
36

insert-equations

Parse README.md files and turn LaTeX equations into SVGs.
JavaScript
2
star
37

StochasticProcessesNotes

Notes for the course 36-733: Probability models and Stochatic Processes
TeX
2
star
38

Flashcards

Flashcards for important theorems and definitions in statistics
TeX
2
star
39

detect-generator-support

Detect native generator function support.
Makefile
2
star
40

discrete-markovchain

discrete time, discrete state space Markov chain package
JavaScript
2
star
41

node-wordnet-JSON

Visualization frontend for WordNet trees savid in JSON format
JavaScript
2
star
42

traj-utilities

Calculate conditional probabilities for a multi-trajectory model fitted by traj.
JavaScript
2
star
43

polymer-codebox

A Polymer web component for evaluating JavaScript code.
HTML
2
star
44

node-make-latex

convert JS objects to LaTeX tables from inside node.js
JavaScript
2
star
45

instruction

Introduction to Probability Distributions and Hypothesis Testing.
HTML
2
star
46

data-science-survival-kit

Slides for Data Science Survival Kit Workshop
JavaScript
1
star
47

stdlib-dependency-test

A demo repository for a library which depends both directly on stdlib and indirectly through a dependency.
JavaScript
1
star
48

test-github-create-repo

1
star
49

Insights-From-Kidney-Space-2

First Heinz Paper presentation
HTML
1
star
50

todo-list

1
star
51

spell-check

simple spelling checker
CoffeeScript
1
star
52

express-saml2

JavaScript
1
star
53

isle-ace-builds

Ace editor builds for ISLE editor
JavaScript
1
star
54

locality-sensitive-hashing

locality-sensitive hashing for nearest neighbor search
1
star
55

chi2gof-test

Test repository for running `chi2gof`
JavaScript
1
star
56

reproducible-error-reports

Reproducible example of screen.find erroring.
JavaScript
1
star
57

ScienceforumsPresentation

R
1
star
58

react-code-playground

Created with CodeSandbox
JavaScript
1
star
59

empty

Empty package.
JavaScript
1
star
60

bundle-sizes

Test repository to compare bundle sizes.
JavaScript
1
star
61

math-snippets

LaTeX Snippets for Atom.io
CoffeeScript
1
star
62

stdlib-realtime-statistics-code

Code snippets accompanying article on realtime statistics with stdlib
JavaScript
1
star
63

gsl-polynomial-js

functions for evaluating and solving polynomial equations in JS via GNU Scientific Library
1
star
64

introduction-to-stdlib

Talk for the International JavaScript Conference, iJS, London (2018).
JavaScript
1
star
65

StickBreakingConstruction

Shiny visualization of stick-breaking construction of Dirichlet Process
R
1
star
66

node-wordnetify-sample

generate random samples from Wordnetify output
JavaScript
1
star