Single Machine implementation of LDA

Modules

parallelLDA contains various implementation of multi threaded LDA
singleLDA contains various implementation of single threaded LDA
topwords a tool to explore topics learnt by the LDA/HDP
perplexity a tool to calculate perplexity on another dataset using word|topic matrix
datagen packages txt files for our program
preprocessing for converting from UCI or cLDA to simple txt file having one document per line

Organisation

All codes are under src within respective folder
For running Topic Models many template scripts are provided under scripts
data is a placeholder folder where to put the data
build and dist folder will be created to hold the executables

Requirements

gcc >= 5.0 or Intel® C++ Compiler 2016 for using C++14 features
split >= 8.21 (part of GNU coreutils)

How to use

We will show how to run our LDA on an UCI bag of words dataset

First of all compile by hitting make
```
  make
```
Download example dataset from UCI repository. For this a script has been provided.
```
  scripts/get_data.sh
```
Prepare the data for our program
```
  scripts/prepare.sh data/nytimes 1
```
For other datasets replace nytimes with dataset name or location.
Run LDA!
```
  scripts/lda_runner.sh
```
Inside the lda_runner.sh all the parameters e.g. number of topics, hyperparameters of the LDA, number of threads etc. can be specified. By default the outputs are stored under out/. Also you can specify which inference algorithm of LDA you want to run:
1. simpleLDA: Plain vanilla Gibbs sampling by Griffiths04
2. sparseLDA: Sparse LDA of Yao09
3. aliasLDA: Alias LDA
4. FTreeLDA: F++LDA (inspired from Yu14
5. lightLDA: light LDA of Yuan14

The make file has some useful features:

if you have Intel® C++ Compiler, then you can instead
```
  make intel
```
or if you want to use Intel® C++ Compiler's cross-file optimization (ipo), then hit
```
  make inteltogether
```
Also you can selectively compile individual modules by specifying
```
  make <module-name>
```
or clean individually by
```
  make clean-<module-name>
```

Performance

Based on our evaluation F++LDA works the best in terms of both speed and perplexity on a held-out dataset. For example on Amazon EC2 c4.8xlarge, we obtained more than 25 million/tokens per second. Below we provide performance comparison against various inference procedures on publicaly available datasets.

Datasets

Dataset	V	L	D	L/V	L/D
NY Times	101,330	99,542,127	299,753	982.36	332.08
PubMed	141,043	737,869,085	8,200,000	5,231.52	89.98
Wikipedia	210,218	1,614,349,889	3,731,325	7,679.41	432.65

Experimental datasets and their statistics. V denotes vocabulary size, L denotes the number of training tokens, D denotes the number of documents, L/V indicates the average number of occurrences of a word, L/D indicates the average length of a document.

dmlc/experimental-lda

dmlc

Reviews

Repository Details