• Stars
    star
    127
  • Rank 272,567 (Top 6 %)
  • Language
    C++
  • License
    Other
  • Created almost 9 years ago
  • Updated almost 8 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Single Machine implementation of LDA

Modules

  1. parallelLDA contains various implementation of multi threaded LDA
  2. singleLDA contains various implementation of single threaded LDA
  3. topwords a tool to explore topics learnt by the LDA/HDP
  4. perplexity a tool to calculate perplexity on another dataset using word|topic matrix
  5. datagen packages txt files for our program
  6. preprocessing for converting from UCI or cLDA to simple txt file having one document per line

Organisation

  1. All codes are under src within respective folder
  2. For running Topic Models many template scripts are provided under scripts
  3. data is a placeholder folder where to put the data
  4. build and dist folder will be created to hold the executables

Requirements

  1. gcc >= 5.0 or Intel® C++ Compiler 2016 for using C++14 features
  2. split >= 8.21 (part of GNU coreutils)

How to use

We will show how to run our LDA on an UCI bag of words dataset

  1. First of all compile by hitting make

      make
  2. Download example dataset from UCI repository. For this a script has been provided.

      scripts/get_data.sh
  3. Prepare the data for our program

      scripts/prepare.sh data/nytimes 1

    For other datasets replace nytimes with dataset name or location.

  4. Run LDA!

      scripts/lda_runner.sh

    Inside the lda_runner.sh all the parameters e.g. number of topics, hyperparameters of the LDA, number of threads etc. can be specified. By default the outputs are stored under out/. Also you can specify which inference algorithm of LDA you want to run:

    1. simpleLDA: Plain vanilla Gibbs sampling by Griffiths04
    2. sparseLDA: Sparse LDA of Yao09
    3. aliasLDA: Alias LDA
    4. FTreeLDA: F++LDA (inspired from Yu14
    5. lightLDA: light LDA of Yuan14

The make file has some useful features:

  • if you have Intel® C++ Compiler, then you can instead

      make intel
  • or if you want to use Intel® C++ Compiler's cross-file optimization (ipo), then hit

      make inteltogether
  • Also you can selectively compile individual modules by specifying

      make <module-name>
  • or clean individually by

      make clean-<module-name>

Performance

Based on our evaluation F++LDA works the best in terms of both speed and perplexity on a held-out dataset. For example on Amazon EC2 c4.8xlarge, we obtained more than 25 million/tokens per second. Below we provide performance comparison against various inference procedures on publicaly available datasets.

Datasets

Dataset V L D L/V L/D
NY Times 101,330 99,542,127 299,753 982.36 332.08
PubMed 141,043 737,869,085 8,200,000 5,231.52 89.98
Wikipedia 210,218 1,614,349,889 3,731,325 7,679.41 432.65

Experimental datasets and their statistics. V denotes vocabulary size, L denotes the number of training tokens, D denotes the number of documents, L/V indicates the average number of occurrences of a word, L/D indicates the average length of a document.

log-Perplexity with time

More Repositories

1

xgboost

Scalable, Portable and Distributed Gradient Boosting (GBDT, GBRT or GBM) Library, for Python, R, Java, Scala, C++ and more. Runs on single machine, Hadoop, Spark, Dask, Flink and DataFlow
C++
25,402
star
2

dgl

Python package built to ease deep learning on graph, on top of existing DL frameworks.
Python
12,966
star
3

gluon-cv

Gluon CV Toolkit
Python
5,749
star
4

gluon-nlp

NLP made easy
Python
2,548
star
5

nnvm

C++
1,655
star
6

decord

An efficient video loader for deep learning with smart shuffling that's super easy to digest
C++
1,567
star
7

ps-lite

A lightweight parameter server interface
C++
1,502
star
8

minpy

NumPy interface with mixed backend execution
Python
1,111
star
9

mshadow

Matrix Shadow:Lightweight CPU/GPU Matrix and Tensor Template Library in C++/CUDA for (Deep) Machine Learning
C++
1,098
star
10

cxxnet

move forward to https://github.com/dmlc/mxnet
C++
1,025
star
11

dmlc-core

A common bricks library for building scalable and portable distributed machine learning.
C++
861
star
12

dlpack

common in-memory tensor structure
Python
829
star
13

treelite

Universal model exchange and serialization format for decision tree forests
C++
703
star
14

minerva

Minerva: a fast and flexible tool for deep learning on multi-GPU. It provides ndarray programming interface, just like Numpy. Python bindings and C++ bindings are both available. The resulting code can be run on CPU or GPU. Multi-GPU support is very easy.
C++
689
star
15

parameter_server

moved to https://github.com/dmlc/ps-lite
C++
645
star
16

mxnet-notebooks

Notebooks for MXNet
Jupyter Notebook
613
star
17

rabit

Reliable Allreduce and Broadcast Interface for distributed machine learning
C++
507
star
18

mxnet.js

MXNetJS: Javascript Package for Deep Learning in Browser (without server)
JavaScript
435
star
19

MXNet.jl

MXNet Julia Package - flexible and efficient deep learning in Julia
372
star
20

tensorboard

Standalone TensorBoard for visualizing in deep learning
Python
370
star
21

wormhole

Deprecated
C++
341
star
22

mxnet-memonger

Sublinear memory optimization for deep learning, reduce GPU memory cost to train deeper nets
Python
308
star
23

difacto

Distributed Factorization Machines
C++
296
star
24

XGBoost.jl

XGBoost Julia Package
Julia
280
star
25

mxnet-model-gallery

Pre-trained Models of DMLC Project
266
star
26

GNNLens2

Visualization tool for Graph Neural Networks
TypeScript
206
star
27

HalideIR

Symbolic Expression and Statement Module for new DSLs
C++
202
star
28

mxnet-gtc-tutorial

MXNet Tutorial for NVidia GTC 2016.
Jupyter Notebook
131
star
29

MXNet.cpp

C++ interface for mxnet
C++
114
star
30

experimental-mf

cache-friendly multithread matrix factorization
C++
86
star
31

web-data

The repo to host all the web data including images for documents in dmlc projects.
Jupyter Notebook
80
star
32

nnvm-fusion

Kernel Fusion and Runtime Compilation Based on NNVM
C++
64
star
33

dmlc.github.io

HTML
27
star
34

cub

Cuda
18
star
35

tl2cgen

TL2cgen (TreeLite 2 C GENerator) is a model compiler for decision tree models
C++
13
star
36

mxnet-deepmark

Benchmark speed and other issues internally, before push to deep-mark
Python
7
star
37

mxnet-examples

MXNet Example
6
star
38

xgboost-bench

Python
4
star
39

drat

Drat Repository for DMLC R packages
4
star
40

nn-examples

1
star
41

gluon-nlp-notebooks

1
star
42

docs-redirect-for-mxnet

redirect mxnet.readthedocs.io to mxnet.io
Python
1
star