• Stars
    star
    161
  • Rank 233,470 (Top 5 %)
  • Language
    C++
  • License
    MIT License
  • Created over 8 years ago
  • Updated almost 6 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Cache efficient implementation for Latent Dirichlet Allocation

WarpLDA: Cache Efficient Implementation of Latent Dirichlet Allocation

Introduction

WarpLDA is a cache efficient implementation of Latent Dirichlet Allocation, which samples each token in O(1).

Installation

Prerequisites:

  • GCC (>=4.8.5)
  • CMake (>=2.8.12)
  • git
  • libnuma
    • CentOS: yum install libnuma-devel
    • Ubuntu: apt-get install libnuma-dev

Clone this project

git clone https://github.com/thu-ml/warplda

Install third-party dependency

./get_gflags.sh

Download some data, and split it as training and testing set

cd data
wget https://raw.githubusercontent.com/sudar/Yahoo_LDA/master/test/ydir_1k.txt
head -n 900 ydir_1k.txt > ydir_train.txt
tail -n 100 ydir_1k.txt > ydir_test.txt
cd ..

Compile the project

./build.sh
cd release/src
make -j

Quick-start

Format the data

./format -input ../../data/ydir_train.txt -prefix train
./format -input ../../data/ydir_test.txt -vocab_in train.vocab -test -prefix test

Train the model

./warplda --prefix train --k 100 --niter 300

Check the result. Each line is a topic, its id, number of tokens assigned to it, and ten most frequent words with their probabilities.

vim train.info.full.txt

Infer latent topics of some testing data.

./warplda --prefix test --model train.model --inference -niter 40 --perplexity 10

Data format

The data format is identical to Yahoo! LDA. The input data is a text file with a number of lines, where each line is a document. The format of each line is

id1 id2 word1 word2 word3 ...

id1, id2 are two string document identifiers, and each word is a string, separated by white space.

Output format

WarpLDA generates a number of files:

.vocab (generated by .format)

Each line of it is a word in the vocabulary.

.info.full.txt (generated by warplda -estimate)

The most frequent words for each topic. Each line is a topic, with its topic it, number of tokens assigned to it, and a number of most frequent words in the format (probability, word). The number of most frequent words is controlled by -ntop. .info.words.txt is a simpler version which only contains words.

.model (generated by warplda -estimate)

The word-topic count matrix. The first line contains four integers

<size of vocabulary> <number of topics> <alpha> <beta>

Each of the remaining lines is a row of the word-topic count matrix, represented in the libsvm sparse vector format,

<number of elements> index:count index:count ...

For example, 0:2 on the first line means that the first word in the vocabulary is assigned to topic 0 for 2 times.

.z.estimate (generated by warplda -estimate)

The topic assignments of each token in the libsvm format. Each line is a document,

<number of tokens> <word id>:<topic id> <word id>:<topic id> ...

.z.inference (generated by warplda -inference)

The format is the same as .z.estimate.

Other features

  • Use custom prefix for output -prefix myprefix

  • Output perplexity every 10 iterations -perplexity 10

  • Tune Dirichlet hyperparameters -alpha 10 -beta 0.1

  • Use UCI machine learning repository data

      wget https://archive.ics.uci.edu/ml/machine-learning-databases/bag-of-words/vocab.nips.txt
      wget https://archive.ics.uci.edu/ml/machine-learning-databases/bag-of-words/docword.nips.txt.gz
      gunzip docword.nips.txt.gz
      ./uci-to-yahoo docword.nips.txt vocab.nips.txt -o nips.txt
      head -n 1400 nips.txt > nips_train.txt
      tail -n 100 nips.txt > nips_test.txt
    

License

MIT

Reference

Please cite WarpLDA if you find it is useful!

@inproceedings{chen2016warplda,
  title={WarpLDA: a Cache Efficient O(1) Algorithm for Latent Dirichlet Allocation},
  author={Chen, Jianfei and Li, Kaiwei and Zhu, Jun and Chen, Wenguang},
  booktitle={VLDB},
  year={2016}
}

More Repositories

1

tianshou

An elegant PyTorch deep reinforcement learning library.
Python
7,810
star
2

zhusuan

A probabilistic programming library for Bayesian deep learning, generative models, based on Tensorflow
Python
2,202
star
3

prolificdreamer

ProlificDreamer: High-Fidelity and Diverse Text-to-3D Generation with Variational Score Distillation (NeurIPS 2023 Spotlight)
Python
1,472
star
4

unidiffuser

Code and models for the paper "One Transformer Fits All Distributions in Multi-Modal Diffusion"
Python
1,365
star
5

CRM

[ECCV 2024] Single Image to 3D Textured Mesh in 10 seconds with Convolutional Reconstruction Model.
Python
520
star
6

ares

A Python library for adversarial machine learning focusing on benchmarking adversarial robustness.
Python
480
star
7

SageAttention

Quantized Attention that achieves speedups of 2.1x and 2.7x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.
Python
222
star
8

controlvideo

Official implementation for "ControlVideo: Adding Conditional Control for One Shot Text-to-Video Editing"
Python
181
star
9

3D_Corruptions_AD

Benchmarking Robustness of 3D Object Detection to Common Corruptions in Autonomous Driving, CVPR 2023
Python
119
star
10

low-bit-optimizers

Low-bit optimizers for PyTorch
Python
114
star
11

MMTrustEval

A toolbox for benchmarking trustworthiness of multimodal large language models (MultiTrust, NeurIPS 2024 Track Datasets and Benchmarks)
Python
89
star
12

stochastic_gcn

Stochastic training of graph convolutional networks
Python
84
star
13

RoboticsDiffusionTransformer

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation
Python
72
star
14

Attack-Bard

Python
53
star
15

DPM-Solver-v3

Official code for "DPM-Solver-v3: Improved Diffusion ODE Solver with Empirical Model Statistics" (NeurIPS 2023)
48
star
16

tianshou-docs-zh_CN

ε€©ζŽˆδΈ­ζ–‡ζ–‡ζ‘£
TeX
46
star
17

Prior-Guided-RGF

Python
41
star
18

zh-clip

Python
41
star
19

SRPO

Codes accompanying the paper "Score Regularized Policy Optimization through Diffusion Behavior" (ICLR 2024).
Python
36
star
20

vflow

Official code for "VFlow: More Expressive Generative Flows with Variational Data Augmentation" (ICML 2020)
Python
35
star
21

AT3D

Towards Effective Adversarial Textured 3D Meshes on Physical Face Recognition, CVPR 2023, Highlight
Python
34
star
22

implicit-normalizing-flows

Code for "Implicit Normalizing Flows" (ICLR 2021 spotlight)
Python
34
star
23

HiDe-Prompt

Hierarchical Decomposition of Prompt-Based Continual Learning: Rethinking Obscured Sub-optimality (NeurIPS 2023, Spotlight)
Python
30
star
24

BigTopicModel

Big Topic Model is a fast engine for running large-scale Topic Models.
C++
22
star
25

NUNO

[ICML 2023] Non-Uniform Neural Operator (NUNO)
Python
18
star
26

IODF

C++
16
star
27

fpovi

Code for "Function Space Particle Optimization for Bayesian Neural Networks"
Python
16
star
28

CF-UIcA

Code for "Collaborative Filtering with User-Item Co-Autoregressive Models"
Python
15
star
29

Zhusuan-Jittor

Zhusuan with backend Jittor
Python
14
star
30

LM-Calibration

Python
12
star
31

mmdcgm-ssl

mmDCGMs for accurate classification and excellent class-conditional generation in semi-supervised learning
Python
11
star
32

Zhusuan-PaddlePaddle

Zhusuan with backend PaddlePaddle
Python
8
star
33

ood-dgm

Python
8
star
34

MEM_DGM

Code for "Learning to Generate with Memory"
Python
8
star
35

ProbML-book-solution

Jupyter Notebook
7
star
36

adversarial_training_imagenet

Python
7
star
37

pmd

Population matching discrepancy
Python
7
star
38

CEURL

Official implementation for "PEAC: Unsupervised Pre-training for Cross-Embodiment Reinforcement Learning" (NeurIPS 2024)
Python
7
star
39

VCAS

Official code for "Efficient Backpropagation with Variance Controlled Adaptive Sampling" (ICLR 2024)
Python
6
star
40

imagenet-a-plus

Python
4
star
41

wmvl

Code for "A Wasserstein Minimum Velocity Approach to Learning Unnormalized Models"
Jupyter Notebook
3
star
42

sEM-vr

code for pLSA and LDA in the paper "Stochastic Expectation Maximization with Variance Reduction"
C++
3
star
43

Efficient-Diffusion-Alignment

Official Codebase for "Aligning Diffusion Behaviors with Q-functions for Efficient Continuous Control" (NeurIPS 2024)
3
star
44

Noise-Contrastive-Alignment

Code accompanying the paper "Noise Contrastive Alignment of Language Models with Explicit Rewards"
2
star
45

i-DODE

Official code for "Improved Techniques for Maximum Likelihood Estimation for Diffusion ODEs" (ICML 2023)
1
star
46

CCA

Codes accompanying the paper "Toward Guidance-Free AR Visual Generation via Condition Contrastive Alignment"
Python
1
star
47

ACTNN-PaddlePaddle

Python
1
star
48

DBIM

Official codebase for "Diffusion Bridge Implicit Models" (https://arxiv.org/abs/2405.15885).
1
star
49

Jetfire-INT8Training

Jupyter Notebook
1
star