• Stars
    star
    150
  • Rank 247,323 (Top 5 %)
  • Language
    C++
  • License
    GNU General Publi...
  • Created over 9 years ago
  • Updated almost 8 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Hierarchical Dirichlet processes. Topic models where the data determine the number of topics. This implements Gibbs sampling.

Hierarchical Dirichlet Process (with Split-Merge Operations)


(C) Copyright 2010, Chong Wang and David Blei. Written by Chong Wang.

This is a C++ implementation of hierarchical Dirichlet process for topic modeling.

README

NB: The split-merge algorithm is preliminary. Note that this code requires the Gnu Scientific Library, http://www.gnu.org/software/gsl/


TABLE OF CONTENTS

A. COMPILING

B. POSTERIOR INFERENCE

C. INFERENCE ON NEW DATA

D. PARAMETER SETTINGS

E. PRINTING TOPICS


A. COMPILING

Type "make" in a shell. Make sure the GSL is installed. You may need to change the Makefile a bit.

B. POSTERIOR INFERENCE

The following shows an example of performing posterior inference on a set of documents,

hdp --algorithm train --data data --directory train_dir

Data format

--data points to a file where each line is of the form (the LDA-C format):

 [M] [term_1]:[count] [term_2]:[count] ...  [term_N]:[count]

where [M] is the number of unique terms in the document, and the [count] associated with each term is how many times that term appeared in the document.

The sampler will produce some files in the --directory,

*-topics.dat: the word counts for each topic, with each line as a topic

*-word-assignments.dat: print each word's assignment to the topic and the table, which is in R-friendly format, d w z t

d: document id w: word id z: topic index t: table index (only for document level. If you only analyze the topics, this is irrelevant.)

*.bin: the binary model file used for inference on new data.

state.log: various information to monitor the Markov chain.

More parameter settings, run: hdp --help

Note: some parameters for split-merge are hand coded at the beginning of hdp.cpp file.


C. INFERENCE ON NEW DATA

To perform inference on a different set of data (in the same format as before), run:

hdp --algorithm test --data data --saved_model saved_model --directory test_dir

where --saved_model is the binary file from the posterior inference on training data.

The sampler will produce some files in the --directory,

test-*-topics.dat: the word counts for each topic, with each line as a topic

test*-word-assignments.dat: print each word's assignment to the topic and the table, which is in R-friendly format.

test.log: various information to monitor the Markov chain.

test-*.bin: the binary model file used for inference on newer data.

More parameter settings, run: hdp --help


D. PARAMETER SETTINGS

The meaning of the parameters is the same as in the in the following paper

Y. Teh, M. Jordan, M. Beal, and D. Blei. Hierarchical Dirichlet processes. Journal of the American Statistical Association, 2006. 101[476]:1566-1581


E. PRINTING TOPICS

A R script (print.topics.R) is included to print topics. Make sure it is executable. (chmod +x print.topics.R) For example,

print.topics.R mode-topics.dat vocab.dat topics.dat 10

will produce a topic list with top 10 words selected. For help, run,

print.topics.R

More Repositories

1

edward

A probabilistic programming language in TensorFlow. Deep generative models, variational inference.
Jupyter Notebook
4,834
star
2

onlineldavb

Online variational Bayes for latent Dirichlet allocation (LDA)
Python
300
star
3

dtm

This implements topics that change over time (Dynamic Topic Models) and a model of how individual documents predict that change.
Shell
196
star
4

lda-c

This is a C implementation of variational EM for latent Dirichlet allocation (LDA), a topic model for text or other discrete data.
C
166
star
5

ctr

Collaborative modeling for recommendation. Implements variational inference for a collaborative topic models. These models recommend items to users based on item content and other users' ratings.
C++
147
star
6

online-hdp

Online inference for the Hierarchical Dirichlet Process. Fits hierarchical Dirichlet process topic models to massive data. The algorithm determines the number of topics.
Python
144
star
7

causal-text-embeddings

Software and data for "Using Text Embeddings for Causal Inference"
Python
122
star
8

deconfounder_tutorial

Jupyter Notebook
87
star
9

hlda

This implements hierarchical latent Dirichlet allocation, a topic model that finds a hierarchy of topics. The structure of the hierarchy is determined by the data.
JavaScript
77
star
10

publications

The pdf and LaTeX for each paper (and sometimes the code and data used to generate the figures).
TeX
73
star
11

class-slda

Implements supervised topic models with a categorical response.
C++
64
star
12

variational-smc

Reference implementation of variational sequential Monte Carlo proposed by Naesseth et al. "Variational Sequential Monte Carlo" (2018)
Python
63
star
13

deep-exponential-families

Deep exponential families (DEFs)
C++
56
star
14

DynamicPoissonFactorization

Dynamic version of Poisson Factorization (dPF). dPF captures the changing interest of users and the evolution of items over time according to user-item ratings.
C++
49
star
15

turbotopics

Turbo topics find significant multiword phrases in topics.
Python
46
star
16

ars-reparameterization

Source code for Naesseth et. al. "Reparameterization Gradients through Acceptance-Rejection Sampling Algorithms" (2017)
Jupyter Notebook
38
star
17

zero-inflated-embedding

Code for the icml paper "zero inflated exponential family embedding"
Python
28
star
18

context-selection-embedding

Context Selection for Embedding Models
Python
27
star
19

ctm-c

This implements variational inference for the correlated topic model.
C
21
star
20

deconfounder_public

Jupyter Notebook
18
star
21

treeffuser

Treeffuser is an easy-to-use package for probabilistic prediction on tabular data with tree-based diffusion models.
Jupyter Notebook
13
star
22

factorial-network-models

Discussion of Durante et al for JSM 2017. Includes factorial network model generalization.
Jupyter Notebook
9
star
23

markovian-score-climbing

Python
8
star
24

diln

This implements the discrete infinite logistic normal, a Bayesian nonparametric topic model that finds correlated topics.
C
6
star
25

poisson-influence-factorization

Jupyter Notebook
4
star
26

Riken_tutorial

Jupyter Notebook
4
star
27

circuitry

Python
3
star