• This repository has been archived on 22/Nov/2018
  • Stars
    star
    149
  • Rank 248,619 (Top 5 %)
  • Language
    Java
  • License
    Apache License 2.0
  • Created almost 13 years ago
  • Updated about 9 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Scalable Topic Modeling using Variational Inference in MapReduce

Mr.LDA

Mr.LDA is an open-source package for flexible, scalable, multilingual topic modeling using variational inference in MapReduce. For more details, please consult this paper. The latest version of the code can always be found on GitHub.

Please send any bugs reports or questions to Ke Zhai ([email protected]).

Getting Started

Clone the repo:

$ git clone [email protected]:lintool/Mr.LDA.git

Then build using the standard invocation:

$ mvn clean package

If you want to set up your Eclipse environment:

$ mvn eclipse:clean
$ mvn eclipse:eclipse

Corpus Preparation

Some sample data from the Associated Press can be found in this separate repo. This is the same sample data that is used in Blei's LDA implementation in C.

The repo includes a Python script for parsing the corpus into a format that Mr.LDA uses. The output of the script is stored in ap-sample.txt.gz. This is the data file that you'll want to load in HDFS.

Mr.LDA takes plain text files as input, where each line in the text file represents a document. The document id and content are separated by a tab, and words in the content are separated by a spaces. For example, the first two lines of ap-sample.txt look like:

ap881218-0003   student privat baptist school allegedli kill ...
ap880224-0195   bechtel group offer sell oil israel discount ...

To prepare the corpus into the internal format used by Mr.LDA, run the following command:

$ hadoop jar target/mrlda-0.9.0-SNAPSHOT-fatjar.jar cc.mrlda.ParseCorpus \
    -input ap-sample.txt -output ap-sample-parsed

When you examine the output, you'll see:

$ hadoop fs -ls ap-sample-parsed
ap-sample-parsed/document
ap-sample-parsed/term
ap-sample-parsed/title

The directory term stores the mapping between a unique token and its unique integer id used internally (i.e., the dictionary). The directory title stores the mapping between the document id and its unique integer internal id. These are both stored in SequenceFiles format, with IntWritable as the key and Text as the value.

To example the first 20 document id mappings:

$ hadoop jar target/mrlda-0.9.0-SNAPSHOT-fatjar.jar \
    edu.umd.cloud9.io.ReadSequenceFile ap-sample-parsed/title 20

And to example the first 20 terms of the dictionary:

$ hadoop jar target/mrlda-0.9.0-SNAPSHOT-fatjar.jar \
    edu.umd.cloud9.io.ReadSequenceFile ap-sample-parsed/term 20

Running "Vanilla" LDA

Mr.LDA implements LDA using variational inference. Here's an invocation for running 50 iterations on the sample dataset:

$ nohup hadoop jar target/mrlda-0.9.0-SNAPSHOT-fatjar.jar \
    cc.mrlda.VariationalInference \
    -input ap-sample-parsed/document -output ap-sample-lda \
    -term 10000 -topic 20 -iteration 50 -mapper 50 -reducer 20 >& lda.log &

The above command will put the process in the background and you can tail -f lda.log to see its process.

Note that -term option specifies the number of unique tokens in the corpus. This just needs to be a reasonable upper bound.

If the MapReduce jobs are interrupted for any reason, you can restart at a particular iteration with the -modelindex parameter. For example, to pick up where the previous command left off and run another 10 iterations, do this:

$ nohup hadoop jar target/mrlda-0.9.0-SNAPSHOT-fatjar.jar \
    cc.mrlda.VariationalInference \
    -input ap-sample-parsed/document -output ap-sample-lda \
    -term 10000 -topic 20 -iteration 60 -mapper 50 -reducer 20 \
    -modelindex 50 >& lda.log &

Evaluation on 20newsgroup

In this section, we will evaluate Mr. LDA using held-out likelihood and topic coherence on the 20newsgroup data. The key to evaluation of any machine learning algorithm is to split the corpus into three datasets: training set, development set, and test set. The training set is used to fit the model, the development set is used to select parameters, and the test set is used for evaluation. For this task, since we do not focus on tuning parameters, we use only the training set and test set.

Step 1: Preprocessing

The sample split and scripts for preprocessing 20newsgroup can be found in separate repo. Clone the repo and use it as a working directory:

$ git clone [email protected]:lintool/Mr.LDA-data.git
$ cd Mr.LDA-data

Download the 20newsgroup data from here:

$ wget http://qwone.com/~jason/20Newsgroups/20news-bydate.tar.gz
$ tar -xzvf 20news-bydate.tar.gz

Unpack the following file:

$ tar xvfz 20news-labels.tgz

Preprocess the collection:

# 50/500 are min/max document frequencies for a word, alter these numbers as you wish
$ python parse_20news/createLdacMrlda.py 20news-labels/train.labels \
   20news-labels/dev.labels 20news-labels/test.labels 50 500

The above command generates the following files:

  • LDAC files: 20news.ldac.train, 20news.ldac.dev, 20news.ldac.test
  • Mr. LDA files: 20news.mrlda.train, 20news.mrlda.dev, 20news.mrlda.test
  • Raw data files: 20news.raw.train, 20news.raw.dev, 20news.raw.test
  • Vocabulary file: 20news.vocab.txt
  • Statistics and final labels: 20news.stat.train, 20news.stat.dev, 20news.stat.test

Step 2: Run Mr. LDA

Follow the below steps to run Mr. LDA:

# Copy data to hdfs
$ hadoop fs -put 20news.mrlda.train

# Parse corpus
$ hadoop jar target/mrlda-0.9.0-SNAPSHOT-fatjar.jar cc.mrlda.ParseCorpus \
   -input 20news.mrlda.train -output 20news.mrlda.train-parsed

# Run Vanilla LDA with symmetric alpha
$ nohup hadoop jar target/mrlda-0.9.0-SNAPSHOT-fatjar.jar \
    cc.mrlda.VariationalInference \
    -input 20news.mrlda.train-parsed/document \
    -output 20news.mrlda.train-lda \
    -symmetricalpha 0.01 \
    -topic 20 -term 100000 -iteration 1000 \
    -mapper 50 -reducer 20 >& 20news.log &

Step 3: Compute held-out likelihood

We use Blei's LDA implementation in C (LDAC) to compute held-out likelihood. LDAC requires a .beta file and .other file to compute held-out score on unseen data.

First, download and untar LDAC:

$ wget http://www.cs.princeton.edu/~blei/lda-c/lda-c-dist.tgz
$ tar -xzvf lda-c-dist.tgz
$ cd lda-c-dist
$ make

Grab the beta file from Mr. LDA and convert it to LDAC format:

# Grab beta file from hdfs, ITERATION is the iteration where Mr. LDA converges
$ hadoop jar target/mrlda-0.9.0-SNAPSHOT-fatjar.jar \
    edu.umd.cloud9.io.ReadSequenceFile \
    20news.mrlda.train-lda/beta-ITERATION > 20news.mrlda.train.20.beta

# convert to proper format for LDAC
$ python parse_20news/convertMrldaBetaToBeta.py 20news.mrlda.train.20.beta \
    20news.mrlda.train.20.ldac.beta VOCAB_SIZE

While VOCAB_SIZE is the size of vocabulary, you can get this number by wc 20news.vocab.txt.

Create .other file and name it 20news.mrlda.train.20.ldac.other. This file contains values for alpha, number of topics, and vocabulary size

For example:

num_topics 20
num_terms 3126
alpha 0.015

Remember that alpha is the hyperparameter for document-topics learned from Mr. LDA. You can find its value from 20news.mrlda.train-lda/alpha-ITERATION using the following command:

$ hadoop jar target/mrlda-0.9.0-SNAPSHOT-fatjar.jar \
   edu.umd.cloud9.io.ReadSequenceFile 20news.mrlda.train-lda/alpha-ITERATION

Finally, compute the held-out likelihood:

# Infer heldout for each documents
$ cd lda-c-dist
$ ./lda inf inf-settings.txt \
   ../20news.mrlda.train.20.ldac ../20news.ldac.test ../20news.mrlda.20.HL

# Average heldout scores
$ cd ..
$ python parse_20news/calculate_heldout_likelihood.py 20news.mrlda.20.HL-lda-lhood.dat

Step 4: Compute Topic coherence

We use Topic Interpretability to compute topic coherence. Topic coherence evaluates topics against a ground corpus using measures such as npmi (stands for normalized point-wise mutual information). The ground corpus can be the whole Wikipedia. For this task, we use the training set and test set as a ground corpus.

Download the Topic Interpretability tool:

$ git clone https://github.com/jhlau/topic_interpretability

Prepare the corpus:

$ mkdir 20news_train_test_raws
$ cp -r 20news.raw.train 20news_train_test_raws/
$ cp -r 20news.raw.test 20news_train_test_raws/

Compute statistics:

# Generate .oneline file
$ python parse_20news/createOneLineDict.py 20news.vocab.txt

# Get statistics of corpus
$ python topic_interpretability/ComputeWordCount.py 20news.vocab.txt.oneline \
    20news_train_test_raws > 20news.train.test.wc

If you want to use Wikipedia as a ground corpus, it is better than the directory that contains many small (1000 documents, one per line) files.

Prepare topics:

# Get topic file from HDFS
$ hadoop jar target/mrlda-0.9.0-SNAPSHOT-fatjar.jar cc.mrlda.DisplayTopic \
    -index 20news.mrlda.train-parsed/term \
    -input 20news.mrlda.train-lda/beta-ITERATION \
    -topdisplay 20 > 20news.mrlda.train.20.topics

# Convert it to proper format
$ python parse_20news/convertMrldaTopicsToTopics.py 20news.mrlda.train.20.topics \
    20news.mrlda.train.20.ti.topics 20

Compute Topic Coherence using npmi:

$ python topic_interpretability/ComputeObservedCoherence.py \
    20news.mrlda.train.20.ti.topics npmi 20news.train.test.wc > 20news.mrlda.20.oc

WARNING: The following documentation may be out of date...

Input Data Format

The data format for Mr. LDA package is defined in class Document.java of every package. It consists an HMapII.java object, storing all word:count pairs in a document using an integer:integer hash map. Take note that the word index starts from 1, whereas index 0 is reserved for system message. Interesting user could refer following piece of code to convert an indexed document String.java to Document.java:

String inputDocument = "Mr. LDA is a Latent Dirichlet Allocation topic modeling package based on Variational Bayesian learning approach using MapReduce and Hadoop";
Document outputDocument = new Document();
HMapII content = new HMapII();
StringTokenizer stk = new StringTokenizer(inputDocument);
while (stk.hasNext()) {
      content.increment(Integer.parseInt(stk.hasNext), 1);
}
outputDocument.setDocument(content);

By defalut, Mr. LDA accepts sequential file format only. The sequence file should be key-ed by a unique document ID of IntWritable.java type and value-d by the corresponding Document.java data type.

If you preprocessing the raw text using ParseCorpus.java command, the directory /hadoop/index/document/output/directory/document is the exact input to the following stage.

Latent Dirichlet Allocation

The primary entry point of Mr. LDA package is via VariationalInference.java class. You may start training, resume training or launch testing on input data.

To print the help information and usage hints, please run the following command

hadoop jar Mr.LDA.jar cc.mrlda.VariationalInference -help

To train LDA model on a dataset, please run one of the following command:

hadoop jar Mr.LDA.jar cc.mrlda.VariationalInference -input /hadoop/index/document/output/directory/document -output /hadoop/mrlda/output/directory -term 60000 -topic 100
hadoop jar Mr.LDA.jar cc.mrlda.VariationalInference -input /hadoop/index/document/output/directory/document -output /hadoop/mrlda/output/directory -term 60000 -topic 100 -iteration 40
hadoop jar Mr.LDA.jar cc.mrlda.VariationalInference -input /hadoop/index/document/output/directory/document -output /hadoop/mrlda/output/directory -term 60000 -topic 100 -iteration 40 -mapper 50 -reducer 20
hadoop jar Mr.LDA.jar cc.mrlda.VariationalInference -input /hadoop/index/document/output/directory/document -output /hadoop/mrlda/output/directory -term 60000 -topic 100 -iteration 40 -mapper 50 -reducer 20 -localmerge

The first four parameters are required options, and the following options are free parameter with their respective default values. Take note that -term option specifies the total number of unique tokens in the whole corpus. If this value is not available from context at run time, it is advised to set this option to the approximated upper bound of the total number of unique tokens in the entire corpus.

To resume training LDA model on a dataset, please run following command, it resumes Mr. LDA from iteration 5 to iteration 40:

hadoop jar Mr.LDA.jar cc.mrlda.VariationalInference -input /hadoop/index/document/output/directory/document -output /hadoop/mrlda/output/directory -term 60000 -topic 100 -iteration 40 -modelindex 5

Take note that, to resume Mr. LDA learning, it requires the corresponding beta (distribution over tokens for a given topic), alpha (hyper-parameter for topic) and gamma (distribution over topics for a give document) to be presented.

To launch testing LDA model on a held-out dataset, please run the following command:

hadoop jar Mr.LDA.jar cc.mrlda.VariationalInference -input /hadoop/index/document/output/directory/test-data -output /hadoop/mrlda/output/test-output -term 60000 -topic 100 -iteration 100 -modelindex 40 -test /hadoop/mrlda/output/directory

This command launches the testing of model after 40 iterations from the training output /hadoop/mrlda/output/directory and run 100 iteration on the testing data /hadoop/index/document/output/directory/test-data. Take note that -test option specifies the training output, and -modelindex specifies the model index from the training output.

Informed Prior

Informed prior guild the latent Dirichlet allocation program to some topics which are particularly of interest. A typical informed prior word list looks like following, whereas every row is a set of words that belong (or "should" belong) to the same topic.

foreign eastern western domestic immigration foreigners ethnic immigrants cultural culture easterns westerners westernstyle immigrant
believe church hope believed determine christian religious christmas believes god determined fatal islamic faith christ jesus fate christopher christians churches belief religion gods christies fatalities saint islam beliefs faithful fatally determining bible lord ritual soul destined determination mosque churchs blessing destiny fatality christine saints godfather
fighting fight battle challenge argued arguments fought challenger fighters threw dominated riot argument challenged fighter knife argue battles confrontation stones cruel challenges challenging battling disagreed disagree fights disagreement knives challengers domination battled dominate
military war chief service army corp troops soldiers officer officers corps combat marine wars veterans soldier troop veteran marines
private person identified personal concern concerned concerns basis natural affected affect identify nature identification tend character concerning identity personally affecting core characters naturalization characterized personality tendency selfdefense identities affects characteristics selfdetermination naturally foundations identical
...

Let us refer the above content as an informed prior file in HDFS --- /hadoop/raw/text/input/informed-prior.txt. To generate the Mr. LDA acceptalbe informed prior with the correct mapping of the word indexing, please run the following command

hadoop jar Mr.LDA.jar cc.mrlda.InformedPrior -input /hadoop/raw/text/input/informed-prior.txt -output /hadoop/index/document/output/directory/prior -index /hadoop/index/document/output/directory/term

To print the help information and usage hits, please run the following command

hadoop jar Mr.LDA.jar cc.mrlda.InformedPrior -help

By the end of the execution, you should get an informed prior file with correct index mapping, ready for training topics using Mr. LDA, for example,

hadoop fs -ls /hadoop/index/document/output/directory/
Found 4 items
drwxr-xr-x   - user supergroup          0 2012-01-12 12:18 /hadoop/index/document/output/directory/document
-rw-r--r--   3 user supergroup         57 2012-01-12 12:25 /hadoop/index/document/output/directory/prior
-rw-r--r--   3 user supergroup        282 2012-01-12 12:18 /hadoop/index/document/output/directory/term
-rw-r--r--   3 user supergroup        189 2012-01-12 12:18 /hadoop/index/document/output/directory/title

To train LDA model on a dataset with informed prior, please run the following command

hadoop jar Mr.LDA.jar cc.mrlda.VariationalInference -input /hadoop/index/document/output/directory/document -informedprior /hadoop/index/document/output/directory/prior -output /hadoop/mrlda/output/directory -term 60000 -topic 100

After Running Experiments

The output is a set of parameters in sequence file format. In your output folder, you will see a set of 'beta-*' files, and 'alpha-*' files, and 'document-*' directory. 'alpha-*' are the hyperparameters, 'beta-*' are the distribution over words per topic and 'document-*' are the topic distribution for each document, where ('*' is the iteration index).

To display the top 20 ranked words of each topic, access 'beta-*' file using following command

hadoop jar Mr.LDA.jar cc.mrlda.DisplayTopic -input /hadoop/mrlda/output/directory/beta-* -term /hadoop/index/document/output/document/term -topdisplay 20

Please set the '-topdisplay' to an extremely large value to display all the words in each topic. Note that the output scores are sorted and in log scale.

To display the distribution over all topics for each document, access 'document-*' file using following command

hadoop jar Mr.LDA.jar cc.mrlda.DisplayDocument -input /path/to/document-*

To display the hyper-parameters, access alpha-* file using following command

hadoop jar Mr.LDA.jar edu.umd.cloud9.io.ReadSequenceFile /path/to/alpha-*

You may refer to -help options for further information.

More Repositories

1

MapReduceAlgorithms

Data-Intensive Text Processing with MapReduce
TeX
620
star
2

guide

The Student's Guide to @lintool
280
star
3

Cloud9

Cloud9 is a Hadoop toolkit for working with big data
Java
236
star
4

twitter-tools

Twitter Tools
Java
217
star
5

warcbase

Warcbase is an open-source platform for managing analyzing web archives
Java
161
star
6

bespin

Reference implementations of data-intensive algorithms in MapReduce and Spark
Java
81
star
7

Ivory

A Hadoop toolkit for web-scale information retrieval research
Java
79
star
8

bigdata-2018w

CS 451/651 431/631 Data-Intensive Distribute Computing (Winter 2018) at the University of Waterloo
HTML
71
star
9

bigcows

Scrapes citation statistics from Google Scholar
JavaScript
59
star
10

UMD-courses

Course homepages for courses that I've taught at the University of Maryland
HTML
53
star
11

IR-Reproducibility

Open-Source Information Retrieval Reproducibility Challenge
Shell
50
star
12

my-data-is-bigger-than-your-data

My data is bigger than your data!
HTML
39
star
13

SparkTutorial

Spark Tutorial at the University of Maryland
38
star
14

bigdata-2016w

CS 489/698 Big Data Infrastructure (Winter 2016) at the University of Waterloo
HTML
38
star
15

wikiclean

A Java Wikipedia markup to plain text converter
Java
37
star
16

clueweb

Hadoop tools for manipulating ClueWeb collections
Java
26
star
17

chrome-archive-this-page

Internet Archive "Save a Page" Plug-In for Chrome
JavaScript
23
star
18

bigdata-2018f

CS 451/651 Data-Intensive Distribute Computing (Fall 2018) at the University of Waterloo
HTML
23
star
19

tools

Lintools: tools by @lintool
Java
22
star
20

art-science-empirical-cs-2022f

The Art and Science of Empirical Computer Science (Fall 2022)
20
star
21

bigdata-2017w

CS 489/698 Big Data Infrastructure (Winter 2017) at the University of Waterloo
HTML
15
star
22

TweetAnalysisWithSpark

Tweet Analysis with Spark
Scala
15
star
23

robust04-analysis

Meta-Analysis of Robust04 Papers (Yang et al., SIGIR 2019)
Python
12
star
24

JScene

A proof-of-concept in-browser JavaScript-based search engine
JavaScript
12
star
25

JASS

Anytime Ranking for Impact-Ordered Indexes
C
12
star
26

GrimmerSenatePressReleases

Grimmer's Senate Press Releases
Python
10
star
27

Enron2mbox

Converting the Enron email collection to mbox format
Python
10
star
28

OptTrees

Source code for: Nima Asadi, Jimmy Lin, and Arjen P. de Vries. Runtime Optimizations for Tree-Based Machine Learning Models. IEEE Transactions on Knowledge and Data Engineering, 26(9):2281-2292, 2014.
C
9
star
29

non-blind-review

My proposal for non-blind reviewing at *ACL
6
star
30

art-science-empirical-cs-2023f

The Art and Science of Empirical Computer Science (Fall 2023)
6
star
31

IR-Reproducibility2

The Replicability of IR Replicability Experiments
Shell
5
star
32

UROC-projects

Undergraduate Research Opportunities Conference sponsored by the University of Waterloo
5
star
33

ClueWeb09-TREC-LTR

learning-to-rank dataset extracted from ClueWeb09 using TREC judgments
5
star
34

Cassovary-vs-GraphJet

Performance comparison between Cassovary and GraphJet
5
star
35

bespin-data

Datasets for Bespin
Python
4
star
36

Tweets2013-stats

4
star
37

robust04-analysis-papers

4
star
38

AnseriniMaven

Maven repo for some Anserini dependencies.
3
star
39

nyt-covid-map

HTML
3
star
40

c-bfscan

Implementations of brute force scans for document retrieval in C
C
3
star
41

MSMARCO-Document-Ranking-Archive.test

CSS
2
star
42

GiraphTutorial

Giraph Tutorial
2
star
43

MSMARCO-Document-Ranking-Archive

Python
2
star
44

Zambezi

Real-time indexer and search engine
C
2
star
45

chrome-archive-this-page-crx

Packaged CRX distribution for Internet Archive "Save a Page" Plug-In
2
star
46

NSF-projects

NSF project homepages
CSS
2
star
47

bfscan

Document retrieval using brute force scans
Java
2
star
48

wiki-tools

Collection of tools for working with Wikipedia
Java
2
star
49

msmarco-docker

Dockerfile
2
star
50

tools-javadoc

HTML
2
star
51

hadoop1-data

1
star
52

IR-Reproducibility-exp

Experimental runs from the Open-Source Information Retrieval Reproducibility Challenge.
MAXScript
1
star
53

TweetTap

Simple program to tap the Twitter sample stream
Java
1
star
54

chrome-scholar-search-extension

Google Scholar Search Extension for Chrome
JavaScript
1
star
55

trec-mb-vis

Visualization of TREC Microblog Track relevance judgments
JavaScript
1
star
56

clueweb09en01-webgraph

Webgraph for ClueWeb09 Category B
1
star
57

cs-big-cows

List of people with great achievements in Computer Science
Python
1
star