Requirements
- python2.7 (development python3 code available in python3 branch; code still requires testing)
- gensim: pip install gensim
- tensorflow 0.8-0.12
Data Format
- One line per document
- Sentences are delimited by tabs in each document
- See examples in data/
- ACL2017 Paper dataset (AP News, BNC and IMDB)
Running the code (example.sh)
Train a word2vec model using gensim. This step is optional, you'll only need to do this if you want to initialise TDLM with pre-trained embeddings. word2vec model settings are in the python file (word2vec.py)
python word2vec_train.py
Train a model; configurations/hyper-parameters are defined in tdlm_config.py
python tdlm_train.py
All test inferences are invoked with tdlm_test.py. E.g. to compute language and topic model perplexity
python tdlm_test.py -m output/toy-model/ -d data/toy-valid.txt --print_perplexity
Print topics (to topics.txt)
python tdlm_test.py -m output/toy-model/ -d data/toy-valid.txt --output_topic topics.txt
Infer topic distribution in documents (saved as a npy file)
python tdlm_test.py -m output/toy-model/ -d data/toy-valid.txt --output_topic_dist topic-dist.npy
Generate sentences conditioned on topics
python tdlm_test.py -m output/toy-model/ -d data/toy-valid.txt --gen_sent_on_topic topic-sents.txt
tdlm_test.py arguments:
usage: tdlm_test.py [-h] -m MODEL_DIR [-d INPUT_DOC] [-l INPUT_LABEL]
[-t INPUT_TAG] [--print_perplexity] [--print_acc]
[--output_topic OUTPUT_TOPIC]
[--output_topic_dist OUTPUT_TOPIC_DIST]
[--output_tag_embedding OUTPUT_TAG_EMBEDDING]
[--gen_sent_on_topic GEN_SENT_ON_TOPIC]
[--gen_sent_on_doc GEN_SENT_ON_DOC]
Given a trained TDLM model, perform various test inferences
optional arguments:
-h, --help show this help message and exit
-m MODEL_DIR, --model_dir MODEL_DIR
directory of the saved model
-d INPUT_DOC, --input_doc INPUT_DOC
input file containing the test documents
-l INPUT_LABEL, --input_label INPUT_LABEL
input file containing the test labels
-t INPUT_TAG, --input_tag INPUT_TAG
input file containing the test tags
--print_perplexity print topic and language model perplexity of the input
test documents
--print_acc print supervised classification accuracy
--output_topic OUTPUT_TOPIC
output file to save the topics (prints top-N words of
each topic)
--output_topic_dist OUTPUT_TOPIC_DIST
output file to save the topic distribution of input
docs (npy format)
--output_tag_embedding OUTPUT_TAG_EMBEDDING
output tag embeddings to file (npy format)
--gen_sent_on_topic GEN_SENT_ON_TOPIC
generate sentences conditioned on topics
--gen_sent_on_doc GEN_SENT_ON_DOC
generate sentences conditioned on input test documents
Publication
Jey Han Lau, Timothy Baldwin and Trevor Cohn (2017). Topically Driven Neural Language Model. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (ACL 2017), Vancouver, Canada, pp. 355--365.