• Stars
    star
    226
  • Rank 169,692 (Top 4 %)
  • Language
    R
  • Created over 8 years ago
  • Updated about 8 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A repository of learning & R resources related to topic models

Topic Models Learning and R Resources Follow

This is a collection documenting the resources I find related to topic models with an R flavored focus. A topic model is a type of generative model used to "discover" latent topics that compose a corpus or collection of documents. Typically topic modeling is used on a collection of text documents but can be used for other modes including use as caption generation for images.

Table of Contents

Just the Essentials

This is my run down of the minimal readings, websites, videos, & scripts the reader needs to become familiar with topic modeling. The list is in an order I believe will be of greatest use and contains a nice mix of introduction, theory, application, and interpretation. As you want to learn more about topic modeling, the other sections will become more useful.

  1. Boyd-Graber, J. (2013). Computational Linguistics I: Topic Modeling
  2. Underwood, T. (2012). Topic Modeling Made Just Simple Enough
  3. Weingart, S. (2012). Topic Modeling for Humanists: A Guided Tour
  4. Blei, D. M. (2012). Probabilistic topic models. *Communications of the ACM, (55)*4, 77-84. doi:10.1145/2133806.2133826
  5. inkhorn82 (2014). A Delicious Analysis! (aka topic modelling using recipes) (CODE)
  6. Grüen, B. & Hornik, K. (2011). topicmodels: An R Package for Fitting Topic Models.. Journal of Statistical Software, 40(13), 1-30.
  7. Marwick, B. (2014a). The input parameters for using latent Dirichlet allocation
  8. Tang, J., Meng, Z., Nguyen, X. , Mei, Q. , & Zhang, M. (2014). Understanding the limiting factors of topic modeling via posterior contraction analysis. In 31 st International Conference on Machine Learning, 190-198.
  9. Sievert, C. (2014). LDAvis: A method for visualizing and interpreting topic models
  10. Rhody, L. M. (2012). Some Assembly Required: Understanding and Interpreting Topics in LDA Models of Figurative Language
  11. Rinker, T.W. (2015). R Script: Example Topic Model Analysis

Key Players

Papadimitriou, Raghavan, Tamaki & Vempala, Santosh (1997) first introduced the notion of topic modeling in their "Latent Semantic Indexing: A probabilistic analysis". Thomas Hofmann (1999) developed "Probabilistic latent semantic indexing". Blei, Ng, & Jordan (2003) proposed latent Dirichlet allocation (LDA) as a means of modeling documents with multiple topics but assumes the topic are uncorrelated. Blei & Lafferty (2007) proposed correlated topics model (CTM), extending LDA to allow for correlations between topics. Roberts, Stewart, Tingley, & Airoldi (2013) propose a Structural Topic Model (STM), allowing the inclusion of meta-data in the modeling process.

Videos

Introductory

Theory

Visualization

Articles

Applied

Theoretical

Websites & Blogs

R Resources

Package Comparisons

Package Functionality Pluses Author R Language Interface
lda* Collapsed Gibbs for LDA Graphing utilities Chang R
topicmodels LDA and CTM Follows Blei's implementation; great vignette; takes C DTM
stm Model w/ meta-data Great documentation; nice visualization Roberts, Stewart, & Tingley C
LDAvis Interactive visualization Aids in model interpretation Sievert & Shirley R + Shiny
mallet** LDA MALLET is well known Mimno Java

*StackExchange discussion of lda vs. topicmodels
**Setting Up MALLET

R Specific References

Example Modeling

Topic Modeling R Demo

topicmodels Package

The .R script for this demonstration can be downloaded from scripts/Example_topic_model_analysis.R

Install/Load Tools & Data

if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh("trinker/gofastr")
pacman::p_load(tm, topicmodels, dplyr, tidyr, igraph, devtools, LDAvis, ggplot2)

## Source topicmodels2LDAvis & optimal_k functions
invisible(lapply(
    file.path(
        "https://raw.githubusercontent.com/trinker/topicmodels_learning/master/functions", 
        c("topicmodels2LDAvis.R", "optimal_k.R")
    ),
    devtools::source_url
))

## SHA-1 hash of file is 5ac52af21ce36dfe8f529b4fe77568ced9307cf0
## SHA-1 hash of file is 7f0ab64a94948c8b60ba29dddf799e3f6c423435

data(presidential_debates_2012)

Generate Stopwords

stops <- c(
        tm::stopwords("english"),
        tm::stopwords("SMART"),
        "governor", "president", "mister", "obama","romney"
    ) %>%
    gofastr::prep_stopwords() 

Create the DocumentTermMatrix

doc_term_mat <- presidential_debates_2012 %>%
    with(gofastr::q_dtm_stem(dialogue, paste(person, time, sep = "_"))) %>%           
    gofastr::remove_stopwords(stops, stem=TRUE) %>%                                                    
    gofastr::filter_tf_idf() %>%
    gofastr::filter_documents() 

Control List

control <- list(burnin = 500, iter = 1000, keep = 100, seed = 2500)

Determine Optimal Number of Topics

The plot below shows the harmonic mean of the log likelihoods against k (number of topics).

(k <- optimal_k(doc_term_mat, 40, control = control))

## 
## Grab a cup of coffee this could take a while...

## 10 of 40 iterations (Current: 08:54:32; Elapsed: .2 mins)
## 20 of 40 iterations (Current: 08:55:07; Elapsed: .8 mins; Remaining: ~2.3 mins)
## 30 of 40 iterations (Current: 08:56:03; Elapsed: 1.7 mins; Remaining: ~1.3 mins)
## 40 of 40 iterations (Current: 08:57:30; Elapsed: 3.2 mins; Remaining: ~0 mins)
## Optimal number of topics = 20

It appears the optimal number of topics is ~k = 20.

Run the Model

control[["seed"]] <- 100
lda_model <- topicmodels::LDA(doc_term_mat, k=as.numeric(k), method = "Gibbs", 
    control = control)

Plot the Topics Per Person & Time

topics <- topicmodels::posterior(lda_model, doc_term_mat)[["topics"]]
topic_dat <- dplyr::add_rownames(as.data.frame(topics), "Person_Time")
colnames(topic_dat)[-1] <- apply(terms(lda_model, 10), 2, paste, collapse = ", ")

tidyr::gather(topic_dat, Topic, Proportion, -c(Person_Time)) %>%
    tidyr::separate(Person_Time, c("Person", "Time"), sep = "_") %>%
    dplyr::mutate(Person = factor(Person, 
        levels = c("OBAMA", "ROMNEY", "LEHRER", "SCHIEFFER", "CROWLEY", "QUESTION" ))
    ) %>%
    ggplot2::ggplot(ggplot2::aes(weight=Proportion, x=Topic, fill=Topic)) +
        ggplot2::geom_bar() +
        ggplot2::coord_flip() +
        ggplot2::facet_grid(Person~Time) +
        ggplot2::guides(fill=FALSE) +
        ggplot2::xlab("Proportion")

Plot the Topics Matrix as a Heatmap

heatmap(topics, scale = "none")

Network of the Word Distributions Over Topics (Topic Relation)

post <- topicmodels::posterior(lda_model)

cor_mat <- cor(t(post[["terms"]]))
cor_mat[ cor_mat < .05 ] <- 0
diag(cor_mat) <- 0

graph <- graph.adjacency(cor_mat, weighted=TRUE, mode="lower")
graph <- delete.edges(graph, E(graph)[ weight < 0.05])

E(graph)$edge.width <- E(graph)$weight*20
V(graph)$label <- paste("Topic", V(graph))
V(graph)$size <- colSums(post[["topics"]]) * 15

par(mar=c(0, 0, 3, 0))
set.seed(110)
plot.igraph(graph, edge.width = E(graph)$edge.width, 
    edge.color = "orange", vertex.color = "orange", 
    vertex.frame.color = NA, vertex.label.color = "grey30")
title("Strength Between Topics Based On Word Probabilities", cex.main=.8)

Network of the Topics Over Dcouments (Topic Relation)

minval <- .1
topic_mat <- topicmodels::posterior(lda_model)[["topics"]]

graph <- graph_from_incidence_matrix(topic_mat, weighted=TRUE)
graph <- delete.edges(graph, E(graph)[ weight < minval])

E(graph)$edge.width <- E(graph)$weight*17
E(graph)$color <- "blue"
V(graph)$color <- ifelse(grepl("^\\d+$", V(graph)$name), "grey75", "orange")
V(graph)$frame.color <- NA
V(graph)$label <- ifelse(grepl("^\\d+$", V(graph)$name), paste("topic", V(graph)$name), gsub("_", "\n", V(graph)$name))
V(graph)$size <- c(rep(10, nrow(topic_mat)), colSums(topic_mat) * 20)
V(graph)$label.color <- ifelse(grepl("^\\d+$", V(graph)$name), "red", "grey30")

par(mar=c(0, 0, 3, 0))
set.seed(369)
plot.igraph(graph, edge.width = E(graph)$edge.width, 
    vertex.color = adjustcolor(V(graph)$color, alpha.f = .4))
title("Topic & Document Relationships", cex.main=.8)

LDAvis of Model

The output from LDAvis is not easily embedded within an R markdown document, however, the reader may see the results here.

lda_model %>%
    topicmodels2LDAvis() %>%
    LDAvis::serVis()

Apply Model to New Data

## Create the DocumentTermMatrix for New Data
doc_term_mat2 <- partial_republican_debates_2015 %>%
    with(gofastr::q_dtm_stem(dialogue, paste(person, location, sep = "_"))) %>%           
    gofastr::remove_stopwords(stops, stem=TRUE) %>%                                                    
    gofastr::filter_tf_idf() %>%
    gofastr::filter_documents() 


## Update Control List
control2 <- control
control2[["estimate.beta"]] <- FALSE


## Run the Model for New Data
lda_model2 <- topicmodels::LDA(doc_term_mat2, k = k, model = lda_model, 
    control = list(seed = 100, estimate.beta = FALSE))


## Plot the Topics Per Person & Location for New Data
topics2 <- topicmodels::posterior(lda_model2, doc_term_mat2)[["topics"]]
topic_dat2 <- dplyr::add_rownames(as.data.frame(topics2), "Person_Location")
colnames(topic_dat2)[-1] <- apply(terms(lda_model2, 10), 2, paste, collapse = ", ")

tidyr::gather(topic_dat2, Topic, Proportion, -c(Person_Location)) %>%
    tidyr::separate(Person_Location, c("Person", "Location"), sep = "_") %>%
    ggplot2::ggplot(ggplot2::aes(weight=Proportion, x=Topic, fill=Topic)) +
        ggplot2::geom_bar() +
        ggplot2::coord_flip() +
        ggplot2::facet_grid(Person~Location) +
        ggplot2::guides(fill=FALSE) +
        ggplot2::xlab("Proportion")


## LDAvis of Model for New Data
lda_model2 %>%
    topicmodels2LDAvis() %>%
    LDAvis::serVis()

Contributing

You are welcome to:

More Repositories

1

sentimentr

Dictionary based sentiment analysis that considers valence shifters
R
416
star
2

pacman

A package management tools for R
HTML
290
star
3

wakefield

Generate random data sets
R
247
star
4

textclean

Tools for cleaning and normalizing text data
R
235
star
5

qdap

Quantitative Discourse Analysis Package: Bridging the gap between qualitative data and quantitative analysis
R
172
star
6

lexicon

A data package containing lexicons and dictionaries for text analysis
R
109
star
7

reports

An R package to assist in the workflow of writing academic articles and other reports
R
102
star
8

textreadr

Tools to uniformly read in text data including semi-structured transcripts
R
72
star
9

numform

tools to assist in the formatting of numbers and plots for publication
R
52
star
10

entity

Easy named entity extraction
R
51
star
11

qdapRegex

qdapRegex is a collection of regular expression tools associated with the qdap package that may be useful outside of the context of discourse analysis.
R
47
star
12

textshape

Tools for reshaping text data
R
45
star
13

textstem

Tools for fast text stemming & lemmatization
R
41
star
14

plotflow

A group of tools to speed up work flow associated with plotting tasks.
R
39
star
15

dplyr_in_a_nutshell

This is a minimal guide, mostly for myself, to remind me of the most import dplyr functions and how they relate to base R functions I'm that familiar with.
35
star
16

Make_Task

A minimal Example for Scheduling Windows Tasks with R
R
34
star
17

gmailR

send email with attachments in R
R
27
star
18

termco

Regular Expression Counts of Terms and Substrings
R
25
star
19

readability

Fast readability scores for text data
R
22
star
20

gofastr

Make a DocumentTermMatrix faster
R
20
star
21

pathr

R
19
star
22

clustext

Easy, fast clustering of texts
R
18
star
23

tidyr_in_a_nutshell

18
star
24

rnltk

R
18
star
25

textplot

Plotting for text data
R
18
star
26

stansent

R
16
star
27

pax

R
16
star
28

regexr

Readable Regular Expressions
HTML
14
star
29

qdapTools

qdapTools is an R package that contains tools associated with the qdap package that may be useful outside of the context of text analysis.
R
13
star
30

syllable

A Small Collection of Syllable Counting Functions
R
11
star
31

tagger

Part of speech (POS) tagger
R
11
star
32

pysty

R
10
star
33

sentimentpy

A Python port of the #rstats sentimentr package
Python
9
star
34

hclustext

R
8
star
35

rmarkdown_variable_doc_demo

R
7
star
36

cal

R console calendars
R
7
star
37

read_docx

R
5
star
38

gtrend

A wrapper for the GTrendsR package for work that interests me.
R
4
star
39

hangman

hangman game
R
4
star
40

qdapDictionaries

Word lists used by the qdap package.
HTML
4
star
41

lemmar

R
4
star
42

parsent

Sentence parsing tools; create sentence parse trees & extract portions of sentences
R
3
star
43

kmeanstext

R
3
star
44

formality

R
3
star
45

CAinterprTools

R package for visual aid to the interpretation of Correspondence Analysis
R
3
star
46

Regression

Tools for regression analyisis
R
3
star
47

discon

Tools for analyzing discourse connectors in text
HTML
3
star
48

qdap2

R
2
star
49

Annotated_Bibliography

TeX
2
star
50

blog_pacman

Blog for Initial Release of pacman
2
star
51

synonym

R
2
star
52

cv

Curriculum Vitae for Tyler Rinker
HTML
2
star
53

testing_Rmd

R
2
star
54

rdir

Functions to work with directories
R
2
star
55

word_vectors_learning

1
star
56

lexr

R
1
star
57

validateMake

Python
1
star
58

coreNLPsetup

Easy coreNLP setup
R
1
star
59

space_manikin

TeX
1
star
60

hilight

R
1
star
61

bounding_box

R
1
star
62

carnegie

R
1
star
63

DIFdetect

R
1
star
64

metaDAT

R
1
star
65

textcorpus

R
1
star
66

flip_example

JavaScript
1
star
67

trinker.github.com

HTML
1
star
68

textcode

R
1
star
69

wakefield_shiny

R
1
star
70

embodied

A package that provides video analysis tools for embodiement related tasks
TeX
1
star
71

acc.ggplot2

A collection of tools to extend and speed up coding for repeated uses of plotting functions that use ggplot2.
R
1
star
72

mapit

R
1
star
73

textproj

R
1
star
74

ggtree-1

This is a read-only mirror of the Bioconductor SVN repository. Package Homepage: http://bioconductor.org/packages/devel/bioc/html/ggtree.html Contributions: https://github.com/GuangchuangYu/ggtree. Bug Reports: https://support.bioconductor.org/p/new/post/?tag_val=ggtree or https://github.com/GuangchuangYu/ggtree/issues.
R
1
star
75

SOdemoing

R
1
star