• Stars
    star
    183
  • Rank 210,154 (Top 5 %)
  • Language
    JavaScript
  • License
    MIT License
  • Created over 11 years ago
  • Updated over 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

An implementation of latent Dirichlet allocation in javascript

jsLDA

An implementation of latent Dirichlet allocation in javascript. A live demonstration is available.

Instructions:

When you first load the page, it will request a file containing documents and a file containing stopwords. The default example is a corpus of paragraphs from US State of the Union speeches.

Click the "Run 50 iterations" button to start training. The browser may appear to freeze for a while. Initially all words have been assigned randomly to topics. We train a model by cycling through every word token in the documents and sampling a topic for that word. An "iteration" corresponds to one pass through the documents.

The topics on the left side of the page should now look more interesting. Run more iterations if you would like -- there's probably still a lot of room for improvement after only 50 iterations.

Once you're satisfied with the model, you can click on a topic from the list on the right to sort documents in descending order by their use of that topic. Proportions are weighted so that longer documents will come first.

You can also explore correlations between topics by clicking the "Topic Correlations" tab. This view shows a force directed layout with connections between topics that have correlations above a certain threshold. You can control this threshold with the slider: a low cutoff will display more edges, while a high cutoff will remove all but the strongest correlations.

Topic correlations are actually pointwise mutual information scores. This score measures whether two topics occur in the same document more often than we would expect by chance. Previous versions of this script calculated correlations on logratios; PMI is simpler to calculate.

Using your own corpus:

To use your own corpus, the best way is to place the files in this repository in the document root of a web server. Replace the files documents.txt and stoplist.txt with your own corpus and stop list. The documents file is a tab-delimited text file with one document per line. Each line has three fields:

[doc ID] [tab] [label] [tab] [text...]

(this is the default format for Mallet). The "label" field is currently unused, but I plan to support timestamps, labels, etc.

The format for stopwords is one word per line. The "Vocabulary" tab may be useful in customizing a stoplist. Unicode is supported, so most languages that have meaningful whitespace (ie not CJK) should work.

The page works best in Chrome. Safari and Firefox work too, but may be considerably slower. It doesn't seem to work in IE.

Download results:

You can create reports about your topic model. Hit the Downloads tab. Reports are in CSV format. The sampling state file contains the same information as a Mallet state file, but in a more compact format.

More Repositories

1

Mallet

MALLET is a Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.
Java
981
star
2

info3300-spr2015

Notes and in-class problems for ML+d3 course
HTML
46
star
3

RMallet

R package wrapping Mallet
R
38
star
4

anchor

Mallet-compatible anchor-based topic model
Java
37
star
5

jsLBFGS

A javascript implementation of limited-memory BFGS
JavaScript
26
star
6

info3300-spr2017

Course materials for Data-Driven Web Applications
HTML
24
star
7

info3300-spr2016

Notes and pre-class work for INFO/CS 3300 and INFO 5100
HTML
16
star
8

PyMallet

Python tools for text
Jupyter Notebook
15
star
9

info3300-spr2018

Course materials for Data-Driven Web Applications
HTML
13
star
10

TidyMallet

A tidy-native LDA implementation in Rcpp
C++
12
star
11

info6150-fall2018

Resources for Advanced Topic Modeling (Fall 2018)
Python
9
star
12

info-3350-fall-2017

Materials for "Text Mining for History and Literature"
Python
8
star
13

admixture-ppc

Posterior predictive checks for genetic admixture models
Java
5
star
14

info-3350-fall-2019

Jupyter Notebook
4
star
15

arxivtopics

Python
4
star
16

hathitools

Tools for working with Hathi Trust Research Center extracted features files
Python
3
star
17

CulturalAnalytics

Articles from CA
Python
3
star
18

GRMM

Mallet-compatible graphical model toolkit
Java
3
star
19

MalletPPC

Posterior predictive checks for Mallet state files
Python
3
star
20

naivebayes

in-browser classification and analysis
2
star
21

ota

Creative commons texts from Oxford Text Archive
2
star
22

TwelveMedievalGhostStories

Stories transcribed by M.R. James from a manuscript from Byland Abbey
2
star
23

info-3350-fall-2015

Python
2
star
24

networks

Poisson network community model
Java
1
star
25

tada2022

Text as Data 2022
1
star