• Stars
    star
    102
  • Rank 335,584 (Top 7 %)
  • Language
    Python
  • License
    GNU General Publi...
  • Created about 14 years ago
  • Updated over 9 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Latent Dirichlet Allocation for topic modeling of streamed data sources

Update

I am no longer maintaining this repo. A more up to date version is available at https://github.com/kzhai/InfVocLDA


STREAM VARIATIONAL BAYES FOR LATENT DIRICHLET ALLOCATION

Stream LDA implements a version of the LDA algorithm such that a continuous stream of documents can be passed in. The classifier will continue to learn new words and refine the topics over time, while maintaining a constant bound on memory requirements.

Original implementation by Matthew D. Hoffman ([email protected]), (C) Copyright 2009, Matthew D. Hoffman

Extensions by Jessy Cowan-Sharp ([email protected]) and Jordan Boyd-Grader ([email protected])


This is free software, you can redistribute it and/or modify it under the terms of the GNU General Public License.

The GNU General Public License does not permit this software to be redistributed in proprietary programs.

This software is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 USA


This Python code is based on the implementation of the online Variational Bayes (VB) algorithm presented in the paper "Online Learning for Latent Dirichlet Allocation" by Matthew D. Hoffman, David M. Blei, and Francis Bach, to be presented at NIPS 2010. It has been extended to support arbitrary streams of documents without constraining the vocabulary or requiring knowledge of the total number of predicted documents.

The algorithm uses stochastic optimization to maximize the variational objective function for the Latent Dirichlet Allocation (LDA) topic model. It only looks at a subset of the total corpus of documents each iteration, and thereby is able to find a locally optimal setting of the variational posterior over the topics more quickly than a batch VB algorithm could for large corpora.

Files provided:

  • streamlda.py: A package of functions for fitting LDA using stochastic optimization.
  • dirichlet_words.py: A class to represent the evolving vocabulary as probability distributions over words and topics. Provides backoff estimates of unseen words.
  • streamwikipedia.py: An example Python script that uses the functions in streamlda.py to fit a set of topics to the documents in Wikipedia.
  • wikirandom.py: A package of functions for downloading randomly chosen Wikipedia articles.
  • printtopics.py: A Python script that displays the topics fit using the functions in streamlda.py.
  • documentation.txt: More detailed commentary and implementation details.
  • readme.txt: This file.
  • LICENSE: A copy of the GNU public license version 3.

Dependencies:

  • numpy
  • scipy
  • nltk

Example: python streamwikipedia.py 101 python printtopics.py

This would run the algorithm for 101 iterations, and display the (expected value under the variational posterior of the) topics fit by the algorithm. (Note that the algorithm will not have fully converged after 101 iterations---this is just to give an idea of how to use the code.)

More Repositories

1

django-survey

a simple django framework for creating and conducting surveys
JavaScript
115
star
2

Raptor-Codes

raptor codes are dead sexy rateless erasure codes that can achieve linear time encoding/decoding
Python
43
star
3

WordAPI

An natural language webservice and API for document and topic analysis
Python
16
star
4

coliving.org

info site about coliving
HTML
13
star
5

nasaprofiles

An extension of the NASA contact system
Python
9
star
6

GPS

Code for working with GPS data to create personal location logs, build kml files, phototag, do statistical analysis, etc.
Python
6
star
7

DBScan

Algorithms and other machine learning or data mining code
Python
6
star
8

OpenGovTracker

Dashboard showing activity and participation across all agencies' IdeaScale brainstorm sites, as part of the Open Government Directive
JavaScript
6
star
9

Congress-Schedule

What the heck is Congress up to today?
Python
4
star
10

coliving-backbone

a map and directory of community, live/work and coliving houses
JavaScript
3
star
11

geoscratch

share stuff with people nearby
JavaScript
3
star
12

derive

The revolutionary strategy?
2
star
13

Pretty-JSON-IRC

Format JSON IRC Logs nicely for human consumption
Python
2
star
14

geopad

Location-specific notepads shared with people nearby.
CSS
2
star
15

Twequency

tweet frequencies
JavaScript
2
star
16

nomadict

A nomadic dictionary
1
star
17

The-Embassy-Network

JavaScript
1
star
18

extitutions.org

SCSS
1
star
19

PersonalAPI

a geeky personal api to return information about YOU!
Python
1
star
20

urbanprediction

R
1
star
21

dandelion

A tool for the 4D community to manage and track life goals.
JavaScript
1
star
22

Templates

Frequently used starting points for web development projects
Python
1
star
23

FDS-Android

fog discovery service for android
Java
1
star
24

4D-Network

code to support development of the 4D Network community
Python
1
star
25

sciencemarkup

Machine-readable metadata for online science content
JavaScript
1
star
26

Degress-of-Freedom

Quantitative participation metrics for P2PU
Ruby
1
star
27

Verneuil

Discrete event simulator for routing protocols. Modular support for different topologies, protocols, applications and movement models.
Ruby
1
star
28

Wodtracker

Track your WODs in a semi-social way, compare with others, see graphs, and find workouts.
Python
1
star
29

drftsim

agent-based simulation of the DRFT mutual credit housing protocol
Python
1
star