• Stars
    star
    115
  • Rank 305,916 (Top 7 %)
  • Language
    Jupyter Notebook
  • License
    MIT License
  • Created over 7 years ago
  • Updated over 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Topic modeling with word vectors

lda2vec

pytorch implementation of Moody's lda2vec, a way of topic modeling using word embeddings.
The original paper: Mixing Dirichlet Topic Models and Word Embeddings to Make lda2vec.

Warning: I, personally, believe that it is quite hard to make lda2vec algorithm work.
Sometimes it finds a couple of topics, sometimes not. Usually a lot of found topics are a total mess.
The algorithm is prone to poor local minima. It greatly depends on values of initial topic assignments.

For my results see 20newsgroups/explore_trained_model.ipynb. Also see Implementation details below.

Loss

The training proceeds as follows. First, convert a document corpus to a set of tuples
{(document id, word, the window around the word) | for each word in the corpus}.
Second, for each tuple maximize the following objective function

objective function

where c - context vector, w - embedding vector for a word, lambda - positive constant that controls sparsity, i - sum over the window around the word, k - sum over sampled negative words, j - sum over topics, p - probability distribution over topics for a document, t - topic vectors.
When training I also shuffle and batch the tuples.

How to use it

  1. Go to 20newsgroups/.
  2. Run get_windows.ipynb to prepare data.
  3. Run python train.py for training.
  4. Run explore_trained_model.ipynb.

To use this on your data you need to edit get_windows.ipynb. Also there are hyperparameters in 20newsgroups/train.py, utils/training.py, utils/lda2vec_loss.py.

Implementation details

  • I use vanilla LDA to initialize lda2vec (topic assignments for each document). It is not like in the original paper. It is not how it supposed to work. But without this results are quite bad.
    Also I use temperature to smoothen the initialization in the hope that lda2vec will have a chance to find better topic assignments.
  • I add noise to some gradients while training.
  • I reweight loss according to document lengths.
  • Before training lda2vec I train 50-dimensional skip-gram word2vec to initialize the word embeddings.
  • For text preprocessing:
    1. do word lemmatization
    2. remove rare and frequent words

Requirements

  • pytorch 0.2, spacy 1.9, gensim 3.0
  • numpy, sklearn, tqdm
  • matplotlib, Multicore-TSNE

More Repositories

1

mtcnn-pytorch

Joint Face Detection and Alignment using Multi-task Cascaded Convolutional Networks
Jupyter Notebook
648
star
2

knowledge-distillation-keras

A machine learning experiment
Jupyter Notebook
182
star
3

FaceBoxes-tensorflow

A fast face detector
Python
178
star
4

shufflenet-v2-tensorflow

A lightweight convolutional neural network
Jupyter Notebook
152
star
5

trained-ternary-quantization

Reducing the size of convolutional neural networks
Jupyter Notebook
107
star
6

image-classification-caltech-256

Exploring CNNs and model quantization on Caltech-256 dataset
Jupyter Notebook
83
star
7

wing-loss

A facial landmarks regressor
Jupyter Notebook
71
star
8

ShuffleNet-tensorflow

A ShuffleNet implementation tested on Tiny ImageNet dataset
Jupyter Notebook
41
star
9

light-head-rcnn

Python
23
star
10

set-transformer

A neural network architecture for prediction on sets
Python
21
star
11

single-shot-detector

A lightweight version of RetinaNet
Python
16
star
12

MultiPoseNet

Python
9
star
13

associative-domain-adaptation

A simple domain adaptation example
Python
8
star
14

multi-scale-gradient-gan

Generation of high resolution fashion images
Python
7
star
15

WESPE

Manipulating image quality using GANs
Python
6
star
16

bicycle-gan

Multimodal edges to image translation
Python
5
star
17

point-cloud-autoencoder

Python
5
star
18

contextual-loss

Jupyter Notebook
3
star
19

CNNMRF

Jupyter Notebook
3
star
20

universal-style-transfer

Python
2
star
21

EDANet

Python
2
star
22

U-GAT-IT

Unsupervised Image-to-Image Translation
Python
2
star
23

maxout-networks-tensorflow

A neural network with maxout activation units
Python
1
star
24

large-shufflenet-tpu

Python
1
star