• This repository has been archived on 11/Sep/2020
  • Stars
    star
    107
  • Rank 323,587 (Top 7 %)
  • Language
    Python
  • License
    MIT License
  • Created over 6 years ago
  • Updated over 5 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Tensorflow 1.5 implementation of Chris Moody's Lda2vec, adapted from @meereeum

Lda2vec-Tensorflow

Tensorflow 1.5 implementation of Chris Moody's Lda2vec, adapted from @meereeum

Note

This algorithm is very much so a research algorithm. It doesn't always work so well, and you have to train it for a long time. As the author noted in the paper, most of the time normal LDA will work better.

Note that you should run this algorithm for at least 100 epochs before expecting to see any results. The algorithm is meant to run for a very long time.

Usage

Installation

Warning- You may have to install dependencies manually before being able to use the package. Requirements can be found here. If you install these on a clean environment, you should be good to go. I am seeking help on this issue.

Clone the repo and run python setup.py install to install the package as is or run python setup.py develop to make your own edits.

You can also just pip install lda2vec (Last updated 3/13/19)

Pretrained Embeddings

This repo can load a wide variety of pretrained embedding files (see nlppipe.py for more info). The examples are all using GloVe embeddings. You can download them from here.

Preprocessing

The preprocessing is all done through the "nlppipe.py" file using Spacy. Feel free to use your own preprocessing, if you like.

At the most basic level, if you would like to get your data processed for lda2vec, you can do the following:

import pandas as pd
from lda2vec.nlppipe import Preprocessor

# Data directory
data_dir ="data"
# Where to save preprocessed data
clean_data_dir = "data/clean_data"
# Name of input file. Should be inside of data_dir
input_file = "20_newsgroups.txt"
# Should we load pretrained embeddings from file
load_embeds = True

# Read in data file
df = pd.read_csv(data_dir+"/"+input_file, sep="\t")

# Initialize a preprocessor
P = Preprocessor(df, "texts", max_features=30000, maxlen=10000, min_count=30)

# Run the preprocessing on your dataframe
P.preprocess()

# Load embeddings from file if we choose to do so
if load_embeds:
    # Load embedding matrix from file path - change path to where you saved them
    embedding_matrix = P.load_glove("PATH/TO/GLOVE/glove.6B.300d.txt")
else:
    embedding_matrix = None

# Save data to data_dir
P.save_data(clean_data_dir, embedding_matrix=embedding_matrix)

When you run the twenty newsgroups preprocessing example, it will create a directory tree that looks like this:

โ”œโ”€โ”€ my_project
โ”‚ย ย  โ”œโ”€โ”€ data
โ”‚ย ย  โ”‚ย ย  โ”œโ”€โ”€ 20_newsgroups.txt
โ”‚ย ย  โ”‚ย ย  โ””โ”€โ”€ clean_data_dir
โ”‚ย ย  โ”‚ย ย      โ”œโ”€โ”€ doc_lengths.npy
โ”‚ย ย  โ”‚ย ย      โ”œโ”€โ”€ embedding_matrix.npy
โ”‚ย ย  โ”‚ย ย      โ”œโ”€โ”€ freqs.npy
โ”‚ย ย  โ”‚ย ย      โ”œโ”€โ”€ idx_to_word.pickle
โ”‚ย ย  โ”‚ย ย      โ”œโ”€โ”€ skipgrams.txt
โ”‚ย ย  โ”‚ย ย      โ””โ”€โ”€ word_to_idx.pickle
โ”‚ย ย  โ”œโ”€โ”€ load_20newsgroups.py
โ”‚ย ย  โ””โ”€โ”€ run_20newsgroups.py

Using the Model

To run the model, pass the same data_path to the load_preprocessed_data function and then use that data to instantiate and train the model.

from lda2vec import utils, model

# Path to preprocessed data
data_path  = "data/clean_data"
# Whether or not to load saved embeddings file
load_embeds = True

# Load data from files
(idx_to_word, word_to_idx, freqs, pivot_ids,
 target_ids, doc_ids, embed_matrix) = utils.load_preprocessed_data(data_path, load_embed_matrix=load_embeds)

# Number of unique documents
num_docs = doc_ids.max() + 1
# Number of unique words in vocabulary (int)
vocab_size = len(freqs)
# Embed layer dimension size
# If not loading embeds, change 128 to whatever size you want.
embed_size = embed_matrix.shape[1] if load_embeds else 128
# Number of topics to cluster into
num_topics = 20
# Amount of iterations over entire dataset
num_epochs = 200
# Batch size - Increase/decrease depending on memory usage
batch_size = 4096
# Epoch that we want to "switch on" LDA loss
switch_loss_epoch = 0
# Pretrained embeddings value
pretrained_embeddings = embed_matrix if load_embeds else None
# If True, save logdir, otherwise don't
save_graph = True


# Initialize the model
m = model(num_docs,
          vocab_size,
          num_topics,
          embedding_size=embed_size,
          pretrained_embeddings=pretrained_embeddings,
          freqs=freqs,
          batch_size = batch_size,
          save_graph_def=save_graph)

# Train the model
m.train(pivot_ids,
        target_ids,
        doc_ids,
        len(pivot_ids),
        num_epochs,
        idx_to_word=idx_to_word,
        switch_loss_epoch=switch_loss_epoch)

Visualizing the Results

We can now visualize the results of our model using pyLDAvis:

utils.generate_ldavis_data(data_path, m, idx_to_word, freqs, vocab_size)

This will launch pyLDAvis in your browser, allowing you to visualize your results like this:

alt text

More Repositories

1

stable-diffusion-videos

Create ๐Ÿ”ฅ videos with Stable Diffusion by exploring the latent space and morphing between text prompts
Python
4,404
star
2

huggingpics

๐Ÿค—๐Ÿ–ผ๏ธ HuggingPics: Fine-tune Vision Transformers for anything using images found on the web.
Jupyter Notebook
275
star
3

download-musiccaps-dataset

Download the MusicCaps dataset for music captioning
Jupyter Notebook
96
star
4

singing-songstarter

Sing an idea โžก๏ธ AI music sample๐Ÿ”ฅ๐ŸŽถ
Python
86
star
5

replicate-examples

Python
74
star
6

huggingface-sync-action

GitHub action that'll sync files from a GitHub Repo with the Hugging Face Hub ๐Ÿค—
Python
64
star
7

openai-vision-api-for-videos

Extract information, summarize, ask questions, and search videos using OpenAI's Vision API ๐Ÿš€๐ŸŽฆ
Jupyter Notebook
61
star
8

huggingface-datasets-converter

Scripts to convert datasets from various sources to Hugging Face Datasets.
Python
57
star
9

animegan-v2-for-videos

Apply AnimeGAN-v2 across frames of a video clip
Jupyter Notebook
42
star
10

hf-hub-lightning

A PyTorch Lightning Callback for pushing models to the Hugging Face Hub ๐Ÿค—โšก๏ธ
Python
35
star
11

spaces-docker-templates

๐Ÿš€๐Ÿค— A collection of templates for Hugging Face Spaces
Dockerfile
35
star
12

huggingface-hub-examples

Examples using ๐Ÿค— Hub to share and reload machine learning models
Jupyter Notebook
33
star
13

roast-or-toast-bot

A fun (yet toxic) twitter bot that uses GPT-3 to either roast ๐Ÿ˜ˆ or toast ๐Ÿฅ‚ a tweet if you mention it in the replies
Python
30
star
14

voice-cloning

Make Kanye sing any song ya want ๐ŸŽค๐Ÿ”ฅ
Jupyter Notebook
23
star
15

huggingface-vit-finetune

Finetune Google's pre-trained ViT models from HuggingFace's model hub.
Python
18
star
16

modelcards

๐Ÿ“ Utility to create, edit, and publish model cards on the Hugging Face Hub. [**Now lives in huggingface_hub**]
Jupyter Notebook
15
star
17

Tensorflow-for-NLP

These are the files from my Tensorflow for NLP playlist on YouTube
Python
15
star
18

encoded-video

Utilities for working with videos
Python
13
star
19

lambdacloud

An unofficial Python client library for Lambda Lab's Cloud Computing Platform
Python
13
star
20

hf-text-classification

Python
12
star
21

modal-examples

Apps that run on modal.com
Python
12
star
22

spaces-template

A ๐Ÿ”ฅ cookiecutter template for building Hugging Face Spaces
Shell
11
star
23

azureml-examples

AzureML is fun! ๐Ÿป
Python
8
star
24

aiart-blog

Jupyter Notebook
7
star
25

pytorch-lightning-azureml

Narrow the gap between research and production ๐Ÿ˜Ž
Python
6
star
26

host-a-blog-on-huggingface-spaces

How to host a blog on ๐Ÿค—
Python
6
star
27

my-huggingface-repos

A command center for multiple Hugging Face repos. Files are synced with the Hub.
Python
6
star
28

tabular-anomaly-detection

Python
5
star
29

azureml-pipelines

Example pipelines using AzureML SDKv1. ๐Ÿ‘ทโ€โ™€๏ธ WIP
Python
5
star
30

discord-image-captioning-bot

A Discord bot for captioning images
Python
5
star
31

lightning-vision-transformer

๐Ÿ–ผ + ๐Ÿค– = ๐Ÿง 
Python
5
star
32

background-remover

๐Ÿ–ผ ๏ธA Gradio app to remove the background from an image
Python
5
star
33

huggingpics-explorer

A streamlit app for exploring image search results from HuggingPics
Python
4
star
34

quickdraw-pytorch

Train a simple CNN on the "Quick, Draw!" dataset using Google Colab
Jupyter Notebook
4
star
35

pytorchvideo-classification

A first look at PyTorch for Video Classification
Python
4
star
36

huggingface-detr-finetune

Python
3
star
37

lightning-pretrain-hf

Python
3
star
38

huggingface-image-datasets

Learn how to share Image datasets on Huggingface's Hub.
Python
3
star
39

spotify-pedalboard-demo

๐Ÿšง WIP Streamlit Demo of Spotify's Pedalboard ๐Ÿšง
Python
3
star
40

lita-colab

Colab Notebook for Nvidia's LITA: Language Instructed Temporal-Localization Assistant
Jupyter Notebook
3
star
41

naterawdotcom

My personal website/blog, made with Quarto
Jupyter Notebook
3
star
42

test-spaces-app

a dummy hugging face spaces app for testing
Python
2
star
43

helpful-snippets

An interactive app with some snippets I've found helpful
Python
2
star
44

pytorchvideo-accelerate

Distributed training of video action recognition models with pytorchvideo and Hugging Face accelerate
Python
2
star
45

azure-web-app-test

2
star
46

lightning-cats-and-dogs

Python
2
star
47

spaces-lfs-workflow

Workflow that syncs code from GitHub and stores LFS files on HF
Python
2
star
48

map-vs-generator-issue

dump of some files
Python
2
star
49

image-generation

Python
2
star
50

auto-anything

Playing with ideas to include/reference code on Huggingface's hub. Experimental!
Python
2
star
51

BeautifulSauce

BeautifulSoup's saucy sibling!
Jupyter Notebook
2
star
52

applied-ml-examples

Temporary repo to put some applied ML examples
Jupyter Notebook
2
star
53

test-colab-pr-action

Jupyter Notebook
2
star
54

fastpages-blog

Trying out fastpages
Jupyter Notebook
2
star
55

colab-pr-action

Python
1
star
56

test_doc_builder

playground for me to figure out hugging face doc builder + related github actions
1
star
57

github-action-playground

Dummy repo to play with github actions. Ignore me :)
1
star
58

gradio-guides

Python
1
star
59

test-space-lfs

Python
1
star
60

azure-devops-flask

A simple template for deploying Flask apps with CI/CD to Azure DevOps
Python
1
star
61

speech-to-code

When mom says we have an OpenAI Codex at home
Jupyter Notebook
1
star
62

cats_vs_dogs

Python
1
star
63

pytorch-lightning-examples

Place for me to keep my personal Pytorch Lightning examples/notebooks
1
star
64

Resume

An overly complicated way to write your resume
HTML
1
star
65

vision-datasets-viewer

Python
1
star
66

nateraw

1
star
67

vsc2022-dataset-visualizer

Simple streamlit app to explore DrivenData's vsc2022 competition dataset
Python
1
star
68

vision

Jupyter Notebook
1
star