• Stars
    star
    4,444
  • Rank 9,645 (Top 0.2 %)
  • Language
    Python
  • License
    MIT License
  • Created about 4 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Leveraging BERT and c-TF-IDF to create easily interpretable topics.

PyPI - Python Build docs PyPI - PyPi PyPI - License arXiv

BERTopic

BERTopic is a topic modeling technique that leverages ๐Ÿค— transformers and c-TF-IDF to create dense clusters allowing for easily interpretable topics whilst keeping important words in the topic descriptions.

BERTopic supports all kinds of topic modeling techniques:

Guided Supervised Semi-supervised
Manual Multi-topic distributions Hierarchical
Class-based Dynamic Online/Incremental
Multimodal Multi-aspect Text Generation/LLM

Corresponding medium posts can be found here, here and here. For a more detailed overview, you can read the paper or see a brief overview.

Installation

Installation, with sentence-transformers, can be done using pypi:

pip install bertopic

If you want to install BERTopic with other embedding models, you can choose one of the following:

# Choose an embedding backend
pip install bertopic[flair, gensim, spacy, use]

# Topic modeling with images
pip install bertopic[vision]

Getting Started

For an in-depth overview of the features of BERTopic you can check the full documentation or you can follow along with one of the examples below:

Name Link
BERTopic - Best Practices Open In Colab
Topic Modeling with BERTopic Open In Colab
(Custom) Embedding Models in BERTopic Open In Colab
Advanced Customization in BERTopic Open In Colab
(semi-)Supervised Topic Modeling with BERTopic Open In Colab
Dynamic Topic Modeling with Trump's Tweets Open In Colab
Topic Modeling on Large Data Open In Colab
Topic Modeling arXiv Abstracts Kaggle

Quick Start

We start by extracting topics from the well-known 20 newsgroups dataset containing English documents:

from bertopic import BERTopic
from sklearn.datasets import fetch_20newsgroups
 
docs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']

topic_model = BERTopic()
topics, probs = topic_model.fit_transform(docs)

After generating topics and their probabilities, we can access all of the topics together with their topic representations:

>>> topic_model.get_topic_info()

Topic	Count	Name
-1	4630	-1_can_your_will_any
0	693	49_windows_drive_dos_file
1	466	32_jesus_bible_christian_faith
2	441	2_space_launch_orbit_lunar
3	381	22_key_encryption_keys_encrypted
...

The -1 topic refers to all outlier documents and are typically ignored. Each word in a topic describes the underlying theme of that topic and can be used for interpreting that topic. Next, let's take a look at the most frequent topic that was generated:

>>> topic_model.get_topic(0)

[('windows', 0.006152228076250982),
 ('drive', 0.004982897610645755),
 ('dos', 0.004845038866360651),
 ('file', 0.004140142872194834),
 ('disk', 0.004131678774810884),
 ('mac', 0.003624848635985097),
 ('memory', 0.0034840976976789903),
 ('software', 0.0034415334250699077),
 ('email', 0.0034239554442333257),
 ('pc', 0.003047105930670237)]

Using .get_document_info, we can also extract information on a document level, such as their corresponding topics, probabilities, whether they are representative documents for a topic, etc.:

>>> topic_model.get_document_info(docs)

Document                               Topic	Name	                        Top_n_words                     Probability    ...
I am sure some bashers of Pens...	0	0_game_team_games_season	game - team - games...	        0.200010       ...
My brother is in the market for...      -1     -1_can_your_will_any	        can - your - will...	        0.420668       ...
Finally you said what you dream...	-1     -1_can_your_will_any	        can - your - will...            0.807259       ...
Think! It's the SCSI card doing...	49     49_windows_drive_dos_file	windows - drive - docs...	0.071746       ...
1) I have an old Jasmine drive...	49     49_windows_drive_dos_file	windows - drive - docs...	0.038983       ...

๐Ÿ”ฅ Tip: Use BERTopic(language="multilingual") to select a model that supports 50+ languages.

Fine-tune Topic Representations

In BERTopic, there are a number of different topic representations that we can choose from. They are all quite different from one another and give interesting perspectives and variations of topic representations. A great start is KeyBERTInspired, which for many users increases the coherence and reduces stopwords from the resulting topic representations:

from bertopic.representation import KeyBERTInspired

# Fine-tune your topic representations
representation_model = KeyBERTInspired()
topic_model = BERTopic(representation_model=representation_model)

However, you might want to use something more powerful to describe your clusters. You can even use ChatGPT or other models from OpenAI to generate labels, summaries, phrases, keywords, and more:

import openai
from bertopic.representation import OpenAI

# Fine-tune topic representations with GPT
openai.api_key = "sk-..."
representation_model = OpenAI(model="gpt-3.5-turbo", chat=True)
topic_model = BERTopic(representation_model=representation_model)

๐Ÿ”ฅ Tip: Instead of iterating over all of these different topic representations, you can model them simultaneously with multi-aspect topic representations in BERTopic.

Visualizations

After having trained our BERTopic model, we can iteratively go through hundreds of topics to get a good understanding of the topics that were extracted. However, that takes quite some time and lacks a global representation. Instead, we can use one of the many visualization options in BERTopic. For example, we can visualize the topics that were generated in a way very similar to LDAvis:

topic_model.visualize_topics()

Modularity

By default, the main steps for topic modeling with BERTopic are sentence-transformers, UMAP, HDBSCAN, and c-TF-IDF run in sequence. However, it assumes some independence between these steps which makes BERTopic quite modular. In other words, BERTopic not only allows you to build your own topic model but to explore several topic modeling techniques on top of your customized topic model:

BERTopicOverview.mp4

You can swap out any of these models or even remove them entirely. The following steps are completely modular:

  1. Embedding documents
  2. Reducing dimensionality of embeddings
  3. Clustering reduced embeddings into topics
  4. Tokenization of topics
  5. Weight tokens
  6. Represent topics with one or multiple representations

Functionality

BERTopic has many functions that quickly can become overwhelming. To alleviate this issue, you will find an overview of all methods and a short description of its purpose.

Common

Below, you will find an overview of common functions in BERTopic.

Method Code
Fit the model .fit(docs)
Fit the model and predict documents .fit_transform(docs)
Predict new documents .transform([new_doc])
Access single topic .get_topic(topic=12)
Access all topics .get_topics()
Get topic freq .get_topic_freq()
Get all topic information .get_topic_info()
Get all document information .get_document_info(docs)
Get representative docs per topic .get_representative_docs()
Update topic representation .update_topics(docs, n_gram_range=(1, 3))
Generate topic labels .generate_topic_labels()
Set topic labels .set_topic_labels(my_custom_labels)
Merge topics .merge_topics(docs, topics_to_merge)
Reduce nr of topics .reduce_topics(docs, nr_topics=30)
Reduce outliers .reduce_outliers(docs, topics)
Find topics .find_topics("vehicle")
Save model .save("my_model", serialization="safetensors")
Load model BERTopic.load("my_model")
Get parameters .get_params()

Attributes

After having trained your BERTopic model, several attributes are saved within your model. These attributes, in part, refer to how model information is stored on an estimator during fitting. The attributes that you see below all end in _ and are public attributes that can be used to access model information.

Attribute Description
.topics_ The topics that are generated for each document after training or updating the topic model.
.probabilities_ The probabilities that are generated for each document if HDBSCAN is used.
.topic_sizes_ The size of each topic
.topic_mapper_ A class for tracking topics and their mappings anytime they are merged/reduced.
.topic_representations_ The top n terms per topic and their respective c-TF-IDF values.
.c_tf_idf_ The topic-term matrix as calculated through c-TF-IDF.
.topic_aspects_ The different aspects, or representations, of each topic.
.topic_labels_ The default labels for each topic.
.custom_labels_ Custom labels for each topic as generated through .set_topic_labels.
.topic_embeddings_ The embeddings for each topic if embedding_model was used.
.representative_docs_ The representative documents for each topic if HDBSCAN is used.

Variations

There are many different use cases in which topic modeling can be used. As such, several variations of BERTopic have been developed such that one package can be used across many use cases.

Method Code
Topic Distribution Approximation .approximate_distribution(docs)
Online Topic Modeling .partial_fit(doc)
Semi-supervised Topic Modeling .fit(docs, y=y)
Supervised Topic Modeling .fit(docs, y=y)
Manual Topic Modeling .fit(docs, y=y)
Multimodal Topic Modeling .fit(docs, images=images)
Topic Modeling per Class .topics_per_class(docs, classes)
Dynamic Topic Modeling .topics_over_time(docs, timestamps)
Hierarchical Topic Modeling .hierarchical_topics(docs)
Guided Topic Modeling BERTopic(seed_topic_list=seed_topic_list)

Visualizations

Evaluating topic models can be rather difficult due to the somewhat subjective nature of evaluation. Visualizing different aspects of the topic model helps in understanding the model and makes it easier to tweak the model to your liking.

Method Code
Visualize Topics .visualize_topics()
Visualize Documents .visualize_documents()
Visualize Document Hierarchy .visualize_hierarchical_documents()
Visualize Topic Hierarchy .visualize_hierarchy()
Visualize Topic Tree .get_topic_tree(hierarchical_topics)
Visualize Topic Terms .visualize_barchart()
Visualize Topic Similarity .visualize_heatmap()
Visualize Term Score Decline .visualize_term_rank()
Visualize Topic Probability Distribution .visualize_distribution(probs[0])
Visualize Topics over Time .visualize_topics_over_time(topics_over_time)
Visualize Topics per Class .visualize_topics_per_class(topics_per_class)

Citation

To cite the BERTopic paper, please use the following bibtex reference:

@article{grootendorst2022bertopic,
  title={BERTopic: Neural topic modeling with a class-based TF-IDF procedure},
  author={Grootendorst, Maarten},
  journal={arXiv preprint arXiv:2203.05794},
  year={2022}
}

More Repositories

1

KeyBERT

Minimal keyword extraction with BERT
Python
2,474
star
2

PolyFuzz

Fuzzy string matching, grouping, and evaluation.
Python
649
star
3

Concept

Concept Modeling: Topic Modeling on Images and Text
Python
149
star
4

soan

Social Analysis based on Whatsapp data
Python
124
star
5

cTFIDF

Creating class-based TF-IDF matrices
Python
67
star
6

ML-API

Guide on creating an API for serving your ML model
Jupyter Notebook
63
star
7

Projects

Data Science Portfolio
Jupyter Notebook
63
star
8

ReinLife

Creating Artificial Life with Reinforcement Learning
Python
56
star
9

CustomerSegmentation

Analysis for Customer Segmentation
Jupyter Notebook
56
star
10

streamlit_guide

A guide on creating and deploying your Streamlit application to Heroku
Python
47
star
11

feature-engineering

Tips for Advanced Feature Engineering
Jupyter Notebook
47
star
12

BERTopic_evaluation

Code and experiments for *BERTopic: Neural topic modeling with a class-based TF-IDF procedure*
Python
40
star
13

boardgame

Heroku app to explore boardgame data
Jupyter Notebook
20
star
14

UnitTesting

Guide for applying Unit Testing in data-driven projects
Python
18
star
15

Sprite-Generator

Python procedural sprite generator
Jupyter Notebook
15
star
16

VLAC

Vectors of Locally Aggregated Concepts
Jupyter Notebook
10
star
17

Reviewer

Tool for extracting and analyzing IMDB reviews
Jupyter Notebook
7
star
18

InterpretableML

My analyses for interpretable Machine Learning
Jupyter Notebook
7
star
19

validation

Overview of validation techniques
Jupyter Notebook
6
star
20

ReinforcementLearning

Train SOTA RL-algorithms using Stable Baselines andย Gym
Jupyter Notebook
4
star
21

cars_dashboard

Dashboard for the cars dataset
Python
3
star
22

MaartenGr

Python
3
star
23

PotholeDetection

Detection of Potholes in Images
Jupyter Notebook
2
star
24

fitbit

Analysis of my FitBit data
Jupyter Notebook
1
star
25

DisneyTournament

Statistically Generated Disney Tournament Bracket
Jupyter Notebook
1
star
26

BoardGames

Analysis of board game matches
Jupyter Notebook
1
star