• Stars
    star
    139
  • Rank 261,477 (Top 6 %)
  • Language
    Python
  • License
    MIT License
  • Created over 4 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Steam review texting embedding analysis

Contextual Topic Identification for Steam Reviews

This repository is the implementation of contextual topic identification model. The model is based on LDA probabilistic topic assignment and pre-trained sentence embeddings from BERT/RoBERTa. The analysis is conducted on the dataset of game reviews on the steam platform.

Motivation

Product reviews are important as influencing people's choices especially for online shopping. We usually have dreadfully huge numbers of reviews for whatever products. However, many platforms have barely a satisfying categorization system for the reviews when it comes to what the reviewers are really talking about. Steam, for example, has a very carefully designed system where people can do a lot of things, but still there is no such access to categorizing the reviews by their semantic meanings.

Steam review logo

Therefore, we provide a topic identification procedure thats combines both bag-of-words and contextual information to explore potential semantically meaningful categories out of the oceans of steam reviews.

Setup

Clone the repo

git clone https://github.com/Stveshawn/contextual_topic_identification.git
cd contextual_topic_identification

and make sure you have dataset in the data folder (you can specify the path in the bash script later).

To run the model and get trained model objects and visualization

With Docker

run the bash script on your terminal

sudo bash docker_build_run.sh

The results will be saved in the docs folder with corresponding model id (Method_Year_Month_Day_Hour_Minute_Second).

Four parameters can be specified in the bash script

  • samp_size: number of reviews used in the model
  • method={"TFIDF", "LDA", "BERT", "LDA_BERT"}: method for the topic model
  • ntopic: number of topics
  • fpath=/contextual_topic_identification/data/steam_reviews.csv: file path to the csv data

To run a test case on the sampled test data, do

sudo bash test.sh

Data

The dataset (Steam review dataset) is published on Kaggle covering ~480K reviews for 46 best selling video games on steam.

To successfully run the model, you should have this dataset downloaded (kaggle authentication required) and placed in the data folder (or specify your own file path in the bash script).

Model

To identify the potential topics of the target documents, traditional approaches are

  • Latent Dirichlet Allocation

  • Embedding + Clustering

Although LDA generally works well for topic modeling tasks, it fails with short documents, in which there isn’t much text to model and documents that don’t coherently discuss topics. Using only bag-of-words information also make it quite limited.

The contextual topic identification model leverages both bag-of-words and contextual information by including both the LDA topic probabilities and the sentence embeddings. The model is as follows

Model

where we

  • take the information from LDA probabilistic topic assignment (v_1) and sentence embeddings (v_2)
  • concatenate \lambda * v_1 and v_2 to get v_{full}
  • learn the latent space representation v_{latent} of by autoencoder
  • implement clustering on the latent space representations.

Result

Visualizations (2D UMAP) of clustering results with different vectorization methods with n_topic=10

TF-IDF BERT LDA_BERT
Model Model Model

Evaluation of different topic identification models with n_topic=10

Metric\Method TF-IDF + Clustering LDA BERT + Clustering LDA_BERT + Clustering
C_Umass -2.161 -5.233 -4.368 -3.394
CV 0.538 0.482 0.547 0.551
Silhouette score 0.025 / 0.063 0.234

Acknowledgements

Libraries

Sentence Transformers: Sentence Embeddings using BERT / RoBERTa / DistilBERT / ALBERT / XLNet with PyTorch

SymSpell: 1 million times faster through Symmetric Delete spelling correction algorithm

Gensim: Topic Modelling in Python