• Stars
    star
    220
  • Rank 180,422 (Top 4 %)
  • Language
    Kotlin
  • Created over 8 years ago
  • Updated over 7 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Pragmatic & Practical Bayesian Sentiment Classifier

Sentiment Analysis - Improving Bayesian Methods

About

The point of this project is to explore practical and pragmatic means to improve results from Bayesian classifiers in the area of sentiment analysis.

This project explores a few optimizations to the typical Naive Bayesian methods for sentiment analysis along with a few notes.

Depending on your requirements you can achieve >90% accuracy with Bayesian techniques. Results of >99% accuracy are possible in aggregate.

The best results typically range from 89-94% accuracy while rating 50-70% of the data. As this is a stochastic method, results vary. All results are generated from a proprietary tokenizer, though I suspect the Lucene tokenizer will get close, or at least demonstrate the improvement over single Bayesian classifier.

  1. Instead of using a single Bayesian classifier, we will train a small cluster of Bayesian classifiers on random samples of the training data (Bagging). (This offered the most immediate improvement)

    • This is similar to the improvement Random Forest provides over Decision Tree algorithms.
    • The implementing class is called Stochastic Bayesian Classifier
    • Choose your strategy for determining how the cluster of classifiers "vote" for the final sentiment. All-or-none, majority wins, average, etc.
  2. You don't have to rate 100% of the data.

    • Depending on your business case, you may not have to rate 100% of your data. E.g. Rating 50% of the data is usually more than enough to get an idea of overall sentiment.
    • The rated data just needs to be representative of the over all data.
    • Adjust the confidence required for the classifier to apply a sentiment label.
  3. N-Gram models can do pretty well.

    • While skipgrams have potential to offer more language coverage with less data, they are also very expensive to compute.
    • Bigrams seemed to hit the sweet spot for speed, accuracy, and memory.
  4. Prune your trained model.

    • Removing n-grams/features that occur "too rarely" can help prevent over-fitting, as well as decrease the model's memory footprint (~4x in these tests)
  5. Tokenization of the text matters!

    • This project contains two tokenizers, a naive white space tokenizer, and a more advanced Lucene standard analyzer. The Lucene standard analyzer does much better.
    • At DataRank/Simply Measured, we use much more developed tokenizers that better handle language, emoji, urls, etc. This allows the model to have more structured information to learn from and offers significant improvement.
  6. In practice, more important than individual text sentiment accuracy is the accuracy in aggregate.

    • E.g. The algorithm achieves 90% accuracy for rating individual texts, however in aggregate it is 99.1% accurate.
    • An example of an aggregate sentiment result is saying a set of text is 70% positive and 30% negative.

Why use Bayesian models?

There are plenty of alternative models to sentiment analysis, a few have been known to outperform Bayesian classifiers. Some of these include LTSM recurrent neural networks, SVM, Convolutional Neural Networks, etc.

I have implemented many different models, including some of the above mentioned models, and achieved success. So why use the simple-man's Bayesian classifier?

If you have ever had to debug why a classifier is wrong or right about it's results or had to retrain a large neural network or SVM, then you probably know the difficulty that comes with managing these technologies. They are also surprisingly difficult to explain to customers (this may or may not be relevant). Imagine explaining to your customer or boss how the neural network mis-learned certain word pairs due to how the word vectors were encoded. This can even be difficult for an engineer to debug. Retraining and testing large models can also be very time consuming considering some models may take hours retrain and verify.

Often times the gains of these algorithms (a few percents of accuracy), may not outweigh the aforementioned costs. Also A Bayesian classifier is incredibly easy to debug (just look at the word/n-gram probabilities). They also train about as fast as you can tokenize, compute n-grams and pump the data into the model.

Results

Imdb Movie Review Test

  • (Scroll to the bottom for instructions on downloading the IMDB movie review dataset.)

  • single model = single Bayesian classifier. stochastic model = cluster of random sampling Bayesian classifiers. (bagging)

Results generated from BayesianClassifierImdbDemo.kt

Confidence Threshold: 0.25

Model Train+ Train- Test+ Test- Net Accuracy % of data rated Misc Parameters
stochastic bigram 98.7% 97.1% 94.8% 92.8% 93.8% 61.0% classifier count: 10, sampling rate: 0.2

Confidence Threshold: 0.2

Model Train+ Train- Test+ Test- Net Accuracy % of data rated Misc Parameters
stochastic bigram 99.5% 97.7% 94.8% 89.1% 91.95% 54.4% classifier count: 10, sampling rate: 0.2
stochastic skipgram(2,2) 98.5% 97.3% 89.1% 87.4% 88.3% 37% default
single bigram 99.98% 100.0% 70.9% 77.2% 74.0% 73% default
single skipgram(2,2) 100.0% 100.0% 69.3% 72.0% 70.7% 70.5% default

Confidence Threshold: 0.05

Model Train+ Train- Test+ Test- Net Accuracy % of data rated Misc Parameters
stochastic bigram 99.6% 99.5% 92.8% 96.5% 94.6% 16% classifier count: 10, sampling rate: 0.2
stochastic skipgram(2,2) 99.8% 98.9% 93.8% 92.05992% 92.8% 15% default
single bigram 100.0% 100.0% 77.3% 77.3% 77% 72% default
single skipgram(2,2) 99.7% 99.4% 94.1% 94.7% 94.2% 10% default

Tweet Test

Results Generated from StochasticBayesianClassifierTwitterSampleDemo

Model Train+ Train- Test+ Test- Net Accuracy % of data rated Misc Parameters
stochastic bigram (kaggle data) 100.0% 99.9% N/A N/A 99.95% 95.7% classifier count: 15, sampling rate: 0.5
stochastic bigram (hand rated) 99.8% 99.9% N/A N/A 99.85% 80.5% classifier count: 15, sampling rate: 0.5
stochastic bigram (hand rated, 50% train, 50% test) 99.8% 99.9% 86.3 95.9 91.1% 64.2%% classifier count: 15, sampling rate: 0.5

Aggregation Testing

Refer to machine-learning/sentiment_analysis.xlsx file for more details.

Simulations Sample Size Percent Positive Average Error Standard Deviation Data Set
100 1000 50% 0.031 0.019 Imdb
100 2000 50% 0.010 0.013 Imdb
100 2000 50% 0.009 0.010 Twitter

This below graph shows hows how the classifier skews against data that is known to be 50% positive/negative. This skew is a plot of 100 simulations.

The below graph shows each of the classifier sentiment aggregations for each simulation, sorted, and then plotted.

The below graph shows given a variable confidence threshold, [0.01, 0.50], the relationship between accuracy and percentage of data rated.

Performance

Model Pruning

Min Frequency Prune Threshold: abs(0.5 - p(pos)) < threshold Avg size Before Avg size Before Accuracy Before Rated Before Accuracy After Rated After
2 0.05 409534 130448 91.3% 40% 94.7% 40%
2 0.05 413216 112896 91.9% 40% 95.2% 43%

Heap size Before and After Pruning of same model used in above tests: 888mb -> 335mb

Processing Speed

Pruned model:

  • 37,153 reviews/min, 18 days to process 1 billion reviews
  • 571,724 tweets/min, 29 hours to process 1 billion tweets

Data

Many thanks to the Stanford team to putting together the IMDB movie review dataset. There is a small sample of the IMDB movie review set included in the test resources for quick testing/experimenting. However, the full dataset can be found here

To download the IMDB movie review dataset and extract it, run:

wget http://ai.stanford.edu/~amaas/data/sentiment/aclImdb_v1.tar.gz
tar xvf aclImdb_v1.tar.gz

Make note of the output directory as the full path to the output will be passed into many of the IMDB "demo" programs.

Kotlin

This program is written in Kotlin, https://kotlinlang.org/, a JVM language that finds the sweet spot between Java and Scala to make an almost perfect language. I hope you enjoy it as much as I do. :)

More Repositories

1

kumo

Kumo - Java Word Cloud
Java
601
star
2

swift_boxxle

A mostly complete port of GameBoy's Boxxle to Swift
Swift
102
star
3

genetic_draw

Genetic Algorithms for drawing images
Kotlin
101
star
4

ultimate_tictactoe

Ultimate Tic Tac Toe - HTML5
JavaScript
53
star
5

java_games

Java 2D Game Engine - Building Zelda to test it (Also includes a MineSweeper Demo)
Java
31
star
6

neural_network_kotlin

Autoencoder + Deep Autoencoder + Deep Convoluted Autoencoder + Hebbian
Kotlin
18
star
7

rbm

Deep Convoluted Restricted Boltzmann Machine (Java)
Java
17
star
8

kakyll

Static Site Generator in Kotlin
Kotlin
17
star
9

struktural

API Testing Made Easy - Kotlin
Kotlin
14
star
10

cellular-automata-pokemon-types

Cellular Automata - Pokemon Type Battle Simulation
Kotlin
8
star
11

sdl_sprite

A Simple SDL Sprite library
C++
8
star
12

haskell_boxxle

Gameboy's Boxxle in Haskell
Haskell
7
star
13

neuralnetwork

多数中間層のニューラルネットのクラス(誤差逆伝播法): multi-center layer neural network (back-error propagation algorithm)
Java
4
star
14

legend_of_mu

Legend of Mu (Playdate Game)
Lua
3
star
15

gdx-controller

LibGDX Controller Library
Java
3
star
16

c_gameoflife

John Conway's Game of life C++ (SDL)
C++
3
star
17

watchgpt

WatchGPT
Swift
3
star
18

supportvectormachine

Support Vector Machine/サポートベクターマシン/支持向量机
Perl
3
star
19

chatgpt

Programs written by ChatGPT
Python
2
star
20

game_of_life_nes

NES Game of Life
C++
2
star
21

log4php

PHP Logger
PHP
2
star
22

blockchain

Simple BlockChain in Kotlin
Kotlin
2
star
23

art

digital art
Kotlin
2
star
24

blocks

A Tetris Clone + added difficulty
PureBasic
2
star
25

soroban

Calculator Library using a Pratt Parser for MixFix grammar
Java
2
star
26

mandelbrot

Mandelbrot generator in Kotlin
Kotlin
2
star
27

vulcan

Vulcan - Space Shooter - Purebasic
PureBasic
2
star
28

games_classic

Collection of various games' source code
2
star
29

photo_filter

Some basic Photo filtering algorithms
Java
2
star
30

fleschkincaid

Flesch-Kincaid Reading Index
Java
2
star
31

pipes

start of a "data flow" programming language based around data flowing through "pipes"
Java
2
star
32

ml

Machine Learning - Collection of my personal user machine learning libraries/experiments
Terra
2
star
33

breakout

Breakout - Purebasic
PureBasic
2
star
34

catuskoti_automata

catuskoti cellular automata
Kotlin
1
star
35

minesweeper

Simple Minesweeper game in C
C++
1
star
36

antlr_dna

A Sample DNA grammar written using Antlr
Java
1
star
37

trie_kotlin

Prefix Tree / Radix Tree / Trie in Kotlin
Kotlin
1
star
38

linux_asm

My Assembly Collection
Assembly
1
star
39

selforganizingmap

Self-Organizing Map/自己組織化写像/自組織映射網路
Java
1
star
40

jGames

JQuery plugin for displaying various board games
JavaScript
1
star
41

euler

Project Euler (85 solutions) - Java
Java
1
star
42

resource_optimization

Resource Optimization via Monte-Carlo
Kotlin
1
star
43

minesweeper_libgdx

A Mine Sweeper implementation using LibGDX.
Java
1
star
44

snakes

Snakes - Purebasic
PureBasic
1
star