• Stars
    star
    1,506
  • Rank 31,140 (Top 0.7 %)
  • Language
    Python
  • License
    MIT License
  • Created about 7 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Sentiment analysis on tweets using Naive Bayes, SVM, CNN, LSTM, etc.

Sentiment Analysis on Tweets

Status badge

Update(21 Sept. 2018): I don't actively maintain this repository. This work was done for a course project and the dataset cannot be released because I don't own the copyright. However, everything in this repository can be easily modified to work with other datasets. I recommend reading the sloppily written project report for this project which can be found in docs/.

Dataset Information

We use and compare various different methods for sentiment analysis on tweets (a binary classification problem). The training dataset is expected to be a csv file of type tweet_id,sentiment,tweet where the tweet_id is a unique integer identifying the tweet, sentiment is either 1 (positive) or 0 (negative), and tweet is the tweet enclosed in "". Similarly, the test dataset is a csv file of type tweet_id,tweet. Please note that csv headers are not expected and should be removed from the training and test datasets.

Requirements

There are some general library requirements for the project and some which are specific to individual methods. The general requirements are as follows.

  • numpy
  • scikit-learn
  • scipy
  • nltk

The library requirements specific to some methods are:

  • keras with TensorFlow backend for Logistic Regression, MLP, RNN (LSTM), and CNN.
  • xgboost for XGBoost.

Note: It is recommended to use Anaconda distribution of Python.

Usage

Preprocessing

  1. Run preprocess.py <raw-csv-path> on both train and test data. This will generate a preprocessed version of the dataset.
  2. Run stats.py <preprocessed-csv-path> where <preprocessed-csv-path> is the path of csv generated from preprocess.py. This gives general statistical information about the dataset and will two pickle files which are the frequency distribution of unigrams and bigrams in the training dataset.

After the above steps, you should have four files in total: <preprocessed-train-csv>, <preprocessed-test-csv>, <freqdist>, and <freqdist-bi> which are preprocessed train dataset, preprocessed test dataset, frequency distribution of unigrams and frequency distribution of bigrams respectively.

For all the methods that follow, change the values of TRAIN_PROCESSED_FILE, TEST_PROCESSED_FILE, FREQ_DIST_FILE, and BI_FREQ_DIST_FILE to your own paths in the respective files. Wherever applicable, values of USE_BIGRAMS and FEAT_TYPE can be changed to obtain results using different types of features as described in report.

Baseline

  1. Run baseline.py. With TRAIN = True it will show the accuracy results on training dataset.

Naive Bayes

  1. Run naivebayes.py. With TRAIN = True it will show the accuracy results on 10% validation dataset.

Maximum Entropy

  1. Run logistic.py to run logistic regression model OR run maxent-nltk.py <> to run MaxEnt model of NLTK. With TRAIN = True it will show the accuracy results on 10% validation dataset.

Decision Tree

  1. Run decisiontree.py. With TRAIN = True it will show the accuracy results on 10% validation dataset.

Random Forest

  1. Run randomforest.py. With TRAIN = True it will show the accuracy results on 10% validation dataset.

XGBoost

  1. Run xgboost.py. With TRAIN = True it will show the accuracy results on 10% validation dataset.

SVM

  1. Run svm.py. With TRAIN = True it will show the accuracy results on 10% validation dataset.

Multi-Layer Perceptron

  1. Run neuralnet.py. Will validate using 10% data and save the best model to best_mlp_model.h5.

Reccurent Neural Networks

  1. Run lstm.py. Will validate using 10% data and save models for each epock in ./models/. (Please make sure this directory exists before running lstm.py).

Convolutional Neural Networks

  1. Run cnn.py. This will run the 4-Conv-NN (4 conv layers neural network) model as described in the report. To run other versions of CNN, just comment or remove the lines where Conv layers are added. Will validate using 10% data and save models for each epoch in ./models/. (Please make sure this directory exists before running cnn.py).

Majority Vote Ensemble

  1. To extract penultimate layer features for the training dataset, run extract-cnn-feats.py <saved-model>. This will generate 3 files, train-feats.npy, train-labels.txt and test-feats.npy.
  2. Run cnn-feats-svm.py which uses files from the previous step to perform SVM classification on features extracted from CNN model.
  3. Place all prediction CSV files for which you want to take majority vote in ./results/ and run majority-voting.py. This will generate majority-voting.csv.

Information about other files

  • dataset/positive-words.txt: List of positive words.
  • dataset/negative-words.txt: List of negative words.
  • dataset/glove-seeds.txt: GloVe words vectors from StanfordNLP which match our dataset for seeding word embeddings.
  • Plots.ipynb: IPython notebook used to generate plots present in report.

More Repositories

1

ZipBomb

A simple implementation of ZipBomb in Python
Python
325
star
2

prototypical-networks-tensorflow

Tensorflow implementation of NIPS 2017 Paper "Prototypical Networks for Few-shot Learning"
Jupyter Notebook
132
star
3

normalizing-flows

Understanding normalizing flows
Jupyter Notebook
131
star
4

pyANPD

Automatic Number Plate Detection for Python using OpenCV
Python
66
star
5

sampling-methods-numpy

This repository contains implementations of some basic sampling methods in numpy.
Jupyter Notebook
63
star
6

langevin-monte-carlo

A simple pytorch implementation of Langevin Monte Carlo algorithms.
Jupyter Notebook
40
star
7

REDSDS

Pytorch implementation of RED-SDS (NeurIPS 2021).
Python
18
star
8

jcomplexnumber

A library which implements the complex number data type in Java.
Java
18
star
9

planar-flow-pytorch

Pytorch implementation of Planar Flow
Jupyter Notebook
18
star
10

Youtube-via-FB

A tool to control YouTube using Facebook messenger by browser automation using selenium.
Python
14
star
11

pyBlur

Soft Blurring in Python using OpenCV
Python
10
star
12

IWAE-tensorflow

Tensorflow implementation of Importance Weighted Auto Encoder
Python
5
star
13

fb-mosaic

Create a mosaic using profiles pictures from Facebook.
Python
4
star
14

SimulatedAnnealing-TSP

Simulated Annealing heuristic to solve the travelling salesman problem written in JavaScript.
JavaScript
4
star
15

blood-analysis-app

Android Source for GSoC '16 project Mobile Based Blood Analysis with Computational Biology @ UNL.
Java
4
star
16

JEvoLisa

Java reimplementation of Roger Alsing's EvoLisa
Java
3
star
17

capsule-networks-pytorch

PyTorch implementation of Capsule Networks
Python
2
star
18

subKmeans

Numpy and pyCUDA implementation of subKmeans
Python
1
star
19

learn-jax

Learning JAX as a PyTorch User
Jupyter Notebook
1
star
20

blood-analysis-app-ios

iOS Source for GSoC '17 project Mobile Based Blood Analysis-iOS with Computational Biology @ UNL.
Swift
1
star
21

learning2learn-pytorch

Jupyter Notebook
1
star
22

neural-ode

Jupyter Notebook
1
star
23

Blood-Analysis

TeX
1
star
24

vae-tensorflow

Variational Auto-Encoder based on DC-GAN-like architecture
Jupyter Notebook
1
star
25

disentanglement-metrics

A list of different disentanglement metrics proposed in literature.
1
star