• This repository has been archived on 19/Aug/2020
  • Stars
    star
    521
  • Rank 84,391 (Top 2 %)
  • Language
    Jupyter Notebook
  • License
    Other
  • Created over 9 years ago
  • Updated about 8 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Sentiment Analysis Challenge

sunny-side-up

Lab41's foray into Sentiment Analysis with Deep Learning. In addition to checking out the source code, visit the Wiki for Learning Resources and possible Conferences to attend.

Try them, try them, and you may! Try them and you may, I say.

Table of Contents

Blog Overviews

Can Word Vectors Help Predict Whether Your Chinese Tweet Gets Censored? March 2016
One More Reason Not To Be Scared of Deep Learning March 2016
Some Tips for Debugging in Deep Learning January 2016
Faster On-Ramp to Deep Learning With Jupyter-driven Docker Containers November 2015
A Tour of Sentiment Analysis Techniques: Getting a Baseline for Sunny Side Up November 2015
Learning About Deep Learning! September 2015

Docker Environments

  • lab41/itorch-[cpu|cuda]: iTorch IPython kernel for Torch scientific computing GPU framework
  • lab41/keras-[cpu|cuda|cuda-jupyter]: Keras neural network library (CPU or GPU backend from command line or within Jupyter notebook)
  • lab41/neon-[cuda|cuda7.5]: neon Deep Learning framework (with CUDA backend) by Nervana
  • lab41/pylearn2: pylearn2 machine learning research library
  • lab41/sentiment-ml: build word vectors (Word2Vec from gensim; GloVe from glove-python), tokenize Chinese text (jieba and pypinyin), and tokenize Arabic text (NLTK and Stanford Parser)
  • lab41/mechanical-turk: convert CSV of Arabic tweets to individual PNG images for each Tweet (to avoid machine-translation of text) and auto-submit/score Arabic sentiment survey via AWS Mechanical Turk

Binary Classification with Word Vectors

Execution

python -m benchmarks/baseline_classifiers

Word Vector Models

model filename filesize vocabulary details
Sentiment140 sentiment140_800000.bin 153M 83,586 gensim Word2Vec(size=200, window=5, min_count=10)
Open Weiboscope openweibo_fullset_hanzi_CLEAN_vocab31357747.bin 56G 31,357,746 jieba-tokenized Hanzi Word2Vec(size=200, window=5, min_count=1)
Open Weiboscope openweibo_fullset_min10_hanzi_vocab2548911.bin 4.6G 2,548,911 jieba-tokenized Hanzi Word2Vec(size=200, window=5, min_count=10)
Arabic Tweets arabic_tweets_min10vocab_vocab1520226.bin 1.2G 1,520,226 Stanford Parser-tokenized Word2Vec(size=200, window=5, min_count=10)
Arabic Tweets arabic_tweets_NLTK_min10vocab_vocab981429.bin 759M 981,429 NLTK-tokenized Word2Vec(size=200, window=5, min_count=10)

Training and Testing Data

train/test set filename filesize details
Sentiment140 sentiment140_800000_samples_[test/train].bin 183M 80/20 split of 1.6M emoticon-labeled Tweets
Open Weiboscope openweibo_hanzi_censored_27622_samples_[test/train].bin 25M 80/20 split of 55,244 censored posts
Open Weiboscope openweibo_800000_min1vocab_samples_[test/train].bin 564M 80/20 split of 1.6M deleted posts
Arabic Twitter arabic_twitter_1067972_samples_[test/train].bin 912M 80/20 split of 2,135,944 emoticon-and-emoji labeled Tweets

Binary Classification via Deep Learning

CNN (Convolutional Neural Network)

Character-by-character processing From Zhang and LeCun's Text Understanding From Scratch:

#Set Parameters for final fully connected layers
fully_connected = [1024,1024,1]

model = Sequential()

#Input = #alphabet x 1014
model.add(Convolution2D(256,67,7,input_shape=(1,67,1014)))
model.add(MaxPooling2D(pool_size=(1,3)))

#Input = 336 x 256
model.add(Convolution2D(256,1,7))
model.add(MaxPooling2D(pool_size=(1,3)))

#Input = 110 x 256
model.add(Convolution2D(256,1,3))

#Input = 108 x 256
model.add(Convolution2D(256,1,3))

#Input = 106 x 256
model.add(Convolution2D(256,1,3))

#Input = 104 X 256
model.add(Convolution2D(256,1,3))
model.add(MaxPooling2D(pool_size=(1,3)))

model.add(Flatten())

#Fully Connected Layers

#Input is 8704 Output is 1024
model.add(Dense(fully_connected[0]))
model.add(Dropout(0.5))
model.add(Activation('relu'))

#Input is 1024 Output is 1024
model.add(Dense(fully_connected[1]))
model.add(Dropout(0.5))
model.add(Activation('relu'))

#Input is 1024 Output is 1
model.add(Dense(fully_connected[2]))
model.add(Activation('sigmoid'))

#Stochastic gradient parameters as set by paper
sgd = SGD(lr=0.01, decay=1e-5, momentum=0.9, nesterov=True)
model.compile(loss='binary_crossentropy', optimizer=sgd, class_mode="binary")

LSTM (Long Short Term Memory)

# initialize the neural net and reshape the data
model = Sequential()
model.add(Embedding(max_features, embedding_size)) # embed into dense 3D float tensor (samples, maxlen, embedding_size)
model.add(Reshape(1, maxlen, embedding_size)) # reshape into 4D tensor (samples, 1, maxlen, embedding_size)

# convolution stack
model.add(Convolution2D(nb_feature_maps, nb_classes, filter_size_row, filter_size_col, border_mode='full')) # reshaped to 32 x maxlen x 256 (32 x 100 x 256)
model.add(Activation('relu'))

# convolution stack with regularization
model.add(Convolution2D(nb_feature_maps, nb_feature_maps, filter_size_row, filter_size_col, border_mode='full')) # reshaped to 32 x maxlen x 256 (32 x 100 x 256)
model.add(Activation('relu'))
model.add(MaxPooling2D(poolsize=(2, 2))) # reshaped to 32 x maxlen/2 x 256/2 (32 x 50 x 128)
model.add(Dropout(0.25))

# convolution stack with regularization
model.add(Convolution2D(nb_feature_maps, nb_feature_maps, filter_size_row, filter_size_col)) # reshaped to 32 x 50 x 128
model.add(Activation('relu'))
model.add(MaxPooling2D(poolsize=(2, 2))) # reshaped to 32 x maxlen/2/2 x 256/2/2 (32 x 25 x 64)
model.add(Dropout(0.25))

# fully-connected layer
model.add(Flatten())
model.add(Dense(nb_feature_maps * (maxlen/2/2) * (embedding_size/2/2), fully_connected_size))
model.add(Activation("relu"))
model.add(Dropout(0.50))

# output classifier
model.add(Dense(fully_connected_size, 1))
model.add(Activation("sigmoid"))

# try using different optimizers and different optimizer configs
model.compile(loss='binary_crossentropy', optimizer='adam', class_mode="binary")

More Repositories

1

PySEAL

This repository is a fork of Microsoft Research's homomorphic encryption implementation, the Simple Encrypted Arithmetic Library (SEAL). This code wraps the SEAL build in a docker container and provides Python API's to the encryption library.
C++
225
star
2

ipython-spark-docker

Python
146
star
3

hermes

Recommender System Framework
Jupyter Notebook
124
star
4

cyphercat

Implementation of membership inference and model inversion attacks, extracting training data information from an ML model. Benchmarking attacks and defenses.
Jupyter Notebook
98
star
5

Dendrite

People. Places. Things. Graphs.
JavaScript
92
star
6

attalos

Joint Vector Spaces
Jupyter Notebook
89
star
7

Circulo

Community Detection Research Effort
Python
79
star
8

pythia

Supervised learning for novelty detection in text
Jupyter Notebook
79
star
9

survey-community-detection

Market Survey: Community Detection
70
star
10

Magnolia

Jupyter Notebook
45
star
11

pelops

The Pelops car re-ID project
Jupyter Notebook
44
star
12

altair

Assessing Source Code Semantic Similarity with Unsupervised Learning
Python
41
star
13

SkyLine

An Exploration into Graph Databases
Python
28
star
14

soft-boiled

Library for Geo-Inferencing in Twitter Data
Python
28
star
15

magichour

Security log file challenge
Jupyter Notebook
28
star
16

Redwood

A project that implements statistical methods for identifying anomalous files
Python
22
star
17

gestalt

Data storytelling. See link for detailed documentations: http://lab41.github.io/gestalt.
JavaScript
20
star
18

d-script

Writer Identification of Handwritten Documents
Jupyter Notebook
13
star
19

Misc

Miscellaneous utility functions
Jupyter Notebook
11
star
20

graph-generators

Scripts for generating graphs in various formats.
Python
11
star
21

lab41.github.com

Lab41 Blog
HTML
10
star
22

etl-by-example

Java
10
star
23

VOiCES-subset

VOiCES-subset
Jupyter Notebook
9
star
24

MRKronecker

MRKronecker
Java
7
star
25

graphlab-twill

Java
7
star
26

Hemlock

Hemlock is a way of providing a common data access layer.
JavaScript
7
star
27

try41

try41 - a demonstration platform
CSS
7
star
28

Summer2018ML

This repository educates users on the basics of machine learning, from basic linear algebra to backward propagation.
Jupyter Notebook
7
star
29

Rio

gephi <3 blueprints
Java
7
star
30

Hemlock-Frontend

Rails frontend for Hemlock
Ruby
4
star
31

ganymede_nbextension

Ganymede logging extension for the Jupyter Notebook Server
Python
4
star
32

verboten_words

pre-commit hook searches for words you do not want in your repo.
Python
3
star
33

Hemlock-REST

RESTful server for Lab41/Hemlock
Python
3
star
34

Epiphyte

Code for bulk loading data into Titan
Java
3
star
35

Blogs

Code that is relevant to our blog posts.
MATLAB
2
star
36

Papers

Lab41 Submitted Academic Paper
2
star
37

nbhub

Python
2
star
38

hadoop-dev-env

Shell
1
star
39

mediumblog

1
star
40

reading-group-generation-1

Reading group summaries and resources
1
star
41

condo

🌇 Simulated codon optimized CDS dataset
Jupyter Notebook
1
star
42

titan-python-tutorial

Python
1
star