• Stars
    star
    690
  • Rank 65,522 (Top 2 %)
  • Language
    Jupyter Notebook
  • Created over 9 years ago
  • Updated over 5 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Tutorial for Sentiment Analysis using Doc2Vec in gensim (or "getting 87% accuracy in sentiment analysis in under 100 lines of code")

Sentiment Analysis using Doc2Vec

Word2Vec is dope. In short, it takes in a corpus, and churns out vectors for each of those words. What's so special about these vectors you ask? Well, similar words are near each other. Furthermore, these vectors represent how we use the words. For example, v_man - v_woman is approximately equal to v_king - v_queen, illustrating the relationship that "man is to woman as king is to queen". This process, in NLP voodoo, is called word embedding. These representations have been applied widely. This is made even more awesome with the introduction of Doc2Vec that represents not only words, but entire sentences and documents. Imagine being able to represent an entire sentence using a fixed-length vector and proceeding to run all your standard classification algorithms. Isn't that amazing?

However, Word2Vec documentation is shit. The C-code is nigh unreadable (700 lines of highly optimized, and sometimes weirdly optimized code). I personally spent a lot of time untangling Doc2Vec and crashing into ~50% accuracies due to implementation mistakes. This tutorial aims to help other users get off the ground using Word2Vec for their own research. We use Word2Vec for sentiment analysis by attempting to classify the Cornell IMDB movie review corpus (http://www.cs.cornell.edu/people/pabo/movie-review-data/). The specific data set used is available for download at http://ai.stanford.edu/~amaas/data/sentiment/.

Show Me The Code

The IPython Notebook (code + tutorial) can be found in word2vec-sentiments.ipynb

The code to just run the Doc2Vec and save the model as imdb.d2v can be found in run.py. Should be useful for running on computer clusters.

What Does This Repo Contain

  • test-neg.txt test-pos.txt train-neg.txt train-pos.txt train-unsup.txt Training and testing data. Explained in more detail in the notebook.
  • word2vec-sentiment.ipynb The notebook (code + tutorial)
  • run.py Just the code

License

Copyright (c) 2015 Linan Qiu

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

More Repositories

1

reddit-dataset

Dataset of threads and comments from reddit
173
star
2

lexrank

Text summarization using Lexrank
JavaScript
55
star
3

wedding-optimization-simulated-annealing

Wedding Optimization using Simulated Annealing
Jupyter Notebook
35
star
4

opt-processing-times-analysis

Analysis of F-1 OPT (I-765) Processing Times
Jupyter Notebook
18
star
5

omega-red

Aggressive reddit scraper in node js
JavaScript
13
star
6

ssol-courses

Register for courses on SSOL Columbia
Java
12
star
7

pca-irs-stat-project

Principal Component Analysis of Interest Rate Swaps
Jupyter Notebook
8
star
8

gitbook-pandoc

Converts Gitbook directory to LaTeX using Pandoc
Java
7
star
9

canvas-submission-time-scraper

Chrome extension to grab submission times for all students in "Speed Grader" for Columbia Courseworks2 / Canvas
Python
7
star
10

ssol-rest

REST API wrapper for SSOL
JavaScript
6
star
11

binomial-european-option-r

Binomial European Option Trees in R
R
4
star
12

econ-w3213-recitation-notes

Recitation Notes for Intermediate Macroeconomics
TeX
4
star
13

applescript-keynote-quicktime

Automated export keynote to quicktime using applescript
2
star
14

jarvis

Jarvis
CSS
2
star
15

futures-curve

Futures Curve Visualization
Jupyter Notebook
2
star
16

data-structures-graph-viz

Graph Viz for HW5 (Data Structures CS3134 Spring 2016)
Java
2
star
17

word2vec-piazza

Word2vec + a semester's worth of piazza posts = hilarious
Jupyter Notebook
2
star
18

data-structures

Notes for Data Structures Class
Java
2
star
19

us_census

Tool to intuitively query the US Census 2010.
Python
2
star
20

leafy-saranade

Solver for ant on chessboard problem
Java
1
star
21

stat-w4400-homework

Homework for STAT W4400
TeX
1
star
22

econ-w4280-recitation-notes

Recitation Notes for Professor Andrew Hertzberg's ECON 4280 Corporate Finance Fall 2014 class.
1
star
23

ssol-api

API for SSOL
JavaScript
1
star
24

jupyter-header

My boilerplate jupyter header
Python
1
star
25

econ-4850

Problem Sets and Notes for ECON 4850
1
star
26

circle-ci-java-assignment-grading

Using CircleCI to Grade Java Assignments
Java
1
star
27

treelite-oob

Hacking treelite to get highly performant OOB predictions for random forests
Jupyter Notebook
1
star
28

cad-email

Custom mass email sender using Java Mail
Java
1
star
29

facebook-graph-meteor

A package for getting user data and friends from a Facebook user in Meteor
JavaScript
1
star
30

crude-oil-inventory

Crude Oil Inventory and Intraday Oil Price Movements.ipynb
Jupyter Notebook
1
star
31

linanbeamer

Beamer for my presentations. Adapted from m, added solarized colors.
TeX
1
star
32

cs4705

Homeworks for CS4705 Natural Language Processing
Java
1
star
33

ieor-w4700-homework

IEOR W4700 Homework
TeX
1
star
34

astr-1404-notes

Notes for Astronomy 1404 Stars, Galaxies, and Cosmology. Felt like I needed to contribute to the class in penance for not attending class an entire semester.
Jupyter Notebook
1
star