• Stars
    star
    7
  • Rank 2,294,772 (Top 46 %)
  • Language
    Python
  • Created over 9 years ago
  • Updated over 9 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

The Rotten Tomatoes movie review corpus is a collection of movie reviews collected by Pang and Lee in [2]. This corpus has been analysed in [3] where each sentence is parsed into its tree structure and each node is assigned a fine-grained sentiment label ranging from 1 − 5 where the numbers represent very negative, negative, neutral, positive and very positive respectively. In this paper we use this data on ath000 phrases and all the methods in this paper are assessed by training on a random subset of phrases (and their subphrases) of size approximately 4/5 of the data set and testing using the remaining 1/5. The idea is to use the non-associative functions and the parser trees structures to modify the feature vectors.

More Repositories

1

Predict-financial-recession

The major goal of this project is to predict financial re- cession given the frequencies of the top 500 word stems in the reports of financial companies. After applying various learning models, we can see that the prediction of financial recession by the bag of words has an accuracy of more than 90%. Hence, there is indeed a correlation between the two. Moreover, we have compared different learning models (ensemble methods with Decision Tree, SVM, and KNN) with various parameters to find the best model with a relatively high average accuracy and low variance of accuracy by cross-validation on the training data set. In addition, we have also tried several pre-processing methods (tf-idf, feature selection, and centroid-based clustering) to improve the accuracy of the learning models. In the end, the best model is Gradient Boosting with Decision Tree using the pre-processed tf-idf data set.
Python
14
star
2

Visualization-of-Latent-Factors-from-Movies

The goal is to visualize and interpret a 2-dimensional latent features for movies of the given data-set. We are given the categorizations of about 1600 movies into 19 genres, and ratings of some users to a specific movie. We applied matrix factorization to the sparse ratings matrix (since not all users are going to rate all movies) to look for latent factor matrices of the movies and users. Then we used the principal component analysis (PCA) to analyze the latent factor matrix of movies and projected each movie to the two strongest latent factors. Finally, we did visualization and interpretation on each category and compared the category average of the two major latent factors.
Python
1
star