• Stars
    star
    939
  • Rank 48,667 (Top 1.0 %)
  • Language
  • License
    Creative Commons ...
  • Created over 6 years ago
  • Updated about 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A list of Twitter datasets and related resources.

awesome-twitter-data

Awesome CC0

A list of Twitter datasets and related resources, released under CC0. If you have a resource to add to the list, feel free to open a pull request, or email me at [email protected].

The license, when known, is given in {curly brackets}. Dataset size is given in [square brackets] when available.

1   Twitter Datasets

1.1   Tweet datasets

1.1.1   Tweet ID datasets

1.2   Tweet datasets (labelled)

  • Sentiment140 - Automatically laballed; authors assume that any tweet with positive emoticons, like :), are positive, and tweets with negative emoticons, like :(, are negative.
  • Weather-sentiment
  • Crowdflower Gender Classifier Data [20k] - Contributors were asked to simply view a Twitter profile and judge whether the user was a male, a female, or a brand (non-individual). The dataset contains 20,000 rows, each with a user name, a random tweet, account profile and image, location, and even link and sidebar color.
  • Sanders Analytics {?} [5k]- Use Internet Archive's Wayback Machine to get the data. The dataset consists of 5513 hand-classified tweets. Each tweet was classified with respect to one of four different topics.
  • Geoparse Benchmark Open Dataset {BSD-4_Clause} [?] - The geoparsing benchmark dataset contains 1000’s of tweets recorded during 4 different natural disasters. These events are Hurricane Sandy 2012, Milan Blackouts 2013, Turkish Earthquake 2012 and the Christchurch Earthquake 2012. Each tweet in the dataset has been manually labelled with location entries at the building, street and region levels to provide a gold standard for evaluation work. The data consists of the full JSON serialized tweet metadata (i.e. including text) with an additional ‘entities’ field of type ‘mentions’ for the ground truth location annotations.

1.3   User datasets

1.4   Lost Datasets

2   Other Lists

3   Tools

3.1   Data Collection

3.2   Analysis

4   Academic Papers

  • Learning Multiview Embeddings of Twitter Users

4.1   Demographics Prediction

  • Developing Age and Gender Predictive Lexica over Social Media, 2014 - We derive predictive lexica (words and weights) for age and gender using regression and classification models from word usage in Facebook, blog, and Twitter data with associated demographiclabels. The lexica, made publicly available, achieved state-of-the-art accuracy in language based age and gender prediction over Facebook and Twitter, and were evaluated for generalization across social media genres as well as in limited message situations.
  • Predicting the Demographics of Twitter Users from Website Traffic Data
  • Inferring Perceived Demographics from User Emotional Tone and User-Environment Emotional Contrast
  • Mining User Interests to Predict Perceived Psycho-Demographic Traits on Twitter
  • Why Gender and Age Prediction from Tweets is Hard: Lessons from a Crowdsourcing Experiment
  • Who tweets? deriving the demographic characteristics of age, occupation and social class from twitter user meta-data

5   Articles & blog posts

6   Contributing

  • Please check for duplicates first.
  • Keep descriptions short, simple and unbiased.
  • Please make an individual commit for each suggestion
  • Add a new category if needed.
  • For datasets, please keep the format when possible: The license, when known, is given in {curly brackets}. Dataset size is given in [square brackets] when available.

Thank you for your suggestions!

7   License

CC0

To the extent possible under law, Shay Palachy has waived all copyright and related or neighboring rights to this work.

More Repositories

1

skift

scikit-learn wrappers for Python fastText.
Jupyter Notebook
234
star
2

stationarizer

Smart, automatic detection and stationarization of non-stationary time series data.
Jupyter Notebook
29
star
3

s3bp

Read and write Python objects to S3, caching them on your hard drive to avoid unnecessary IO.
Python
24
star
4

birch

Simple hierarchical configuration for Python packages.
Python
13
star
5

lazyimport

lazyimport lets you import python modules lazily.
Python
11
star
6

holcrawl

holcrawl is a crawler for building Hollywood movies datsets.
Python
9
star
7

morejson

A drop-in replacement for Python's json that handles additional built-in Python types.
Python
7
star
8

active_learning_for_domain_adaptation_in_sentiment_analysis

Code for the Active Learning for Domain Adaptation in Sentiment Analysis paper by Shay Palachy and Inbar Naor
Python
7
star
9

imbutil

Additions to the imblearn package.
Python
6
star
10

exploring_networks_with_python_intro

An introduction to exploring network-structured datasets with python's networkx package.
Jupyter Notebook
6
star
11

pdutil

Utilities for pandas.
Python
6
star
12

rotten_needles

Rotten Needles: Online movie ratings and success in Hollywood movies
Jupyter Notebook
5
star
13

tqdl

requests-based file downloads with tqdm progress bars.
Python
5
star
14

mongozen

Enhance MongoDB for Python dynamic shells and scripts.
Python
4
star
15

catlolzer

Concise Python-based lolzing for cats.
Python
4
star
16

skutil

Utilities for scikit-learn.
Python
3
star
17

ssdts_matching

Fast matching of items for source-sharing derivative time series.
Python
3
star
18

barn

Simple local/remote dataset store for Python.
Python
2
star
19

decore

A small pure-python package for utility decorators.
Python
1
star
20

strct

A small pure-python package for data structure related utility functions.
Python
1
star
21

my.sublime

My sublime configuration
JavaScript
1
star
22

shaypal5.github.io

Shay Palachy's personal website.
PHP
1
star
23

utilitime

A small pure-python package for time-related utility functions.
Python
1
star
24

comath

A small pure-python package for math-related utility functions.
Python
1
star