• Stars
    star
    173
  • Rank 220,124 (Top 5 %)
  • Language
  • Created over 8 years ago
  • Updated almost 7 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Dataset of threads and comments from reddit

Reddit Comment and Thread Datas

Around 260,000 threads / comments scraped from Reddit. Useful dataset for NLP projects.

Quick Start

Scraped using omega-red

The .csvs are named <metareddit>_<subreddit>.csv. The headers are described here and in headers.txt.

Headers are:

text,id,subreddit,meta,time,author,ups,downs,authorlinkkarma,authorkarma,authorisgold
  • text: Text of the comment / thread
  • id: Unique reddit id for the comment / thread
  • subreddit: Subreddit that the comment / thread belongs to
  • meta: Metareddit that the comment / thread belongs to. Subreddits belong to metareddits. A subreddit can be leagueoflegends. The metareddit for that subreddit would be gaming, which can also include the subreddit dota2
  • time: UNIX timestamp of the comment / thread
  • author: Username of the author of the comment / thread
  • ups: Number of upvotes the comment / thread received
  • downs: Number of downvotes the comment / thread received
  • authorlinkkarma: The author's link karma. What is Link Karma?
  • authorkarma: The author's karma. Reddit FAQ explaining karma.
  • authorisgold: Boolean indicator for the gold status of the user. 1 for gold users, 0 for non-gold (normal) users. Reddit FAQ explaining gold status.

Original Texts

If you prefer the texts with punctuation, they are included (as the original output files from omega-red) at https://mega.nz/#F!NtsCGTgD!urXdXLJ6yITYdWEdWN-H1w

Threads

are in /threads.csv at https://mega.nz/#F!NtsCGTgD!urXdXLJ6yITYdWEdWN-H1w

Headers are:

'text', 'title', 'url', 'id', 'subreddit', 'meta', 'time', 'author', 'ups', 'downs', 'authorlinkkarma', 'authorcommentkarma', 'authorisgold'
  • text: text of the thread
  • title: title of the thread
  • url: url of the thread
  • id: unique ID of the thread
  • subreddit: subreddit that the thread belongs to
  • meta: meta tag assigned to the subreddit of the thread in config.json
  • time: timestamp of the thread
  • author: username of the author of the thread
  • ups: number of ups the thread has received
  • downs: number of downs the thread has received
  • authorlinkkarma: the author's link karma
  • authorcommentkarma: the author's comment karma
  • authorisgold: 1 if the author has gold status, 0 otherwise

Comments

are in comments.csv at https://mega.nz/#F!NtsCGTgD!urXdXLJ6yITYdWEdWN-H1w

Heades are (different from threads.csv)

'text', 'id', 'subreddit', 'meta', 'time', 'author', 'ups', 'downs', 'authorlinkkarma', 'authorcommentkarma', 'authorisgold'
  • text: text of the comment
  • id: unique ID of the comment
  • subreddit: subreddit that the thread belongs to
  • meta: meta tag assigned to the subreddit of the thread in config.json
  • time: timestamp of the thread
  • author: username of the author of the thread
  • ups: number of ups the thread has received
  • downs: number of downs the thread has received
  • authorlinkkarma: the author's link karma
  • authorcommentkarma: the author's comment karma
  • authorisgold: 1 if the author has gold status, 0 otherwise

All text is normalized to lower case, tokenized using a TreebankTokenizer from natural, then joined with spaces. This results in punctuation being separated from words, a desired effect.


The MIT License (MIT)

Copyright (c) 2016, Linan Qiu

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

More Repositories

1

word2vec-sentiments

Tutorial for Sentiment Analysis using Doc2Vec in gensim (or "getting 87% accuracy in sentiment analysis in under 100 lines of code")
Jupyter Notebook
690
star
2

lexrank

Text summarization using Lexrank
JavaScript
55
star
3

wedding-optimization-simulated-annealing

Wedding Optimization using Simulated Annealing
Jupyter Notebook
35
star
4

opt-processing-times-analysis

Analysis of F-1 OPT (I-765) Processing Times
Jupyter Notebook
18
star
5

omega-red

Aggressive reddit scraper in node js
JavaScript
13
star
6

ssol-courses

Register for courses on SSOL Columbia
Java
12
star
7

pca-irs-stat-project

Principal Component Analysis of Interest Rate Swaps
Jupyter Notebook
8
star
8

gitbook-pandoc

Converts Gitbook directory to LaTeX using Pandoc
Java
7
star
9

canvas-submission-time-scraper

Chrome extension to grab submission times for all students in "Speed Grader" for Columbia Courseworks2 / Canvas
Python
7
star
10

ssol-rest

REST API wrapper for SSOL
JavaScript
6
star
11

binomial-european-option-r

Binomial European Option Trees in R
R
4
star
12

econ-w3213-recitation-notes

Recitation Notes for Intermediate Macroeconomics
TeX
4
star
13

applescript-keynote-quicktime

Automated export keynote to quicktime using applescript
2
star
14

jarvis

Jarvis
CSS
2
star
15

futures-curve

Futures Curve Visualization
Jupyter Notebook
2
star
16

data-structures-graph-viz

Graph Viz for HW5 (Data Structures CS3134 Spring 2016)
Java
2
star
17

word2vec-piazza

Word2vec + a semester's worth of piazza posts = hilarious
Jupyter Notebook
2
star
18

data-structures

Notes for Data Structures Class
Java
2
star
19

us_census

Tool to intuitively query the US Census 2010.
Python
2
star
20

leafy-saranade

Solver for ant on chessboard problem
Java
1
star
21

stat-w4400-homework

Homework for STAT W4400
TeX
1
star
22

econ-w4280-recitation-notes

Recitation Notes for Professor Andrew Hertzberg's ECON 4280 Corporate Finance Fall 2014 class.
1
star
23

ssol-api

API for SSOL
JavaScript
1
star
24

jupyter-header

My boilerplate jupyter header
Python
1
star
25

econ-4850

Problem Sets and Notes for ECON 4850
1
star
26

circle-ci-java-assignment-grading

Using CircleCI to Grade Java Assignments
Java
1
star
27

treelite-oob

Hacking treelite to get highly performant OOB predictions for random forests
Jupyter Notebook
1
star
28

cad-email

Custom mass email sender using Java Mail
Java
1
star
29

facebook-graph-meteor

A package for getting user data and friends from a Facebook user in Meteor
JavaScript
1
star
30

crude-oil-inventory

Crude Oil Inventory and Intraday Oil Price Movements.ipynb
Jupyter Notebook
1
star
31

linanbeamer

Beamer for my presentations. Adapted from m, added solarized colors.
TeX
1
star
32

cs4705

Homeworks for CS4705 Natural Language Processing
Java
1
star
33

ieor-w4700-homework

IEOR W4700 Homework
TeX
1
star
34

astr-1404-notes

Notes for Astronomy 1404 Stars, Galaxies, and Cosmology. Felt like I needed to contribute to the class in penance for not attending class an entire semester.
Jupyter Notebook
1
star