• Stars
    star
    108
  • Rank 309,689 (Top 7 %)
  • Language
    R
  • Created almost 8 years ago
  • Updated about 7 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A simple dataset of Stack Overflow questions and tags

StackLite: A simple dataset of Stack Overflow questions and tags

This repository shares a dataset about Stack Overflow questions. For each question, it includes:

  • Question ID
  • Creation date
  • Closed date, if applicable
  • Deletion date, if applicable
  • Score
  • Owner user ID
  • Number of answers
  • Tags

This dataset is ideal for answering questions such as:

  • The increase or decrease in questions in each tag over time
  • Correlations among tags on questions
  • Which tags tend to get higher or lower scores
  • Which tags tend to be asked on weekends vs weekdays
  • Rates of question closure or deletion over time
  • The speed at which questions are closed or deleted

This is all public data within the Stack Exchange Data Dump, which is much more comprehensive (including question and answer text), but also requires much more computational overhead to download and process. This dataset is designed to be easy to read in and start analyzing. Similarly, this data can be examined within the Stack Exchange Data Explorer, but this offers analysts the chance to work with it locally using their tool of choice.

Status

This dataset was extracted from the Stack Overflow database at 2017-04-06 16:39:26 UTC and contains questions up to 2017-04-05. This includes 13629741 non-deleted questions, and 4133745 deleted ones. (The script for downloading the data can be found in setup-data.R, though it can be run only by Stack Overflow employees with database access).

Examples in R

The dataset is provided as csv.gz files, which means you can use almost any language or statistical tool to process it. But here I'll share some examples of an analysis in R.

The question data and the question-tag pairings are stored separately. You can read in the dataset with:

library(readr)
library(dplyr)

questions <- read_csv("questions.csv.gz")
question_tags <- read_csv("question_tags.csv.gz")
questions
## # A tibble: 17,763,486 × 7
##       Id        CreationDate          ClosedDate        DeletionDate Score
##    <int>              <dttm>              <dttm>              <dttm> <int>
## 1      1 2008-07-31 21:26:37                <NA> 2011-03-28 00:53:47     1
## 2      4 2008-07-31 21:42:52                <NA>                <NA>   472
## 3      6 2008-07-31 22:08:08                <NA>                <NA>   210
## 4      8 2008-07-31 23:33:19 2013-06-03 04:00:25 2015-02-11 08:26:40    42
## 5      9 2008-07-31 23:40:59                <NA>                <NA>  1452
## 6     11 2008-07-31 23:55:37                <NA>                <NA>  1154
## 7     13 2008-08-01 00:42:38                <NA>                <NA>   464
## 8     14 2008-08-01 00:59:11                <NA>                <NA>   296
## 9     16 2008-08-01 04:59:33                <NA>                <NA>    84
## 10    17 2008-08-01 05:09:55                <NA>                <NA>   119
## # ... with 17,763,476 more rows, and 2 more variables: OwnerUserId <int>,
## #   AnswerCount <int>
question_tags
## # A tibble: 52,224,835 × 2
##       Id                 Tag
##    <int>               <chr>
## 1      1                data
## 2      4                  c#
## 3      4            winforms
## 4      4     type-conversion
## 5      4             decimal
## 6      4             opacity
## 7      6                html
## 8      6                 css
## 9      6                css3
## 10     6 internet-explorer-7
## # ... with 52,224,825 more rows

As one example, you could find the most popular tags:

question_tags %>%
  count(Tag, sort = TRUE)
## # A tibble: 59,140 × 2
##           Tag       n
##         <chr>   <int>
## 1  javascript 1712324
## 2        java 1614786
## 3         php 1406127
## 4          c# 1356681
## 5     android 1327680
## 6      jquery 1035978
## 7      python  898647
## 8        html  804340
## 9         ios  652484
## 10        c++  645197
## # ... with 59,130 more rows

Or plot the number of questions asked per week:

library(ggplot2)
library(lubridate)

questions %>%
  count(Week = round_date(CreationDate, "week")) %>%
  ggplot(aes(Week, n)) +
  geom_line()

plot of chunk questions_per_week

Or you could compare the growth of particular tags over time:

library(lubridate)

tags <- c("c#", "javascript", "python", "r")

q_per_year <- questions %>%
  count(Year = year(CreationDate)) %>%
  rename(YearTotal = n)

tags_per_year <- question_tags %>%
  filter(Tag %in% tags) %>%
  inner_join(questions) %>%
  count(Year = year(CreationDate), Tag) %>%
  inner_join(q_per_year)

ggplot(tags_per_year, aes(Year, n / YearTotal, color = Tag)) +
  geom_line() +
  scale_y_continuous(labels = scales::percent_format()) +
  ylab("% of Stack Overflow questions with this tag")

plot of chunk tags_per_year

More Repositories

1

tidy-text-mining

Manuscript of the book "Tidy Text Mining with R" by Julia Silge and David Robinson
TeX
1,288
star
2

fuzzyjoin

Join tables together on inexact matching
R
656
star
3

data-screencasts

Code from live exploratory analyses of data in R
R
375
star
4

gleam

Creating interactive visualizations with Python
JavaScript
251
star
5

dgrtwo.github.com

My website
HTML
237
star
6

empirical-bayes-book

Introduction to Empirical Bayes: Examples from Baseball Statistics
TeX
186
star
7

snippr

Manage, share, and install RStudio code snippets
R
78
star
8

ebbr

Empirical Bayes binomial estimation
R
69
star
9

stackr

R package for connecting to the Stack Exchange API
R
65
star
10

drlib

Personal R package
R
63
star
11

unvotes

United Nations General Assembly Voting Data
R
59
star
12

rpanama

The Panama Papers offshore leaks database in R
R
54
star
13

tabs-spaces-post

Code source behind the blog post "Developers who use spaces make more money than those who use tabs"
46
star
14

tracestack

Search Stack Overflow for your most recent error message
R
42
star
15

cranview

A Shiny app to visualize downloads from RStudio's CRAN mirror
R
34
star
16

knowledgerepo

R Interface to AirBnb's Knowledge Repository
R
28
star
17

splittestr

Functions for Bayesian A/B Testing Post
R
24
star
18

stacksurveyr

Stack Overflow 2016 Developer Survey Results
R
20
star
19

cord19

COVID-19 Open Research Dataset (work in progress)
R
20
star
20

adventdrob

Personal R package for Advent of Code
R
19
star
21

RData

Data Analysis and Visualization Using R: Course website
CSS
19
star
22

monetizr

Make money from your open source packages
R
18
star
23

love-actually-network

A Shiny app of "Love, Actually" connections
R
18
star
24

so-trends

Stack Overflow Trends
R
18
star
25

ggfreehand

Add freehand circles to ggplot2 graphs
R
17
star
26

rgallery

Build a gallery of R snippets
R
12
star
27

GSEAMA

Gene Set Enrichment Analysis Made Awesome
R
10
star
28

OASIS

Optimized Annotation System for Insertion Sequences
Python
10
star
29

adblockr

Block ads from the monetizr package
R
6
star
30

parsetidy

Tidy an R parse tree into a data frame
R
6
star
31

snippets

Example RStudio Snippets
5
star
32

BarNone

Match barcodes in sequencing data based on Levenshtein distance
C++
5
star
33

HW-Formatter

Formats homework assignments downloaded from Blackboard. Combines multiple PDFs and code files into a single PDF, while allowing syntax highlighting..
Python
4
star
34

providence-viewer

Visualize your programming style based Stack Exchange's Providence predictions
R
4
star
35

broom-gallery

Gallery of simple examples of the broom package, built with rgallery
CSS
4
star
36

Sweave2knitr

Convert Sweave LaTeX documents to work with knitr.
Python
4
star
37

serial-ggvis

A visualization of the call log and map from the Serial podcast, using ggvis and Shiny.
R
3
star
38

broom_paper

Manuscript of "broom: An R Package for Converting Statistical Analysis Objects Into Tidy Data Frames"
TeX
3
star
39

stackbigquery

Database-specific package for the Stack Overflow data on Google BigQuery
R
2
star
40

BarSeqG3

Reproduction information for "Design and Analysis of Bar-Seq Experiments"
R
2
star
41

rparse

Parse API Client for R
R
2
star
42

rgallery-default

Default gallery setup for the rgallery package
CSS
1
star
43

nofalThesis

Shared thesis functions
R
1
star
44

swirlify

A comprehensive toolbox for swirl instructors.
R
1
star