• Stars
    star
    1,626
  • Rank 28,772 (Top 0.6 %)
  • Language
    R
  • License
    Creative Commons ...
  • Created over 6 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Readings in applied data science

Stats 337: Readings in Applied Data Science

Stats 337 is a small discussion class available to Stanford students in Spring 2018. Student in this class will read 3-4 papers (or equivalent) per week, write a brief response, and then discuss the papers (and related ideas) in class.

Readings

These readings reflect my personal thoughts about applied data science, and are skewed towards topics that I think are important but are generally under appreciated. It is not a systematic attempt to survey the field. That said, if you think there's something major that I've missed, please feel free to submit an issue (or pull request!). These readings will evolve as the quarter goes by.

Many of the readings come from Practical Data Science for Stats, a join PeerJ collection and special issue of the American Statistician. Jenny Bryan and I pulled this collection together in order to publish some of the important parts of data science that were previously unpublished. Other readings are blog posts because so much of applied data science is outside the comfort zone of traditional academic fields.

The development of much of this course has been driven by conversations on twitter. A big thanks go to everyone who has helped me out! Key threads: classroom discussion, ethics, google sheets, citation management.

What the *&!% is data science? (Apr 2)

In-class resources

Data collection and collaboration (Apr 9)

In-class photos

Spend 3-5 minutes filling out class feedback.

Software engineering (Apr 16)

Collaborative google doc

DevOps (Apr 23)

Collaborative google doc

Teaching (Apr 30)

Reproducibility (May 7)

Ethics (May 14)

Career (May 21)

Industry

Workflow

Annotated bibliographies

Many students in the spring 2018 elected to share their final annotated bibliographies

Grading

This is a discussion based class so the majority of your final grade will come from your preparation for discussion (weekly 1-page responses, 30%), and your in-class participation (also 30%). This class is not meant to be self-contained, so the final component of your grade will be an annotated bibliography (40%) describing other papers that you read outside of this class. The goal of these assessments is to force you to do things that are in your own best interests, and to encourage you learn helpful workflows that will stand you in good stead outside of this class.

I am not interested in policing excuses so no late responses will be accepted, and absences from class will count as a zero for participation. That said, I also don't want one bad week to affect your final grade, so your lowest two scores from each will be dropped.

Responses

Each week (after the first week), you need to turn in a 1-2 page written response to the papers that you read that week. The goal of response is to ensure that you've read the weekly readings, thought about them, and connected them to your existing knowledge, interests, and experience. In your response, you should briefly summarise the paper (1-2 sentences to jog your memory when you re-read your notes), and then focus on your response to the paper: How did it make you feel? What questions were you left with? What do you think it got wrong? If you found one of the readings to be particularly thought provoking, feel free to devote your entire response to that paper.

Each response will be graded on the check/plus/minus system. You will get a check if you briefly summarise the readings and add your own commentary. You will get a check-plus if you synthesize the readings, and combine them with outside knowledge/experience. You will get a check-minus if you only summarise the paper. (I will likely evolve these guidelines to be more concrete once I've read a few responses.)

If you're not familiar with reading academic papers (or you want to polish your skills), you might want to read these guidelines from Jeff Leek. I'd also highly recommend that you learn and use a citation management system. Having a system for managing citations is crucial if you plan to write a thesis. If you don't have an existing system, start by reading the advice of Caleb McDaniel.

Participation

This is a discussion class so your classroom participation is essential. But don't worry if you're an introvert, shy, or English is your second language: there will be plenty of opportunities to participate that don't require verbal agility. In this class, I'll be drawing on the techniques described in The Discussion Book by Stephen D. Brookfield and Stephen Preskill to make sure that everyone gets a chance to participate. I'll also collect regular feedback to make sure that everything is going well.

Annotated bibliography.

Your final project will be an annotated bibliography containing at least 20 papers or blog posts related to data science that we did not cover in this course. (See citation tracing)

Due June 6 (electronically)

There are three components to the bibliography:

  • Executive summary (25%). Introduce the overall theme of your bibliography in 1-2 paragraphs. Then use 1-2 pages to synthesise the most important or interesting from your annotated bibliography.

  • Top 3 (25%). List the three papers that you would most highly recommend and describe briefly why.

  • Bibliography (50%). List all the papers you have read with a proper reference and any notes you find helpful.

Each component will be graded 1 (C), 2 (B), or 3 (A):

  • Executive summary:

    • 3:
    • 2:
    • 1:
  • Top 3:

    • 3: Your description of the top 3 papers makes me want to run out and read them immediately, and you make that easy with impeccable citations and links to pdfs

    • 2:

    • 1: You have listed 3 papers and briefly described why they are interesting.

  • Bibliography:

    • 1: 6-10 papers
    • 2: 11-16 papers
    • 3: >25 papers

License

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

More Repositories

1

r4ds

R for data science: a book
R
4,500
star
2

adv-r

Advanced R: a book
TeX
2,248
star
3

ggplot2-book

ggplot2: elegant graphics for data analysis
Perl
1,543
star
4

mastering-shiny

Mastering Shiny: a book
R
1,335
star
5

r-pkgs

Building R packages
R
765
star
6

plyr

A R package for splitting, applying and combining large problems into simpler problems
R
493
star
7

tidy-data

A paper on data tidying
TeX
404
star
8

emo

Easily insert emoji into R and RMarkdown
R
396
star
9

r-internals

Documentation for R's internal C API
341
star
10

bigvis

Exploratory data analysis for large datasets (10-100 million observations)
C++
286
star
11

strict

Make R a little bit stricter
R
219
star
12

data-baby-names

Distribution of US baby names, 1880-2008
R
207
star
13

reshape

An R package to flexible rearrange, reshape and aggregate data
R
206
star
14

data-movies

Download data from IMDB movies and parse into useful form
Ruby
203
star
15

pryr

Pry open the covers of R
R
201
star
16

assertthat

User friendly assertions for R
R
200
star
17

r2d3

ggplot2 + d3 = r2d3
JavaScript
183
star
18

babynames

An R package containing US baby names from the SSA
R
131
star
19

lazyeval

Lazy evaluation: an alternative to non-standard evaluation (NSE) for R
R
131
star
20

secure

Secure private R data in public packages
R
105
star
21

purrrlyr

Tools at the intersection of purrr and dplyr
C++
103
star
22

lineprof

Visualise line profiling results in R
JavaScript
102
star
23

requirements

Find packages required for code to run
R
75
star
24

elmer

Call LLM APIs from R
R
71
star
25

ggstat

Statistical computations for visualisation
C++
70
star
26

r-python

Exploring data related to relative usage of R vs. python
R
68
star
27

gg2v

Render ggplot2 graphics using vega
JavaScript
67
star
28

building-permits

Code & data accompanying "whole-game" youtube video
66
star
29

stringb

A dependency-free version of stringr
R
65
star
30

precis

Succintly Summarise Data Frames
R
63
star
31

r-on-github

An exploration of R code and package on github, using the github search and repo apis
R
54
star
32

data-housing-crisis

Clean data related to the housing crisis
R
53
star
33

tidy-tools

Building tidy tools in R, a workshop
R
49
star
34

decumar

An alternative to sweave
R
49
star
35

neiss

Data from National Electronic Injury Surveillance System
HTML
48
star
36

monads

Work with Monads in R
R
47
star
37

joy-of-fp

Supplemental materials for "The joy of functional programming"
R
45
star
38

crantastic

Source code for crantastic.org: a community site for R
Ruby
44
star
39

recipes

Wickham family recipes
R
43
star
40

oldbookdown

R
39
star
41

cubelyr

A data cube dplyr backend
R
36
star
42

data-fuel-economy

Fuel economy data, 1978-2008
35
star
43

table-shapes

34
star
44

lvplot

Letter value boxplots for R
R
34
star
45

usdanutrients

USDA nutrient database as an R data package
R
34
star
46

reactive-docs

An introduction to reactive documents in R (for teaching stats)
34
star
47

vis-eda

Visualisation for EDA
R
32
star
48

rsmith

A static site generator for R inspired by metalsmith.io
R
32
star
49

sfhousing

Code to download and process SF housing sales data
R
32
star
50

helpr

An alternative html help system for R
R
31
star
51

profr

An alternative profiling package for R
R
30
star
52

cocktails

Hadley's cocktail book
R
29
star
53

productplots

Product graphics for categorical data
R
29
star
54

shinySignals

R
29
star
55

data-counties

County boundaries in csv for all US counties
R
28
star
56

l1tf

L1 trend filtering
C
27
star
57

ggplot1

Before there was ggplot2
R
26
star
58

roxygen3

R
23
star
59

15-state-of-the-union

R
22
star
60

minby

Compute minimum of one variable grouped by another
R
21
star
61

mylittlepony

A package for learning about the basics of package development
R
19
star
62

tidyverse-booster

R
19
star
63

hadley.github.com

Personal blog
JavaScript
18
star
64

boxplots-paper

TeX
18
star
65

mturkr

Tools to make MTurk tasks easy to run from R
R
18
star
66

monthApp

An example of a Shiny app-package
R
18
star
67

docker

My personal dockerfiles
17
star
68

fueleconomy

EPA fuel economy data in an R package
R
16
star
69

meifly

An R package for exploring ensembles of (generalised) linear models
R
16
star
70

clusterfly

An R package for visualising high-dimensional clustering algorithms
R
16
star
71

rminds

Sample R code for visualising models (especially models in data space)
16
star
72

sinartra

R
15
star
73

beautiful-data

Book chapter for beautiful data
15
star
74

eggnogr

Shiny app for scaling eggnog
R
14
star
75

15-student-papers

Graphics & computing student paper winners @ JSM 2015
R
14
star
76

fec-dplyr

Exploration of FEC contributions data with dplyr
R
13
star
77

mexico-mortality

Mortality data for Mexico, along with useful extra data
R
13
star
78

grouperise

Explore the idea of "grouperised" functions
C
13
star
79

mutatr

Prototype-based mutable objects for R, based on io and javascript
R
12
star
80

lvplot-paper

TeX
12
star
81

yrbss

Youth Risk Behaviour Surveillance System Data
R
12
star
82

tanglekit

R bindings for Brett Victor's tangle.js
JavaScript
11
star
83

nasaweather

Data from the 2006 ASA data expo
R
11
star
84

ggplot2-bayarea

Data, code and slides for ggplot2 talk given to Bay Area useR group, 17 Sep 2009
R
11
star
85

htmlbook

Convert a Quarto book to O'Reilly's html book format
HTML
11
star
86

vita

HTML
10
star
87

classifly

An R package to visualise high-dimensional classification boundaries with GGobi
R
10
star
88

ideas

Research ideas
10
star
89

proto

Prototype Object-Based Programming
R
10
star
90

cran-logs-dplyr

An case study using dplyr on a large dataset: all package downloads from the Rstudio cran mirror.
R
9
star
91

scagnostics

An R package to calculate graph theoretic scagnostics
C++
9
star
92

ggplot2movies

What the package does (one paragraph).
R
9
star
93

tidycore

Core tidyverse packages
R
9
star
94

densityvis

R package for cutting and binning data
R
9
star
95

fortify

Convert any R object to a data frame, suitable for visualisation
R
9
star
96

hadladdin

RStudio add-ins by Hadley
R
9
star
97

hadcol

Hadley's utilities for adding columns
R
9
star
98

talk-httr2

R
9
star
99

localmds

Local multidimensional scaling, an R package
8
star
100

layers

Layers code extracted out of ggplot2
R
8
star