Stats 337: Readings in Applied Data Science
Stats 337 is a small discussion class available to Stanford students in Spring 2018. Student in this class will read 3-4 papers (or equivalent) per week, write a brief response, and then discuss the papers (and related ideas) in class.
Readings
These readings reflect my personal thoughts about applied data science, and are skewed towards topics that I think are important but are generally under appreciated. It is not a systematic attempt to survey the field. That said, if you think there's something major that I've missed, please feel free to submit an issue (or pull request!). These readings will evolve as the quarter goes by.
Many of the readings come from Practical Data Science for Stats, a join PeerJ collection and special issue of the American Statistician. Jenny Bryan and I pulled this collection together in order to publish some of the important parts of data science that were previously unpublished. Other readings are blog posts because so much of applied data science is outside the comfort zone of traditional academic fields.
The development of much of this course has been driven by conversations on twitter. A big thanks go to everyone who has helped me out! Key threads: classroom discussion, ethics, google sheets, citation management.
What the *&!% is data science? (Apr 2)
-
Data scientists mostly just do arithmetic and thatβs a good thing; Noah Lorang (2016).
-
Optional: Enterprise Data Analysis and Visualization: An Interview Study; Sean Kandel, Andreas Paepcke, Joseph Hellerstein, Jeffrey Heer (2012).
-
Optional: 50 years of data science (OA preprint); David Donoho (2017). This is discussion paper and a number of notable statisticians have contributed commentary. Make sure to read some of these as well.
Data collection and collaboration (Apr 9)
-
Tidy data; Hadley Wickham (2013).
-
Data organization in spreadsheets; Karl W Broman, Kara Woo (2017).
-
Best practices for using google sheets in your data project; Matthew Lincoln (2018).
-
Bonus: Modeling as a core component of structuring data; Clifford Konold, William Finzer, Kozoom Kreetong (2017)
Spend 3-5 minutes filling out class feedback.
Software engineering (Apr 16)
-
Software development skills for data scientists; Trey Causey (2015).
-
Excuse me, do you have a moment to talk about version control?; Jennifer Bryan (2017).
-
Good enough practices in scientific computing; Greg Wilson, Jennifer Bryan, Karen Cranston, Justin Kitzes, Lex Nederbragt, Tracy K. Teal (2017).
DevOps (Apr 23)
-
Opinionated analysis development; Hillary Parker (2017)
-
An introduction to Docker for reproducible research, with examples from the R environment; Carl Boettiger (2014).
-
Hidden Technical Debt in Machine Learning Systems; D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, Dan Dennison (2015).
Teaching (Apr 30)
-
The Introductory Statistics Course: A Ptolemaic Curriculum?. George W Cobb (2007).
-
The democratization of data science education; Sean Kross, Roger D Peng, Brian S Caffo, Ira Gooding, Jeffrey T Leek (2017).
-
Teaching stats for data science; Danny Kaplan (2017).
-
Ten quick tips for teaching programming; Neil C. C. Brown, Greg Wilson (2018).
Reproducibility (May 7)
-
Best practices for computational science; Victoria Stodden, Sheila Miguez (2014).
-
How rOpenSci uses Code review to promote reproducible science; Noam Ross, Scott Chamberlain, Karthik Ram, MaΓ«lle Salmon (2017).
-
A practical guide for transparency in psychological science; Olivier Klein, Tom Hardwicke, Frederik Aust, Johannes Breuer, Henrik Danielsson, Alicia Hofelich Mohr, Hans IJzerman, Gustav Nilsonne, Wolf Vanpaemel, Michael Frank (2018).
-
Lessons Learned Reproducing a Deep Reinforcement Learning Paper; Matthew Rahtz (2018).
-
Bonus: The Practice of Reproducible Research; Justin Kitzes, Daniel Turek, Fatma Deniz (2018).
Ethics (May 14)
-
The Ethical Data Scientist; Cathy O'Neil (2016).
-
Big data, machine learning, and the social sciences; Hannah Wallach (2014).
-
A Code of Ethics for Data Science; DJ Patil (2018).
-
An ethical code canβt be about ethics; Schaun Wheeler (2018).
-
Ethical Guidelines for Statistical Practice; Committee on Professional Ethics of the American Statistical Association (2016).
-
Journalism as a Professional Model for Data Science; Brian C. Keegan (2016)
Career (May 21)
-
What it's like to be on the data science job market; Trey Causey (2015)
-
Academic job search advice; Matt Might (????).
-
Importance of sponsorship; Emily Robinson (2018).
-
Imposter syndrome in data science; Caitlin Hudon (2018).
Industry
-
Doing data science at twitter; Robert Chang (2015).
-
Engineers shouldnβt write ETL: A guide to building a high functioning data science Department; Jeff Magnusson (2016).
-
Using R packages and education to scale data science at Airbnb; Ricardo Bion (2016).
-
Data science at Instacart; Jeremy Stanley (2017).
-
.rprofile: Jenny Bryan; Kelly O'Briant (2017)
-
Marketing for data science. Erik Oberg (2018).
Workflow
-
The plain person's guide to plain text social science; Kieran Healy (2016).
-
Open notebook history; Caleb McDaniel (2013).
-
Optional: How to be a modern scientist; Jeff Leek (2016).
Annotated bibliographies
Many students in the spring 2018 elected to share their final annotated bibliographies
-
Communication and visualization by Kenneth Tay
-
Connections to cognitive science by Sara Altman.
-
Data science in modern medicine by Sean R. Zion.
-
Ethics in data science (pdf)
-
Graphical advice by Nick Hershey
-
Sharing analyses across research groups by Hershel Mehta.
-
Tailoring learning experiences for adults through data analytics (pdf) by Anna Khazenzon.
-
Teaching data science (pdf) by Ben Stenhaug.
Grading
This is a discussion based class so the majority of your final grade will come from your preparation for discussion (weekly 1-page responses, 30%), and your in-class participation (also 30%). This class is not meant to be self-contained, so the final component of your grade will be an annotated bibliography (40%) describing other papers that you read outside of this class. The goal of these assessments is to force you to do things that are in your own best interests, and to encourage you learn helpful workflows that will stand you in good stead outside of this class.
I am not interested in policing excuses so no late responses will be accepted, and absences from class will count as a zero for participation. That said, I also don't want one bad week to affect your final grade, so your lowest two scores from each will be dropped.
Responses
Each week (after the first week), you need to turn in a 1-2 page written response to the papers that you read that week. The goal of response is to ensure that you've read the weekly readings, thought about them, and connected them to your existing knowledge, interests, and experience. In your response, you should briefly summarise the paper (1-2 sentences to jog your memory when you re-read your notes), and then focus on your response to the paper: How did it make you feel? What questions were you left with? What do you think it got wrong? If you found one of the readings to be particularly thought provoking, feel free to devote your entire response to that paper.
Each response will be graded on the check/plus/minus system. You will get a check if you briefly summarise the readings and add your own commentary. You will get a check-plus if you synthesize the readings, and combine them with outside knowledge/experience. You will get a check-minus if you only summarise the paper. (I will likely evolve these guidelines to be more concrete once I've read a few responses.)
If you're not familiar with reading academic papers (or you want to polish your skills), you might want to read these guidelines from Jeff Leek. I'd also highly recommend that you learn and use a citation management system. Having a system for managing citations is crucial if you plan to write a thesis. If you don't have an existing system, start by reading the advice of Caleb McDaniel.
Participation
This is a discussion class so your classroom participation is essential. But don't worry if you're an introvert, shy, or English is your second language: there will be plenty of opportunities to participate that don't require verbal agility. In this class, I'll be drawing on the techniques described in The Discussion Book by Stephen D. Brookfield and Stephen Preskill to make sure that everyone gets a chance to participate. I'll also collect regular feedback to make sure that everything is going well.
Annotated bibliography.
Your final project will be an annotated bibliography containing at least 20 papers or blog posts related to data science that we did not cover in this course. (See citation tracing)
Due June 6 (electronically)
There are three components to the bibliography:
-
Executive summary (25%). Introduce the overall theme of your bibliography in 1-2 paragraphs. Then use 1-2 pages to synthesise the most important or interesting from your annotated bibliography.
-
Top 3 (25%). List the three papers that you would most highly recommend and describe briefly why.
-
Bibliography (50%). List all the papers you have read with a proper reference and any notes you find helpful.
Each component will be graded 1 (C), 2 (B), or 3 (A):
-
Executive summary:
- 3:
- 2:
- 1:
-
Top 3:
-
3: Your description of the top 3 papers makes me want to run out and read them immediately, and you make that easy with impeccable citations and links to pdfs
-
2:
-
1: You have listed 3 papers and briefly described why they are interesting.
-
-
Bibliography:
- 1: 6-10 papers
- 2: 11-16 papers
- 3: >25 papers
License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.