DAT3 Course Repository
Course materials for General Assembly's Data Science course in Washington, DC (10/2/14 - 12/18/14).
Instructors: Josiah Davis and Kevin Markham (Data School blog, email newsletter, YouTube channel)
Week | Tuesday | Thursday |
---|---|---|
0 | 10/2: Introduction | |
1 | 10/7: Git and GitHub | 10/9: Base Python |
2 | 10/14: Getting and Cleaning Data | 10/16: Exploratory Data Analysis |
3 | 10/21: Linear Regression Milestone: Question and Data Set |
10/23: Linear Regression Part 2 |
4 | 10/28: Machine Learning and KNN | 10/30: Model Evaluation |
5 | 11/4: Logistic Regression Milestone: Data Exploration and Analysis Plan |
11/6: Logistic Regression Part 2, Clustering |
6 | 11/11: Dimension Reduction | 11/13: Clustering Part 2, Naive Bayes |
7 | 11/18: Natural Language Processing | 11/20: Decision Trees |
8 | 11/25: Recommenders Milestone: First Draft Due |
Thanksgiving |
9 | 12/2: Ensembling | 12/4: Ensembling Part 2, Python Companion Tools |
10 | 12/9: Working a Data Problem Milestone: Second Draft Due |
12/11: Neural Networks |
11 | 12/16: Review | 12/18: Project Presentations |
Class 1: Introduction
- Introduction to General Assembly
- Course overview and philosophy (slides)
- What is data science? (slides)
- Brief demo of Slack
Homework:
- Install Anaconda distribution of Python 2.7, Git, and Slack
- Add a photo to your Slack profile
- Create a GitHub account
- Read Analyzing the Analyzers (40 pages) and think about where you'd like to fit in!
Optional:
- Subscribe to some data-focused newsletters, to keep current: Center for Data Innovation, O'Reilly Data Newsletter, Data Community DC
- Watch Introduction to Data Science and Analysis (50 minutes) for another look at the data science workflow
- Find an open source project hosted on GitHub that interests you
Class 2: Git and GitHub
- Homework discussion: Any installation issues? Find any interesting GitHub projects? Any takeaways from "Analyzing the Analyzers"?
- Introduce yourself: What's your technical background? Why did you join this course? How do you define success in this course?
- Office hours
- Git and GitHub lesson (slides)
- Create a repo on GitHub, clone it, make changes, and push up to GitHub
- Fork the DAT3-students repo, clone it, add a Markdown file (
about.md
) in your folder, push up to GitHub, and create a pull request
Homework:
- Review the course project information, past projects from other GA students, and public data sources
Optional:
- Clone this repo (DAT3) for easy access to the course files
- Watch Introduction to Git and GitHub (36 minutes) to repeat a lot of today's presentation
- Read the first two chapters of Pro Git for a much deeper understanding of version control and the basic Git commands
- Learn some more Markdown and add it to your
about.md
file, then push those edits to GitHub and send another pull request - Read this friendly command line tutorial if you are brand new to the command line
- For more project inspiration, browse the student projects from Andrew Ng's Machine Learning course at Stanford
Resources:
- Dillinger is a browser-based Markdown editor, useful for checking your Markdown code
- GitRef is an excellent reference guide for Git commands
- Git quick reference for beginners is a shorter reference guide with commands grouped by workflow
Class 3: Base Python
- Any questions about Git/GitHub?
- Discuss the course project. What's one thing you learned from reviewing student projects?
- Base Python lesson, with exercises (code)
Homework:
- Complete the exercises at the end of the Python script we went over in class today and add your solutions to your folder in the DAT3-students repo
- Keep thinking about your project, and consult past projects and public data sources for more inspiration
Class 4: Getting and Cleaning Data
- Discuss homework solutions (code)
- File input/output in Python
- Article, original data, modified data
- Open in Sublime Text
- Reading and writing files (code)
- Getting data from APIs
- What is an API? Why provide one?
- Apigee: API providers, Echo Nest API console
- Echo Nest Developer Center for API key and documentation
- Three options for reading data into Python (code):
- curl to file, view file in browser, read with json module
- Use requests
- Use Pyechonest
Homework:
- Exercise 2 from file input/output
- Read What I do when I get a new data set as told through tweets
- Watch Look at Your Data (18 minutes)
Optional:
- Exercise 3 from file input/output
- Read this fun article about using web scraping to analyze Netflix's "micro-genres"
Resources:
- Online Python Tutor is useful for visualizing (and debugging) your code
- Directory of API wrappers for Python
Class 5: Exploratory Data Analysis
- Discuss homework solutions (code)
- Scraping the web for data
- What is web scraping? Why use it?
- Web scraping example (code):
- Pages to scrape using Beautiful Soup 4
- Adapted from Web scraping 101 with Python
- Pandas for data analysis (code)
- Split-Apply-Combine pattern
Homework:
- Project milestone: Submit your question and data set to DAT3-students by Tuesday!
- Read through this excellent example of data wrangling and exploration in Pandas
Optional:
- To learn more Pandas, read through this three-part tutorial (some overlap with today's class), or read through these two excellent (but extremely long) notebooks: Introduction to Pandas, Data Wrangling with Pandas
Resources:
- For more web scraping with Beautiful Soup 4, here's a longer example: slides, code
- Web scraping without writing any code: "turn any website into an API" with import.io or kimono
- Simple examples of joins in Pandas, for when you need to merge multiple DataFrames together
Class 6: Linear Regression
- Discuss your project question and data set
- Pandas for visualization (code)
- Linear regression (code, slides)
- What is linear regression?
- How to interpret the output?
- What assumptions does linear regression depend upon?
- What is multicollinearity and heteroskedasticity, and why should I care?
- How do I represent categorical variables?
Optional:
- Post your favorite visualization in the "viz" channel on Slack, and tell us what you like about it!
Resources:
- For more on Pandas plotting, browse through this IPython notebook or read the visualization page from the official Pandas documentation
- To learn how to customize your plots further, browse through this IPython notebook on matplotlib
- To explore different types of visualizations and when to use them, Choosing a Good Chart is a handy one-page reference, and here is an excellent slide deck from Columbia's Data Mining class
- If you are already a master of ggplot2 in R, you may prefer "ggplot for Python" over matplotlib: introduction, tutorial
Class 7: Linear Regression Part 2
- Linear regression, continued
Homework:
- Complete the exercises at the end of the python script from class
Resources:
- One of the best places to go for more information about linear regression is chapter 3 of our course "textbook": An Introduction to Statistical Learning - or just read Kevin's highly abbreviated version
- For more information about core assumptions, check out this article and this one
- For more on log transformations, check out this article
- This handout provides an overview of the computation of the F-test
- This may be a helpful article on how to derive the coefficient estimates
Class 8: Machine Learning and KNN
- Discuss homework solutions (code)
- "Human learning" on iris data using Pandas (code)
- Introduction to numpy (code)
- Machine learning and K-Nearest Neighbors (slides)
Homework:
- Read this excellent article, Understanding the Bias-Variance Tradeoff, and be prepared to discuss it on Thursday
Optional:
- Walk through the rest of the numpy reference and see if you can understand each of the functions
Resources:
- For a more thorough introduction to numpy, this guide is quite good
Class 9: Model Evaluation
- Introduction to scikit-learn with iris data (code)
- Discuss the article on the bias-variance tradeoff
- Model evaluation procedures (slides, code)
- Training error
- Underfitting and overfitting
- Test set approach
- Cross-validation
- Model evaluation metrics (slides, code)
- Confusion matrix
- Introduction to Kaggle
Homework:
- Project milestone: Submit your "Data Exploration and Analysis Plan" to DAT3-students by Tuesday!
- Read this simple example of machine learning and see if you understand everything in the article
- Watch Kevin's Kaggle project presentation video (16 minutes) for a tour of the machine learning process
Optional:
- For more on Kaggle, watch the video Kaggle Transforms Data Science Into Competitive Sport (28 minutes)
- For much more on the Kaggle Allstate competition, read Kevin's project paper, read a brief interview with the first place team, review the Python code from the second place team, or skim the solution sharing thread
- If you want to try out the Kaggle Bike Sharing Demand competition, feel free to reuse Kevin's starter code
Resources:
- If you'd like to see more on today's topics, these videos from Hastie and Tibshirani are excellent: bias-variance tradeoff (10 minutes), test set (aka "validation set") approach (14 minutes), cross-validation (14 minutes) - or just read section 5.1 from their book (free PDF download!)
- Kevin wrote a simple guide to confusion matrix terminology that you can use as a reference guide
- The Kaggle wiki has a decent page describing other common model evaluation metrics
Class 10: Logistic Regression
- Any questions from last time: model evaluation, Kaggle, article on Smart Autofill?
- Summary of your feedback
- Discuss your data exploration and analysis plan
- Logistic Regression (slides, code)
Homework:
- Continue to work on Part I of the exercise from class and submit your solution to DAT3-students
Class 11: Logistic Regression Part 2, Clustering
- Logistic Regression, continued (exercise solution)
- Clustering (slides)
- Why cluster?
- Introduction to the K-means algorithm
Homework:
- Read through section 8.2 on K-means Clustering from Introduction to Data Mining by next Thursday. What are some of the strengths and limitations of k-means clustering?
Resources:
- If you would like a review on the topics we covered today (and Tuesday), the videos from Hastie and Tibshirani from Stanford are very good:
- Introduction to Classification (10 minutes)
- Logistic Regression and Maximum Likelihood (9 minutes)
- Multivariate Logistic Regression and Confounding Variables (10 minutes)
- If you want to understand the math of how coefficients are estimated, check out these notes from CMU's Advanced Data Analysis class. Written by Cosma Shalizi, one of CMU's professors.
- Documentation for plotting math text
- Documentation for plotting scatter plots
Class 12: Dimension Reduction
- Model evaluation metrics, continued
- ROC curves and AUC (visualization, code)
- Root Mean Squared Error (slides)
- Dimension Reduction (Guest Lecturer: Sinan Ozdemir)
- Slides
- Code: PCA and SVD
- Code: image compression with PCA (original source)
Homework:
- Read Paul Graham's "A Plan for Spam" in preparation for Thursday's class on Naive Bayes
Resources:
- Kevin has a video tutorial (14 minutes) and blog post summarizing ROC curves and AUC
- scikit-learn has extensive documentation on model evaluation
- On Cross Validated, this question has dozens of explanations of PCA, and this question has a useful visualization of what is essentially PCA
Class 13: Clustering Part 2, Naive Bayes
- Clustering Analysis (slides)
- Understanding the K-means algorithm
- Choosing K for k-means
- Exercises
- Visualizing data in multi-dimensional space
- Limitations of K-means, K-means cluster validation
- Naive Bayes (slides)
- Briefly discuss "A Plan for Spam"
- Probability and Bayes' Theorem (original source, confusion matrix reference)
- Naive Bayes classification
Homework:
- Open Python, type
import nltk
, typenltk.download()
, find the "NLTK Downloader" popup window, click on "all", then click on "Download". Do this at home, since it's more than 300 MB! If you have space constraints on your computer, we can tell you next class exactly which packages to download.
Resources:
- For clustering, scikit-learn has documentation on K-means clustering, alternative clustering algorithms, and clustering metrics
- Vipin Kumar from the University of Minnesota has a helpful chapter on clustering from his textbook: Introduction to Data Mining
- For an alternative introduction to Bayes' Theorem, Bayes' Rule for Ducks, Bayes' Rule in an animated gif, and this 5-minute video on conditional probability may be helpful
- For more details on Naive Bayes classification, Wikipedia has two useful articles: Naive Bayes classifier, Naive Bayes spam filtering
- If you enjoyed Paul Graham's article, you can read his follow-up article on how he improved his spam filter and this related paper about state-of-the-art spam filtering in 2004
Class 14: Natural Language Processing
- Naive Bayes, continued (code)
- Create a spam classifier using CountVectorizer and MultinomialNB
- Natural Language Processing (code)
- Real-world examples
- NLTK: tokenization, stemming, lemmatization, stopwords, Named Entity Recognition (Stanford NER Tagger), TF-IDF, document summarization
- Alternative: TextBlob
Resources:
- Natural Language Processing with Python: free online book to go in-depth with NLTK
- NLP online course: no sessions are available, but video lectures and slides are still accessible
- Brief slides on the major task areas of NLP
- Detailed slides on a lot of NLP terminology
- A visual survey of text visualization techniques: for exploration and inspiration
- DC Natural Language Processing: active Meetup group
- Stanford CoreNLP: suite of tools if you want to get serious about NLP
- Getting started with regex: Python introductory lesson and reference guide, real-time regex tester, in-depth tutorials
Homework:
- We will use Graphviz to visualize the output of the classification trees. Please download this before class.
- In order to download and install for Windows, you will also need to manually add the folder location (e.g., C:\Program Files (x86)\Graphviz2.30\bin;) to your environment path. (download graphviz-2.38.msi)
- You can also download and install for Mac. I am not aware of any issues with installation.
Class 15: Decision Trees
At the end of this class, you should be able to do the following:
- Describe the output of a decision tree to someone without a data science background
- Describe how the algorithm creates the decision tree
- Predict the likelihood of a binary event using the decision tree algorithm in scikit-learn
- Create a decision tree visualization
- Determine the optimal tree size using a tune grid and the AUC metric in Python
- Describe the strengths and weaknesses of a decision tree
Homework:
- Work on your project. The first draft of your project is due on Tuesday at 5 pm.
Resources:
- Dr. Justin Esarey from Rice University has a nice video lecture on CART that also includes a code walkthrough
- For those of you with background in javascript, d3.js has a nice tree layout that would make more presentable tree diagrams
- Here is a link to a static version, as well as a link to a dynamic version with collapsable nodes
- If this is something you are interested in, Gary Sieling wrote a nice function in python to take the output of a scikit-learn tree and convert into json format
- If you are intersted in learning d3.js, this a good tutorial for understanding the building blocks of a decision tree. Here is another tutorial focusing on building a tree diagram in d3.js.
- Chapter 8.1 of the Introduction to Statistical Learning also covers the basics of Classification and Regression Trees
Class 16: Recommenders
- Recommenders (Guest Lecturer: Sinan Ozdemir)
Class 17: Ensembling
Resources:
- Leo Brieman's paper on Random Forests
- yhat has a brief primer on Random Forests that can provide a review of many of the topics we covered today.
- Here is a link to some Kaggle competitions that were won using Random Forests
- Ensemble models... tend to strongly outperform their component models on new data. Doesn't this violate βOccamβs razorβ? In this paper entitled: The Generalization Paradox of Ensembles John Elder IV argues for a more refined understanding of model complexity.
Class 18: Ensembling Part 2, Python Companion Tools
- Ensembling
- Review the Random Forest algorithm
- Discuss tuning the Random Forest Algorithm
- Give an overview of the AdaBoost algorithm
- Implement boosted trees in Python
- IPython Notebook
- nbviewer
- Biostatistics course: GitHub repo, direct links to nbviewer
- A gallery of interesting IPython Notebooks
Resources:
- Chapter 10 of the Elements of Statistical Learning covers Boosting. See page 339 for the algorithm presented in class.
- Dr. Justin Esary has a nice tutorial on Boosting. Watch from 32:00 β 59:00 for relevant material.
- Tutorial by Professor Rob Schapire of Princeston on the AdaBoost Algorithm
- IPython documentation in website form and notebook form: does not focus exclusively on the IPython Notebook
Class 19: Working a Data Problem
Class 20: Neural Networks
- Inspiration for Neural Networks
- Neural Networks
- Gradient Descent
Resources:
- Michael Neilson has a free book called Neural Networks and Deep Learning that gives thorough introduction to Neural Networks
- Geoffrey Hinton, one of the pioneers of the deep learning movement, has an entire class on Coursera called Neural Networks for Machine Learning
- The Python wiki has a list of Python packages commonly used for Neural Networks
Homework:
- Read this "classic" paper, which may help you to connect many of the topics we have studied throughout the course: A Few Useful Things to Know about Machine Learning
Class 21: Review
- Review of all of data science (slides)
- Comparing supervised learning algorithms
- Special guest: Laura Lorenz
Resources:
- scikit-learn "machine learning map": Guide for choosing the optimal estimator
- Choosing a Machine Learning Classifier: Short and highly readable
- Machine Learning Done Wrong: Thoughtful advice on common mistakes to avoid in machine learning
- Practical machine learning tricks from the KDD 2011 best industry paper: More advanced advice than the resources above
- An Empirical Comparison of Supervised Learning Algorithms: Research paper from 2006
- Getting in Shape for the Sport of Data Science: 75-minute video of practical tips for machine learning (by the past president of Kaggle)
- Resources for continued learning
- Bonus content (below)
- Keep using Slack!
Class 22: Project Presentations
- Note: Guests are welcome! Invite your friends and family!
Bonus Content
- Python reference guide
- Regular expressions ("regex"):
- Simple example of regex usage: code, data
- Detailed reference guide
- Python introductory lesson
- In-depth tutorials
- Real-time regex tester
- Fun example: Exploring Expressions of Emotions in GitHub Commit Messages
- Tidy data
- Brief summary of tidy data
- Tidy data resources from Hadley Wickham: detailed paper, shorter version of the paper with more R code, slides from his class on tidy data, video presentation
- Examples: tidy, untidy, very untidy
- Common issues with obtaining tidy data from Excel
- Reproducibility
- Overview of reproducibility and reproducible research
- Twitter definition of reproducibility
- How to share data with a statistician: practical guide for turning raw data into tidy data in a reproducible and documented manner
- Example of reproducible data processing: includes raw data, processed data, processing scripts, and documentation
- Colbert on reproducibility (8-minute video)
- Domino Data Lab
- Primary use cases: run your model in the cloud, create a self-service web form or API that interacts with your model
- Usage rates and subscription pricing
- Command reference