• Stars
    star
    230
  • Rank 174,053 (Top 4 %)
  • Language
    Jupyter Notebook
  • Created over 9 years ago
  • Updated 5 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

General Assembly's Data Science course in Washington, DC

DAT7 Course Repository

Course materials for General Assembly's Data Science course in Washington, DC (6/1/15 - 8/12/15).

Instructor: Kevin Markham (Data School blog, email newsletter, YouTube channel)

Course Project

Monday Wednesday
6/1: Introduction to Data Science 6/3: Command Line and Version Control
6/8: Data Reading and Cleaning 6/10: Exploratory Data Analysis
6/15: Visualization 6/17: Machine Learning
6/22: Getting Data
Project Discussion Deadline
6/24: K-Nearest Neighbors
Project Question and Dataset Due
6/29: Basic Model Evaluation 7/1: Linear Regression
7/6: Logistic Regression 7/8: Advanced Model Evaluation
7/13: First Project Presentation 7/15: Naive Bayes and Text Data
7/20: Natural Language Processing 7/22: Kaggle Competition
7/27: Decision Trees
Draft Paper Due
7/29: Ensembling
8/3: Advanced scikit-learn and
Clustering, Peer Review Due
8/5: Course Review
8/10: Final Project Presentation 8/12: Final Project Presentation

Python Resources

Submission Forms

What's next?

Additional resources


Class 1: Introduction to Data Science

Homework:

Resources:


Class 2: Command Line and Version Control

  • Command line exercise (code)
  • Git and GitHub (slides)
  • Intermediate command line
  • Wrap up: Course schedule, office hours

Homework:

  • Complete the homework exercise listed in the command line introduction. Create a Markdown document that includes your answers and the code you used to arrive at those answers. Add this file to a GitHub repo that you'll use for all of your coursework, and submit a link to your repo using the homework submission form.
  • Review the code from the beginner and intermediate Python workshops. If you don't feel comfortable with any of the content (up through the "dictionaries" section), you should spend some time this weekend practicing Python. Here are my recommended resources:
    • If you like learning from a book, Python for Informatics has useful chapters on strings, lists, and dictionaries.
    • If you prefer interactive exercises, try these lessons from Codecademy: "Python Lists and Dictionaries" and "A Day at the Supermarket".
    • If you have more time, try these much longer lessons from DataQuest: "Find the US city with the lowest crime rate" and "Discover weather patterns in LA".
    • If you've already mastered these topics and want more of a challenge, try solving the second Python Challenge and send me your code in Slack.
  • If there are specific Python topics you want me to cover next week, send me a Slack message.

Git and Markdown Resources:

  • Pro Git is an excellent book for learning Git. Read the first two chapters to gain a deeper understanding of version control and basic commands.
  • If you want to practice a lot of Git (and learn many more commands), Git Immersion looks promising.
  • If you want to understand how to contribute on GitHub, you first have to understand forks and pull requests.
  • GitRef is my favorite reference guide for Git commands, and Git quick reference for beginners is a shorter guide with commands grouped by workflow.
  • Markdown Cheatsheet provides a thorough set of Markdown examples with concise explanations. GitHub's Mastering Markdown is a simpler and more attractive guide, but is less comprehensive.

Command Line Resources:

  • If you want to go much deeper into the command line, Data Science at the Command Line is a great book. The companion website provides installation instructions for a "data science toolbox" (a virtual machine with many more command line tools), as well as a long reference guide to popular command line tools.
  • If you want to do more at the command line with CSV files, try out csvkit, which can be installed via pip.

Class 3: Data Reading and Cleaning

  • Git and GitHub assorted tips (slides)
  • Review command line homework (solution)
  • Python:
    • Spyder interface
    • Review of list comprehensions
    • Lesson on file reading with airline safety data (code, data, article)
    • Data cleaning exercise
    • Walkthrough of homework with Chipotle order data (code, data, article)

Homework:

  • Complete the homework assignment with the Chipotle data, and add a commented Python script to your GitHub repo. If you are unable to complete a part, try writing some pseudocode instead! You have until Monday to complete this assignment.

Resources:

  • PEP 8 is Python's "classic" style guide, and is worth a read if you want to write readable code that is consistent with the rest of the Python community.

Class 4: Exploratory Data Analysis

Homework:

Resources:


Class 5: Visualization

  • Part 2 of Exploratory Data Analysis with Pandas (code)
  • Visualization with Pandas and Matplotlib (code)

Homework:

Pandas Resources:

Visualization Resources:


Class 6: Machine Learning

Homework:

  • Your deadline for discussing your project ideas with an instructor is Monday, and your project question and dataset is due Wednesday.

Resources:


Class 7: Getting Data

Homework:

API Resources:

Web Scraping Resources:


Class 8: K-Nearest Neighbors

Homework:

  • Reading assignment on the bias-variance tradeoff
  • Browse through the scikit-learn documentation for KNN to get a sense of how it's organized: user guide, module reference, class documentation
  • Work on your project... your first project presentation is in less than three weeks!
  • Optional: Read the Teaching Assistant Evaluation dataset into Pandas, create the X and y objects (the response variable is "class attribute"), and go through scikit-learn's 4-step modeling process. (There's no need to submit your code unless you have a question or would like feedback!)

KNN Resources:

Reproducibility Resources:

Other Resources:

  • If you would like to learn the IPython Notebook, the official Notebook tutorials are useful.
  • To get started with Seaborn for visualization, the official website has a series of tutorials and an example gallery.

Class 9: Basic Model Evaluation

Homework:

Resources:


Class 10: Linear Regression

Homework:

Resources:


Class 11: Logistic Regression

Homework:

Resources:


Class 12: Advanced Model Evaluation

  • Advanced model evaluation (notebook, notebook code)
    • Null accuracy, handling missing values
    • Confusion matrix
    • Handling categorical features
  • ROC curves and AUC

Homework:

  • Your first project presentation is on Monday! Please submit a link to your project repository (with slides, code, data, and visualizations) before class using the submission form.

ROC Resources:

Other Resources:


Class 13: First Project Presentation

  • Project presentations!

Homework:


Class 14: Naive Bayes and Text Data

Homework:

  • Confirm that you have TextBlob installed by running import textblob from within your preferred Python environment. If it's not installed, run pip install textblob at the command line (not from within Python).
  • Complete the Yelp review text homework, and add a Python script (or IPython notebook) to your GitHub repo. This assignment is due on Monday.
  • There is a video/reading assignment on cross-validation, for those of you that have not already watched the video or would prefer a reading instead.

Resources:

  • For more on conditional probability, read these slides, or read section 2.2 of the OpenIntro Statistics textbook (14 pages).
  • For an intuitive explanation of Naive Bayes classification, read this post on airport security.
  • For more details on Naive Bayes classification, Wikipedia has two excellent articles (Naive Bayes classifier and Naive Bayes spam filtering), and Cross Validated has a good Q&A.
  • When applying Naive Bayes classification to a dataset with continuous features, it is best to use GaussianNB rather than MultinomialNB. Wikipedia has a short description of Gaussian Naive Bayes, as well as an excellent example of its usage.
  • These slides from the University of Maryland provide more mathematical details on both logistic regression and Naive Bayes, and also explain how Naive Bayes is actually a "special case" of logistic regression.
  • Andrew Ng has a paper comparing the performance of logistic regression and Naive Bayes across a variety of datasets.
  • If you enjoyed Paul Graham's article, you can read his follow-up article on how he improved his spam filter and this related paper about state-of-the-art spam filtering in 2004.

Class 15: Natural Language Processing

Homework:

  • Download the competition files, move them to the DAT7/data directory, and make sure you can open the CSV files using Pandas. If you have any problems opening the files, you probably need to turn off real-time virus scanning (especially Microsoft Security Essentials).
  • Come up with some theories about which features might be relevant to predicting the response, and then explore the data to see if those theories appear to be true.
  • Optional: Think about some features that might be worth creating from the data, and then figure out how to actually create those features.
  • Optional: Watch my project presentation video (16 minutes) for a tour of the end-to-end machine learning process for a Kaggle competition, including the creation of new features. (Or, just read through the slides.)

NLP Resources:

Cross-Validation Resources:


Class 16: Kaggle Competition

Homework:

  • Your draft paper is due on Monday! Please submit a link to your project repository (with paper, code, data, and visualizations) before class using the submission form.
  • Optional: Keep working on this competition! You can make up to 5 submissions per day, and the competition doesn't close until 6:30pm ET on Wednesday, August 5 (class 20).

Resources:


Class 17: Decision Trees

Homework:

Resources:

Installing GraphViz (optional):

  • Mac: Download and install PKG file
  • Windows: Download and install MSI file, and then add GraphViz to your path:
    • Go to Control Panel, System, Advanced System Settings, Environment Variables
    • Under system variables, edit "Path" to include the path to the "bin" folder, such as: C:\Program Files (x86)\Graphviz2.38\bin

Class 18: Ensembling

Resources:


Class 19: Advanced scikit-learn and Clustering

Homework:

scikit-learn Resources:

Clustering Resources:


Class 20: Course Review

Homework:

  • Your final project is due next week!

Resources:


Classes 21 and 22: Final Project Presentation


Bonus Resources

Databases and SQL

Tidy Data

Regular Expressions ("Regex")

  • RegexOne is an interactive tutorial for learning the basics of regular expressions.
  • Google's Python Class includes an excellent introductory lesson on regular expressions (which also has an associated video).
  • Python for Informatics has a nice chapter on regular expressions. (If you want to run the examples, you'll need to download mbox.txt and mbox-short.txt.)
  • My reference guide to regular expressions includes lots of short explanations and simple examples.
  • regex101 is an online tool for testing your regular expressions in real time.
  • If you want to go really deep with regular expressions, RexEgg includes endless articles and tutorials.
  • Exploring Expressions of Emotions in GitHub Commit Messages is a fun example of how regular expressions can be used for data analysis, and Emojineering explains how Instagram uses regular expressions to detect emoji in hashtags.

Regularization

Recommendation Systems

More Repositories

1

scikit-learn-videos

Jupyter notebooks from the scikit-learn video series
Jupyter Notebook
3,663
star
2

pandas-videos

Jupyter notebook and datasets from the pandas video series
Jupyter Notebook
2,143
star
3

scikit-learn-tips

🤖⚡ 50 scikit-learn tips
Jupyter Notebook
1,714
star
4

DAT8

General Assembly's 2015 Data Science course in Washington, DC
Jupyter Notebook
1,602
star
5

DAT4

General Assembly's Data Science course in Washington, DC
Jupyter Notebook
794
star
6

python-reference

Python Quick Reference
Jupyter Notebook
669
star
7

DAT3

General Assembly's Data Science course in Washington, DC
Roff
660
star
8

pycon-2019-tutorial

Data Science Best Practices with pandas
Jupyter Notebook
526
star
9

pycon-2016-tutorial

Machine Learning with Text in scikit-learn
Jupyter Notebook
441
star
10

pycon-2018-tutorial

Using pandas for Better (and Worse) Data Science
Jupyter Notebook
321
star
11

trump-lies

Tutorial: Web scraping in Python with Beautiful Soup
Jupyter Notebook
241
star
12

DAT5

General Assembly's Data Science course in Washington, DC
Jupyter Notebook
185
star
13

dplyr-tutorial

Tutorials for the dplyr package in R
159
star
14

pydata-dc-2016-tutorial

Tutorial: Machine Learning with Text in scikit-learn
Jupyter Notebook
74
star
15

python-data-analysis-workshop

Workshop: Intro to Python for Data Analysis
Python
71
star
16

python-data-science-workshop

Workshop: Python for Data Science
Python
61
star
17

kaggle-allstate

Allstate Purchase Prediction Challenge on Kaggle
R
58
star
18

kaggle-pycon-2015

Solution code from my winning submission to Kaggle's PyCon 2015 competition
Python
55
star
19

tidy-data

Commented R code from Hadley Wickham's "tidy data" presentation
R
29
star
20

PracticalMachineLearning

Course project for Practical Machine Learning: https://www.coursera.org/course/predmachlearn
13
star
21

coursera-getting-data

Class project for Coursera's "Getting and Cleaning Data" class
R
10
star
22

babynames

Baby Names by Birth Year
R
5
star
23

justmarkham

1
star