• Stars
    star
    358
  • Rank 118,855 (Top 3 %)
  • Language
    Jupyter Notebook
  • License
    MIT License
  • Created about 5 years ago
  • Updated 4 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

An introduction to data science in Python, for people with no programming experience.

Elements of Data Science

Elements of Data Science is an introduction to data science for people with no programming experience. My goal is to present a small, powerful subset of Python that allows you to do real work in data science as quickly as possible.

I don't assume that the reader knows anything about programming, statistics, or data science. When I use a term, I try to define it immediately, and when I use a programming feature, I try to explain it.

This book is in the form of Jupyter notebooks. Jupyter is a software development tool you can run in a web browser, so you don't have to install any software. A Jupyter notebook is a document that contains text, Python code, and results. So you can read it like a book, but you can also modify the code, run it, develop new programs, and test them.

The notebooks contain exercises where you can practice what you learn. Most of the exercises are meant to be quick, but a few are more substantial.

The license for this book is the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0).

This material is a work in progress, so suggestions are welcome. The best way to provide feedback is to click here and create an issue in this GitHub repository.

Case Studies

In addition to the notebooks below, the Elements of Data Science curriculum includes these case studies:

  • Political Alignment Case Study: Using data from the General Social Survey, this case study explore changing opinions on a variety of topics among survey respondents in the United States. Readers choose one of about 120 survey questions and see how responses have changed over time and how these changes relate to political alignment (conservative, moderate, or liberal).

  • Recidivism Case Study: This case study is based on a well known paper, "Machine Bias", which was published by Politico in 2016. It relates to COMPAS, a statistical tool used in the criminal justice system to assess the risk that a defendant will commit another crime if released. The ProPublica article concludes that COMPAS is unfair to Black defendants because they are more likely to be misclassified as high risk. A response article in the Washington Post suggests that "It's actually not that clear." Using the data from the original article, this case study explains the (many) metrics used to evaluate binary classifiers, shows the challenges of defining algorithmic fairness, and starts a discussion of the context, ethics, and social impact of data science.

  • Bite Size Bayes: An introduction to probability with a focus on Bayes's Theorem.

  • Astronomical Data in Python: An introduction to SQL using data from the Gaia space telescope as an example.

The notebooks

For each of the notebooks below, you have three options:

  • If you view the notebook on NBViewer, you can read it, but you can't run the code.

  • If you run the notebook on Colab, you'll be able to run the code, do the exercises, and save your modified version of the notebook in a Google Drive (if you have one).

  • Or, if you download the notebook, you can run it in your own environment. But in that case it is up to you to make sure you have the libraries you need.

Notebook 1

Variables and values: The first notebook explains how to use Jupyter and introduces variables, values, and numerical computation.

Click here to run this notebook on Colab

or click here to download it

Notebook 2

Times and places: This notebook shows how to represent times, dates, and locations in Python, and uses the GeoPandas library to plot points on a map.

Click here to run this notebook on Colab

or click here to download it

Notebook 3

Lists and Arrays: This notebook presents lists and NumPy arrays. It discusses absolute, relative, and percent errors, and ways to summarize them.

Click here to run this notebook on Colab

or click here to download it

Notebook 4

Loops and Files: This notebook presents the for loop and the if statement; then it uses them to speed-read War and Peace and count the words.

Click here to run this notebook on Colab

or click here to download it

Notebook 5

Dictionaries: This notebook presents one of the most powerful features of Python, dictionaries, and uses them to count the unique words in a text and their frequencies.

Click here to run this notebook on Colab

or click here to download it

Notebook 6

Plotting: This notebook introduces a plotting library, Matplotlib, and uses it to generate a few common data visualizations and one less common one, a Zipf plot.

Click here to run this notebook on Colab

or click here to download it

Notebook 7

DataFrames: This notebook presents DataFrames, which are used to represent tables of data. As an example, it uses data from the National Survey of Family Growth to find the average weight of babies in the U.S.

Click here to run this notebook on Colab

or click here to download it

Notebook 8

Distributions: This notebook explains what a distribution is and presents 3 ways to represent one: a PMF, CDF, or PDF. It also shows how to compare a distribution to another distribution or a mathematical model.

Click here to run this notebook on Colab

or click here to download it

Notebook 9

Relationships: This notebook explores relationships between variables using scatter plots, violin plots, and box plots. It quantifies the strength of a relationship using the correlation coefficient and uses simple regression to estimate the slope of a line.

Click here to run this notebook on Colab

or click here to download it

Notebook 10

Regression: This notebook presents multiple regression and uses it to explore the relationship between age, education, and income. It uses visualization to interpret multivariate models. It also presents binary variables and logistic regression.

Click here to run this notebook on Colab

or click here to download it

Notebook 11

Resampling: This notebook presents computational methods we can use to quantify variation due to random sampling, which is one of several sources of error in statistical estimation.

Click here to run this notebook on Colab

or click here to download it

Notebook 12

Bootstrapping: Bootstrapping is a kind of resampling that is well suited to the kind of survey data we've been working with.

Click here to run this notebook on Colab

or click here to download it

Notebook 13

Hypothesis Testing: Hypothesis testing is the bugbear of classical statistics. This notebook presents a computational approach to the topic that makes it clear that there is only one test.

Click here to run this notebook on Colab

or click here to download it

More Repositories

1

ThinkStats2

Text and supporting code for Think Stats, 2nd Edition
Jupyter Notebook
3,899
star
2

ThinkDSP

Think DSP: Digital Signal Processing in Python, by Allen B. Downey.
Jupyter Notebook
3,476
star
3

ThinkPython2

LaTeX source and supporting code for Think Python, 2nd edition, by Allen Downey.
TeX
2,378
star
4

ThinkBayes

Code repository for Think Bayes.
TeX
1,627
star
5

ThinkBayes2

Text and code for the forthcoming second edition of Think Bayes, by Allen Downey.
Jupyter Notebook
1,617
star
6

ThinkPython

Code examples and exercise solutions from Think Python by Allen Downey, published by O'Reilly Media.
PostScript
930
star
7

ModSimPy

Text and supporting code for Modeling and Simulation in Python
HTML
818
star
8

ThinkComplexity2

Book and code for Think Complexity, 2nd edition
Jupyter Notebook
728
star
9

ThinkOS

Text and supporting code for Think OS: A Brief Introduction to Operating Systems, by Allen Downey.
TeX
526
star
10

ThinkDataStructures

LaTeX source and supporting code for Think Data Structures: Algorithms and Information Retrieval in Java
TeX
510
star
11

ThinkJavaCode

Supporting code for Think Java by Allen Downey and Chris Mayfield.
Java
364
star
12

BayesMadeSimple

Code for a tutorial on Bayesian Statistics by Allen Downey.
Jupyter Notebook
330
star
13

LittleBookOfSemaphores

LaTeX source and supporting code for The Little Book of Semaphores, by Allen Downey.
TeX
237
star
14

CompStats

Code for a workshop on statistical interference using computational methods in Python.
Jupyter Notebook
215
star
15

empiricaldist

Python library that represents empirical distribution functions.
Jupyter Notebook
152
star
16

DSIRP

Data Structures and Information Retrieval in Python
Jupyter Notebook
128
star
17

BiteSizeBayes

An introduction to Bayesian statistics using Python and (coming soon) R.
Jupyter Notebook
126
star
18

ThinkCPP

Text and code for Think C++ by Allen Downey
PostScript
111
star
19

ExercisesInC

Exercises for people learning the C programming language
C
103
star
20

ThinkComplexity

Code for Allen Downey's book Think Complexity, published by O'Reilly Media.
PostScript
96
star
21

AstronomicalData

An introduction to working with astronomical data in Python.
Jupyter Notebook
87
star
22

Swampy

Code for Swampy, a set of modules used in Think Python, first edition
Python
85
star
23

PhysicalModelingInMatlab

Text and code for Physical Modeling in MATLAB
TeX
83
star
24

ProbablyOverthinkingIt

Supplementary material for my book, Probably Overthinking It.
Jupyter Notebook
82
star
25

ThinkPythonItalian

LaTeX source for the Italian Translation of Think Python.
TeX
81
star
26

DataExploration

Supporting code for a video series on best practices for exploratory data analysis.
Python
71
star
27

BayesianDecisionAnalysis

Repository for a workshop on Bayesian Decision Analysis
Jupyter Notebook
64
star
28

ExploratoryDataAnalysis

Repository for an online class on Exploratory Data Analysis in Python
Jupyter Notebook
63
star
29

ThinkJava

LaTeX source for Think Java, 1st edition, by Allen Downey and Chris Mayfield.
TeX
57
star
30

SurvivalAnalysisPython

Explorations of survival analysis in Python
Jupyter Notebook
48
star
31

BayesForUndergrads

Materials for a workshop on developing undergraduate classes on Bayesian statistics.
46
star
32

DataScience

Site for a Data Science class taught by Allen Downey
HTML
42
star
33

ComplexityScience

Repository for a workshop on Complexity Science
Jupyter Notebook
35
star
34

ThinkX

Python
30
star
35

ThinkStats3

Code and LaTeX source for Think Stats, third edition
29
star
36

BayesSeminar

Bayesian statistics seminars
Jupyter Notebook
29
star
37

BayesianInferencePyMC

Workshop on Bayesian inference using PyMC
Jupyter Notebook
26
star
38

ElementsOfDataScienceBook

Repository for the manuscript of Elements of Data Science
TeX
25
star
39

PoliticalAlignmentCaseStudy

Notebooks and data for a case study on political alignment, outlook, and beliefs
Jupyter Notebook
23
star
40

thinkjavasolutions5

Automatically exported from code.google.com/p/thinkjavasolutions
Java
21
star
41

blair-walden-project

The Blair Walden Project: in 1845 Henry David Thoreau went to live in the woods... a year later his journal was found.
19
star
42

Portfolio

Portfolio of Allen Downey at Olin College
HTML
18
star
43

ThinkPythonSolutions

Automatically exported from code.google.com/p/thinkpythonsolutions
Python
17
star
44

DataQnA

Data Q&A: Questions and answers about data and statistics
Jupyter Notebook
17
star
45

ProbablyOverthinkingIt2

New repo for projects related to my blog, Probably Overthinking It.
Jupyter Notebook
16
star
46

MarriageNSFG

Repository for a project using NSFG data to explore marriage patterns in the US.
Stata
15
star
47

clink

A network measurement tool, described at http://allendowney.com/research/clink/
C
12
star
48

RecidivismCaseStudy

Case study on evaluating statistical tools that predict recidivism.
Jupyter Notebook
11
star
49

ModSim

Modeling and Simulation in Python and MATLAB/Octave
Jupyter Notebook
11
star
50

ThinkStats

Notebooks for the third edition of Think Stats
Jupyter Notebook
11
star
51

SignalsAndSystemsAndDynamics

Code and examples for an experimental class on signals, systems, and dynamics
MATLAB
10
star
52

GssReligion

Code and data for measuring and predicting religious affiliation using GSS data.
Jupyter Notebook
10
star
53

GunControlGenerational

Data and analysis related to generational changes in attitudes toward gun control
Jupyter Notebook
9
star
54

ThinkPerl6

Text and supporting code for Think Perl 6 by Laurent Rosenfeld with Allen Downey
TeX
9
star
55

ModSimMatlab

Text and supporting code for Modeling and Simulation.
Makefile
8
star
56

JavaOOP

Supporting code for the OOP in Java independent study
Java
8
star
57

DSIRPSolutions

Solutions to the exercises in Data Structures and Information Retrieval in Python (DSIRP)
Jupyter Notebook
8
star
58

SoftwareSystems

Repo for software related to Software Systems at Olin College.
C
8
star
59

ThinkBayes2Translations

Translations of Think Bayes.
Jupyter Notebook
8
star
60

plastex-oreilly

Branch of plastex that generates DocBook 4.5 that meets O'Reilly style guidelines.
TeX
7
star
61

JupyterAsciidocTemplate

Template for converting Jupyter notebooks to an asciidoc book.
Jupyter Notebook
7
star
62

internet-religion

Data and code for an analysis of Internet use and religious affiliation using data from the GSS.
Python
6
star
63

AtmoChem

Atmospheric chemistry data and analysis
Jupyter Notebook
6
star
64

TheShakes

Jupyter Notebook
5
star
65

complexity

Automatically exported from code.google.com/p/complexity
PostScript
5
star
66

PythonCounterPmf

Examples using Python's Counter collection to implement a probability mass function (PMF)
Jupyter Notebook
5
star
67

FirstLateNSFG

Data and analysis for "Are first babies more likely to be late?"
Jupyter Notebook
4
star
68

PythonFun

Jupyter Notebook
4
star
69

ThinkJavaSequel

Text and supporting code for Think DS: Data Structures in Java, by Allen Downey.
4
star
70

matlabsolutions

Automatically exported from code.google.com/p/matlabsolutions
MATLAB
4
star
71

ThinkOCaml

Automatically exported from code.google.com/p/thinkocaml
PostScript
4
star
72

Notebooks

A repo for iPython notebooks.
4
star
73

ISSPRegression

Exploration of the data from the Crowdsourced Replication Initiative
Makefile
4
star
74

thinkjava5

Automatically exported from code.google.com/p/thinkapjava
TeX
3
star
75

plastex-docbook

DocBook renderer plugin templates and classes for the plasTeX engine
Python
3
star
76

GssExtract

Jupyter Notebook
3
star
77

SoftwareDesign

Directories and unit tests for exercises in Software Design at Olin College.
Python
3
star
78

InspectionParadox

Code and data for an article on length-biased sampling and the inspection paradox
Jupyter Notebook
2
star
79

OlinPyShop

Code for Python workshops from Olin College
2
star
80

TeamAllocation

Code for making team allocations under constraints.
Python
2
star
81

QEACode

Code for Quantitative Engineering Analysis (QEA) class at Olin College
2
star
82

thinkpythonchinese

Automatically exported from code.google.com/p/thinkpythonchinese
TeX
2
star
83

simulating

2
star
84

LongTailedDistributions

Data and code from a series of papers about long-tailed distributions in the Internet.
2
star
85

AfroBarometer

Jupyter Notebook
1
star
86

python-in-hydrology

Automatically exported from code.google.com/p/python-in-hydrology
1
star
87

a-bad-synthesizer

Arduino-based analog-digital synthesizer
Python
1
star
88

2019-08-27-needham

Python
1
star
89

GssFeminism

Exploration of changes in views related to feminism
Jupyter Notebook
1
star