• This repository has been archived on 08/Aug/2024
  • Stars
    star
    1,574
  • Rank 29,738 (Top 0.6 %)
  • Language
    HTML
  • Created over 11 years ago
  • Updated about 9 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Statistical Data Analysis in Python

Statistical Data Analysis in Python

Introductory Tutorial, SciPy 2013, 25 June 2013

Christopher Fonnesbeck - Vanderbilt University School of Medicine

Chris Fonnesbeck is an Assistant Professor in the Department of Biostatistics at the Vanderbilt University School of Medicine. He specializes in computational statistics, Bayesian methods, meta-analysis, and applied decision analysis. He originally hails from Vancouver, BC and received his Ph.D. from the University of Georgia.

Description

This tutorial will introduce the use of Python for statistical data analysis, using data stored as Pandas DataFrame objects. Much of the work involved in analyzing data resides in importing, cleaning and transforming data in preparation for analysis. Therefore, the first half of the course is comprised of a 2-part overview of basic and intermediate Pandas usage that will show how to effectively manipulate datasets in memory. This includes tasks like indexing, alignment, join/merge methods, date/time types, and handling of missing data. Next, we will cover plotting and visualization using Pandas and Matplotlib, focusing on creating effective visual representations of your data, while avoiding common pitfalls. Finally, participants will be introduced to methods for statistical data modeling using some of the advanced functions in Numpy, Scipy and Pandas. This will include fitting your data to probability distributions, estimating relationships among variables using linear and non-linear models, and a brief introduction to bootstrapping methods. Each section of the tutorial will involve hands-on manipulation and analysis of sample datasets, to be provided to attendees in advance.

The target audience for the tutorial includes all new Python users, though we recommend that users also attend the NumPy and IPython session in the introductory track.

Student Instructions

For students familiar with Git, you may simply clone this repository to obtain all the materials (iPython notebooks and data) for the tutorial. Alternatively, you may download a zip file containing the materials. A third option is to simply view static notebooks by clicking on the titles of each section below.

Outline

Introduction to Pandas

  • Importing data
  • Series and DataFrame objects
  • Indexing, data selection and subsetting
  • Hierarchical indexing
  • Reading and writing files
  • Sorting and ranking
  • Missing data
  • Data summarization

Data Wrangling with Pandas

  • Date/time types
  • Merging and joining DataFrame objects
  • Concatenation
  • Reshaping DataFrame objects
  • Pivoting
  • Data transformation
  • Permutation and sampling
  • Data aggregation and GroupBy operations

Plotting and Visualization

  • Plotting in Pandas vs Matplotlib
  • Bar plots
  • Histograms
  • Box plots
  • Grouped plots
  • Scatterplots
  • Trellis plots

Statistical Data Modeling

  • Statistical modeling
  • Fitting data to probability distributions
  • Fitting regression models
  • Model selection
  • Bootstrapping

Required Packages

  • Python 2.7 or higher (including Python 3)
  • pandas >= 0.11.1 and its dependencies
  • NumPy >= 1.6.1
  • matplotlib >= 1.0.0
  • pytz
  • IPython >= 0.12
  • pyzmq
  • tornado

Optional: statsmodels, xlrd and openpyxl

For students running the latest version of Mac OS X (10.8), the easiest way to obtain all the packages is to install the Scipy Superpack which works with Python 2.7.2 that ships with OS X.

Otherwise, another easy way to install all the necessary packages is to use Continuum Analytics' Anaconda.

Statistical Reading List

The Ecological Detective: Confronting Models with Data, Ray Hilborn and Marc Mangel

Though targeted to ecologists, Mangel and Hilborn identify key methods that scientists can use to build useful and credible models for their data. They don't shy away from the math, but the book is very readable and example-laden.

Data Analysis Using Regression and Multilevel/Hierarchical Models, Andrew Gelman and Jennifer Hill

The go-to reference for applied hierarchical modeling.

The Elements of Statistical Learning, Hastie, Tibshirani and Friedman

A comprehensive machine learning guide for statisticians.

A First Course in Bayesian Statistical Methods, Peter Hoff

An excellent, approachable book to get started with Bayesian methods.

Regression Modeling Strategies, Frank Harrell

Frank Harrell's bag of tricks for regression modeling. I pull this off the shelf every week.


Creative Commons License
Statistical Data Analysis in Python by Christopher Fonnesbeck is licensed under a Creative Commons Attribution 4.0 International License.

More Repositories

1

Bios8366

Advanced Statistical Computing at Vanderbilt University Medical Center's Department of Biostatistics
Jupyter Notebook
534
star
2

ScipySuperpack

Recent builds of Numpy, Scipy, Matplotlib, iPython and PyMC for OSX
Shell
490
star
3

scipy2014_tutorial

Tutorial: Bayesian Statistical Analysis in Python
Jupyter Notebook
313
star
4

Bayes_Computing_Course

Jupyter Notebook
230
star
5

intro_stat_modeling_2017

Introduction to Statistical Modeling with Python (PyCon 2017)
Jupyter Notebook
167
star
6

gp_regression

A Primer on Gaussian Processes for Regression Analysis (PyData NYC 2019)
Jupyter Notebook
164
star
7

mcmc_pydata_london_2019

PyData London 2019 Tutorial on Markov chain Monte Carlo with PyMC3
Jupyter Notebook
153
star
8

pytenn2014_tutorial

PyTennessee 2014: Statistical Data Analysis in Python
85
star
9

Bios6301

Biostatistics 301: Introduction to Statistical Computing
R
78
star
10

PyMC3_DataScienceLA

PyMC3 tutorial for DataScience LA (January 2017)
Jupyter Notebook
68
star
11

probabilistic_python

PyData London 2022 Tutorial
Jupyter Notebook
64
star
12

multilevel_modeling

Tutorial on multilevel modeling, using Gelman radon example
CSS
55
star
13

stan_workshop_2016

Bayesian Modeling using Stan in R (May/June 2016)
Jupyter Notebook
52
star
14

scipy2015_tutorial

Computational Statistics II Tutorial at SciPy 2015
Python
47
star
15

bayes_tutorial_2019

Introductory overview of Bayesian inference
Jupyter Notebook
44
star
16

PyMC3_Oslo

Probabilistic programming in Python workshop at Oslo universitetssykehus HF
Jupyter Notebook
36
star
17

bayes_course_2022

Probabilistic Programming and Bayesian Computing with PyMC
Jupyter Notebook
27
star
18

pymc_tutorial

PyMC Tutorial for SciPy 2011
Python
27
star
19

enar_2019_tutorial

A Primer on Python for Statistical Programming and Data Science
Jupyter Notebook
26
star
20

ComputationalMethodsCourse

iPython notebook for Computational Methods for Data Analysis course on Coursera
24
star
21

bayes_course_dec_2023

Probabilistic Programming and Bayesian Computing with PyMC
Jupyter Notebook
24
star
22

PyMC3_EUSS

Course in Probabilistic Programming in Python for the 2018 EU Summer School
Jupyter Notebook
24
star
23

bayes_course_july2020

Course materials for short course on Bayesian computation
Jupyter Notebook
23
star
24

ngcm_pandas_2017

Python data analysis course for 2017 NGCM Summer Academy
Jupyter Notebook
19
star
25

bayes_pydata_london_2024

Probabilistic Programming and Bayesian Computing with PyMC
Jupyter Notebook
18
star
26

ngcm_sklearn_2017

scikit-learn course for 2017 NGCM Summer Academy
Jupyter Notebook
17
star
27

scientific-python-workshop

Scientific Python Programming Workshop, April 2016 Australia
Jupyter Notebook
16
star
28

hierarchical_models_sports_analytics

Developing Hierarchical Models for Sports Analytics
Jupyter Notebook
15
star
29

gp_tutorial_pydata

PyData San Luis 2017 Tutorial: An Introduction to Gaussian Processes in PyMC3
Jupyter Notebook
15
star
30

ngcm_pandas_course

Python data analysis course for 2015 NGCM Summer Academy
Python
14
star
31

bayes_course_june_2024

Probabilistic Programming and Bayesian Computing with PyMC
Jupyter Notebook
11
star
32

StatisticalLearningInPython

Implementing Hastie and Tibshirani's Course in Python
10
star
33

bayesian_mixer_london_2017

Fitting Gaussian process models in PyMC3: Bayesian Mixer London 2017 seminar
Jupyter Notebook
10
star
34

election_pycast

PyMC3 implementation of Drew Linzer’s dynamic Bayesian election forecasting model
Jupyter Notebook
10
star
35

pymc_workshop

One-day workshop on probabilistic programming with PyMC
Jupyter Notebook
10
star
36

NCTC_course

Markov Decision Processes and Dynamic Optimization module at NCTC, March 2015
CSS
8
star
37

intro_to_pandas

A short introductory workshop on Pandas for applied users
Jupyter Notebook
8
star
38

gp_showdown

A comparison of Gaussian process fitting packages in Python
Jupyter Notebook
7
star
39

jupyter_for_reproducible_research

Jupyter for Reproducible Research
Jupyter Notebook
7
star
40

bayes_mixer_2023

London Bayes Mixer presentation, June 2023
Jupyter Notebook
6
star
41

cqs_machine_learning

2018 CQS Summer Institute course in machine learning
Jupyter Notebook
6
star
42

dqn_rl_outbreak_response

Deep Q-learning for Disease Outbreak Decision Modeling
Python
6
star
43

pymc_sdss_2024

SDSS 2024 Course: Probabilistic Programming and Bayesian Computing with PyMC
Jupyter Notebook
6
star
44

baseball

Baseball data analysis in Python
Jupyter Notebook
5
star
45

basic_bayes

Basic Bayesian analysis for comparing two groups with continuous and binary outcomes
Jupyter Notebook
5
star
46

tensorflow_demo

Quick tutorial on neural networks and TensorFlow
Jupyter Notebook
5
star
47

git_tutorial

Slides for SWC git lecture
4
star
48

useRshootout

useR session on comparing statistical computing languages
R
3
star
49

bmi_python_tutorial

Brief Python tutorial for Vanderbilt Biomedical Informatics big data class
Jupyter Notebook
3
star
50

framingham_risk

Functions for calculating the Framingham Risk Score (FRS)
Python
3
star
51

bimodal-bilateral

Outcomes in Children with Bilateral Cochlear implants and Bimodal Hearing
3
star
52

plotly_bayes

Bayesian analysis for Python with Plotly graphics
3
star
53

bayesball

Probabilistic models for the analysis of baseball data
3
star
54

fonnesbeck.github.io

Strong Inference website
HTML
3
star
55

bayes_course_dec_2024

Probabilistic Programming and Bayesian Computing with PyMC
Jupyter Notebook
3
star
56

ngcm_pandas_2016

Python data analysis course for 2016 NGCM Summer Academy
Jupyter Notebook
2
star
57

CharlestonLanesAnalysis

Development of shipping lanes recommendations for Port of Charleston based on right whale activity
HTML
2
star
58

ebola_data_processing

Example of importing and cleaning external data
Jupyter Notebook
2
star
59

bootcamp_python

Python files for Vanderbilt University Software Carpentry Bootcamp
Python
2
star
60

autism_intervention_MA

Meta-analysis of Autism Intervention Effectiveness
Jupyter Notebook
2
star
61

dbmi_seminar_2018

VUMC Seminar Series talk, October 2018
2
star
62

HealthPolicyPython

Python programming workshop for Vanderbilt's Department of Health Policy, December 16, 2015
Jupyter Notebook
2
star
63

jupyter-ds

Docker containers for serving up Jupyter
Dockerfile
2
star
64

disruptive_behavior_disorder_MA

Meta-analysis of psychosocial interventions for disruptive behavior disorder (DBD)
2
star
65

CDRN_Obesity

Mid-south CDRN Obesity Project
Jupyter Notebook
1
star
66

SDM_Tools

Decision Analysis Tools Course (March 2015)
1
star
67

bayesian_marcel

Bayesian implementation of Tango's MARCEL projection system
Jupyter Notebook
1
star
68

neurips_2018_talk

PyMC's Big Adventure (MLOSS Workshop 2018)
CSS
1
star
69

stronginference

Pelican CMS for Strong Inference
Makefile
1
star
70

python_and_r

Draft book chapters
1
star
71

git_training

Training repository for Bios301 students
1
star
72

autism_screening

Meta-analysis of autism screening tools
1
star
73

PKUMetaAnalysis

Vanderbilt EPC meta-analysis on PKU supplementary materials
TeX
1
star
74

pmi_example

Example models for Precision Medicine Initiative planning
Jupyter Notebook
1
star
75

mbsr_intervention_study

Mindfulness-based stress reduction (MBSR) intervention study for autism outcomes
Jupyter Notebook
1
star
76

sdss_2024_course

SDSS 2024 Course: Probabilistic Programming and Bayesian Computing with PyMC
1
star
77

AHRQ_Complex_Interventions

AHRQ Tools for Systematic Reviews of Complex Interventions (Bayesian Inference)
CSS
1
star
78

CCASAnetRCourse

Support materials for CCASAnet's R short course
1
star
79

mongolia_measles

Mongolia measles outbreak intervention modeling
Jupyter Notebook
1
star