• Stars
    star
    116
  • Rank 303,894 (Top 6 %)
  • Language
    Jupyter Notebook
  • License
    MIT License
  • Created over 9 years ago
  • Updated about 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A short tutorial for data scientists on how to write tests for code + data.

Best Testing Practices for Data Science

A short tutorial for data scientists on how to write tests for your code and your data. Before the tutorial, please read through this README file, for it contains a lot of useful information that will help you best prepare for the tutorial.

How to use this repository

The tutorial notes are typed up in Jupyter notebooks, and static HTML versions are available under the docs folder. For the non-bonus material, I suggest working through the notes in order. With the exception of the Projects, the bonus material can be tackled in any order. During the tutorial, be sure to have the HTML versions open.

Pre-Requisite Knowledge

I am assuming you are of the following type of coder:

  • You are a data analytics type, who knows how to read/write CSV files with Pandas, and do basic data manipulation (slicing, indexing rows + columns, using the .apply() function).
  • You are not necessarily a seasoned software developer who has experience running tests.
  • You are comfortable with operating in the Terminal environment.
  • You have some rudimentary knowledge of numpy, particularly the the array.min(), array.max(), array.mean(), array.std(), and numpy.allclose(a1, a2) function calls.

In order to prepare for the tutorial, there are some pieces of Python syntax that will come in handy to know:

  • the context manager syntax (with ....),
  • assertions (assert conditions1 == condition2),
  • file I/O (with open(....) as f:...),
  • list/dict/tuple comprehensions ([a for a in container if condition(a)]),
  • checking types & attributes (isinstance(obj, type) or hasattr(obj, attr)).

Feedback

If you've taken a version of this tutorial, please leave feedback here. I use the suggestions in there to adjust the tutorial content and make it better. The changes are always released publicly on GitHub, so everybody benefits!

Environment Setup

conda setup

This installation route should work cross-platform. I recommend using the Anaconda distribution of Python because it is a good way to bootstrap your data science environment.

To get setup, create a conda environment based on the provided environment.yml spec file. Run the following commands in your bash terminal.

$ bash conda-setup.sh

pip setup

The alternative way is to use a virtualenv environment:

$ bash venv-setup.sh
$ source datatest/bin/activate

Alternatively, you can pip install each of the dependencies listed in the environment.yml file. (The requirements.txt file may be less eagerly maintained than the environment.yml file, given the conda-biases that I have.)

Manual Setup

If you prefer having more control over your installation process, conda or pip install the dependencies listed in the environment.yml file.

Checks

To check whether the environment is correctly setup, run the checkenv.py script:

$ python checkenv.py

It should print to your terminal, All packages found; environment checks passed.. Otherwise, conda or pip install the necessary packages stated (they will show up one by one).

Authors

Contributors

Special thanks goes to individuals who have contributed in ways big and small to the improvement of the material.

  • Renee Chu
  • Matt Bachmann: @Bachmann1234
  • Hugo Bowne-Anderson: @hugobowne
  • Boston Python tutorial attendees:
    • @races1986
    • Thao Nguyen: @ThaoNguyen15
    • @ChrisMuir

Data Credits

More Repositories

1

Network-Analysis-Made-Simple

An introduction to network analysis and applied graph theory using Python and NetworkX
Jupyter Notebook
977
star
2

bayesian-stats-modelling-tutorial

How to do Bayesian statistical modelling using numpy and PyMC3
Jupyter Notebook
630
star
3

bayesian-analysis-recipes

A collection of Bayesian data analysis recipes using PyMC3
Jupyter Notebook
544
star
4

nxviz

Visualization Package for NetworkX
Python
454
star
5

essays-on-data-science

In which I put together my thoughts on the practice of data science.
Dockerfile
231
star
6

dl-workshop

Crash course to master gradient-based machine learning. Also secretly a JAX course in disguise!
Jupyter Notebook
200
star
7

bayesian-deep-learning-demystified

In which I try to demystify the fundamental concepts behind Bayesian deep learning.
CSS
118
star
8

hiveplot

Hive Plots in using Python & matplotlib!
Jupyter Notebook
69
star
9

bayesian-stats-talk

Doing Bayesian statistics in Python!
Jupyter Notebook
65
star
10

protein-interaction-network

Computes a molecular graph for protein structures.
Python
58
star
11

minimal-flask-example

The simplest complex example that I can think of to show main Flask app concepts.
HTML
46
star
12

causality

In which I play with the ideas surrounding causality
Python
45
star
13

flu-sequence-predictor

An experimental deep learning & genotype network-based system for predicting new influenza protein sequences.
Jupyter Notebook
34
star
14

minimal-streamlit-example

A minimal example of how to use streamlit on Heroku
Python
21
star
15

pyds-cli

Helping you manage your data science projects sanely.
Python
18
star
16

llamabot

Pythonic class-based interface to LLMs
Python
17
star
17

conda-envs

My conda environment YAML files
16
star
18

distributions

Central repository for my distributions figures
Jupyter Notebook
16
star
19

Circos

Jupyter Notebook
15
star
20

fundl

A pedagogical, functional-oriented deep learning library built on top of jax.
Python
15
star
21

scikit-learn-tutorial

Jupyter Notebook
14
star
22

minimal-panel-app

A pedagogical implementation of panel apps served up on a remote machine.
Jupyter Notebook
14
star
23

bayesian-generalized-abcde-testing

PyCon 2019 talk on Bayesian multi-group testing.
Jupyter Notebook
9
star
24

pyflatten

A utility for flattening nested data structures into an array.
Python
9
star
25

what-are-probability-distributions

PyCon 2020 Talk on "what probability distributions are"
Python
9
star
26

principled-ds-workflow

Delivered at PyData Boston on 21 July 2020
8
star
27

resume

Building a resume using nothing but YAML files and Python. A prototype.
HTML
8
star
28

ericmjl.github.io

HTML
7
star
29

probability-distributions-with-python

A talk on what probability distributions are, using Python
Python
7
star
30

graph-fingerprint

A package for using convolutional neural nets to learn a graph fingerprint.
Jupyter Notebook
6
star
31

iacs2017

Materials for IACS 2017 contest.
Jupyter Notebook
6
star
32

probabilistic-programming-tutorial

6
star
33

systems-microbiology-hiv

Machine learning and phylogenetics on HIV
Jupyter Notebook
6
star
34

graph-deep-learning-demystified

An attempt at demystifying graph deep learning
HTML
6
star
35

score-models

In which I learn about score functions and how they can be used to generate data.
Jupyter Notebook
6
star
36

website

Eric Ma's Personal Website
HTML
5
star
37

dotfiles

my dotfiles
Shell
5
star
38

worship-manager

Open source software for worship coordinators and leaders.
JavaScript
5
star
39

testing-for-data-scientists

Slides for my talk on testing for data scientists.
Shell
5
star
40

matplotlib-tutorial

A short tutorial on how to make matplotlib plots.
Jupyter Notebook
4
star
41

target-prediction

In which I try to replicate the main findings of Ferrero, E., Dunham, I., & Sanseau, P. (2017), Journal of Translational Medicine, 15(1), 182.
Jupyter Notebook
4
star
42

normalizing-flows

Deeply learning about normalizing flows.
Jupyter Notebook
4
star
43

czbiohub

TeX
4
star
44

autograd-cupy

Autograd wrapper for CuPy
Python
4
star
45

software-testing-open-source-and-data-science

Software Testing in Open Source and Data Science: A talk delivered at the Data Umbrella speaker series
3
star
46

curve-fitting-talk

"Fret not, it's curve fitting all the way down!
Jupyter Notebook
3
star
47

insight-data-challenges

Jupyter Notebook
3
star
48

influenza-reassortment-detector

Scripts for running the influenza reassortment detector
Python
3
star
49

thesis

PhD thesis!!!!!
TeX
3
star
50

emailme

A Python module to email myself from Python scripts and the command line.
Python
2
star
51

influenza-reassortment-analysis

Python
2
star
52

dream-respiratory-viral-challenge

Python
2
star
53

hiv-resistance-prediction

In which I try to use ML models to predict HIV resistance phenotypes.
Jupyter Notebook
2
star
54

internet-monitor

A Streamlit app that monitors internet locally
Python
2
star
55

Primer-Design-Automator

A tool for automating my primer design workflow
Python
2
star
56

beast-gpu-tutorial

A short website that describes how to create an Amazon AWS GPU instance that runs BEAST + BEAGLE.
HTML
2
star
57

habit-tracker

Personal Flask app for tracking a habit.
HTML
2
star
58

bayesian-measurement-paper

My academic 'rant' on why n=3 is not sufficient.
Jupyter Notebook
2
star
59

protein-convolutional-nets

Part of my thesis work. Doing convolutional neural nets on protein graphs to make predictions.
Jupyter Notebook
2
star
60

protein-systematic-characterization

All our protocols, data, analysis, and papers related to this project are stored here.
Jupyter Notebook
2
star
61

continuous-pull

A command-line utility to continuously pull Git repository locally.
Python
2
star
62

computational-representations-message-passing

A short technical piece on how message passing on graphs can be simultaneously made efficient and easy to read.
CSS
2
star
63

influenza-global-reassortment

Jupyter notebooks and data - reproducible analysis from reassortment paper
Jupyter Notebook
2
star
64

math-for-programmers-exercises

My exercises answers from Jeremy Kun's book, Mathematics for Programmers.
Jupyter Notebook
2
star
65

nnet-HA

Toy project, in which I train a neural network to predict influenza virus host tropism.
Jupyter Notebook
1
star
66

small-group

A local web app I built to store information about our Bible Study small group and use it to divide us into smaller groups.
Python
1
star
67

polcart

A small utility for converting between polar and cartesian units.
Python
1
star
68

flu-gibson

A tool for designing primers to clone influenza polymerase segments from viral cDNA.
Python
1
star
69

genomic-surveillance-whitepaper

Publicly written white paper on genomic surveillance.
Shell
1
star
70

flu-gibson-webui

A Flask-based UI for the FluGibson package.
HTML
1
star
71

easy-talk-slides-and-notes

CSS
1
star
72

Personal-Scripts-and-Functions

My repository of custom scripts and functions.
Python
1
star
73

tensor-flow-tutorial

In which I teach myself TensorFlow.
Jupyter Notebook
1
star
74

Influenza-Reassortment-Simulation-and-Identification

Python
1
star
75

Influenza-Network-Transmission-Model

Jupyter Notebook
1
star
76

imgdisplay

A Python command-line app for displaying photos as a slideshow in a directory.
Python
1
star
77

mbtools

Molecular Biology Tools
Python
1
star
78

boston-gov-data

Jupyter Notebook
1
star
79

generative-thinking

something cool happening here
Dockerfile
1
star
80

pymc3-models

Default models built on top of PyMC3.
Python
1
star
81

Influenza-RNA-Secondary-Structure-Prediction

Python
1
star
82

reveal-nord-theme

A personal implementation of the Nord theme + other slide utilities for reveal.js slides.
CSS
1
star
83

cookiecutter-data-project

Opinionated and personalized cookie-cutter data project template
1
star
84

tensor-fun

Minimal tensor operations examples. Playing around with higher-dimensional tensors.
Jupyter Notebook
1
star
85

genotype-network

Genotype network software, collaboration with Kyle Yuan.
Python
1
star
86

h9-pb2-global-analysis

1
star
87

Song-Sheet-Transposer

Python
1
star
88

d3-graph

A repository for me to remember how to use d3's force-directed layout API.
CSS
1
star
89

blog-assistant

My personal blogging assistant, built on top of llamabot and GPT4.
Dockerfile
1
star
90

epaper-badge

Code for ePaper display badge
Python
1
star
91

autoencoders

Me playing around with autoencoders. For fun.
Jupyter Notebook
1
star
92

ecdf-guide

An interactive guide to ECDFs.
Jupyter Notebook
1
star
93

flask-sandbox

In which i futz around with Flask, trying to make a random web app that does something.
Python
1
star
94

pandoc-recipes

A curated set of recipes that I've used with pandoc to make all sorts of documents.
Shell
1
star
95

pytorch-playground

In which I play around with PyTorch.
1
star
96

flu-assembler

In which I try to implement my own influenza genome assembler. For funzies.
Jupyter Notebook
1
star
97

autograd-sparse

Autograd wrapper for scipy.sparse
Python
1
star
98

bluetooth-proximity-tracker-calibration

A repository containing all of the raw data and experiments done on the Raspberry Pi bluetooth tracker.
Jupyter Notebook
1
star
99

quarto-scipy24-exercises

1
star
100

cookiecutter-talk

A repository to bootstrap my writing using Markdown, Pandoc, HTML and Reveal.js
CSS
1
star