Cookiecutter Modern Data Science
Cookiecutter template for starting a Data Science project with modern, fast Python tools.
Features
- Pipenv for managing packages and virtualenvs in a modern way.
- Prefect for modern pipelines and data workflow.
- Weights and Biases for experiment tracking.
- FastAPI for self-documenting fast HTTP APIs - on par with NodeJS and Go - based on asyncio, ASGI, and uvicorn.
- Modern CLI with Typer.
- Batteries included: Pandas, numpy, scipy, seaborn, and jupyterlab already installed.
- Consistent code quality: black, isort, autoflake, and pylint already installed.
- Pytest for testing.
- GitHub Pages for the public website.
Quickstart
Install the latest Cookiecutter and Pipenv:
pip install -U pipenv cookiecutter
Generate the project:
cookiecutter gh:crmne/cookiecutter-modern-datascience
Get inside the project:
cd <repo_name>
pipenv shell # activates virtualenv
(Optional) Start Weights & Biases locally, if you don't want to use the cloud/on-premise version:
wandb local
Start working:
jupyter-lab
Directory structure
This is our your new project will look like:
โโโ .gitignore <- GitHub's excellent Python .gitignore customized for this project
โโโ LICENSE <- Your project's license.
โโโ Pipfile <- The Pipfile for reproducing the analysis environment
โโโ README.md <- The top-level README for developers using this project.
โ
โโโ data
โ โโโ 0_raw <- The original, immutable data dump.
โ โโโ 0_external <- Data from third party sources.
โ โโโ 1_interim <- Intermediate data that has been transformed.
โ โโโ 2_final <- The final, canonical data sets for modeling.
โ
โโโ docs <- GitHub pages website
โ โโโ data_dictionaries <- Data dictionaries
โ โโโ references <- Papers, manuals, and all other explanatory materials.
โ
โโโ notebooks <- Jupyter notebooks. Naming convention is a number (for ordering),
โ the creator's initials, and a short `_` delimited description, e.g.
โ `01_cp_exploratory_data_analysis.ipynb`.
โ
โโโ output
โ โโโ features <- Fitted and serialized features
โ โโโ models <- Trained and serialized models, model predictions, or model summaries
โ โโโ reports <- Generated analyses as HTML, PDF, LaTeX, etc.
โ โโโ figures <- Generated graphics and figures to be used in reporting
โ
โโโ pipelines <- Pipelines and data workflows.
โ โโโ Pipfile <- The Pipfile for reproducing the pipelines environment
โ โโโ pipelines.py <- The CLI entry point for all the pipelines
โ โโโ <repo_name> <- Code for the various steps of the pipelines
โ โ โโโ __init__.py
โ โ โโโ etl.py <- Download, generate, and process data
โ โ โโโ visualize.py <- Create exploratory and results oriented visualizations
โ โ โโโ features.py <- Turn raw data into features for modeling
โ โ โโโ train.py <- Train and evaluate models
โ โโโ tests
โ โโโ fixtures <- Where to put example inputs and outputs
โ โ โโโ input.json <- Test input data
โ โ โโโ output.json <- Test output data
โ โโโ test_pipelines.py <- Integration tests for the HTTP API
โ
โโโ serve <- HTTP API for serving predictions
โโโ Dockerfile <- Dockerfile for HTTP API
โโโ Pipfile <- The Pipfile for reproducing the serving environment
โโโ app.py <- The entry point of the HTTP API
โโโ tests
โโโ fixtures <- Where to put example inputs and outputs
โ โโโ input.json <- Test input data
โ โโโ output.json <- Test output data
โโโ test_app.py <- Integration tests for the HTTP API