Awesome Machine Learning Engineer
What is this and how do I use it?
- This is a curated list of delightful resources for everything you need to develop Machine Learning solutions.
- Each item in this list will teach you at least one distinct and significant skill or piece of information.
- There are three content levels:
π₯ Essential reading for all ML engineersπ Advanced reading for professional ML engineersπ¦ Expert material for expert ML engineers
- Descriptions are written to complete the sentence "After reading this article you will have learned ...".
Contents
Communication
π₯ BLUF: The Military Standard That Can Make Your Writing More Powerful - How to make your communication more powerful (5 min)π₯ The XY Problem - How to focus on explaining your end goal when asking for help (5 min)π₯ Bike-shedding: how mature are you as an engineer? - How to avoid and call out bike-shedding (5 min)π₯ E-mail like a boss - How to write better e-mails (5 min)π₯ Stop Swiss Cheesing your calendar - How to manage your calendar so you can focus (15 min)π₯ How to write in plain English - How to write in plain English (30 min)π₯ Presentation Rules - How to create a great slide deck (30 min)π SMART criteria - How to define goals (15 min)π MECE principle - How to fully decompose a problem into a structured list (15 min)π SCQA: What is it, how does it work, and how can it help me? - How to structure your presentations, proposals, and sales outlines (15 min)π No More Misunderstandings - How to avoid miscommunication by paraphrasing (15 min)π¦ Nonviolent communication - How to deliver constructive feedback in difficult situations (15 min)π¦ The Halo effect - How to recognize and use the Halo effect to your advantage (15 min)π¦ Mythical Man Month - The relationship between person-days and throughput time in a project (15 min)π¦ Four-sides model - How to communicate effectively by considering how the receiver interprets your message (30 min)
Software Engineering
API design
π₯ Semantic Versioning - How to bump the version of your apps and packages (15 min)π₯ __all__
and wild imports in Python - How__all__
defines the public API of your Python packages (15 min)π₯ APIs for Machine Learning - How to design RESTful APIs for Machine Learning applications (30 min)π FastAPI docs - How to build RESTful APIs that correspond one-to-one with an OpenAPI specification (1 day)π The Rule of Three - When to build reusable components and when not (15 min)π Falsehoods programmers believe about time - How to avoid common pitfalls about time (15 min)π Falsehoods programmers believe about names - How to avoid common pitfalls about names (15 min)π Command Line Interface Guidelines - How to write great CLIs (1 hour)π¦ Zalando's RESTful API guidelines - How to design RESTful APIs (1 day)
Workflow
π₯ Poetry Cookiecutter - How to scaffold a modern Poetry-based development environment for Python packages and apps (30 min)π₯ The seven rules of a great Git commit message - How to write great Git commit messages (15 min)π₯ Learn Git Branching - Practice Git from beginner to advanced (1 hour)π Keep a Changelog - How to keep a changelog for your apps and packages (30 min)π Conventional Commits - How to prefix your commit messages to automate Semantic Versioning and Keep a Changelog (15 min)π Testing Python Applications with Pytest - How to properly test a package with pytest (30 min)π A successful Git branching model - How to release software with Git (15 min)π Code Review Best Practices - What to look for when reviewing a Pull Request (30 min)π Code Health: Respectful Reviews == Useful Reviews - How to communicate code review comments respectfully (15 min)π The Code Review Pyramid - What to look for and what to automate when reviewing a Pull Request (15 min)π¦ Poetry workspace plugin - How to create and manage a Poetry-based monorepo (15 min)
Python patterns
π₯ PEP20 "The Zen of Python" - How to write idiomatic Python (15 min)π₯ The Definitive Guide to Python import Statements - How to write import statements (30 min)π₯ Understanding Python's logging module - How to use thelogging
module effectively (30 min)π Don't run code at import time - Why you shouldn't run code at import timeπ Please fix your decorators - Why you should probably usewrapt
to write your decorators (30 min)π¦ Do not log - What you should be doing instead of logging (30 min)π¦ The Little Book of Python Anti-Patterns - A collectiong of Python anti-patterns (X hours)π¦ Effective Python - A collection of Python idioms (X hours)π¦ Python Design Patterns - A collection of software architecture patterns (1 hour)π¦ SOLID - A standard set of software architecture patterns (1 hour)π¦ What the f*ck Python! - How to master Python by understanding its edge cases (1 day)
Typing
π The Comprehensive Guide to mypy - How to write type annotations in Python (1 hour)π Pydantic overview - How to write type annotations for complex types instead of a meaninglessDict[str, Any]
(1 hour)π Magic number - Why magic values are an anti-pattern (15 min)π Enums - How to writeEnum
s in Python instead of type-unsafe magic values (15 min)π¦ Mypy generics - How to useTypeVar
s to write generic types such asList[T]
(30 min)π¦ Mypy protocols - How to useProtocol
s to define interfaces such asIterable
(30 min)
Curated Python packages
Workflow
π cookiecutter - Scaffold new Python packages or apps quickly with a Cookiecutter templateπ cruft - Update a Python package's underlying Cookiecutter scaffoldingπ commitizen - Check that commit messages satisfy Conventional Commits and automate Semantic Versioning and Keep a Changelogπ poetry - Manage the packaging and dependencies of your Python projectπ poe - Define and run tasks in a Poetry project with Poe the Poetπ poetry-workspace-plugin - Manage a Python monorepo with this Poetry plugin
Code quality
π₯ black - Automatically format your codeπ₯ isort - Automatically sort your import statementsπ pre-commit - Automatically run code quality checks on commitπ bandit - Find common security issuesπ darglint - Check that your docstrings match your function signatureπ flake8 - Check your code for bugs and that your code style is PEP8-compliantπ flake8 extensions - An awesome list of Flake8 extensionsπ mypy - Check the type-correctness of your codeπ pre-commit hooks - A collection of pre-commit hooks that check file qualityπ pydocstyle - Check that your code is documentedπ pygrep hooks - A collection of pre-commit hooks that check for common Python code smellsπ pytest-recording - Record and play back HTTP requests in your pytest testsπ pyupgrade - Check that your code is written using the latest Python language featuresπ safety - Check that your dependencies don't have any known security vulnerabilitiesπ shellcheck - Check the quality of your shell scriptsπ coverage.py - Check your code's test coverageπ¦ hypothesis - Write tests that automatically look for edge cases that break your codeπ¦ hypothesis-auto - Automate generate Hypothesis tests based on your code's type annotations
Application development
π fastapi - Create RESTful APIs based on type annotationsπ typer - Create CLIs based on type annotationsπ streamlit - Create web apps with a single Python file
Utilities
π bump2version - Release a new version of your packageπ coloredlogs - Increase your logs' readability with colourπ hvplot - Create interactive plots from pandas dataframesπ mkdocs - Create developer documentation for your projectπ pdoc - Generate API documentation for your codeπ birdseye - Graphically debug your Python codeπ scalene - Profile your code's CPU and memory usage by lineπ viztracer - Vizualize your code's performance with a flamegraphπ tqdm - Easily add progress bars to long-running jobs
Machine Learning
Practical theory
π Bias-variance tradeoff - How a model's total error is the sum of bias and variance (30 min)π The two different uses of cross-validation - How to use nested cross-validation to combine the two different uses of cross-validation (30 min)π Modes, Medians and Means: A Unifying Perspective - Why minimizing the Mean Absolute Error (MAE) is more robust than minimizing the Mean Squared Error (MSE) (30 min)π Backpropagation is the chain rule to compute the gradient - How backpropagation is an algorithm to compute the objective function's gradient (30 min)π Stacked generalization - How to stack models (30 min)π We have been using the wrong initialization for t-SNE and UMAP - How to initialize t-SNE and UMAP properly (15min)π¦ From classic Fully Connected Networks to Transformers - How neural networks evolved from Fully Connected Networks to Transformers (30 min)π¦ What is the .632+ rule? - How to measure generalization performance with bootstrapping (30 min)π¦ Stacking strategies with and without leaks - Different strategies to stack models (30 min)π¦ Data Distribution Shifts and Monitoring - How to detect and address the different types of data shift (1 hour)π¦ Backprop is not just the chain rule - How backpropagation relates to Lagrange multipliers (30 min)π¦ Why ML algorithms are hard to tune - Optimize multiple objectives when the Pareto front is concave (30min)π¦ Deep learning model compression - How quantization, pruning, and distillation can be used to compress models (30 min)
Explainability
π SHAP: SHapley Additive exPlanations - How to explain a model's output with Shapley values (30 min)π¦ Intro to Shapley and SHAP - How Shapley values are approximated by SHAP (30 min)
Unsupervised
π UMAP: Uniform Manifold Approximation and Projection - How to reduce dimensionality for visualization and modelling (30 min)π PyNNDescent - How to find nearest neighbours in huge datasets (15 min)
Classification
π₯ Precision and recall - How precision and recall measure a classifier's performance (30 min)π Probability calibration - How and for which model types you should calibrate the model's output scores into probabilities (30 min)π You're all calculating churn rates wrong - Correctly define what churn is (30 min)
Regression
π¦ Gaussian processes - From scratch - How to build probabilistic regression models with Gaussian Processes (1 hour)
Computer Vision
π Microsoft's Document Image Transformer - A self-supervised pre-trained model that achieves SotA performance on PubLayNet and can be used for various downstream tasks (30 min)
Natural Language Processing
π Awesome Sentence Embedding - A curated list of pretrained sentence and word embedding models (15 min)
Time Series Analysis
π The Prophet model - How Meta's Prophet model decomposes a time series into a trend, seasonality, and holiday components (30 min)π Darts - Time Series Made Easy in Python - How to build forecasting models withdarts
(1 hour)
Recommender Systems
π Microsoft Recommenders - A comparison of recommender system models (30 min)
Tensor computation libraries
π What I Wish Someone Had Told Me About Tensor Computation Libraries - How JAX, PyTorch, TensorFlow, and Theano are different (30 min)
Pandas
π Modern Pandas series (Part 1 - 7) - Write idiomatic pandas (1 hour)π Awesome Pandas - An awesome list of Pandas resources (1 hour)
Sci-kit learn
π₯ Using scikit-learn Pipelines and FeatureUnions - How to usePipeline
s to build end-to-end models (30 min)π₯ Transforming target in regression - How to transform the target to build more robust models (15 min)π ColumnTransformer for heterogeneous data - How to useColumnTransformer
to process pandas DataFrames in sklearnPipeline
s (30 min)π Custom Estimators - Create your own customEstimator
(30 min)π Hyperparameter optimization with successive halving - How to optimize hyperparameters with the most computationally efficient method (30 min)
Labelling
π Doccano - A tool for labelling text (30 min)π CVAT: Computer Vision Annotation Tool - A tool for labelling images (30 min)π Awesome Data Labelling - An awesome list of data labelling tools (30 min)
DevOps
CI/CD
π invoke - How to implement common tasks you run on your project as a CLI (30 min)π poe - How to implement common tasks you run on your project as a CLI (30 min)
Environment and dependency management
π₯ Intro to packaging and dependency management for Python with Poetry - How to manage your Python package's dependencies and environment (30 min)π Intro to Pyenv for Machine Learning - How to use pyenv to manage your Python interpreter (30 min)π Modern Python Environments - dependency and workspace management - A comparison between pyenv, venv + pip, venv + pip-tools, poetry, pipenv, and conda (30 min)π¦ Conda: Myths and Misconceptions - Common misconceptions about Conda (15 min)
Docker
π₯ Docker Curriculum - How to use Docker (4 hours)π Docker layer caching - How to write Dockerfiles to benefit from layer caching (30 min)π Dockerfile best practices - How to write good Dockerfiles (1 hour)π Configuring Gunicorn for Docker - How to best configure Gunicorn for a Docker image (30 min)π Speed up Docker with BuildKitβs new caching - How to speed up Docker builds with a build cache (30 min)π¦ Build secrets in Docker and Compose, the secure way - How to use secrets in a Docker build (15 min)π¦ Security scanners for Python and Docker - How to scan your Docker image for security issues with your code and Docker image (30 min)π¦ The security scanner that cried wolf - How to scan your Docker image for security issues without false positives (15 min)π¦ Awesome Docker - An awesome list of Docker resources (30 min)
Data pipelines
π Great Expectations - How to test and document your data and data pipelines (30 min)
Shell
π Cron best practices - How to best use cron to schedule tasks (30 min)π A visual guide to SSH tunnels - How to forward ports and create tunnels with SSH (30 min)π Safe ways to do things in bash - How to write safe and robust shell scripts (1 hour)π¦ Your terminal is not a terminal: An Introduction to Streams - How your terminal is a tool to manipulate streams (30 min)π¦ Bash Heredoc - How to pass multiline arguments to commands with a heredoc (30 min)π¦ Please stop writing shell scripts - Why you shouldn't write shell scripts for CI/CD or Docker images (30 min)
Terraform
π₯ An Introduction to Terraform - How to use Terraform (1 hour)π Terraform best practices - Terraform best practices (1 hour)π¦ Terraform pre-commit hooks collection - How to automate Terraform code quality checks with pre-commit (1 hour)π¦ Awesome Terraform - An awesome list of Terraform resources (30 min)π₯ Terraform Tutorial - How to get started with Terraform (1 hour)
Infrastructure
π₯ Using Redis In-Memory Storage for your Python Applications - How to use Redis as an in-memory cache for your Python application (30 min)π Python Kafka Consumers: at-least-once, at-most-once, exactly-once - How to write different types of Kafka consumers in Python (30 min)π¦ Kafka Exactly-Once-Semantics - How to produce and consume messages exactly once (1 hour)π¦ RabbitMQ: a message queue library with persistance - RabbitMQ is a messaging system with a message broker (4 hours)π¦ ZeroMQ: a socket library with message queue primitives - ZeroMQ is a lightweight messaging system without a message broker (8 hours)
Curated by Radix
Radix is a Belgium-based Machine Learning company.
We invent, design and develop AI-powered software. Together with our clients, we identify which problems within organizations can be solved with AI, demonstrating the value of Artificial Intelligence for each problem.
Our team is constantly looking for novel and better-performing solutions and we challenge each other to come up with the best ideas for our clients and our company.
Here are some examples of what we do with Machine Learning, the technology behind AI:
- Help job seekers find great jobs that match their expectations. On the Belgian Public Employment Service website, you can find our job recommendations based on your CV alone.
- Help hospitals save time. We extract diagnosis from patient discharge letters.
- Help publishers estimate their impact by detecting copycat articles.
We work hard and we have fun together. We foster a culture of collaboration, where each team member feels supported when taking on a challenge, and trusted when taking on responsibility.