Daily Dose of Data Science
Daily Dose of Data Science is a publication on Substack that brings together intriguing frameworks, libraries, technologies, and tips that make the life cycle of a Data Science project effortless.
This repository is a collection of all the code snippets presented in my publication. If you want to receive these tips in your mailbox daily, you can subscribe to my Substack newsletter.
Star History
Run These Code Snippets on Your Local Machine
To download the tips listed here, you can clone this repo.
git clone https://github.com/ChawlaAvi/Daily-Dose-of-Data-Science
Table of Contents
- Pandas
- Jupyter Tips
- Python
- Plotting
- NumPy
- Memory Optimization
- Cool Tools
- Run-time Optimization
- Sklearn
- Debugging
- Missing Data
- ML-AI News
- Machine Learning
- Statistics
- Testing
- Terminal
- Documents
- Animations
Pandas
Title | Notebook | Substack | Article |
---|---|---|---|
One-Minute Guide To Becoming a Polars-savvy Data Scientist | |||
Avoid Using Pandas' Apply() Method At All Times | |||
Pandas vs Polars β Run-time and Memory Comparison | |||
A Lesser-Known Feature of the Merge Method in Pandas | |||
A Highly Overlooked Approach To Analysing Pandas DataFrames | |||
The Most Common Misconception About Inplace Operations in Pandas | |||
Become A Bilingual Data Scientist With These Pandas to SQL Translations | |||
Avoid This Costly Mistake When Indexing A DataFrame | |||
AutoProfiler: Automatically Profile Your DataFrame As You Work | |||
Why You Should Avoid Appending Rows To A DataFrame | |||
Are You Sure You Are Using The Correct Pandas Terminologies? | |||
If You Are Not Able To Code A Vectorized Approach, Try This. | |||
Why Are We Typically Advised To Never Iterate Over A DataFrame? | |||
PyGWalker: Analyze Pandas Dataframe in Jupyter using a Tableau-style Interface | |||
A Simple Trick to Make The Most Out of Pivot Tables in Pandas | |||
Never Worry About Parsing Errors Again While Reading CSV with Pandas | |||
An Interesting and Lesser-Known Way To Create Plots Using Pandas | |||
Generate Helpful Hints As You Write Your Pandas Code | |||
Speed-up Parquet I/O of Pandas by 5x | |||
Stop Using The Describe Method in Pandas. Instead, use Skimpy. | |||
Stop Using The Describe Method in Pandas. Instead, use Summarytools. | |||
Analyze A Pandas DataFrame Without Code | |||
70x Faster Pandas By Changing Just One Line of Code | |||
Reduce Memory Usage Of A Pandas DataFrame By 90% | |||
Speed-up Pandas Apply 5x with NumPy | |||
A Lesser-Known Feature of Apply Method In Pandas | |||
Create Pandas DataFrame from Dataclass | |||
Run SQL in Jupyter To Analyze A Pandas DataFrame | |||
When You Should Not Use the head() Method In Pandas | |||
Three Lesser-known Tips For Reading a CSV File Using Pandas | |||
The Best File Format To Store A Pandas DataFrame | |||
Lesser-Known Feature of the Merge Method in Pandas | |||
The Best Way to Use Apply() in Pandas | |||
A No-code Tool To Understand Your Data Quickly | |||
Display Progress Bar With Apply() in Pandas | |||
Supercharge value_counts() Method in Pandas With Sidetable | |||
Explore CSV Data Right From The Terminal | |||
Define the Correct DataType for Categorical Columns | |||
Don't Create Conditional Columns in Pandas with Apply | |||
Write Your Own Flavor Of Pandas | |||
Create DataFrame Hassle-free By Using Clipboard | |||
Alter the Datatype of Multiple Columns at Once | |||
Why you should not dump DataFrames to a CSV | |||
Why You Should Not Read CSVs with Pandas | |||
Parallelize Pandas Apply() With Swifter | |||
A Hidden Feature of Describe Method In Pandas | |||
Enrich Your Notebook With Interactive Controls | |||
Data Analysis Using No-Code Pandas In Jupyter | |||
Create Pivot Tables, Aggregations and Plots Without Any Code | |||
Parallelize Pandas with Pandarallel | |||
Pretty Plotting With Pandas | |||
How to Read Multiple CSV Files Efficiently | |||
Configure Sklearn To Output Pandas DataFrame | |||
Datatype For Handling Missing Valued Columns in Pandas | |||
Vectorization Does Not Always Guarantee Better Performance |
Jupyter Tips
Title | Notebook | Substack | Article |
---|---|---|---|
Declutter Your Jupyter Notebook Using Interactive Controls | |||
The Coolest GitHub-Colab Integration You Would Ever See | |||
Break the Linear Presentation of Notebooks With Stickyland | |||
Restart Jupyter Kernel Without Losing Variables | |||
Annotate Data With The Click Of A Button Using Pigeon | |||
Build Elegant Web Apps Right From Jupyter Notebook with Mercury | |||
Supercharge Your Jupyter Kernel With ipyflow | |||
PyGWalker: Analyze Pandas Dataframe in Jupyter using a Tableau-style Interface | |||
Draw The Data You Are Looking For In Seconds | |||
Never Search Jupyter Notebooks Manually Again To Find Your Code | |||
Stop Previewing Raw DataFrames. Instead, Use DataTables | |||
Label Your Data With The Click Of A Button | |||
The Coolest Jupyter Notebook Hack | |||
View Documentation in Jupyter Notebook | |||
Get Notified When Jupyter Cell Has Executed | |||
Clear Cell Output In Jupyter Notebook During Run-time | |||
CodeSquire: The AI Coding Assistant You Should Use Over GitHub Copilot | |||
Find Your Code Hiding In Some Jupyter Notebook With Ease | |||
Enrich Your Notebook With Interactive Controls | |||
Data Analysis Using No-Code Pandas In Jupyter | |||
Create Pivot Tables, Aggregations and Plots Without Any Code | |||
Restart Notebook Without Losing Variables | |||
Retrieve Previously Computed Output In Jupyter Notebook | |||
Transfer Variables Between Jupyter Notebooks |
Python
Title | Notebook | Substack | Article |
---|---|---|---|
7 Elegant Usages of Underscore in Python | |||
How To Enforce Type Hints in Python? | |||
A Common Misconception About Deleting Objects in Python | |||
What Makes The Join() Method Blazingly Faster Than Iteration? | |||
A Hidden Feature of a Popular String Method in Python | |||
Execute Python Project Directory as a Script | |||
Improve Python Run-time Without Changing A Single Line of Code | |||
A Lesser-Known Difference Between For-Loops and List Comprehensions | |||
A Lesser-Known Difference Between For-Loops and List Comprehensions | |||
Magic Methods: An Underrated Gem of Python OOP | |||
9 Command Line Flags To Run Python Scripts More Flexibly | |||
Use Custom Python Objects In A Boolean Context | |||
You Were Probably Given Incomplete Info About A Tuple's Immutability | |||
A Counterintuitive Thing About Python Dictionaries | |||
A Counterintuitive Thing About Python Dictionaries | |||
Probably The Fastest Way To Execute Your Python Code | |||
A Counterintuitive Fact About Python Functions | |||
Manipulating Mutable Objects In Python Can Get Confusing At Times | |||
Most Python Programmers Don't Know This About Python OOP | |||
You Can Add a List As a Dictionary's Key (Technically)! | |||
Why Python Does Not Offer True OOP Encapsulation | |||
Most Python Programmers Don't Know This About Python For-loops | |||
How To Enable Function Overloading In Python | |||
The Right Way to Roll Out Library Updates in Python | |||
F-strings Are Much More Versatile Than You Think | |||
A Single Line That Will Make Your Python Code Faster | |||
Make Dot Notation More Powerful in Python | |||
An Elegant Way To Perform Shutdown Tasks in Python | |||
What Are Class Methods and When To Use Them? | |||
Hide Attributes While Printing A Dataclass Object | |||
List : Tuple :: Set : ? | |||
Post_init: Add Attributes To A Dataclass Post Initialization | |||
Simplify Your Functions With Partial Functions | |||
DotMap: A Better Alternative to Python Dictionary | |||
Prevent Wild Imports With all in Python | |||
Performance Comparison of Python 3.11 and Python 3.10 | |||
Why 256 is 256 But 257 is not 257? | |||
Make a Class Object Behave Like a Function | |||
Lesser-known Feature of Pickle Files | |||
Specify Loops and Runs In %%timeit | |||
Don't Use time.time() To Measure Execution Time | |||
Import Your Python Package as a Module | |||
Fine-grained Error Tracking With Python 3.11 | |||
Run Python Project Directory As A Script | |||
Use Slotted Class To Improve Your Python Code | |||
Using Dictionaries In Place of If-conditions | |||
In Defense of Match-case Statements in Python |
Plotting
Title | Notebook | Substack | Article |
---|---|---|---|
Don't Overuse Scatter, Line and Bar Plots. Try These Four Elegant Alternatives. | |||
Sankey Diagrams: An Underrated Gem of Data Visualization | |||
Enrich Your Heatmaps With This Simple Trick | |||
The Coolest Matplotlib Hack to Create Subplots Intuitively | |||
Waterfall Charts: A Better Alternative to Line/Bar Plot | |||
Enrich Your Confusion Matrix With A Sankey Diagram | |||
A Simple One-Liner to Create Professional Looking Matplotlib Plots | |||
Visualise The Change In Rank Over Time With Bump Charts | |||
A Simple Trick That Significantly Improves The Quality of Matplotlib Plots | |||
A Lesser-known Feature of Creating Plots with Plotly | |||
A Little Bit Of Extra Effort Can Hugely Transform Your Basic Matplotlib Plots | |||
Interactively Visualise A Decision Tree With A Sankey Diagram | |||
Use Histograms With Caution. They Are Highly Misleading! | |||
Three Simple Ways To (Instantly) Make Your Scatter Plots Clutter Free | |||
Matplotlib Has Numerous Hidden Gems. Here's One of Them. | |||
A Simple Trick That Will Make Heatmaps More Elegant | |||
The Limitations Of Heatmap That Are Slowing Down Your Data Analysis | |||
An Underrated Technique To Improve Your Data Visualizations | |||
Who Said Matplotlib Cannot Create Interactive Plots? | |||
Don't Create Messy Bar Plots. Instead, Try Bubble Charts! | |||
Use Box Plots With Caution! They May Be Misleading. | |||
An Underrated Technique To Create Better Data Plots | |||
An Interesting and Lesser-Known Way To Create Plots Using Pandas | |||
Style Matplotlib Plots To Make Them More Attractive | |||
Simple One-Liners to Preview a Decision Tree Using Sklearn | |||
Create Data Plots Right From The Terminal | |||
Make Your Matplotlib Plots More Professional | |||
Perfplot: Measure, Visualize and Compare Run-time With Ease | |||
Prettify Word Clouds In Python | |||
Calendar Map As A Richer Alternative to Line Plot | |||
Density Plot As A Richer Alternative to Scatter Plot | |||
Python One-Liner To Create Sketchy Hand-drawn Plots | |||
Create a Moving Bubbles Chart in Python | |||
Visualizing Google Search Trends of 2022 using Python | |||
Create A Racing Bar Chart In Python | |||
Elegantly Plot the Decision Boundary of a Classifier | |||
Dot Plot: A Potential Alternative to Bar Plot | |||
Hexbin Plots As A Richer Alternative to Scatter Plots | |||
Enrich Your Notebook With Interactive Controls | |||
Regression Plot Made Easy with Plotly | |||
Pretty Plotting With Pandas | |||
Polynomial Linear Regression Plot Made Easy With Seaborn | |||
Analyse Flow Data With Sankey Diagrams | |||
Waterfall Charts: A Better Alternative to Line/Bar Plot |
NumPy
Title | Notebook | Substack | Article |
---|---|---|---|
A Major Limitation of NumPy Which Most Users Aren't Aware Of | |||
Beware of This Unexpected Behaviour of NumPy Methods | |||
Speedup NumPy Methods 25x With Bottleneck | |||
Speed-up NumPy 20x with Numexpr | |||
An Elegant Way To Perform Matrix Multiplication | |||
Difference Between Dot and Matmul in NumPy | |||
Don't Print NumPy Arrays! Use Lovely-NumPy Instead | |||
Polynomial Linear Regression with NumPy |
Memory Optimization
Title | Notebook | Substack | Article |
---|---|---|---|
70x Faster Pandas By Changing Just One Line of Code | |||
Reduce Memory Usage Of A Pandas DataFrame By 90% | |||
The Best File Format To Store A Pandas DataFrame | |||
Define the Correct DataType for Categorical Columns | |||
Datatype For Handling Missing Valued Columns in Pandas | |||
Save Memory with Python Generators |
Cool Tools
Title | Notebook | Substack | Article |
---|---|---|---|
CNN Explainer: Interactively Visualize a Convolutional Neural Network | |||
Break the Linear Presentation of Notebooks With Stickyland | |||
Annotate Data With The Click Of A Button Using Pigeon | |||
Mito Just Got Supercharged With AI! | |||
PyGWalker: Analyze Pandas Dataframe in Jupyter using a Tableau-style Interface | |||
Supercharge Shell With Python Using Xonsh | |||
Draw The Data You Are Looking For In Seconds | |||
Preview Your README File Locally In GitHub Style | |||
This GUI Tool Can Possibly Save You Hours Of Manual Work | |||
Stop Previewing Raw DataFrames. Instead, Use DataTables. | |||
Converting Python To LaTeX Has Possibly Never Been So Simple | |||
Label Your Data With The Click Of A Button | |||
Analyze A Pandas DataFrame Without Code | |||
A No-Code Online Tool To Explore and Understand Neural Networks | |||
Speed-up NumPy 20x with Numexpr | |||
Debugging Made Easy With PySnooper | |||
Deep Learning Network Debugging Made Easy | |||
CodeSquire: The AI Coding Assistant You Should Use Over GitHub Copilot | |||
Find Unused Python Code With Ease | |||
Enrich Your Notebook With Interactive Controls | |||
Data Analysis Using No-Code Pandas In Jupyter | |||
Modify Python Code During Run-Time | |||
Modify Function During Run-Time | |||
Importing Modules Made Easy with Pyforest | |||
Create Pivot Tables, Aggregations and Plots Without Any Code |
Run-time Optimization
Title | Notebook | Substack | Article |
---|---|---|---|
Pandas vs Polars β Run-time and Memory Comparison | |||
The Limitation of KMeans Which Is Often Overlooked by Many | |||
Most Sklearn Users Don't Know This About Its LinearRegression Implementation | |||
Probably The Fastest Way To Execute Your Python Code | |||
Why Are We Typically Advised To Never Iterate Over A DataFrame? | |||
Speed-up Parquet I/O of Pandas by 5x | |||
A Single Line That Will Make Your Python Code Faster | |||
Make Sklearn KMeans 20x times faster | |||
Speed-up NumPy 20x with Numexpr | |||
The Best File Format To Store A Pandas DataFrame | |||
The Best Way to Use Apply() in Pandas | |||
Don't Create Conditional Columns in Pandas with Apply | |||
Why you should not dump DataFrames to a CSV | |||
Parallelize Pandas Apply() With Swifter | |||
Parallelize Pandas with Pandarallel | |||
How to Read Multiple CSV Files Efficiently |
Sklearn
Title | Notebook | Substack | Article |
---|---|---|---|
Why Sklearn's Linear Regression Has No Hyperparameters? | |||
Scikit-LLM: Integrate Sklearn API with Large Language Models | |||
Most Sklearn Users Don't Know This About Its LinearRegression Implementation | |||
A Lesser-Known Feature of Sklearn To Train Models on Large Datasets | |||
Sklearn One-liner to Generate Synthetic Data | |||
Skorch: Use Scikit-learn API on PyTorch Models | |||
Make Sklearn KMeans 20x times faster | |||
Build Baseline Models Effortlessly With Sklearn | |||
Polynomial Linear Regression with NumPy | |||
An Elegant Way to Import Metrics From Sklearn | |||
Feature Tracking Made Simple In Sklearn Transformers | |||
Configure Sklearn To Output Pandas DataFrame |
Debugging
Title | Notebook | Substack | Article |
---|---|---|---|
Debugging Made Easy With PySnooper | |||
Don't use print() to debug your code. | |||
Inspect Program Flow with IceCream | |||
Lesser-known Feature of f-strings in Python |
Missing Data
Title | Notebook | Substack | Article |
---|---|---|---|
Handle Missing Data With Missingno | |||
Datatype For Handling Missing Valued Columns in Pandas |
ML-AI News
Title | Notebook | Substack | Article |
---|---|---|---|
Now You Can Use DALLΒ·E With OpenAI API |
Machine Learning
Title | Notebook | Substack | Article |
---|---|---|---|
Decision Trees ALWAYS Overfit. Here's A Lesser-Known Technique To Prevent It. | |||
Evaluate Clustering Performance Without Ground Truth Labels | |||
The Most Common Misconception About Continuous Probability Distributions | |||
A Common Misconception About Feature Scaling and Standardization | |||
Random Forest May Not Need An Explicit Validation Set For Evaluation | |||
A Visual and Overly Simplified Guide To Bagging and Boosting | |||
10 Most Common (and Must-Know) Loss Functions in ML | |||
A Visual and Overly Simplified Guide To Bagging and Boosting | |||
10 Most Common (and Must-Know) Loss Functions in ML | |||
Theil-Sen Regression: The Robust Twin of Linear Regression | |||
The Limitations Of Elbow Curve And What You Should Replace It With | |||
21 Most Important (and Must-know) Mathematical Equations in Data Science | |||
Try This If Your Linear Regression Model is Underperforming | |||
The Limitation of KMeans Which Is Often Overlooked by Many | |||
Nine Most Important Distributions in Data Science | |||
The Limitation of Linear Regression Which is Often Overlooked By Many | |||
The Limitation of Linear Regression Which is Often Overlooked By Many | |||
A Reliable and Efficient Technique To Measure Feature Importance | |||
Does Every ML Algorithm Rely on Gradient Descent? | [ |
||
Visualize The Performance Of Linear Regression With This Simple Plot | |||
Confidence Interval and Prediction Interval Are Not The Same | |||
The Ultimate Categorization of Performance Metrics in ML | |||
The Most Overlooked Problem With One-Hot Encoding | |||
9 Most Important Plots in Data Science | |||
Is Categorical Feature Encoding Always Necessary Before Training ML Models? | |||
The Counterintuitive Behaviour of Training Accuracy and Training Loss | |||
A Highly Overlooked Point In The Implementation of Sigmoid Function | |||
The Ultimate Categorization of Clustering Algorithms | |||
A Lesser-Known Feature of Sklearn To Train Models on Large Datasets | |||
Visualize The Performance Of Any Linear Regression Model With This Simple Plot | |||
How To Truly Use The Train, Validation and Test Set | |||
The Advantages and Disadvantages of PCA To Consider Before Using It | |||
Loss Functions: An Algorithm-wise Comprehensive Summary | |||
Is Data Normalization Always Necessary Before Training ML Models? | |||
A Visual Guide to Stochastic, Mini-batch, and Batch Gradient Descent | |||
The Taxonomy Of Regression Algorithms That Many Don't Bother To Remember | |||
The Limitation of PCA Which Many Folks Often Ignore | |||
Breathing KMeans: A Better and Faster Alternative to KMeans | |||
How Many Dimensions Should You Reduce Your Data To When Using PCA? | |||
A Visual Guide To Sampling Techniques in Machine Learning | |||
A Visual and Overly Simplified Guide to PCA | |||
The Limitation Of Euclidean Distance Which Many Often Ignore | |||
Visualising The Impact Of Regularisation Parameter | |||
A (Highly) Important Point to Consider Before You Use KMeans Next Time | |||
Is Class Imbalance Always A Big Problem To Deal With? | |||
A Visual Comparison Between Locality and Density-based Clustering | |||
Why Don't We Call It Logistic Classification Instead? | |||
A Typical Thing About Decision Trees Which Many Often Ignore | |||
Always Validate Your Output Variable Before Using Linear Regression | |||
Why Is It Important To Shuffle Your Dataset Before Training An ML Model | |||
Why Are We Typically Advised To Set Seeds for Random Generators? | |||
This Small Tweak Can Significantly Boost The Run-time of KMeans | |||
Most ML Folks Often Neglect This While Using Linear Regression | |||
Is This The Best Animated Guide To KMeans Ever? | |||
An Effective Yet Underrated Technique To Improve Model Performance | |||
How to Encode Categorical Features With Many Categories? | |||
Why KMeans May Not Be The Apt Clustering Algorithm Always | |||
Skorch: Use Scikit-learn API on PyTorch Models | |||
A No-Code Online Tool To Explore and Understand Neural Networks | |||
Make Sklearn KMeans 20x times faster | |||
Deep Learning Network Debugging Made Easy | |||
Build Baseline Models Effortlessly With Sklearn | |||
Polynomial Linear Regression with NumPy |
Statistics
Title | Notebook | Substack | Article |
---|---|---|---|
Be Cautious Before Drawing Any Conclusions Using Summary Statistics | |||
The Limitation Of Pearson Correlation Which Many Often Ignore | |||
Pandas and NumPy Return Different Values for Standard Deviation. Why? | |||
Why Correlation (and Other Statistics) Can Be Misleading |
Testing
Title | Notebook | Substack | Article |
---|---|---|---|
Generate Your Own Fake Data In Seconds |
Terminal
Title | Notebook | Substack | Article |
---|---|---|---|
Supercharge Shell With Python Using Xonsh | |||
Most Command-line Users Don't Know This Cool Trick About Using Terminals | |||
Never Refactor Your Code Manually Again. Instead, Use Sourcery! | |||
Create Data Plots Right From The Terminal | |||
Visualize Commit History of Git Repo With Beautiful Animations | |||
How Would You Identify Fuzzy Duplicates In A Data With Million Records? | |||
Automated Code Refactoring With Sourcery | |||
Explore CSV Data Right From The Terminal |
Documents
Title | Document | Substack | Article |
---|---|---|---|
Daily Dose of Data Science - Full Archive | |||
35 Hidden Python Libraries That Are Absolute Gems | |||
40 Open-Source Tools to Supercharge Your Pandas Workflow | |||
37 Hidden Python Libraries That Are Absolute Gems | |||
10 Automated EDA Tools That Will Save You Hours Of (Tedious) Work | |||
30 Python Libraries to (Hugely) Boost Your Data Science Productivity |
Animations
Title | Notebook | Substack | Video |
---|---|---|---|
Visualizing The Data Transformation of a Neural Network |