Data Science, Machine Learning & Visualization Dojo
Collections of Data Science & ML projects and dojo where I practice Data Science, Machine Learning, Deep Learning and Data Visualization related skills, theories, probability, statistics, etc.
Built with
Machine Learing, Deep Learning, Data Science libraries
- NumPy - package for scientific computing with Python
- Pandas - fast, powerful, flexible and easy to use open source data analysis and manipulation tool
- Pandas Profiling - generate reports from dataframe
- Geo Pandas - support for geographic data to pandas objects.
- Scikit-learn - Simple and efficient tools for predictive data analysis
- TensorFlow - An end-to-end open source machine learning platform
- Keras - Deep Learning framework
- NLTK - Natural Language Toolkit
- dlib - A toolkit for making real world machine learning and data analysis applications in C++
- Face Recognition - The world's simplest facial recognition api for Python and the command line
Data Visualization libraries
- Matplotlib - a comprehensive library for creating static, animated, and interactive visualizations in Python
- Seaborn - statistical data visualization
- Bokeh - interactive visualization library for modern web browsers
- Plotly - The front-end for ML and data science models
- Cufflinks - Productivity Tools for Plotly + Pandas
Turning into Web applications
- Streamlit - The fastest way to build and share data apps
- Flask - a micro web framework written in Python
Spark
- Apache Spark - a unified analytics engine for large-scale data processing.
- Spark with pyspark - PySpark is the collaboration of Apache Spark and Python
- Databricks - Unified Data Analytics Platform - One cloud platform for massive scale data engineering and collaborative data science.
Tools and Datasources
- Jupyter Notebook - Notebook system for data analysis
- Google Colab - Great Notebook system by google, which give free access to GPUs
- Kaggle - Source of Dataset collections
- Plotly Chart Studio - The fastest way to publish & embed interactive charts online
Projects
Breast Cancer Tumor Diagnostic - Classification Project
- The project is to build a machine learning model to predict whether the tumor is benign or malignant basedon several observations/features.
- using data from Breast Cancer Wisconsin (Diagnostic) Data Set - UCI
Fandango movie ratings - Capstone Project
Data Analysis and Visualization Capstone project from Machine Learning and Datascience Masterclass Course.
- This is the data behind the story Be Suspicious Of Online Movie Ratings, Especially Fandango’s
- using data from 538
- If you are planning on going out to see a movie, how well can you trust online reviews and ratings? Especially if the same company showing the rating also makes money by selling movie tickets.
- Do they have a bias towards rating movies higher than they should be rated?
- etc..
Supervised Learning Capstone Project - Cohort Analysis & Customer Churn Predictions
- This project is to build a machine learning model to predict whether or not a customer will Churn or not.
- Includes cohort analysis based on Telco subsriber's contract type, etc.
Predicting Heart Disease - Classification Project
Milestone project from Complete Machine Learning and Data Science - Zero to Mastery course.
- The project is to build a machine learning model capable of predicting whether or not someone has a Heart Disease based on their medical attributes.
- using data from Heart Disease Data Set of UCI - kaggle version
Predicting Bulldozer Sale Price - Regression Project
Milestone project from Complete Machine Learning and Data Science - Zero to Mastery course.
- The project is to build a machine learning model to predict the sale price of bulldozers based on the past prices.
- using data from Blue Book for Bulldozers - kaggle version
Deep Learning ANN Project - Dog breed predictions
Project from Complete Machine Learning and Data Science - Zero to Mastery course.
- The project is to build deep learning model with Tensorflow to predict the dog breeds.
- using data from Dog Breed Identification - kaggle version
911 Calls - Data Capstone Project
Data Analysis and Visualization Capstone project from Data Science and Machine Learning Bootcamp Course.
- analyzing 911 calls data from kaggle
- top 5 zips code for 911 calls
- top 5 townships for 911 calls
- most common Reason for a 911
- different types of visualizations based on the findings
- etc..
ML App - Random Forest Algorithm - ML Project
- Machine learning app using streamlit, for building a regression model using the Random Forest algorithm.
Machine Learning & Data Science Projects
Masterclass Projects
- Ames Housing Data Project - Linear Regression
- Heart Disease Detection Project - Logistic Regression
- Sona Data - Detecting Rock or Mine Project - KNN
- Wine Fraud Detection Project - SVM
- Mushroom Edible or Poisonous Prediction Project - with AdaBoost
- Mushroom Edible or Poisonous Prediction Project - with Gradient Boosting
- Ecommerce Project - Linear Regression
- Advertisement Project - Logistic Regression
- Anonymized Data Project - KNN
- Supervised Learning Capstone Project - Cohort Analysis & Customer Churn Predictions
- NLP - Flight Tweets Sentiment Analysis - Classification
- NLP - Moview Reivew Sentiment Analysis - Classification
- Color Quantization - KMeans
- CIA Country Analysis and Clustering - KMeans
- Cars Model - Hierarchical Clustering
- Wholesale Customers - DBSCAN Clustering
- Breast Cancer - PCA Manual Implementation
- Breast Cancer - PCA with sklearn
Other Projects
- Project - Used Car Price Prediction with XG-Boost
- Project - Predict Career Longevity for NBA Rookies with Binary Classification - Logistic Regression
- Project - Facial Classification - SVM
- Project - Predict Sales Revenue with Interaction Term - Multiple Linear Regression
- Project - Predict Sales Revenue - Simple Linear Regression
- Project - Breast Cancer Tumor Diagnostic Classification - SVM
- Project - Music Recommender
- Project - Smarty Brain Image Prediction
Deep Learning Projects
- Iris Flower Predictions App on Flask
- ANN - Loan Default Prediction Prediction Project
- ANN - Predict House Price for House Sales in King County, USA Project
- ANN - Breast Cancer Wisconsin (Diagnostic) Project
- CNN - Convolutional Neural Networks for Image Classification - MNIST data Project
- CNN - Convolutional Neural Networks for Image Classification - CIFAR 10 data Project
- CNN - Convolutional Neural Networks for Image Classification - Real Image - Malaria Detection Project
- CNN - Convolutional Neural Networks for Image Classification - Fashion MNIST Data Project
- RNN - Forzen Dessert Sales Forecasting with LSTM
- NLP - Yelp Reviews Classification - Natural Language Processing Project
- Average Eating Habits of UK Countries - Autoencoders
Data Analysis and Visualization Projects
- Data Visualization with Python - Project: Data analysis and Data Visualization using Pandas, Matplotlib for Countries's GDP, Life Expectancy comparison across continents, GDP per Capita Relative Growth, Population Reative Growth comparison etc.
- Fuel Economy Case Study - Project: Analyzing Fuel Economy Data provied by EPA for distributions of greenhouse gas score, combined mpg in 2008 and 2018, correlation between displacement and combined mpg ,greenhouse gas score and combined mpg. Are more unique models using alternative fuels in 2018 compared to 2008? By how much? How much have vehicle classes improved in fuel economy (increased in mpg)? What are the characteristics of SmartWay vehicles? Have they changed over time? (mpg, greenhouse gas) What features are associated with better fuel economy (mpg)? What is the top vehicle which improved the most in terms of combined mpg from 2008 to 2018?
- Wine Quality Case Study - Project: Analyzing wine data for the following points for wine businesses to model better wine. Is a certain type of wine (red or white) associated with higher quality? What level of acidity (pH value) receives the highest average rating? Do wines with higher alcoholic content receive better ratings? Do sweeter wines (more residual sugar) receive better ratings? White Vs Red Wine Proportions by Color & Quality
- TV, Halftime Shows, and the Big Game - Project: Analyzing Superbowls data and answering questions like - What are the most extreme game outcomes? How does the game affect television viewership? How have viewership, TV ratings, and ad cost evolved over time? Who are the most prolific musicians in terms of halftime show performances?
- Weather Trend - Project: Analyzing Global weather trends, Singapore weather trends, Comparing Global vs Singapore 10 years Moving Average trends
- Real-time Insights from Social Media Data - Project: Analyzing Twitter data and answering questions like: What are gobal trend and local trends?, finding the common trends
- frequency analysis on tweets and hashtags, etc.
- Statistics From Stock Data: Analyzing google, apple and amzon stock prices and checking the rolling mean.
- Android Play Store App Data Analysis - Project: Analyzing andriod play store data and answering questions like - How many apps are paid? How much money are they making? When were these apps released?
Bootcamps
RL - Practical AI with Python and Reinforcement Learning - JP - On Hold
- 00. NumPy Crash Course
- 01. Matplotlib Visualization
- 02. Pandas and Scikit-learn
- 03. ANNs
- 04. CNNs
- 05. Introduction to gym
- 06. Classical Q Learning
- 07. Deep Q Learning
- 08. Deep Q Learning on Images
- 09. Creating Custom Open AI Gym Environment
Tensorflow 2.0: Deep Learning and Artificial Intelligence - LP
- Section 2 - Google Colab
- Section 3 - Machine Learning and Neurons
- Section 4 - Feedforward Artifical Neural Networks
- Section 5 - CNN Convolutional Neural Networks
- Section 6 - RNN - Recurrent Neural Networks, Time Series, Sequence Data
- Section 7 - NLP
- Section 8 - Recommender Systems
- Section 9 - Transfer Learning for Computer Vision
- Section 10 - GANs
- Section 11 - Deep Reinforcement Learning (Theory)
- Section 12 - Stock Trading Project with DL
- Section 13: Advanced Tensorflow Usage
- Section 14: Low - Level Tensorflow
- Section 15: In-Depth: Loss Functions
- Section 16: In-Depth: Gradient Descent
- Section 17 - 21: Misc
DeepLearning.AI - Course 04.Sequences, Time Series and Predictions in Tensorflow
- Week 01 - Sequences and Prediction
- Week 02 - Deep Neural Networks for Time Series
- Week 03 - Recurrent Neural Networks for Time Series
- Week 04 - Real-world time series data
DeepLearning.AI - Course 03.Netural Language Processing in Tensorflow
- Week 01 - Sentiment in Text
- Week 02 - Word Embeddings
- Week 03 - Sequence Models
- Week 04 - Sequence Models and Literature
DeepLearning.AI - Course 02.Convolutional Neural Networks in TensorFlow
- Week 01 - Exploring a Larger Dataset
- Week 02 - Augmentation: A technique to avoid overfitting
- Week 03 - Transfer Learning
- Week 04 - Multiclass Classification
DeepLearning.AI - Course 01.Introduction to TensorFlow for Artificial Intelligence, Machine Learning, and Deep Learning
- Week 01 - A New Programming Paradigm
- Week 02 - Introduction to Computer Vision
- Week 03 - Enhancing Vision with CNN
- Week 04 - Using Real-world images
Deep Learning TensorFlow Developer Certificate - ZTM - IN PROGRESS
- 01. Introduction
- 02. Deep Learning and Tensorflow Fundamentals
- 03. Neural Network Regression with Tensorflow
- 04. Neural Network Classification with Tensorflow
- 05. Computer Vision and Convolutional Neural Networks in Tensorflow
- 06. Transfer Learning - Feature Extraction
- 07. Transfer Learning - Fine Tuning
- 08. Transfer Learning - Scaling up
- 09. Milestone Project 1 - Food Vision Big
- 10. NLP Fundamentals in Tensorflow
- 11. Milestone Project 2 - SkimLit
- 12. Timseries Fundamentals + Milestone Project 3 - BitPredict
- 13. Passing Tensorflow Certificate Exam
- 15. Appendix - Machine Learning Primer
- 16. Appendix - Machine Learning Framework
- 14, 17-19. Misc
Complete Tensorflow 2 and Keras Deep Learning Bootcamp - JP
- NumPy Crash Course
- Pandas Crash Course
- Visualization Crash Course
- Basic Artifical Neural Networks - ANNs
- Convolutional Neural Networks - CNNs
- Convolutional Neural Networks for Image Classification - MNIST data
- Convolutional Neural Networks for Image Classification - CIFAR 10 data
- Convolutional Neural Networks for Image Classification - Real Image - Malaria Detection Project
- Convolutional Neural Networks for Image Classification - Fashion MNIST Data Project
- Recurrent Neural Networks - RNNs
- Natural Language Processing - NLP
- Auto Encoders
- Generative Adverserial Networks - GANs
- Deployment
Machine Learning & Data Science Masterclass - JP
- new track 2021 Python for Machine Learning & Data Science Masterclass
- Python Crash Course
- NumPy
- Pandas
- Matplotlib
- Seaborn Data Visualizations
- Data Analysis and Data Visualization Capstone Project
- Linear Regression Models
- Feature Engineering and Data Preparation
- Cross Validation, Grid Search and Linear Regression Project
- Logistic Regression Models
- KNN - K Nearest Neighbors
- SVM - Support Vector Machines
- Tree Based Methods - Decision Tree Learning
- Random Forests
- Boosting Methods
- Supervised Learning Capstone Project - Cohort Analysis & Customer Churn Predictions
- Naive Bayes Classification and Natural Language Processing (Supervised Learning)
- K Means Clustering (Unsupervised Learning)
- Hierarchical Clustering (Unsupervised Learning)
- DBSCAN (Unsupervised Learning)
- Principal Component Analysis (Unsupervised Learning)
- Model Deployment
- Serving model as API with Flask
Complete Machine Learning and Data Science - Zero to Mastery
- Data Analysis with Pandas
- Data Analysis with NumPy
- Linear Regression with Polyfit - Data 36
- Matplotlib - Data Visualizations
- Scikit-learn - Creating Machine Learning Models
- Milestone Project - Supervised Learning (Classification)- Heart Disease Detection
- Milestone Project - Supervised Learning (Regression)- Bulldozer Sales Price Prediction
- Deep Learning Project - Dog breed predictions
ML - Machine Learning & Data Science A-Z Hands-on Python - NS
- 03. Preprocessing
- 04. Machine Learning Types
- 05. Supervised Learning - Classification
- 06. Supervised Learning - Regression
- 07. Unsupervised Learning - Clustering
- 08. Hyper Parameters Optimization
Data Science and Machine Learning Bootcamp
- Python Crash Course
- Python for Data Analysis - NumPy
- Python for Data Analysis - Pandas
- Python for Data Visualization - Matplotlib
- Python for Data Visualization - Seaborn
- Pandas Built In Data Visualization
- Visualization with Plotly and Cufflinks
- Data Capstone Projects
- Linear Regression
- Logistic Regression
- K Nearest Neighbors (KNN)
- Decision Tree and Random Forest
- Support Vector Machine (SVM)
- K Means Clustering
- Principal Component Analysis
- Recommender Systems
- Natural Language Processing
- Neural Nets and Deep Learning
- Big Data and Spark with Python
- SciPy
Complete Data Science Bootcamp - 365
- Part 1 - The Field of Data Science
- Part 2 - Probability
- Part 3 - Statistics (Descriptive & Inferential)
- Part 4 - Python
- Part 5 - Advanced Statistical Methods in Python / Machine Learning in Python
- Part 6 - Mathematics
- Part 7 - Deep Learning
- Software Integration
- Case Study - Absenteeism
Books
Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (in progress)
- The Fundamentals of Machine Learning
- The Machine Learning Landscape
- End-to-End Machine Learning Project
- Classification
- Training Models
The Hundreded page - Machine Learning book
- Introduction
- Notation and Definitions
- Fundamental Algorithms
- Anatomy of a Learning Algorithm
- Basic Practice
- Neural Networks and Deep Learning
- Problems and Solutions
- Advanced Practice
- Unsupervised Learning
- Unsupervised Learning - in-depth material
- Other Forms of Learning
- Conclusion
Advancing Machine Learning & Data Science Journey - (In Progress)
To skill up my ML & DS related skills in specific areas and topics:
Applied Machine Learning - Ensemble Learning
- Project: Titanic dataset
- 01.ML Basic
- 02.Preparing the Data
- 03.Ensemble Learning
- 04.Boosting
- 05.Bagging
- 06.Stacking
- 07.Evaluation and Selection of Models
Applied Machine Learning - Feature Engineering
- Project: Titanic dataset
- 01.ML Basic
- 02.Intro to Feature Engineering
- 03.Explore Data
- 04.Create and Clean Features
- 05.Prepare Features for Modelling
- 06.Compare and Evaluate Models
Applied Machine Learning - Algorithms
- Project: Titanic dataset
- 01.Review of Foundation
- 02.Logistic Regression
- 03.Support Vector Machine
- 04.Multi-layer Perceptron
- 05.Random Forest
- 06.Boosting
- 07.Final Model Selection and Evaluation
Applied Machine Learning - Foundation
- Project: Titanic dataset
- 01.ML Basic
- 02.Exploratory Data Analysis and Data Cleaning
- 03.Evaluation - Measuring Success
- 04.Optimizing a Model
- 05.End to End Pipeline
ML - Mistakes to avoid in Machine Learning
- Assuming Data is good to go
- Neglecting to consult subject matter experts
- Overtiffing your models
- Not standardizing your data
- Focusing on Wrong Factors
- Data Leakage
- Forgetting traditional statistics tools
- Assuming Deployment is a breeze
- Assuming Machine Learning is the answer
- Developing in a silo
- Not treating for imbalanced sampling
- Interpreting your coefficients without properly treating for multicollinearity
- Evaluating by accuracy alone
- Giving overly technical presentations
Deep Learning , Machine Learning, AI & Data Science
- Deep Learning - Natural Language Processing with TensorFlow
- Deep Learning - Face Recognition
- Deep Learning - Image Recognition
- Deep Learning - Buliding Deep Learning Applications with Keras 2.0
- Applied Machine Learning - Ensemble Learning
- Applied Machine Learning - Feature Engineering
- Applied Machine Learning - Algorithms
- Applied Machine Learning - Foundation
- Machine Learning with Python - 03_k-Means Clustering
- Machine Learning with Python - 02_Decision Trees
- Machine Learning with Python - 01_Foundations
- ML - Mistakes to avoid in Machine Learning
- ML - Classification Modelling with Iris flowers
- Data Science A-Z Modeling
- Designing for Neural Networks and AI Interfaces
- Introduction to GPT-3: A Leap in Artificial Intelligence
Data Analysis, Manipulation & Data Visualization
- DA & DV - Python Data Analysis & Visualization Masterclass
- Pandas - Pandas Code Challenges
- Pandas - Advanced Pandas
- DV - Data Visualizations with Plotly
- DA - Data Analysis with Pandas and Python - BP
- DA - Python Data Playbook - Cleaning Data
- Pandas - Pandas Playbook - Manipulating Data
- More Python Data Tools - Microsoft
Apache Spark & PySpark
- Intro to Spark SQL and DataFrames
- Apache Spark Essential Training
- Spark for Machine Learning & AI
- Apache PySpark by Example
- Apache Spark Deep Learning Essential Training
Data Scientist Reading Materials
- Supervised Learning
- Lesson 01: Machine Learning Bird's Eye View
- Lesson 02: Linear Regression
- Lesson 03: Perceptron Algorithm
- Lesson 04: Decision Trees
- Lesson 05: Naive Bayes
- Lesson 06: Support Vector Machines
- Lesson 07: Ensemble Methods
- Lesson 08: Model Evaluation Metrics
- Lesson 09: Training and Tuning
- Lesson 10: Finding Donors Project
Kaggle Courses
- Python
- Pandas
- Data Cleaning
- Introduction to Machine Learning
- Machine Learning Intermediate
- Feature Engineering
- Machine Learning Explaniability
- Data Visualization
- Intro to Deep Learning
- Intro to Game AI and Reinforcement Learning
- Natural Language Processing
- Micro-challenges
- Computer Vision
- Intro to SQL
- Advanced SQL
Google ML courses
- ML Crash Course
- Problem Framing
- Data Prep
- Clustering
- Recommendation
- Testing and Debugging
- GANs
Probability & Statistics (in progress)
- Linear Regression Analysis
- Multi Regression Analysis
- Pratical Statistics
- Admission Case Study with Python (Simpson's Paradox)
- Simulating Coin Flips & Probability
- Stimulating multiple Coin Flips & Bionmial Distribution
- Cancer Test Results
- Conditional Probability & Bayes Rules
- Excel Data Manipulation, Analysis and Visualization
Data Science Math Skills - Duke University
Topics include:
- Set theory, including Venn diagrams
- Properties of the real number line
- etc
License
This project is licensed under the MIT License - see the LICENSE.md file for details