DATA SCIENCE ROADMAP
This repo has been inspired by these:
Why this?
I want to track the progress of my studies in this broad area. I do not intend to list a huge number of resources or courses, just the ones that I have completed so far and the following ones that are on my mind for a next step (in a short/medium or even long term). Anyway, valuable resources that may be part of my plan in the future.
How are things classified here?
It is not that easy to classify subjects in Data Science. Some courses may correspond clearly to only one category, some others may belong to more than one, etc. I have tried to simplify the categories of interest in what you can see in the table of contents. There may still be some incongruences, but I think I am happy with the result :)
TABLE OF CONTENTS
- Introductory Courses in Data Science.
- General Courses in Data Science.
- Data Analysis.
- Machine Learning.
- Octave/Matlab.
- Python.
- R.
- Text Mining and NLP.
- Data Visualization and Reporting.
- Python.
- R.
- JavaScript.
- Probability and Statistics.
- Big Data.
- Books.
- Other courses in Computer Science.
back to top ↑)
1. INTRODUCTORY COURSES IN DATA SCIENCE (back to top ↑)
Python (- Intro to Python for Data Science (by Filip Schouwenaars from DataCamp and Jonathan Sanito from Microsoft at edX). ~ 12 hours.
- Intro to Python for Data Science (by Filip Schouwenaars at DataCamp). ~ 4 hours.
- Intermediate Python for Data Science (by Filip Schouwenaars at DataCamp). ~ 4 hours.
- Python Data Science Toolbox (Part 1) (by Hugo Bowne-Anderson at DataCamp). ~ 3 hours.
- Python Data Science Toolbox (Part 2) (by Hugo Bowne-Anderson at DataCamp). ~ 4 hours.
- Data Types for Data Science (by Jason Myers at DataCamp). ~ 4 hours.
back to top ↑)
R (- Mathematical and Statistical Sofware (by Yosu Yurramendi, Master in Computational Engineering and Intelligent Systems, University of the Basque Country). 3 ECTS ~ 75 hours.
- The Data Scientist’s Toolbox (by Jeff Leek, Roger D. Peng and Brian Caffo from Johns Hopkins University at Coursera). ~ 4 hours.
- R Programming (by Roger D. Peng, Jeff Leek and Brian Caffo from Johns Hopkins University at Coursera). ~ 28 hours.
- Introduction to R (by Jonathan Cornelissen at DataCamp). ~ 4 hours.
- Intermediate R (by Filip Schouwenaars at DataCamp). ~ 6 hours.
- Intermediate R - Practice (by Filip Schouwenaars at DataCamp). Same sections to the previous course. ~ 4 hours.
- Writing Functions in R (by Hadley Wickham and Charlotte Wickham at DataCamp). ~ 4 hours.
- Writing Efficient R Code (by Colin Gillespie at DataCamp). ~ 4 hours.
- Object-Oriented Programming in R:
S3
andR6
(by Richie Cotton at DataCamp). ~ 4 hours.
back to top ↑)
SQL (- Intro to SQL for Data Science (by Nick Carchedi at DataCamp). ~ 4 hours.
- Joining Data in PostgreSQL (by Chester Ismay at DataCamp). ~ 5 hours.
back to top ↑)
2. GENERAL COURSES IN DATA SCIENCE (back to top ↑)
Python (- Master thesis corresponding to the Master in Computational Engineering and Intelligent Systems program (by Javier Estraviz. Master in Computational Engineering and Intelligent Systems, University of the Basque Country). 18 ECTS ~ 450 hours. Expected: 2018.
- Intro to Data Science (by Dave Holtz and Cheng-Han Lee at Udacity). ~ 80 hours.
back to top ↑)
R (- 15.071x The Analytics Edge (by Dimitris Bertsimas from MITx at edX). ~ 120 hours.
- Data Science Capstone (by Jeff Leek, Roger D. Peng and Brian Caffo from Johns Hopkins University at Coursera). 35 hours.
back to top ↑)
3. DATA ANALYSIS (back to top ↑)
Python (- Importing Data in Python (Part 1) (by Hugo Bowne-Anderson at DataCamp). ~ 3 hours.
- Importing Data in Python (Part 2) (by Hugo Bowne-Anderson at DataCamp). ~ 2 hours.
- Cleaning Data in Python (by Daniel Chen at DataCamp). ~ 4 hours.
- pandas Foundations (by Dhavide Aruliah at DataCamp). ~ 4 hours.
- Manipulating DataFrames with
pandas
(by Dhavide Aruliah at DataCamp). ~ 4 hours. - Merging DataFrames with
pandas
(by Dhavide Aruliah at DataCamp). ~ 4 hours. - Introduction to Databases in Python (by Jason Myers at DataCamp). ~ 4 hours.
- Become a Python Data Analyst (by Alvaro Fuentes at Safari). ~ 4 hours 30 minutes. My github repo.
- Intro to Data Analysis (by Caroline Buckey at Udacity). ~ 60 hours.
back to top ↑)
R (- Exploratory Data Analysis (by Iñaki Inza, Itziar Irigoien, Yosu Yurramendi, Javier Muguerza, Ibai Gurrutxaga, José Ignacio Martín, Olatz Arbelaitz, Txus Perez. Master in Computational Engineering and Intelligent Systems, University of the Basque Country). 6 ECTS ~ 150 hours.
- Getting and Cleaning Data (by Jeff Leek, Roger D. Peng and Brian Caffo from Johns Hopkins University at Coursera). ~ 16 hours.
- Exploratory Data Analysis (by Roger D. Peng, Jeff Leek and Brian Caffo from Johns Hopkins University at Coursera). ~ 16 hours.
- Data Analysis with R (by at Udacity). ~ 80 hours.
- Importing Data in R (Part 1) (by Filip Schouwenaars at DataCamp). ~ 3 hours.
- Importing Data in R (Part 2) (by Filip Schouwenaars at DataCamp). ~ 3 hours.
- Cleaning Data in R (by Nick Carchedi at DataCamp). ~ 4 hours.
- Importing & Cleaning Data in R: Case Studies (by Nick Carchedi at DataCamp). ~ 4 hours.
- String Manipulation in R with
stringr
(by Charlotte Wickham at DataCamp). ~ 4 hours. - Data Manipulation in R with
dplyr
(by Garrett Grolemund at DataCamp). ~ 4 hours. - Joining Data in R with
dplyr
(by Garrett Grolemund at DataCamp). ~ 4 hours. - Exploratory Data Analysis in R: Case Study (by David Robinson at DataCamp). ~ 4 hours.
- Data Analysis in R, the
data.table
Way (by Matt Dowle and Arun Srinivasan at DataCamp). ~ 4 hours.
back to top ↑)
Weka (- Data Mining and Big Data Analysis (by Itziar Irigoien, Javier Muguerza, Ibai Gurrutxaga, José Ignacio Martín, Olatz Arbelaitz, Txus Perez. Master in Computational Engineering and Intelligent Systems, University of the Basque Country). 3 ECTS ~ 75 hours.
back to top ↑)
4. MACHINE LEARNING (back to top ↑)
Octave/Matlab (- Machine Learning (by Andrew Ng from at Stanford at Coursera). ~ 55 hours.
- Learning from Data (by Yaser Abu-Mostafa at Caltech). ~ 108 hours.
back to top ↑)
Python (- Unsupervised Learning in Python (by Benjamin Wilson at DataCamp). ~ 4 hours.
- Making Predictions with Data and Python (by Alvaro Fuentes at Safari). ~ 4 hours. My github repo.
- Neural Networks and Deep Learning (by Andrew Ng from deeplearning.ai at Coursera) ~ 12 hours.
- Improving Deep Neural Networks: Hyperparameter tuning, Regularization and Optimization (by Andrew Ng from deeplearning.ai at Coursera) ~ 9 hours.
- Structuring Machine Learning Projects (by Andrew Ng from deeplearning.ai at Coursera) ~ 6 hours.
- Deep Learning for Business (by Jong-Moon Chung from Yonsei University at Coursera) ~ 8 hours.
- Supervised Learning with scikit-learn (by Andreas Müller and Hugo Bowne-Anderson at DataCamp). ~ 4 hours.
- Machine Learning with the Experts: School Budgets (by Peter Bull at DataCamp). ~ 4 hours.
- Deep Learning in Python (by Dan Becker at DataCamp). ~ 4 hours.
- Introduction to Data Science in Python (by Christopher Brooks from University of Michigan at Coursera) ~ 40 hours.
- Applied Plotting, Charting & Data Representation in Python (by Christopher Brooks from University of Michigan at Coursera) ~ 40 hours.
- Applied Machine Learning in Python (by Kevyn Collins-Thompson from University of Michigan at Coursera) ~ 40 hours.
- Applied Text Mining in Python (by V.G. Vinod Vydiswaran from University of Michigan at Coursera) ~ 40 hours.
- Applied Social Network Analysis in Python (by Daniel Romero from University of Michigan at Coursera) ~ 40 hours.
- Machine Learning Foundations: A Case Study Approach (by Carlos Guestrin and Emily Fox from University of Washington at Coursera) ~ 30 hours.
- Machine Learning: Regression (by Emily Fox and Carlos Guestrin from University of Washington at Coursera) ~ 30 hours.
- Machine Learning: Classification (by Carlos Guestrin and Emily Fox from University of Washington at Coursera) ~ 35 hours.
- Machine Learning: Clustering & Retrieval (by Emily Fox and Carlos Guestrin from University of Washington at Coursera) ~ 30 hours.
- Convolutional Neural Networks (by Andrew Ng from deeplearning.ai at Coursera).
- Sequence Models (by Andrew Ng from deeplearning.ai at Coursera).
- Neural Networks for Machine Learning (by Geoffrey Hinton from University of Toronto at Coursera). ~ 112 hours.
- Probabilistic Graphical Models 1: Representation (by Daphne Koller from Stanford at Coursera) ~ 75 hours.
- Probabilistic Graphical Models 2: Inference (by Daphne Koller from Stanford at Coursera) ~ 75 hours.
- Probabilistic Graphical Models 3: Learning (by Daphne Koller from Stanford at Coursera) ~ 75 hours.
- Introduction to Machine Learning (by Katie Malone and Sebastian Thrun at Udacity). ~ 100 hours.
- Machine Learning (by Michael Littman, Charles Isbell and Pushkar Kolhe at Udacity). ~ 160 hours.
- Deep Learning (by Vincent Vanhoucke and Arpan Chakraborty at Udacity). ~ 120 hours.
- Practical Deep Learning For Coders, Part 1 (by Jeremy Howard from fast.ai) ~ 90 hours.
- Cutting Edge Deep Learning For Coders, Part 2 (by Jeremy Howard from fast.ai) ~ 90 hours.
- Creative Applications of Deep Learning with TensorFlow (by Parag Mital at Kadenze) ~ 30 hours.
- Neural Networks (by Hugo Larochelle from Université de Sherbrooke).
- Machine Learning (by Nando de Freitas at Oxford University).
back to top ↑)
R (- Practical Machine Learning (by Jeff Leek, Roger D. Peng and Brian Caffo from Johns Hopkins University at Coursera). ~ 16 hours.
- Machine Learning Toolbox (by Zachary Deane-Mayer and Max Kuhn at DataCamp). ~ 4 hours.
- Introduction to Machine Learning (by Vincent Vankrunkelsven and Gilles Inghelbrecht at DataCamp). ~ 6 hours.
- Unsupervised Learning in R (by Hank Roark at DataCamp). ~ 4 hours.
- Supervised Learning in R: Regression (by Nina Zumel and John Mount at DataCamp). ~ 4 hours.
back to top ↑)
5. TEXT MINING AND NLP (back to top ↑)
Python (- Natural Language Processing Fundamentals in Python (by Katharine Jarmul at DataCamp). ~ 4 hours.
back to top ↑)
R (- Text Mining: Bag of Words (by Ted Kwartler at DataCamp). ~ 4 hours.
back to top ↑)
6. DATA VISUALIZATION AND REPORTING (back to top ↑)
Python (- Introduction to Data Visualization with Python (by Bryan Van de Ven at DataCamp). ~ 4 hours.
- Interactive Data Visualization with
bokeh
(by Bryan Van de Ven at DataCamp). ~ 4 hours.
back to top ↑)
R (- Reproducible Research (by Roger D. Peng, Jeff Leek and Brian Caffo from Johns Hopkins University at Coursera). ~ 16 hours.
- Data Analysis and Visualization (by Arpan Chakraborty at Udacity). ~ 160 hours.
- Data Visualization in R (by Ronald Pearson at DataCamp). ~ 4 hours.
- Data Visualization in R with lattice (by Deepayan Sarkar at DataCamp). ~ 4 hours.
- Data Visualization with
ggplot2
(Part 1) (by Rick Scavetta at DataCamp). ~ 5 hours. - Data Visualization with
ggplot2
(Part 2) (by Rick Scavetta at DataCamp). ~ 5 hours. - Data Visualization with
ggplot2
(Part 3) (by Rick Scavetta at DataCamp). ~ 6 hours. - Reporting with R Markdown (by Garrett Grolemund at DataCamp). ~ 3 hours.
- Working with Geospatial Data in R (by Charlotte Wickham at DataCamp). ~ 4 hours.
back to top ↑)
JavaScript (- Data Visualization and D3.js (by Ryan Orban, Chris Saden and Jonathan Dinu at Udacity) ~ 70 hours.
back to top ↑)
7. PROBABILITY AND STATISTICS (back to top ↑)
Python (- Probabilistic Modeling and Bayesian Networks (by Borja Calvo and Aritz Pérez. Master in Computational Engineering and Intelligent Systems, University of the Basque Country). 3 ECTS ~ 75 hours.
- Statistical Thinking in Python (Part 1) (by Justin Bois at DataCamp). ~ 3 hours.
- Developing Data Products (by Brian Caffo, Jeff Leek and Roger D. Peng from Johns Hopkins University at Coursera). ~ 16 hours.
- Statistical Thinking in Python (Part 2) (by Justin Bois at DataCamp). ~ 4 hours.
- Case Studies in Statistical Thinking (by Justin Bois at DataCamp). ~ 4 hours.
- Network Analysis in Python (Part 1) (by Eric Ma at DataCamp). ~ 4 hours.
- Network Analysis in Python (Part 2) (by Eric Ma at DataCamp). ~ 4 hours.
back to top ↑)
R (- Statistical Inference (by Brian Caffo, Roger D. Peng and Jeff Leek from Johns Hopkins University at Coursera). ~ 28 hours.
- Regression Models (by Brian Caffo, Roger D. Peng and Jeff Leek from Johns Hopkins University at Coursera). ~ 16 hours.
- Statistical Learning (by Trevor Hastie and Robert Tibshirani from Stanford at Stanford Online). ~ 50 hours.
- Introduction to Data (by Mine Cetinkaya-Rundel at DataCamp). ~ 4 hours.
- Exploratory Data Analysis (by Andrew Bray at DataCamp). ~ 4 hours.
- Correlation and Regression (by Ben Baumer at DataCamp). ~ 4 hours.
- Multiple and Logistic Regression (by Ben Baumer at DataCamp). ~ 4 hours.
- Foundations of Inference (by Jo Hardin at DataCamp). ~ 4 hours.
- Foundations of Probability in R (by David Robinson at DataCamp). ~ 4 hours.
- Beginning Bayes in R (by Jim Albert at DataCamp). ~ 4 hours.
- Spatial Statistics in R (by Barry Rowlingson at DataCamp). ~ 4 hours.
- Sentiment Analysis in R: The Tidy Way (by Julia Silge at DataCamp). ~ 4 hours.
back to top ↑)
8. BIG DATA (back to top ↑)
General (- Introduction to Big Data (2015) (by Ilkay Altintas and Amarnath Gupta from UC San Diego at Coursera). ~ 15 hours.
back to top ↑)
Spark (- CS100.1x Introduction to Big Data with Apache Spark (by Anthony D. Joseph from BerkeleyX at edX). ~ 65 hours.
- CS190.1x Scalable Machine Learning (by Ameet Talwalkar from BerkeleyX at edX). ~ 20 hours.
- Introduction to Spark in R using sparklyr (by Richie Cotton at DataCamp). ~ 4 hours.
back to top ↑)
9. BOOKS (This is a selection of books for Data Science and related disciplines from which I have good references. The books are listed in descending order of publication date.
Title | Author | Publisher | Release Date | Code | |
---|---|---|---|---|---|
Deep Learning with Python | Francois Chollet | Manning | Jan 2018 (*) | GitHub | |
Python Tricks: The Book | Dan Bader | Ron Holland Designs | Oct 2017 | ||
Python for Data Analysis (2nd ed.) | Wes McKinney | O'Reilly | Oct 2017 | GitHub | |
Python Machine Learning (2nd ed.) | Sebastian Raschka, Vahid Mirjalili | Packt | Sep 2017 | GitHub | |
An Introduction to Statistical Learning (2nd ed.) | Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani | Springer | Sep 2017 | R code | |
Deep Learning | Josh Patterson, Adam Gibson | O'Reilly | Aug 2017 | ||
Fundamentals of Deep Learning | Nikhil Buduma | O'Reilly | Jun 2017 | GitHub | |
The Elements of Statistical Learning (2nd ed.) | Trevor Hastie, Robert Tibshirani, Jerome Friedman | Springer | May 2017 | Datasets | |
Practical Statistics for Data Scientists | Peter Bruce, Andrew Bruce | O'Reilly | May 2017 | GitHub | |
Hands-On Machine Learning with Scikit-Learn and TensorFlow | Aurélien Géron | O'Reilly | Apr 2017 | GitHub | |
Think Like a Data Scientist | Brian Godsey | Manning | Abr 2017 | ||
Deep Learning | Ian Goodfellow, Yoshua Bengio, Aaron Courville | MIT Press | Jan 2017 | ||
Efficient R Programming | Robin Lovelace, Colin Gillespie | O'Reilly | Dec 2016 | GitHub | |
Python Data Science Handbook | Jake VanderPlas | O'Reilly | Nov 2016 | GitHub | |
Introduction to Machine Learning with Python | Sarah Guido, Andreas C. Müller | O'Reilly | Oct 2016 | GitHub | |
Real-World Machine Learning | Henrik Brink, Joseph W. Richards, Mark Fetherolf | Manning | Sep 2016 | GitHub | |
Algorithms of the Intelligent Web (2nd ed.) | Douglas G. McIlwraith, Haralambos Marmanis, Dmitry Babenko | Manning | Aug 2016 | GitHub | |
R for Data Science | Garrett Grolemund, Hadley Wickham | O'Reilly | Jul 2016 | GitHub | |
Introducing Data Science | Davy Cielen, Arno D. B. Meysman, Mohamed Ali | Manning | May 2016 | Code 1, 2 | |
R Deep Learning Essentials | Joshua F. Wiley | Packt | Mar 2016 | GitHub | |
R in Action (2nd ed.) | Robert I. Kabacoff | Manning | May 2015 | GitHub | |
Data Science from Scratch | Joel Grus | O'Reilly | Apr 2015 | GitHub | |
Data Science at the Command Line | Jeroen Janssens | O'Reilly | Oct 2014 | GitHub | |
Learning scikit-learn: Machine Learning in Python | Raúl Garreta, Guillermo Moncecchi | Packt | Nov 2013 | GitHub |
(*) Expected publication date
back to top ↑)
10. OTHER COURSES IN COMPUTER SCIENCE (back to top ↑)
Software Design (- Domain-Driven Design Distilled (by Vaughn Vernon at Safari). ~ 4 hours.
- Microservices: The Big Picture (by Antonio Goncalves at Pluralsight). ~ 2 hours.
back to top ↑)
JavaScript (- Introduction to graphics engines: modeling, animation and graphic representation (by Joseba Makazaga and Aitor Soroa. Master in Computational Engineering and Intelligent Systems, University of the Basque Country). Library
three.js
. 6 ECTS ~ 150 hours.
back to top ↑)
Python (- Heuristic Search (by José Antonio Lozano and Roberto Santana. Master in Computational Engineering and Intelligent Systems, University of the Basque Country). 3 ECTS ~ 75 hours.
- Python Epiphanies. Exploring Fundamental Concepts (by Stuart Williams at Safari). ~ 2.5 hours.
- Python: Design Patterns (by Jungwoo Ryoo at Lynda.com). ~ 2 hours.
- Enterprise Software with Python (by Mahmoud Hashemi at Safari). ~ 8 hours.
- Python: Getting Started (by Bo Milanovich at Pluralsight). ~ 3 hours.
- Python Fundamentals (by Austin Bingham and Robert Smallshire at Pluralsight). ~ 5 hours.
- Intermediate Python Programming (by Jessica McKellar at Safari). ~ 3 hours.
back to top ↑)
R (- Computation in Science and Engineering: numerical simulation (by Ander Murua. Master in Computational Engineering and Intelligent Systems, University of the Basque Country). 6 ECTS ~ 150 hours.
- Image and signal processing (by Mamen Hernández and Josune Gallego. Master in Computational Engineering and Intelligent Systems, University of the Basque Country). 4.5 ECTS ~ 112.5 hours.
- Cryptography (by Itziar Baragaña and Alicia Roca. Master in Computational Engineering and Intelligent Systems, University of the Basque Country). 4.5 ECTS ~ 112.5 hours.
back to top ↑)
Miscellaneous (- Methodology and research techniques (by Basilio Sierra. Master in Computational Engineering and Intelligent Systems, University of the Basque Country). 3 ECTS ~ 75 hours.
- Cracking the Data Science Interview (by Jonathan Dinu and Katie Kent at Safari). ~ 3 hours.
- Try Docker (by Jon Friskics at Codeschool.com). ~ 1 hour.
- Introduction to Git for Data Science (by Greg Wilson at DataCamp). ~ 4 hours.
- Introduction to Shell for Data Science Course (by Greg Wilson at DataCamp). ~ 4 hours.