• Stars
    star
    135
  • Rank 269,297 (Top 6 %)
  • Language
    HTML
  • Created almost 8 years ago
  • Updated over 7 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Materials for STATS 418 - Tools in Data Science course taught in the Master of Applied Statistics at UCLA

UCLA Master of Applied Statistics (MAS)

STATS 418 - Tools in Data Science

Course designer and main instructor: Szilárd Pafka

Course short description:

Tools for data acquisition, transformation and analysis, data visualization, machine learning and tools for reproducible data analysis, collaboration and model deployment used by data scientists in practice. Advanced R packages, analytical databases, high-performance machine learning libraries, big data tools.

Prerequisites:

404 – Statistical Computing and Programming
405 – Data Management

Course objectives:

Despite the new name and the recent hype, Data Science is hardly new, it has in fact solid foundations in statistics and computing technology that go back several decades. The 405 – Data Management course lies down the foundations necessary for most data science tasks (data acquisition, exploratory data analysis, data cleaning and transformation, databases, data visualization and data mining) while the 404 – Statistical Computing and Programming course prepares students to use R, the most widely used tool for data science. This proposed course will build on all this knowledge and it will discuss more advanced topics and present more tools (R packages, scalable machine learning libraries, high-performance analytical databases, big data technologies) that are used by data scientists in daily practice. The course will also discuss software engineering tools/techniques that are important when working on data science projects or when the results of the projects (models, data visualization dashboards etc.) are deployed in production. While this instructor is a great proponent of a good balance of theory and practice, and furthermore in the context of data science advocates a good balance of statistics and computing technology, this course will try to complement the other existing courses in the MAS program, and therefore it might appear as overly tilted towards software systems. Therefore, it is important to restate here that good statistical and theoretical foundations (e.g. cognitive science for data visualization or a good understanding of machine learning algorithms etc.) are also crucial when conducting data science in practice.

Syllabus and schedule:

Week 1 [4/5]: Overview of data science. The elements of a data science project. Overview of tools (R/Python, databases, machine learning libraries, big data tools, workflow/reproducibility etc)

Week 2 [4/12]: The Unix toolbox for manipulating files/text and automating tasks. Cloud computing for scaling up data science.

Week 3 [4/19]: Tools for reproducible research/productive data analysis and collaboration (Rmarkdown, Jupyter notebooks, git/Github).

Week 4 [4/26]: Tools for data visualization: ggplot2, shiny (interactive web applications with R) / shiny dashboards

Week 5 [5/3]: Foundations for supervised learning (classification/regression): basic algorithms, overfitting, train and test sets, cross-validation, bias-variance tradeoff, regularization, ROC curve for binary classification (various R packages)

Week 6 [5/10]: Tools for supervised learning 1 (GLM, Lasso, random forest, gradient boosted machines) (R packages, Vowpal Wabbit, xgboost, H2O)

Week 7 [5/17]: Analytical databases (columnar/MPP relational databases), SQL. NoSQL databases (key-value stores, document databases). “Big data” technologies (Hadoop, HDFS, Map-reduce, Hive, Impala, Spark, EMR etc.)

Week 8 [5/24]: Tools for supervised learning 2 (support vector machines, neural networks, deep learning, ensembles) (R packages, H2O, libraries for deep learning on GPUs)

Week 9 [5/31]: Tools for unsupervised learning (K-means clustering, hierarchical clustering) (R packages)

Week 10 [6/7]: Course discussions/Q&A session. Summary of the course, conclusions, final thoughts etc.

Instructors/guest speakers:

Szilárd Pafka
Eduardo Ariño de la Rubia
Yasmin Lucero

TA:
Medha Uppala

Announcements and Q&A:

Class announcements and student Q&A will be done via github issues.

Grading:

Class Participation 10%
Homework (4 assignments) 60%
Final Exam 30%

Sample exam questions here

Supplemental reading:

(Click on link to open/download paper/free book PDF)

Papers:

David Donoho: 50 years of Data Science

Sean Kandel, Andreas Paepcke, Joseph M. Hellerstein, and Jeffrey Heer: Enterprise Data Analysis and Visualization: An Interview Study

Rexer Analytics: 2015 Data Science Survey (Summary Report)

Leo Breiman: Statistical Modeling: The Two Cultures

Rich Caruana, Alexandru Niculescu-Mizil: An Empirical Comparison of Supervised Learning Algorithms

Books:

John M. Chambers: Software for Data Analysis: Programming with R, Springer, 2008

Hadley Wickham: Advanced R, Chapman & Hall/CRC, 2015

W.N. Venables, B.D. Ripley, Modern Applied Statistics with S, Springer, 4th ed., 2003

William S. Cleveland, The Elements of Graphing Data, Hobart Press, 1994

William S. Cleveland, Visualizing Data, Hobart Press, 1993

Edward R. Tufte, The Visual Display of Quantitative Information, Graphics Press, 2nd ed., 2001

Stephen Few, Show Me the Numbers: Designing Tables and Graphs to Enlighten, 2nd ed., Analytics Press, 2012

Hadley Wickham, ggplot2: Elegant Graphics for Data Analysis, Springer, 2009

Dorian Pyle: Data Preparation for Data Mining, Morgan Kaufmann, 1999

Micheline Kamber, Jiawei Han, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2nd ed., 2005

Gareth James, Daniela Witten, Trevor Hastie, Robert Tibshirani: An Introduction to Statistical Learning with Applications in R, Springer, 2013

Trevor Hastie, Robert Tibshirani, Jerome Friedman: The Elements of Statistical Learning, 2nd. ed., Springer, 2009

Eric Redmond and Jim R. Wilson: Seven Databases in Seven Weeks: A Guide to Modern Databases and the NoSQL Movement, The Pragmatic Bookshelf, 2012

More Repositories

1

benchm-ml

A minimal benchmark for scalability, speed and accuracy of commonly used open source implementations (R packages, Python scikit-learn, H2O, xgboost, Spark MLlib etc.) of the top machine learning algorithms for binary classification (random forests, gradient boosted trees, deep neural networks etc.).
R
1,871
star
2

GBM-perf

Performance of various open source GBM implementations
HTML
215
star
3

benchm-databases

A minimal benchmark of various tools (statistical software, databases etc.) for working with tabular data of moderately large sizes (interactive data analysis).
R
90
star
4

ml-prod

Some thoughts on how to use machine learning in production
72
star
5

benchm-dl

Playing with various deep learning tools and network architectures
Python
69
star
6

survey-ml-tools

Quick informal survey at the Los Angeles Machine learning meetup about tools used for machine learning.
51
star
7

teach-data-science-msc-analytics-ceu

Materials for a short introductory/intermediate Data Science course taught in the MSc in Business Analytics program at the Central European University
HTML
33
star
8

xgboost-adv-workshop-LA

Advanced workshop on XGBoost with Tianqi Chen in Santa Monica, June 2, 2016
R
26
star
9

ML-scoring

Compare the scoring speed of several open source machine learning libraries.
R
21
star
10

teach-ML-CEU-master-bizanalytics

Machine Learning #1 and #2 courses at CEU Master of Science in Business Analytics
HTML
21
star
11

GBM-tune

Tuning GBMs (hyperparameter tuning) and impact on out-of-sample predictions
HTML
21
star
12

GBM-multicore

GBM multicore scaling: h2o, xgboost and lightgbm on multicore and multi-socket systems
HTML
20
star
13

datascience-latency

Latency numbers every data scientist should know (aka the pyramid of analytical tasks) - the order of magnitude of computational time for the most common analytical tasks (SQL-like data munging, linear and non-linear supervised learning etc.) with the typically available tools on commodity hardware.
20
star
14

GBM-intro

GBM intro talk (with R and Python code)
HTML
17
star
15

dataset-sizes-kdnuggets

Size of datasets used for analytics based on 10 years of surveys by KDnuggets.
HTML
16
star
16

talks-main

Most recent/important talks given at conferences/meetups
15
star
17

GBM-adv-workshop-Bp19

Advanced GBM Workshop - Budapest, Nov 2019
HTML
12
star
18

kaggle-scripts-R-pydata

Kaggle scripts: R vs pydata + most popular R and Python packages for Machine Learning
R
11
star
19

awesome-GBMs

A curated list of gradient boosting machines (GBM) resources
10
star
20

benchm-dplyr-dt

10
star
21

datascience-course-historical

Inspired by David Donoho's "50 Years of Data Science" (2015) paper, I'm releasing here a course proposal draft I wrote in 2009 for a possible course of "data science".
9
star
22

dscomp-winstab

Winner stability in data science competitions
R
8
star
23

ml-algos-perf

Performance of Machine Learning Algorithms - playground for experimentation in order to understand their performance characteristics as a function of the attributes of the datasets used for training
Python
7
star
24

GBM-workshop

Code (and other materials) for an introductory talk/workshop on GBMs (developed originally for an R-Ladies Meetup)
HTML
6
star
25

DS_meetups

Contents from the Real Data Science USA (formerly LA Data Science) Meetup
5
star
26

h2o-scoring--OLD

Various options for deploying h2o.ai models to production (scoring new data)
Java
5
star
27

datascience-1slide

Data Science in 1 Slide
4
star
28

ml-x1

Machine learning tools on monster EC2 X1 instance (128 cores, 2 TB RAM)
HTML
4
star
29

aboutme

HTML
4
star
30

GBM-meltdown

The Effect of the Linux Kernel Page-Table Isolation (KPTI) Patch (Meltdown Vulnerability) on GBMs
R
3
star
31

benchm-ml-talks

3
star
32

bio

Szilard Pafka's short bio (to go with conference talk abstracts)
2
star
33

benchm-R-mysql

R
2
star
34

shinyvalidinp

R
2
star
35

MLprod-1slide

Machine Learning in Production in 1 Slide
1
star
36

LA-data-meetups

1
star
37

BigDataDayLA2015-DataScience

List of talks from the Data Science Track of Big Data Day LA 2015 (annual free conference)
1
star