• Stars
    star
    359
  • Rank 118,537 (Top 3 %)
  • Language
    Jupyter Notebook
  • License
    MIT License
  • Created over 2 years ago
  • Updated about 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Slides, scripts and materials for the Machine Learning in Finance Course at NYU Tandon, 2022

MLSys-NYU-2022

Slides, scripts and materials for the Machine Learning in Finance course at NYU Tandon, 2022.

Overview

DISCLAIMER: A significant part of our course is class participation (this is why, in the end, we have universities and not just books and repos!), and no amount of scripts can provide the same level of educational content, or a comparable experience. Please note that this course changes substantially every year, so the best way to keep up to date with us is... by enrolling in the Master!

TL;DR: This repository contains some of the teaching materials by Prof. Ethan Rosenthal and myself for the 2022 course in ML at the NYU Tandon School of Engineering. The course is presented as an introduction to Machine Learning with Finance use cases and industry-standard tools.

We open source slides, code snippets and assignments after the class is completed, hoping to benefit the broader community of Machine Learning students and practitioners; I had a calculus book that said, "What one fool can do, another can.", and I wish more and more fools could become proficient at building reliable, trust-worthy, well-crafted ML systems.

We feel there are now enough books and YouTube videos for people interested purely in the theory of ML; moreover, practitioners produce a much bigger marginal value when bringing into the class their day-to-day experience, which, for the time being, cannot be as easily found on YouTube.

Therefore, the course we run is very practical and focuses on the intuitive understanding of ML problems and their solutions through real-world tools: we emphasize the importance of good coding habits, and the use of industry standard methodology, over complex modelling and formulas (alas, we do indeed sometimes need to talk about math).

The whole course runs in 14 weeks, but we cover arguments that would keep you busy for a lifetime: every lecture, every slide, every code snippet are the result of many explicit and implicit trade-offs - what should we cover, what should we not? While no material can substitute for real-world interactions and our great sense of humour, we leave for the open source community to judge how useful the trade-offs we picked actually are.

At a glance

Main themes

The course is structured around 14 weeks: 13 weeks of lectures, and 1 final demo day for students (organized in teams) to present an end-to-end machine learning project that showcases what they learned in the course. Main topics, roughly in order of appearance:

  • Introduction to ML in Finance: use cases, tools, the rise of MLOps.
  • Python setup for scientific computing: notebooks, environments, dependencies.
  • ML best practices: dataset split, hyper-parameter tuning.
  • Modelling: classification, regression.
  • Use case deep dive I: fraud detection.
  • MLOps best practices: experiment tracking, DAG-based pipelines, deployment.
  • Rounded evaluation: slice-based metrics, behavioral testing.
  • Introduction to embeddings: skip-gram, similarity in a latent space.
  • Use case deep dive II: recommender systems.

Repo structure

This year's repository is structured by week: each week has its folder, with a self-contained README, scripts / notebooks and slides. The choice results in some redundancy (especially in the second part, where the same training loop is used several times), but provides more clarity for students pacing themselves through the course, and highlights the highly modular nature of the syllabus.

As part of the course, we emphasize the importance of virtual environments and submitting properly structured projects (notebooks are great, but we leave them for experimentation!). Each week contains a devoted requirements.txt file to make sure the scripts are reproducible.

Changelog

Compared to 2021 edition, most material is indeed new and re-vamped. As a non-exhaustive list:

  • new intro to Python / coding good practices;
  • fraud detection as a finance-specific ML use case;
  • NLP section has been replaced by a RecSys deep dive;
  • new section on the importance of metrics, and expanded discussion on evaluating models;
  • new tools: Metaflow sandbox and Streamlit apps.

Tooling overview

Intro to Python and Git

Python is the main language for Machine Learning, but it is surprisingly hard to set up a working environment. We introduce virtualenv and basics of git to get you started.

Metaflow

Metaflow is an open-source tool designed to simplify building, maintaining and deploying ML pipelines (e.g. here). Starting this year, Outerbounds provided us with free sandbox accounts (thank you!).

Streamlit

Streamlit turns Python scripts into web apps in minutes, helping with prototyping and sharing the results of our pipelines. Streamlit apps can be used to display artifacts from Metaflow, and make the model interactive for non-technical stakeholders.

Comet

Comet is a machine learning platform that can help you manage, visualize, and optimize training runs. We use Comet to keep track of our experiments, and document our progress with the rest of our team and technical stakeholders (thank you for the free account!).

Flask

Flask is a micro web framework written in Python. It allows us to build (in Python) APIs that serve predictions made by our trained model in real-time, and display the results in the browser.

Acknowledgments

Thanks to all outstanding people quoted and linked in the slides: this course is possible only because we truly stand on the shoulders of giants. Special thanks also to:

  • Hugo for being our fantastic guest speaker on Metaflow;
  • Chip for being our fantastic guest speaker on MLOps;
  • Ciro for being our fantastic guest speaker on industry applications of ML and ML careers;
  • Gideon for fantastic support and free Comet accounts for all the students.

Suggested complementary / additional readings

The main topics - Regression, Classification, Time Series, Fraud Detection, MLOps, RecSys etc. - are all huge, and we could obviously just scratch the surface. Aside from all the references to be found in the slides and READMEs, these are few good places to further explore this world.

Machine Learning

MLOps

Contacts

For questions, feedback, comments, please drop me a message at: jacopo dot tagliabue at nyu.edu.

More Repositories

1

you-dont-need-a-bigger-boat

An end-to-end implementation of intent prediction with Metaflow and other cool tools
Python
835
star
2

reclist

Behavioral "black-box" testing for recommender systems
Python
408
star
3

recs-at-resonable-scale

Recommendations at "Reasonable Scale": joining dataOps with recSys through dbt, Merlin and Metaflow
Python
224
star
4

post-modern-stack

Joining the modern data stack with the modern ML stack
Python
187
star
5

foundation-models-for-dbt-entity-matching

Playground for using large language models into the Modern Data Stack for entity matching
Python
105
star
6

FREE_7773

Materials for my 2021 NYU class on NLP and ML Systems (Master of Engineering).
Jupyter Notebook
96
star
7

paas-data-ingestion

Ingesting data with Pulumi, AWS lambdas and Snowflake in a scalable, fully replayable manner
PLpgSQL
66
star
8

tensorflow_to_lambda_serverless

Serve tensorflow models prediction from AWS lambda endpoints
Python
58
star
9

no-ops-machine-learning

A PaaS End-to-End ML Setup with Metaflow, Serverless and SageMaker.
Python
36
star
10

dag-card-is-the-new-model-card

Template-based generation of DAG cards from Metaflow classes, inspired by Google cards for machine learning models.
Python
29
star
11

retail-personalization-workshop

In-Session Personalization Workshop for eCommerce, April 2021, and the MICES Workshop in June 2021.
Jupyter Notebook
21
star
12

anki-drive-python-sdk

Python+node wrapper to read/send message from/to Anki Overdrive bluetooth vehicles.
Python
17
star
13

clothes-in-space

Personalization with deep learning in 100 lines of code
Jupyter Notebook
14
star
14

pixel_from_lambda

Serve a 1x1 GIF pixel from an AWS lambda-powered endpoint
Python
13
star
15

MLSys-NYU-2023

Slides, scripts and materials for the Machine Learning in Finance course at NYU Tandon, 2023.
Jupyter Notebook
12
star
16

spark_tree2lambda

Python micro-service to serve a decision tree trained with Spark through AWS Lambda
Jupyter Notebook
9
star
17

session-path

SessionPath is a deep learning model that provides personalized category suggestions for type-ahead APIs. This repo re-implements the original paper (https://arxiv.org/abs/2005.12781) leveraging Ludwig capabilities.
Python
6
star
18

tarski-2.0

Old-style computational semantics at the time of Python 3.6
Python
5
star
19

magic-the-gpthering

Playground for generating cards in the style of "Magic The Gathering" using generative AI
Python
4
star
20

webppl_to_lambda_serverless

Deploying a webppl probabilistic program as an (AWS lambda) endpoint.
JavaScript
4
star
21

On-the-plurality-of-graphs

WIP code for the "on the plurality of graphs" paper
Jupyter Notebook
3
star
22

jacopotagliabue.github.io

Personal website
2
star
23

how-much-is-a-billion

Generating meaningful perspectives with NLP and Probabilistic Programming.
JavaScript
2
star