• Stars
    star
    24,859
  • Rank 862 (Top 0.02 %)
  • Language
    Jupyter Notebook
  • Created about 3 years ago
  • Updated 3 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Free Data Engineering course!

Data Engineering Zoomcamp

Syllabus

Taking the course

2023 Cohort

Self-paced mode

All the materials of the course are freely available, so that you can take the course at your own pace

  • Follow the suggested syllabus (see below) week by week
  • You don't need to fill in the registration form. Just start watching the videos and join Slack
  • Check FAQ if you have problems
  • If you can't find a solution to your problem in FAQ, ask for help in Slack

Asking for help in Slack

The best way to get support is to use DataTalks.Club's Slack. Join the #course-data-engineering channel.

To make discussions in Slack more organized:

Course UI

Alternatively, you can access this course using the provided UI app, the app provides a user-friendly interface for navigating through the course material.

dezoomcamp-ui

Syllabus

Note: NYC TLC changed the format of the data we use to parquet. But you can still access the csv files here.

Week 1: Introduction & Prerequisites

  • Course overview
  • Introduction to GCP
  • Docker and docker-compose
  • Running Postgres locally with Docker
  • Setting up infrastructure on GCP with Terraform
  • Preparing the environment for the course
  • Homework

More details

Week 2: Workflow Orchestration

  • Data Lake
  • Workflow orchestration
  • Introduction to Prefect
  • ETL with GCP & Prefect
  • Parametrizing workflows
  • Prefect Cloud and additional resources
  • Homework

More details

Week 3: Data Warehouse

  • Data Warehouse
  • BigQuery
  • Partitioning and clustering
  • BigQuery best practices
  • Internals of BigQuery
  • Integrating BigQuery with Airflow
  • BigQuery Machine Learning

More details

Week 4: Analytics engineering

  • Basics of analytics engineering
  • dbt (data build tool)
  • BigQuery and dbt
  • Postgres and dbt
  • dbt models
  • Testing and documenting
  • Deployment to the cloud and locally
  • Visualizing the data with google data studio and metabase

More details

Week 5: Batch processing

  • Batch processing
  • What is Spark
  • Spark Dataframes
  • Spark SQL
  • Internals: GroupBy and joins

More details

Week 6: Streaming

  • Introduction to Kafka
  • Schemas (avro)
  • Kafka Streams
  • Kafka Connect and KSQL

More details

Week 7, 8 & 9: Project

Putting everything we learned to practice

  • Week 7 and 8: working on your project
  • Week 9: reviewing your peers

More details

Workshop: Maximizing Confidence in Your Data Model Changes with dbt and PipeRider

More details

Overview

Architecture diagram

Technologies

  • Google Cloud Platform (GCP): Cloud-based auto-scaling platform by Google
    • Google Cloud Storage (GCS): Data Lake
    • BigQuery: Data Warehouse
  • Terraform: Infrastructure-as-Code (IaC)
  • Docker: Containerization
  • SQL: Data Analysis & Exploration
  • Prefect: Workflow Orchestration
  • dbt: Data Transformation
  • Spark: Distributed Processing
  • Kafka: Streaming

Prerequisites

To get the most out of this course, you should feel comfortable with coding and command line and know the basics of SQL. Prior experience with Python will be helpful, but you can pick Python relatively fast if you have experience with other programming languages.

Prior experience with data engineering is not required.

Instructors

Tools

For this course, you'll need to have the following software installed on your computer:

  • Docker and Docker-Compose
  • Python 3 (e.g. via Anaconda)
  • Google Cloud SDK
  • Terraform

See Week 1 for more details about installing these tools

Supporters and partners

Thanks to the course sponsors for making it possible to create this course

Do you want to support our course and our community? Please reach out to [email protected]

More Repositories

1

mlops-zoomcamp

Free MLOps course from DataTalks.Club
Jupyter Notebook
11,082
star
2

machine-learning-zoomcamp

Learn ML engineering for free in 4 months!
Jupyter Notebook
9,429
star
3

llm-zoomcamp

LLM Zoomcamp - a free online course about building a Q&A system
Jupyter Notebook
2,813
star
4

stock-markets-analytics-zoomcamp

Course Materials for Analytics in Stock Markets Zoomcamp
Jupyter Notebook
397
star
5

project-of-the-week

Learn by doing: DIY project groups at DataTalks.Club
375
star
6

awesome-data-podcasts

A list of awesome data podcasts
362
star
7

datatalksclub.github.io

The web page for DataTalks.Club
Python
180
star
8

nyc-tlc-data

Backup for NYC TLC data for the DE Zoomcamp course
151
star
9

data-paths

Learning paths for data roles
123
star
10

data-analytics-interviews

Data analytics interview questions and answers
55
star
11

course-management-platform

Django-based course management platform for Zoomcamps
Python
46
star
12

zoomcamp-analytics

Public data and analytics for our open course
Jupyter Notebook
29
star
13

kaggle-qa-challenge-starter

The getting started notebook for the DTC Zoomcamp Q&A challenge
Jupyter Notebook
28
star
14

mlzoomcamp.com

The page for the ML Zoomcamp course
25
star
15

reading-club-nlp

Notes from our NLP reading club!
16
star
16

kitchenware-competition-starter

A starter notebook for the Kitchenware classification competition on Kaggle
Jupyter Notebook
14
star
17

whylogs-workshop

The code from the whylogs workshop in DataTalks.Club on 29 March 2022
Jupyter Notebook
13
star
18

reading-club-books

12
star
19

llm-zoomcamp-saturncloud

Saturn Cloud starter code for LLM Zoomcamp
Jupyter Notebook
8
star
20

website-django

The DTC website in Django
Jupyter Notebook
4
star
21

course-management-platform-old

A platform for hosting our courses
Python
4
star
22

fashion

Python
2
star