• Stars
    star
    22,477
  • Rank 997 (Top 0.03 %)
  • Language
    Jupyter Notebook
  • Created over 2 years ago
  • Updated 16 days ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Free Data Engineering course!

Data Engineering Zoomcamp

Syllabus

Taking the course

2023 Cohort

Self-paced mode

All the materials of the course are freely available, so that you can take the course at your own pace

  • Follow the suggested syllabus (see below) week by week
  • You don't need to fill in the registration form. Just start watching the videos and join Slack
  • Check FAQ if you have problems
  • If you can't find a solution to your problem in FAQ, ask for help in Slack

Asking for help in Slack

The best way to get support is to use DataTalks.Club's Slack. Join the #course-data-engineering channel.

To make discussions in Slack more organized:

Course UI

Alternatively, you can access this course using the provided UI app, the app provides a user-friendly interface for navigating through the course material.

dezoomcamp-ui

Syllabus

Note: NYC TLC changed the format of the data we use to parquet. But you can still access the csv files here.

Week 1: Introduction & Prerequisites

  • Course overview
  • Introduction to GCP
  • Docker and docker-compose
  • Running Postgres locally with Docker
  • Setting up infrastructure on GCP with Terraform
  • Preparing the environment for the course
  • Homework

More details

Week 2: Workflow Orchestration

  • Data Lake
  • Workflow orchestration
  • Introduction to Prefect
  • ETL with GCP & Prefect
  • Parametrizing workflows
  • Prefect Cloud and additional resources
  • Homework

More details

Week 3: Data Warehouse

  • Data Warehouse
  • BigQuery
  • Partitioning and clustering
  • BigQuery best practices
  • Internals of BigQuery
  • Integrating BigQuery with Airflow
  • BigQuery Machine Learning

More details

Week 4: Analytics engineering

  • Basics of analytics engineering
  • dbt (data build tool)
  • BigQuery and dbt
  • Postgres and dbt
  • dbt models
  • Testing and documenting
  • Deployment to the cloud and locally
  • Visualizing the data with google data studio and metabase

More details

Week 5: Batch processing

  • Batch processing
  • What is Spark
  • Spark Dataframes
  • Spark SQL
  • Internals: GroupBy and joins

More details

Week 6: Streaming

  • Introduction to Kafka
  • Schemas (avro)
  • Kafka Streams
  • Kafka Connect and KSQL

More details

Week 7, 8 & 9: Project

Putting everything we learned to practice

  • Week 7 and 8: working on your project
  • Week 9: reviewing your peers

More details

Workshop: Maximizing Confidence in Your Data Model Changes with dbt and PipeRider

More details

Overview

Architecture diagram

Technologies

  • Google Cloud Platform (GCP): Cloud-based auto-scaling platform by Google
    • Google Cloud Storage (GCS): Data Lake
    • BigQuery: Data Warehouse
  • Terraform: Infrastructure-as-Code (IaC)
  • Docker: Containerization
  • SQL: Data Analysis & Exploration
  • Prefect: Workflow Orchestration
  • dbt: Data Transformation
  • Spark: Distributed Processing
  • Kafka: Streaming

Prerequisites

To get the most out of this course, you should feel comfortable with coding and command line and know the basics of SQL. Prior experience with Python will be helpful, but you can pick Python relatively fast if you have experience with other programming languages.

Prior experience with data engineering is not required.

Instructors

Tools

For this course, you'll need to have the following software installed on your computer:

  • Docker and Docker-Compose
  • Python 3 (e.g. via Anaconda)
  • Google Cloud SDK
  • Terraform

See Week 1 for more details about installing these tools

Supporters and partners

Thanks to the course sponsors for making it possible to create this course

Do you want to support our course and our community? Please reach out to [email protected]

More Repositories

1

mlops-zoomcamp

Free MLOps course from DataTalks.Club
Jupyter Notebook
8,805
star
2

machine-learning-zoomcamp

Learn ML engineering for free in 4 months!
Jupyter Notebook
8,288
star
3

project-of-the-week

Learn by doing: DIY project groups at DataTalks.Club
336
star
4

awesome-data-podcasts

A list of awesome data podcasts
322
star
5

datatalksclub.github.io

The web page for DataTalks.Club
Python
160
star
6

nyc-tlc-data

Backup for NYC TLC data for the DE Zoomcamp course
133
star
7

data-paths

Learning paths for data roles
121
star
8

data-analytics-interviews

Data analytics interview questions and answers
48
star
9

course-management-platform

Django-based course management platform for Zoomcamps
Python
28
star
10

kaggle-qa-challenge-starter

The getting started notebook for the DTC Zoomcamp Q&A challenge
Jupyter Notebook
27
star
11

zoomcamp-analytics

Public data and analytics for our open course
Jupyter Notebook
24
star
12

mlzoomcamp.com

The page for the ML Zoomcamp course
24
star
13

reading-club-nlp

Notes from our NLP reading club!
14
star
14

kitchenware-competition-starter

A starter notebook for the Kitchenware classification competition on Kaggle
Jupyter Notebook
13
star
15

whylogs-workshop

The code from the whylogs workshop in DataTalks.Club on 29 March 2022
Jupyter Notebook
13
star
16

reading-club-books

11
star
17

stock-markets-analytics-zoomcamp

Course Materials for Analytics in Stock Markets Zoomcamp
7
star
18

website-django

The DTC website in Django
Jupyter Notebook
3
star
19

course-management-platform-old

A platform for hosting our courses
Python
3
star
20

llm-zoomcamp

LLM Zoomcamp - a free online course about building an AI bot that can answer questions about your knowledge base
2
star
21

fashion

Python
1
star