• Stars
    star
    266
  • Rank 154,064 (Top 4 %)
  • Language
    Jupyter Notebook
  • License
    MIT License
  • Created over 5 years ago
  • Updated about 5 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Projects done in the Data Engineering Nanodegree by Udacity.com

Data-engineering-nanodegree

Projects done in the Data Engineering Nanodegree by Udacity.com

Icon

Course 1: Data Modeling

Introduction to Data Modeling

βž” Understand the purpose of data modeling

βž” Identify the strengths and weaknesses of different types of databases and data storage techniques

βž” Create a table in Postgres and Apache Cassandra

Relational Data Models

βž” Understand when to use a relational database

βž” Understand the difference between OLAP and OLTP databases

βž” Create normalized data tables

βž” Implement denormalized schemas (e.g. STAR, Snowflake)

NoSQL Data Models

βž” Understand when to use NoSQL databases and how they differ from relational databases

βž” Select the appropriate primary key and clustering columns for a given use case

βž” Create a NoSQL database in Apache Cassandra

Project: Data Modeling with Postgres and Apache Cassandra

Course 2: Cloud Data Warehouses

Introduction to the Data Warehouses

βž” Understand Data Warehousing architecture

βž” Run an ETL process to denormalize a database (3NF to Star)

βž” Create an OLAP cube from facts and dimensions

βž” Compare columnar vs. row oriented approaches

Introduction to the Cloud with AWS

βž” Understand cloud computing

βž” Create an AWS account and understand their services

βž” Set up Amazon S3, IAM, VPC, EC2, RDS PostgreSQL

Implementing Data Warehouses on AWS

βž” Identify components of the Redshift architecture

βž” Run ETL process to extract data from S3 into Redshift

βž” Set up AWS infrastructure using Infrastructure as Code (IaC)

βž” Design an optimized table by selecting the appropriate distribution style and sorting key

Project 2: Data Infrastructure on the Cloud

Course 3: Data Lakes with Spark

The Power of Spark

βž” Understand the big data ecosystem

βž” Understand when to use Spark and when not to use it

Data Wrangling with Spark

βž” Manipulate data with SparkSQL and Spark Dataframes

βž” Use Spark for ETL purposes

Debugging and Optimization

βž” Troubleshoot common errors and optimize their code using the Spark WebUI

Introduction to Data Lakes

βž” Understand the purpose and evolution of data lakes

βž” Implement data lakes on Amazon S3, EMR, Athena, and Amazon Glue

βž” Use Spark to run ELT processes and analytics on data of diverse sources, structures, and vintages

βž” Understand the components and issues of data lakes

Project 3: Big Data with Spark

Course 4: Automate Data Pipelines

Data Pipelines

βž” Create data pipelines with Apache Airflow

βž” Set up task dependencies

βž” Create data connections using hooks

Data Quality

βž” Track data lineage

βž” Set up data pipeline schedules

βž” Partition data to optimize pipelines

βž” Write tests to ensure data quality

βž” Backfill data

Production Data Pipelines

βž” Build reusable and maintainable pipelines

βž” Build your own Apache Airflow plugins

βž” Implement subDAGs

βž” Set up task boundaries

βž” Monitor data pipelines

Project: Data Pipelines with Airflow