Data-engineering-nanodegree
Projects done in the Data Engineering Nanodegree by Udacity.com
Course 1: Data Modeling
Introduction to Data Modeling
β Understand the purpose of data modeling
β Identify the strengths and weaknesses of different types of databases and data storage techniques
β Create a table in Postgres and Apache Cassandra
Relational Data Models
β Understand when to use a relational database
β Understand the difference between OLAP and OLTP databases
β Create normalized data tables
β Implement denormalized schemas (e.g. STAR, Snowflake)
NoSQL Data Models
β Understand when to use NoSQL databases and how they differ from relational databases
β Select the appropriate primary key and clustering columns for a given use case
β Create a NoSQL database in Apache Cassandra
Project: Data Modeling with Postgres and Apache Cassandra
Course 2: Cloud Data Warehouses
Introduction to the Data Warehouses
β Understand Data Warehousing architecture
β Run an ETL process to denormalize a database (3NF to Star)
β Create an OLAP cube from facts and dimensions
β Compare columnar vs. row oriented approaches
Introduction to the Cloud with AWS
β Understand cloud computing
β Create an AWS account and understand their services
β Set up Amazon S3, IAM, VPC, EC2, RDS PostgreSQL
Implementing Data Warehouses on AWS
β Identify components of the Redshift architecture
β Run ETL process to extract data from S3 into Redshift
β Set up AWS infrastructure using Infrastructure as Code (IaC)
β Design an optimized table by selecting the appropriate distribution style and sorting key
Project 2: Data Infrastructure on the Cloud
Course 3: Data Lakes with Spark
The Power of Spark
β Understand the big data ecosystem
β Understand when to use Spark and when not to use it
Data Wrangling with Spark
β Manipulate data with SparkSQL and Spark Dataframes
β Use Spark for ETL purposes
Debugging and Optimization
β Troubleshoot common errors and optimize their code using the Spark WebUI
Introduction to Data Lakes
β Understand the purpose and evolution of data lakes
β Implement data lakes on Amazon S3, EMR, Athena, and Amazon Glue
β Use Spark to run ELT processes and analytics on data of diverse sources, structures, and vintages
β Understand the components and issues of data lakes
Project 3: Big Data with Spark
Course 4: Automate Data Pipelines
Data Pipelines
β Create data pipelines with Apache Airflow
β Set up task dependencies
β Create data connections using hooks
Data Quality
β Track data lineage
β Set up data pipeline schedules
β Partition data to optimize pipelines
β Write tests to ensure data quality
β Backfill data
Production Data Pipelines
β Build reusable and maintainable pipelines
β Build your own Apache Airflow plugins
β Implement subDAGs
β Set up task boundaries
β Monitor data pipelines