• Stars
    star
    1,417
  • Rank 32,892 (Top 0.7 %)
  • Language
    Python
  • License
    Other
  • Created over 4 years ago
  • Updated about 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Few projects related to Data Engineering including Data Modeling, Infrastructure setup on cloud, Data Warehousing and Data Lake development.

Data Engineering Projects

Project 1: Data Modeling with Postgres

In this project, we apply Data Modeling with Postgres and build an ETL pipeline using Python. A startup wants to analyze the data they've been collecting on songs and user activity on their new music streaming app. Currently, they are collecting data in json format and the analytics team is particularly interested in understanding what songs users are listening to.

Link: Data_Modeling_with_Postgres

Project 2: Data Modeling with Cassandra

In this project, we apply Data Modeling with Cassandra and build an ETL pipeline using Python. We will build a Data Model around our queries that we want to get answers for. For our use case we want below answers:

  • Get details of a song that was herad on the music app history during a particular session.
  • Get songs played by a user during particular session on music app.
  • Get all users from the music app history who listened to a particular song.

Link : Data_Modeling_with_Apache_Cassandra

Project 3: Data Warehouse

In this project, we apply the Data Warehouse architectures we learnt and build a Data Warehouse on AWS cloud. We build an ETL pipeline to extract and transform data stored in json format in s3 buckets and move the data to Warehouse hosted on Amazon Redshift.

Use Redshift IaC script - Redshift_IaC_README

Link - Data_Warehouse

Project 4: Data Lake

In this project, we will build a Data Lake on AWS cloud using Spark and AWS EMR cluster. The data lake will serve as a Single Source of Truth for the Analytics Platform. We will write spark jobs to perform ELT operations that picks data from landing zone on S3 and transform and stores data on the S3 processed zone.

Link: Data_Lake

Project 5: Data Pipelines with Airflow

In this project, we will orchestrate our Data Pipeline workflow using an open-source Apache project called Apache Airflow. We will schedule our ETL jobs in Airflow, create project related custom plugins and operators and automate the pipeline execution.

Link: Airflow_Data_Pipelines

Project 6: Api Data to Postgres

In this project, we build an etl pipeline to fetch data from yelp API and insert it into the Postgres Database. This project is a very basic example of fetching real time data from an open source API.

Link: API to Postgres

CAPSTONE PROJECT

Udacity provides their own crafted Capstone project with dataset that include data on immigration to the United States, and supplementary datasets that include data on airport codes, U.S. city demographics, and temperature data.

I worked on my own open-ended project.
Here is the link - goodreads_etl_pipeline

More Repositories

1

goodreads_etl_pipeline

An end-to-end GoodReads Data Pipeline for Building Data Lake, Data Warehouse and Analytics Platform.
Python
1,259
star
2

Cloudera_Material

Cloudera_Material: Study Material to help people preparing for Cloudera CCA Spark and Hadoop Developer Exam (CCA175). Feel free to collaborate.
31
star
3

Optimizing-Public-Transportation

A real-time event pipeline around Kafka Ecosystem for Chicago Transit Authority.
Python
27
star
4

Big_Data_Project

Fake News Detection - Feature Extraction using Vectorization such as Count Vectorizer, TFIDF Vectorizer, Hash Vectorizer,. Then used an Ensemble model to classify whether the news is fake or not.
Python
15
star
5

Spark_Packaged_project

This project contains pyspark jobs to create data pipelines and shows how to distribute the project package on Cluster.
Python
5
star
6

SF-Crime-Statistics

A Kafka and Spark Streaming Integration project : SF Crime Statistics with Spark Streaming
Python
3
star
7

IPL-analysis-with-Python-Pandas

This project provides an analysis on IPL(Indian premier League) stats from Year 2008 to 2017.
Jupyter Notebook
2
star
8

Uppaal_Model_Checking

Model Checking For Automated Machine Learning Models
q
2
star
9

Yelp_Project

This project is to create a Data lake for Yelp data-set and further using the it to create an Analytical Sandbox Data Science purpose and also creating a data warehouse for reporting purpose.
Jupyter Notebook
2
star
10

SOEN_6441

A multiplayer board Risk Game.
Java
1
star
11

Black-Friday-Sales-Analysis

This Project gives an insight into few statistics related to black Friday Sale.
Jupyter Notebook
1
star