• This repository has been archived on 25/Mar/2024
  • Stars
    star
    132
  • Rank 274,205 (Top 6 %)
  • Language
    Python
  • License
    MIT License
  • Created over 4 years ago
  • Updated over 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A full data warehouse infrastructure with ETL pipelines running inside docker on Apache Airflow for data orchestration, AWS Redshift for cloud data warehouse and Metabase to serve the needs of data visualizations such as analytical dashboards.

Skytrax Data Warehouse

A full data warehouse infrastructure with ETL pipelines running inside docker on Apache Airflow for data orchestration, AWS Redshift for cloud data warehouse and Metabase to serve the needs of data visualizations such as analytical dashboards.

skytrax-warehouse

Architecture

Architecture

Data Warehouse Consists of various modules:

Overview

Data is obtained from here. The data collected is stored on local disk and is timely moved to the Landing Bucket on AWS S3. ETL jobs are written in SQL and scheduled in airflow to run every hour to keep data fresh in cloud data warehouse.

Data Modeling

Following are the fact and dimension tables created:

Dimension Table

aircrafts
airlines
passengers
airports
lounges

Fact Tables

fact_ratings

ETL Flow

  • Data Collected from here is moved to landing zone s3 buckets.
  • ETL job has s3 module which copies data from landing zone to stagging in Redshift.
  • Once the data is moved to Redshift, a task in airflow is triggered which reads the data from stagging area and apply transformation.
  • Using the Redshift staging tables and UPSERT operation is performed on the dimensional & fact Data Warehouse tables to update the data.
  • ETL job execution is completed once the Data Warehouse is updated.
  • Airflow DAG runs the data quality check on Warehouse tables between the ETL job to ensure right data.
  • Dag execution completes once the Data Warehouse is updated.

Environment Setup

Hardware Used

Redshift: For Redshift I used 2 Node cluster with Instance Types dc2.large

Setting Up Infrastructure

Run the following commands in terminal to setup whole infrastructure locally:

  1. git clone https://github.com/iam-mhaseeb/Skytrax-Data-Warehouse
  2. cd Skytrax-Data-Warehouse
  3. Considering you have docker service installed and running run docker-compose up. It will take sometime to pull latest images & install everything automatically in docker.

Setting up Redshift

You can follow the AWS Guide to run a Redshift cluster.

How to run

Airflow

Make sure docker containers are running. Open the Airflow UI by hitting http://localhost:8080 in browser and setup required connections.

You should be able to see skytrax_etl_pipeline Dag like in pictures below:

Skytrax Pipeline DAG DAG View

You can explore dag further in different views like below:

DAG View: DAG

DAG Tree View: DAG Tree

DAG Gantt View: DAG Gantt View

Metabase

Make sure docker containers are running. Open the Metabase UI by hitting http://localhost:3000 in browser & setup your metabase account and database.

You should be able to play with data after running dag successfully like I made dashboard in pictures below:

Dashboard1: DAG

Dashboard2: DAG Tree

Scenarios

  • Data increase by 100x. read > write. write > read

    • Redshift: Analytical database, optimized for aggregation, also good performance for read-heavy workloads
    • Introduce EMR cluster size to handle bigger volume of data
  • Pipelines would be run on 7am daily. how to update dashboard? would it still work?

    • DAG is scheduled to run every hour and can be configured to run every morning at 7 AM if required.
    • Data quality operators are used at appropriate position. In case of DAG failures email triggers can be configured to let the team know about pipeline failures.
  • Make it available to 100+ people

    • We can set the concurrency limit for your Amazon Redshift cluster. While the concurrency limit is 50 parallel queries for a single period of time, this is on a per cluster basis, meaning you can launch as many clusters as fit for you business.

Authors

License

This project is licensed under the MIT License - see the LICENSE file for details

More Repositories

1

StandupMonkey

A self hosted slack bot to conduct standups & generate reports.
Python
73
star
2

Python-Scrapy-Email-Phone-Number-Crawler

Search given query on Google, and crawls for emails & phones related to the result
Python
48
star
3

Multi-Layer-Perceptron-MNIST-with-PyTorch

This repository is MLP implementation of classifier on MNIST dataset with PyTorch
Jupyter Notebook
37
star
4

Photo-Editor-App

Photo Editor App is the photo editor and photo capture app for mobile.
Java
20
star
5

Djano-Anti-Crawler

A light weight anti crawler app for Django.
Python
17
star
6

Instagram-Bot

Increase your Instagram followers with a simple Python bot.
Python
13
star
7

Satellite-Imagery-Analysis-of-Vegetation-in-Southern-Pakistan

This repository contains a study how we can examine the vegetation cover of a region with the help of satellite data. The notebook in this repository aims to familiarise with the concept of satellite imagery data and how it can be analyzed to investigate real-world environmental and humanitarian challenges.
Jupyter Notebook
12
star
8

Python-Implementation-of-LSA

A Jupyter notebook on implementation of Latent Semantic Analysis (A Topic Modelling Algorithm) in python.
Jupyter Notebook
8
star
9

Bookly

This is repository to open source android application to download or read free books named Bookly.
Java
7
star
10

pyspecty

A happy light weight library to search python errors on stackoverflow automatically.
Python
7
star
11

OLX-Crawler

This repository contains code of olx crawler to extract public phone numbers.
Python
5
star
12

Deals-Mash

This is repository to open source android application to show latest discount deals from surrondings named Deals Mash.
Java
5
star
13

Formative-Feedbacker

An opensource project to do formative feedback on free texts on specific given topic.
CSS
4
star
14

Realtime-Face-Detection-Model-from-Live-Video

An implementation to leverage open source tools to build real-time face detection systems that have real-world usefulness.
Python
4
star
15

Data-Streaming-Pipeline

A step by step guide to building a highly scalable data streaming pipeline in Python.
Python
3
star
16

Predicting-Student-Admissions-with-Neural-Networks

In this notebook, we predict student admissions to graduate school at UCLA
Jupyter Notebook
3
star
17

Framples-Fake-news-generator-

Fake news generator using RNN.
Jupyter Notebook
3
star
18

Discovering-water-reservoir-from-Google-maps-using-Python-OpenCv

Discovering water reservoir from Google maps using Python & OpenCv
Python
3
star
19

Mac-OS-Mojave-Dynamic-Wallpaper-for-Ubuntu

This repository contains Mac os Mojave dynamic wallpaper for Ubuntu and instructions to configure it.
2
star
20

Multi-Text-Classification-on-Resumes-With-Dataset

Multi Text Classification on Resumes With Dataset
Jupyter Notebook
2
star
21

suiteCapty

An opensource project to showcase implementation of blockchain in Python.
Python
2
star
22

LinkedinScraper

Python
2
star
23

PredictStockPrices

A simple Python script that uses Machine Learning to predict stock prices
Python
2
star
24

EDA-on-Retail-Data

This repository is implementation of Exploratory Data Analysis on Retail data.
Jupyter Notebook
2
star
25

Automate-Excel-Reporting-with-Python

This repository is demonstration of Python's power to automate Excel reporting.
Python
2
star
26

Text-Summarization-on-Tennis-Articles

Implementation of automatic Text Summarization using TextRank algorithm on tennis article.
Jupyter Notebook
1
star
27

PyWordCloudy

A happy light weight word cloud generator in Python
Python
1
star
28

am-i-dead

A game where a person wakes up with a confused memory of his possible death. Now he has to figure out whether he died or not?
C#
1
star
29

Shrimp

A simple chatbot build with Python, NLTK & SKLearn.
Python
1
star
30

kibar

An open source light weight boilerplate implementation of scrapping framework in Python.
Python
1
star
31

Predicting-Terrorist-Attack-with-Machine-Learning

This repo is implementation of predicting terrorist attack globally using global terrorism database.
Jupyter Notebook
1
star
32

Mastering-Data-Selection-with-Pandas

This repository is demonstration of Pandas library of Python's super powers.
Jupyter Notebook
1
star
33

quixel

quixel is an open source project for text content analysis semantically.
Python
1
star
34

Intro-to-Machine-Learning-Libraries

This repo contains intro to few libraries used mostly in Machine Learning, Data Science, Deep Learning etc.
Jupyter Notebook
1
star
35

Topic-Modeling-using-Latent-Semantic-Analysis-

This is implementation to extract topic from text using topic modelling. A Topic Model can be defined as an unsupervised technique to discover topics across various text documents. These topics are abstract in nature, i.e., words which are related to each other form a topic. Similarly, there can be multiple topics in an individual document.
Jupyter Notebook
1
star