Discover GabrielAmazonas/airflow-pyspark-emr Open Source project

Stars
6
Rank 2,539,965 (Top 51 %)
Language
Python
Created over 4 years ago
Updated over 2 years ago

GabrielAmazonas

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

This project demonstrate how to process data stored in a data lake fashion, transforming it into an OLAP optimized structure by using PySpark. The PySpark Job runs on AWS EMR, and the Data Pipeline is orchestrated by Apache Airflow, including the infrastructure creation and the EMR cluster termination.

hudi-on-glue-quick-start

AWS Glue PySpark - Apache Hudi Quick Start Guide

Python

datasprints-open-spaces

Repository for the code demoed in the talk

Jupyter Notebook

flame

Flame 🔥 Opinionated Flask & MongoDB backend boilerplate.

Python

delta-lake-on-glue-quickstart

This is a quick start guide for the Delta Lake (delta.io) Python Spark connector, running on AWS Glue.

Python