• Stars
    star
    287
  • Rank 144,232 (Top 3 %)
  • Language HCL
  • License
    MIT License
  • Created over 4 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Beginner data engineering project - batch edition

Beginner DE Project - Batch Edition

See detailed explanation at Data engineering project: Batch edition

Design

We will be using Airflow to orchestrate

  1. Classifying movie reviews with Apache Spark.
  2. Loading the classified movie reviews into the data warehouse.
  3. Extract user purchase data from an OLTP database and load it into the data warehouse.
  4. Joining the classified movie review data and user purchase data to get user behavior metric data.

Data pipeline design

Setup

Prerequisite

  1. git
  2. Github account
  3. Terraform
  4. AWS account
  5. AWS CLI installed and configured
  6. Docker with at least 4GB of RAM and Docker Compose v1.27.0 or later
  7. psql

Read this post, for information on setting up CI/CD, DB migrations, IAC(terraform), "make" commands and automated testing.

Run these commands to setup your project locally and on the cloud.

# Clone and cd into the project directory.
git clone https://github.com/josephmachado/beginner_de_project.git
cd beginner_de_project

# Local run & test
make up # start the docker containers on your computer & runs migrations under ./migrations
make ci # Runs auto formatting, lint checks, & all the test files under ./tests

# Create AWS services with Terraform
make tf-init # Only needed on your first terraform run (or if you add new providers)
make infra-up # type in yes after verifying the changes TF will make

# Create Redshift Spectrum tables (tables with data in S3)
make spectrum-migration
# Create Redshift tables
make redshift-migration

# Wait until the EC2 instance is initialized, you can check this via your AWS UI
# See "Status Check" on the EC2 console, it should be "2/2 checks passed" before proceeding
# Wait another 5 mins, Airflow takes a while to start up

make cloud-airflow # this command will forward Airflow port from EC2 to your machine and opens it in the browser
# the user name and password are both airflow

make cloud-metabase # this command will forward Metabase port from EC2 to your machine and opens it in the browser

To get Redshift connection credentials for metabase use these commands.

make infra-config
# use redshift_dns_name as host
# use redshift_user & redshift_password
# dev as database

Since we cannot replicate AWS components locally, we have not set them up here. To learn more about how to set up components locally read this article

Create database migrations as shown below.

make db-migration # enter a description, e.g., create some schema
# make your changes to the newly created file under ./migrations
make redshift-migration # to run the new migration on your warehouse

For the continuous delivery to work, set up the infrastructure with terraform, & defined the following repository secrets. You can set up the repository secrets by going to Settings > Secrets > Actions > New repository secret.

  1. SERVER_SSH_KEY: We can get this by running terraform -chdir=./terraform output -raw private_key in the project directory and paste the entire content in a new Action secret called SERVER_SSH_KEY.
  2. REMOTE_HOST: Get this by running terraform -chdir=./terraform output -raw ec2_public_dns in the project directory.
  3. REMOTE_USER: The value for this is ubuntu.

We have a dag validity test defined here.

Tear down infra

After you are done, make sure to destroy your cloud infrastructure.

make down # Stop docker containers on your computer
make infra-down # type in yes after verifying the changes TF will make

This will stop all the AWS services. Please double-check this by going to the AWS UI S3, EC2, EMR, & Redshift consoles.

Contributing

Contributions are welcome. If you would like to contribute you can help by opening a Github issue or putting up a PR.

More Repositories

1

data_engineering_project_template

A template repository to create a data project with IAC, CI/CD, Data migrations, & testing
HCL
85
star
2

simple_dbt_project

Code for dbt tutorial
50
star
3

bitcoinMonitor

Near real time ETL to populate a dashboard.
Python
29
star
4

online_store

End to end data engineering project
Python
25
star
5

beginner_de_project_stream

Simple stream processing pipeline
Scala
24
star
6

spark_submit_airflow

Simple repo to demonstrate how to submit a spark job to EMR from Airflow
Python
20
star
7

analytical_dp_with_sql

Code for my "Analytical Data Processing in SQL" book.
Makefile
16
star
8

socialetl

Project for "Data pipeline design patterns" blog.
Python
14
star
9

e2e_datapipeline_test

Example repo to create end to end tests for data pipeline.
Python
12
star
10

local_dev

Local development environment for python data projects, with Docker
Python
10
star
11

data_test_ci

Repository showing how to automate data testing as part of CI
Python
6
star
12

change_data_capture

Repo for CDC with debezium blog post
Python
4
star
13

idempotent-data-pipeline

Making data pipelines idempotent
Python
4
star
14

dbt_development

Repo to explain development, CI/CD cycle in dbt
4
star
15

data-engineering-interview-series

WIP repository for Data Engineering Interview Series
Jupyter Notebook
3
star
16

trigger_spark_with_lambda

Simple example showing how to trigger a spark job with AWS Lambda
Shell
3
star
17

adv_data_transformation_in_sql

Code for "Advanced data transformations in SQL" free live workshop
2
star
18

data_engineering_best_practices

WIP
Python
2
star
19

spark_submit_airflow-

Simple repo to demonstrate how to submit a spark job
2
star
20

python_essentials_for_data_engineers

WIP
Python
2
star
21

recipes

personal how-tos for common DE tasks
2
star
22

josephmachado

Profile readme
1
star
23

unit_test_dbt

unit test example in DBT
Shell
1
star
24

sde_superset_demo

Apache Superset Demp
1
star
25

docker_for_data_engineers

C
1
star
26

data_engineering_best_practices_log

Code to demonstrate data engineering metadata & logging best practices
Python
1
star
27

how-to-slash-dbt-cost-w-duckdb-

JavaScript
1
star