Reddit ETL Pipeline
A data pipeline to extract Reddit data from r/dataengineering.
Output is a Google Data Studio report, providing insight into the Data Engineering official subreddit.
Motivation
Project was based on an interest in Data Engineering and the types of Q&A found on the official subreddit.
It also provided a good opportunity to develop skills and experience in a range of tools. As such, project is more complex than required, utilising dbt, airflow, docker and cloud based storage.
Architecture
- Extract data using Reddit API
- Load into AWS S3
- Copy into AWS Redshift
- Transform using dbt
- Create PowerBI or Google Data Studio Dashboard
- Orchestrate with Airflow in Docker
- Create AWS resources with Terraform
Output
- Final output from Google Data Studio. Link here. Note that Dashboard is reading from a static CSV output from Redshift. Redshift database was deleted so as not to incur cost.
Setup
Follow below steps to setup pipeline. I've tried to explain steps where I can. Feel free to make improvements/changes.
NOTE: This was developed using an M1 Macbook Pro. If you're on Windows or Linux, you may need to amend certain components if issues are encountered.
As AWS offer a free tier, this shouldn't cost you anything unless you amend the pipeline to extract large amounts of data, or keep infrastructure running for 2+ months. However, please check AWS free tier limits, as this may change.
First clone the repository into your home directory and follow the steps.
git clone https://github.com/ABZ-Aaron/Reddit-API-Pipeline.git
cd Reddit-API-Pipeline