Mara Example Project
A runnable app that demonstrates how to build a data warehouse with mara. Combines the data-integration and bigquery-downloader libraries with the mara-app framework into a project.
The example ETL integrates PyPi download stats and GitHub repo activitiy metrics into more general Python project activity stats.
The repository is intended to serve as a template for new projects.
Â
Example: Python Project Stats
The project uses two data sources:
-
The PyPI downloads BigQuery data set at https://bigquery.cloud.google.com/dataset/the-psf:pypi (Google login required). It contains each individual package download together with project and client attributes.
-
The Github archive BigQuery data set at https://bigquery.cloud.google.com/dataset/githubarchive:day. It contains nearly all events that happen to Github repositories.
From both data sources, a set of pre-aggregated and filtered CSVs is incrementally downloaded using the queries in app/bigquery_downloader:
$ gunzip --decompress --stdout data/2018/04/10/pypi/downloads-v1.csv.gz | grep "\tflask\t\|day_id" | head -n 11
day_id project project_version python_version installer number_of_downloads
20180410 flask 0.1 bandersnatch 1
20180410 flask 0.2 bandersnatch 1
20180410 flask 0.5 bandersnatch 1
20180410 flask 0.6 bandersnatch 1
20180410 flask 0.8 bandersnatch 1
20180410 flask 0.9 bandersnatch 1
20180410 flask 0.10 2.6 pip 1
20180410 flask 0.11 Browser 1
20180410 flask 0.11 bandersnatch 1
20180410 flask 0.5.1 bandersnatch 1
$ gunzip --decompress --stdout data/2018/04/10/github/repo-activity-v1.csv.gz | grep "\tflask\t\|day_id"
day_id user repository number_of_forks number_of_commits number_of_closed_pull_requests
20180410 liks79 flask 1
20180410 dengyifan flask 1
20180410 xeriok18600 flask 1
20180410 manhhomienbienthuy flask 10
20180410 davidism flask 49
20180410 pallets flask 10 6 3
The total size of these (compressed) csv files is 3.5GB for the time range from Jan 2017 to April 2018.
Â
Then there is the ETL in app/data_integration/pipelines that transforms this data into a classic Kimball-like star schema:
It shows 4 database schemas, each created by a different pipeline:
time
: All days from the beginning of data processing until yesterday,pypi_dim
: PyPI download counts per project version, installer and day,gh_dim
: The number of commits, forks and closed pull requests per Github repository and day,pp_dim
: PyPI and Github metrics merged by day and repository/ project name.
The overall database size of the data warehouse is roughly 100GB for the timerange mentioned above.
Â
With this structure in place, it is then possible to run queries like this:
SELECT
_date, project_name, number_of_downloads, number_of_forks,
number_of_commits, number_of_closed_pull_requests
FROM pp_dim.python_project_activity
JOIN pypi_dim.project ON project_id = project_fk
JOIN time.day ON day_fk = day_id
WHERE project_name = 'flask'
ORDER BY day_fk DESC
LIMIT 10;
_date | project_name | number_of_downloads | number_of_forks | number_of_commits | number_of_closed_pull_requests
------------+--------------+---------------------+-----------------+-------------------+--------------------------------
2018-04-10 | flask | 45104 | 11 | 67 | 3
2018-04-09 | flask | 57177 | 13 | 45 | 4
2018-04-08 | flask | 70392 | 13 | 7 |
2018-04-07 | flask | 65060 | 10 | 7 |
2018-04-06 | flask | 70779 | 7 | 11 | 2
2018-04-05 | flask | 62215 | 6 | 22 |
2018-04-04 | flask | 33116 | 11 | 23 |
2018-04-03 | flask | 39248 | 15 | 27 |
2018-04-02 | flask | 54517 | 14 | 17 |
2018-04-01 | flask | 68685 | 4 | 6 |
(10 rows)
Â
Mara data integration pipelines are visualized and debugged though a web ui. Here, the pipeline github
is run (locally on an old Mac with 2 days of data):
Â
On production, pipelines are run through a cli interface:
Â
Mara ETL pipelines are compeletely transparent, both to stakeholders in terms of applied business logic and to data engineers in terms of runtime behavior.
This is the page in the web ui that visualizes the pipeline pypi
:
It shows
- a graph of all pipeline children with dependencies between them,
- run times of the pipeline and top child nodes over time,
- a list of all child nodes with their average run time and cost,
- system statistics, a timeline and output of the last runs.
Â
Similarly, this the page for the task pypi/transform_project
:
It shows its
- direct upstreams and downstreams,
- run times over time,
- all commands of the task,
- system statistics, a timeline and output of the last runs.
Â
Getting started
Sytem requirements
Python >=3.6 and PostgreSQL >=10 and some smaller packages are required to run the example (and mara in general).
Mac:
$ brew install -v python3
$ brew install -v dialog
$ brew install -v coreutils
$ brew install -v graphviz
Ubuntu 16.04:
$ sudo apt install git dialog coreutils graphviz python3 python3-dev python3-venv
Â
Mara does not run Windows.
Â
On Mac, install Postgresql with brew install -v postgresql
. On Ubuntu, follow these instructions. Also, install the cstore_fdw with brew install cstore_fdw
and postgresql-hll extensions from source.
To optimize PostgreSQL for ETL workloads, update your postgresql.conf along this example.
Start a database client with sudo -u postgres psql postgres
and then create a user with CREATE ROLE root SUPERUSER LOGIN;
(you can use any other name).
Â
Installation
Clone the repository somewhere. Copy the file app/local_setup.py.example
to app/local_setup.py
and adapt to your machine.
Log into PostgreSQL with psql -U root postgres
and create two databases:
CREATE DATABASE example_project_dwh;
CREATE DATABASE example_project_mara;
Hit make
in the root directory of the project. This will
- create a virtual environment in
.venv
, - install all packages from
requirements.txt.freeze
(if you want to create a newrequirements.txt.freeze
fromrequirements.txt
, then runmake update-packages
), - create a number of tables that are needed for running mara.
You can now activate the virtual environment with
$ source .venv/bin/activate
To list all available flask cli commands, run flask
without parameters.
Â
Running the web UI
$ flask run --with-threads --reload --eager-loading
The app is now accessible at http://localhost:5000.
Â
Downloading PyPI and Github data from BigQuery
This step takes many hours to complete. If you don't have time for this (or don't want to go through the hassle of creating Google Cloud credentials), we provide a daily updated copy of the result data sets on s3. You can get (and update) the set of result CSVs with
$ make sync-bigquery-csv-data-sets-from-s3
Â
To download the data yourself, follow the instructions in the README of the mara/bigquery-downloader package to get a Google Cloud credentials file. Store it in app/bigquery_downloader/bigquery-credentials.json
.
Then run the downloader in an activated virtual environment with:
$ flask bigquery_downloader.download_data
Â
Running the ETL
For development, it is recommended to run the ETL from the web UI (see above). On production, use flask data_integration.ui.run
to run a pipeline or a set of its child nodes.
The command data_integration.ui.run_interactively
provides an ncurses-based menu for selecting and running pipelines.
Â
Documentation
Documentation is work in progress. But the code base is quite small and documented.