• Stars
    star
    176
  • Rank 216,940 (Top 5 %)
  • Language
    Python
  • License
    MIT License
  • Created over 6 years ago
  • Updated over 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

An example mini data warehouse for python project stats, template for new projects

Mara Example Project

A runnable app that demonstrates how to build a data warehouse with mara. Combines the data-integration and bigquery-downloader libraries with the mara-app framework into a project.

The example ETL integrates PyPi download stats and GitHub repo activitiy metrics into more general Python project activity stats.

The repository is intended to serve as a template for new projects.

 

Example: Python Project Stats

The project uses two data sources:

  1. The PyPI downloads BigQuery data set at https://bigquery.cloud.google.com/dataset/the-psf:pypi (Google login required). It contains each individual package download together with project and client attributes.

  2. The Github archive BigQuery data set at https://bigquery.cloud.google.com/dataset/githubarchive:day. It contains nearly all events that happen to Github repositories.

From both data sources, a set of pre-aggregated and filtered CSVs is incrementally downloaded using the queries in app/bigquery_downloader:

$ gunzip --decompress --stdout data/2018/04/10/pypi/downloads-v1.csv.gz | grep "\tflask\t\|day_id" | head -n 11
day_id	project	project_version	python_version	installer	number_of_downloads
20180410	flask	0.1		bandersnatch	1
20180410	flask	0.2		bandersnatch	1
20180410	flask	0.5		bandersnatch	1
20180410	flask	0.6		bandersnatch	1
20180410	flask	0.8		bandersnatch	1
20180410	flask	0.9		bandersnatch	1
20180410	flask	0.10	2.6	pip	1
20180410	flask	0.11		Browser	1
20180410	flask	0.11		bandersnatch	1
20180410	flask	0.5.1		bandersnatch	1
$ gunzip --decompress --stdout data/2018/04/10/github/repo-activity-v1.csv.gz | grep "\tflask\t\|day_id"
day_id	user	repository	number_of_forks	number_of_commits	number_of_closed_pull_requests
20180410	liks79	flask	1		
20180410	dengyifan	flask		1	
20180410	xeriok18600	flask		1	
20180410	manhhomienbienthuy	flask		10
20180410	davidism	flask		49	
20180410	pallets	flask	10	6	3

The total size of these (compressed) csv files is 3.5GB for the time range from Jan 2017 to April 2018.

 

Then there is the ETL in app/data_integration/pipelines that transforms this data into a classic Kimball-like star schema:

Star schema

It shows 4 database schemas, each created by a different pipeline:

  • time: All days from the beginning of data processing until yesterday,
  • pypi_dim: PyPI download counts per project version, installer and day,
  • gh_dim: The number of commits, forks and closed pull requests per Github repository and day,
  • pp_dim: PyPI and Github metrics merged by day and repository/ project name.

The overall database size of the data warehouse is roughly 100GB for the timerange mentioned above.

 

With this structure in place, it is then possible to run queries like this:

SELECT 
  _date, project_name, number_of_downloads, number_of_forks, 
  number_of_commits, number_of_closed_pull_requests 
FROM pp_dim.python_project_activity 
  JOIN pypi_dim.project ON project_id = project_fk 
  JOIN time.day ON day_fk = day_id 
WHERE project_name = 'flask' 
ORDER BY day_fk DESC 
LIMIT 10;
   _date    | project_name | number_of_downloads | number_of_forks | number_of_commits | number_of_closed_pull_requests 
------------+--------------+---------------------+-----------------+-------------------+--------------------------------
 2018-04-10 | flask        |               45104 |              11 |                67 |                              3
 2018-04-09 | flask        |               57177 |              13 |                45 |                              4
 2018-04-08 | flask        |               70392 |              13 |                 7 |                               
 2018-04-07 | flask        |               65060 |              10 |                 7 |                               
 2018-04-06 | flask        |               70779 |               7 |                11 |                              2
 2018-04-05 | flask        |               62215 |               6 |                22 |                               
 2018-04-04 | flask        |               33116 |              11 |                23 |                               
 2018-04-03 | flask        |               39248 |              15 |                27 |                               
 2018-04-02 | flask        |               54517 |              14 |                17 |                               
 2018-04-01 | flask        |               68685 |               4 |                 6 |                               
(10 rows)

 

Mara data integration pipelines are visualized and debugged though a web ui. Here, the pipeline github is run (locally on an old Mac with 2 days of data):

Mara web ui ETL run

 

On production, pipelines are run through a cli interface:

Mara cli ETL run

 

Mara ETL pipelines are compeletely transparent, both to stakeholders in terms of applied business logic and to data engineers in terms of runtime behavior.

This is the page in the web ui that visualizes the pipeline pypi:

Mara web UI for pipelines

It shows

  • a graph of all pipeline children with dependencies between them,
  • run times of the pipeline and top child nodes over time,
  • a list of all child nodes with their average run time and cost,
  • system statistics, a timeline and output of the last runs.

 

Similarly, this the page for the task pypi/transform_project:

Mara web ui for tasks

It shows its

  • direct upstreams and downstreams,
  • run times over time,
  • all commands of the task,
  • system statistics, a timeline and output of the last runs.

 

Getting started

Sytem requirements

Python >=3.6 and PostgreSQL >=10 and some smaller packages are required to run the example (and mara in general).

Mac:

$ brew install -v python3
$ brew install -v dialog
$ brew install -v coreutils
$ brew install -v graphviz

Ubuntu 16.04:

$ sudo apt install git dialog coreutils graphviz python3 python3-dev python3-venv

 

Mara does not run Windows.

 

On Mac, install Postgresql with brew install -v postgresql. On Ubuntu, follow these instructions. Also, install the cstore_fdw with brew install cstore_fdw and postgresql-hll extensions from source.

To optimize PostgreSQL for ETL workloads, update your postgresql.conf along this example.

Start a database client with sudo -u postgres psql postgres and then create a user with CREATE ROLE root SUPERUSER LOGIN; (you can use any other name).

 

Installation

Clone the repository somewhere. Copy the file app/local_setup.py.example to app/local_setup.py and adapt to your machine.

Log into PostgreSQL with psql -U root postgres and create two databases:

CREATE DATABASE example_project_dwh;
CREATE DATABASE example_project_mara;

Hit make in the root directory of the project. This will

  • create a virtual environment in .venv,
  • install all packages from requirements.txt.freeze (if you want to create a new requirements.txt.freeze from requirements.txt, then run make update-packages),
  • create a number of tables that are needed for running mara.

You can now activate the virtual environment with

$ source .venv/bin/activate

To list all available flask cli commands, run flask without parameters.

 

Running the web UI

$ flask run --with-threads --reload --eager-loading

The app is now accessible at http://localhost:5000.

 

Downloading PyPI and Github data from BigQuery

This step takes many hours to complete. If you don't have time for this (or don't want to go through the hassle of creating Google Cloud credentials), we provide a daily updated copy of the result data sets on s3. You can get (and update) the set of result CSVs with

$ make sync-bigquery-csv-data-sets-from-s3

 

To download the data yourself, follow the instructions in the README of the mara/bigquery-downloader package to get a Google Cloud credentials file. Store it in app/bigquery_downloader/bigquery-credentials.json.

Then run the downloader in an activated virtual environment with:

$ flask bigquery_downloader.download_data

 

Running the ETL

For development, it is recommended to run the ETL from the web UI (see above). On production, use flask data_integration.ui.run to run a pipeline or a set of its child nodes.

The command data_integration.ui.run_interactively provides an ncurses-based menu for selecting and running pipelines.

 

Documentation

Documentation is work in progress. But the code base is quite small and documented.

More Repositories

1

mara-pipelines

A lightweight opinionated ETL framework, halfway between plain scripts and Apache Airflow
Python
2,067
star
2

mara-schema

Mapping of DWH database tables to business entities, attributes & metrics in Python, with automatic creation of flattened tables
Python
70
star
3

mara-db

Lightweight configuration and access to multiple databases in a single project
Python
38
star
4

mara-etl-tools

Utilities for creating ETL pipelines with mara
PLpgSQL
36
star
5

mara-example-project-1

Runnable e-commerce mini data warehouse based on Python, PostgreSQL & Metabase, template for new projects
Python
27
star
6

mara-metabase

Configuration and schema sync for Metabase from Python
Python
19
star
7

mara-data-explorer

Flask based UI for displaying & segmenting a single database table
JavaScript
15
star
8

mara-app

A framework for distributing flask apps across separate packages with minimal dependencies
Python
15
star
9

mara-mondrian

A python integration for the Saiku ad hoc analysis tool
Python
9
star
10

mara-page

Minimal API for defining pages of Flask-based backends independently from the actual backend infrastructure
Python
4
star
11

mara-olist-ecommerce-data

Stores the Olist E-commerce Dataset in PostgreSQL
Python
4
star
12

mara-storage

Configuration of different storages
Python
3
star
13

mara-google-sheet-downloader

Download Google Sheets to a DB table in a mara ETL
Python
3
star
14

mara-markdown-docs

Display markdown documentation in mara
Python
2
star
15

mara-acl

Default ACL implementation for Mara
Python
2
star
16

mara-superset

Configuring Apache Superset from Mara Schema
Python
2
star
17

mara-dbt

Python
1
star