• Stars
    star
    333
  • Rank 126,599 (Top 3 %)
  • Language
    Jupyter Notebook
  • License
    BSD 3-Clause "New...
  • Created over 2 years ago
  • Updated almost 2 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

New Generation Opensource Data Stack Demo

ngods stock market demo

This repository contains a stock market analysis demo of the ngods data stack. The demo performs the following steps:

  1. Download selected stock symbols data from Yahoo Finance API.
  2. Store the stock data in ngods data warehouse (using Iceberg format).
  3. Transform the data (e.g. normalize stock prices) using dbt.
  4. Expose analytics data model using cube.dev.
  5. Visualize data as reports and dashboards using Metabase.
  6. Predicts stock prices using ARIMA in Apache Spark.

The demo is packaged as docker-compose script that downloads, installs, and runs all components of the data stack.

UPDATES

  • 2023-02-03:
    • Upgrade to Apache Iceberg 1.1.0
    • Upgrade to Trino 406
    • Migrated to the new JDBC catalog (removed the heavyweigt Hive Metastore)

ngods

ngods stands for New Generation Opensource Data Stack. It includes the following components:

ngods components

ngods is open-sourced under a BSD license and it is distributed as a docker-compose script that supports Intel and ARM architectures.

Running the demo

ngods requires a machine with at least 16GB RAM and Intel or Arm 64 CPU running Docker. It requires docker-compose.

  1. Clone the ngods repo
git clone https://github.com/zsvoboda/ngods-stocks.git
  1. Start the data stack with the docker-compose up command
cd ngods-stocks

docker-compose up -d

NOTE: This can take quite long depending on your network speed.

  1. Stop the data stack via the docker-compose down command
docker-compose down
  1. Execute the data pipeline from the Dagster console at http://localhost:3070/ with this yaml config file.

Dagster e2e

Cut and paste the content of the e2e.yaml file to this Dagster UI console page and start the data pipeline by clicking the Launch Run button.

NOTE: You can customize the list of stock symbols that will be downloaded.

  1. Review and customize the cube.dev metrics, and dimensions. Test these metrics in the cube.dev playground.

cube.dev playground

See the cube.dev documentation for more information.

  1. Check out the Metabase data visualizations that is connected to the cube.dev analytical model. You can run SQL queries on top of the cube.dev schema.

Use username [email protected] and password metabase1.

Metabase

You can create your own data visualizations and dashboards. See the Metabase documentation for more information.

  1. Predict stock close price. Run the ARIMA time-series prediction model notebook that is trained on 29 months of the Apple:AAPL stock data and predicts the next month.

Jupyter ARIMA

  1. Download DBeaver SQL tool.

  2. Connect to the Postgres database that contains the gold stage data. Use jdbc:postgresql://localhost:5432/ngods JDBC URL with username ngods and password ngods.

Postgres JDBC connection

  1. Connect to the Trino database that has access to all data stages (bronze, silver, and gold schemas of the warehouse database). Use jdbc:trino://localhost:8060 JDBC URL with username trino and password trino.

Trino JDBC connection

Trino schemas

  1. Connect to the Spark database that is used for data transformations. Use jdbc:hive2://localhost:10009 JDBC URL with no username and password.

Spark JDBC connection

Customizing the demo

This chapter contains useful information for customizing the demo.

ngods directories

Here are few distribution's directories that you may need to customize:

  • conf configuration of all data stack components
    • cube cube.dev schema (semantic model definition)
  • data main data directory
    • minio root data directory (contains buckets and file data)
    • spark Jupyter notebooks
    • stage file stage data. Spark can access this directory via /var/lib/ngods/stage path.
  • projects dbt, Dagster, and DataHub projects
    • dagster Dagster orchestration project
    • dbt dbt transformations (one project per each medallion stage: bronze, silver, and gold)

ngods endpoints

The data stack has the following endpoints

ngods databases: Spark, Trino, and Postgres

ngods stack includes three database engines: Spark, Trino, and Postgres. Both Spark and Trino have access to Iceberg tables in warehouse.bronze and warehouse.silver schemas. Trino engine can also access the analytics.gold schema in Postgres. Trino can federate queries between the Postgres and Iceberg tables.

The Spark engine is configured for ELT and pyspark data transformations.

Spark

The Trino engine is configured for data federation between the Iceberg and Postgres tables. Additional catalogs can be configured as needed.

Trino

The Postgres database has accesses only to the analytics.gold schema and it is used for executing analytical queries over the gold data.

Demo data pipeline

The demo data pipeline is utilizes the medallion architecture with bronze, silver, and gold data stages.

data pipeline

and consists of the following phases:

  1. Data are downloaded from Yahoo Finance REST API to the local Minio bucket (./data/stage) using this Dagster operation.
  2. The downloaded CSV file is loaded to the bronze stage Iceberg tables (warehouse.bronze Spark schema) using dbt models that are executed in Spark (./projects/dbt/bronze).
  3. Silver stage Iceberg tables (warehouse.silver Spark schema) are created using dbt models that are executed in Spark (./projects/dbt/silver).
  4. Gold stage Postgres tables (analytics.gold Trino schema) are created using dbt models that are executed in Trino (./projects/dbt/gold).

DBT models

All data pipeline phases are orchestrated by Dagster framework. Dagster operations, resources and jobs are defined in the Dagster project.

Dagster console

The pipeline is executed by running the e2e job from the Dagster console at http://localhost:3070/ using this yaml config file

ngods analytics layer

ngods includes cube.dev for semantic data model and Metabase for self-service analytics (dashboards, reports, and visualizations).

Analytics

Analytical (semantic) model is defined in cube.dev and is used for executing analytical queries over the gold data.

cube.dev

Metabase is connected to the cube.dev via SQL API. End users can use it for self-service creation of dashboards, reports, and data visualizations. Metabase is also directly connected to the gold schema in the Postgres database.

Metabase

ngods machine learning

Jupyter Notebooks with Scala, Java and Python backends can be used for machine learning.

Jupyter

Support

Create a github issue if you have any questions.