• Stars
    star
    325
  • Rank 129,350 (Top 3 %)
  • Language
    Python
  • License
    Other
  • Created about 6 years ago
  • Updated 7 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

sgr (command line client for Splitgraph) and the splitgraph Python library

sgr

Build status Coverage Status PyPI version Discord chat room Follow

Overview

sgr is the CLI for Splitgraph, a serverless API for data-driven Web applications.

With addition of the optional sgr Engine component, sgr can become a stand-alone tool for building, versioning and querying reproducible datasets. We use it as the storage engine for Splitgraph. It's inspired by Docker and Git, so it feels familiar. And it's powered by PostgreSQL, so it works seamlessly with existing tools in the Postgres ecosystem. Use sgr to package your data into self-contained Splitgraph data images that you can share with other sgr instances.

To install the sgr CLI or a local sgr Engine, see the Installation section of this readme.

Build and Query Versioned, Reproducible Datasets

Splitfiles give you a declarative language, inspired by Dockerfiles, for expressing data transformations in ordinary SQL familiar to any researcher or business analyst. You can reference other images, or even other databases, with a simple JOIN.

When you build data images with Splitfiles, you get provenance tracking of the resulting data: it's possible to find out what sources went into every dataset and know when to rebuild it if the sources ever change. You can easily integrate sgr your existing CI pipelines, to keep your data up-to-date and stay on top of changes to upstream sources.

Splitgraph images are also version-controlled, and you can manipulate them with Git-like operations through a CLI. You can check out any image into a PostgreSQL schema and interact with it using any PostgreSQL client. sgr will capture your changes to the data, and then you can commit them as delta-compressed changesets that you can package into new images.

sgr supports PostgreSQL foreign data wrappers. We call this feature mounting. With mounting, you can query other databases (like PostgreSQL/MongoDB/MySQL) or open data providers (like Socrata) from your sgr instance with plain SQL. You can even snapshot the results or use them in Splitfiles.

Components

The code in this repository contains:

  • sgr CLI: sgr is the main command line tool used to work with Splitgraph "images" (data snapshots). Use it to ingest data, work with Splitfiles, and push data to Splitgraph.
  • sgr Engine: a Docker image of the latest Postgres with sgr and other required extensions pre-installed.
  • Splitgraph Python library: All sgr functionality is available in the Python API, offering first-class support for data science workflows including Jupyter notebooks and Pandas dataframes.

Docs

We also recommend reading our Blog, including some of our favorite posts:

Installation

Pre-requisites:

  • Docker is required to run the sgr Engine. sgr must have access to Docker. You either need to install Docker locally or have access to a remote Docker socket.

You can get the sgr single binary from the releases page. Optionally, you can run sgr engine add to create an engine.

For Linux and OSX, once Docker is running, install sgr with a single script:

$ bash -c "$(curl -sL https://github.com/splitgraph/sgr/releases/latest/download/install.sh)"

This will download the sgr binary and set up the sgr Engine Docker container.

See the installation guide for more installation methods.

Quick start guide

You can follow the quick start guide that will guide you through the basics of using sgr with Splitgraph or standalone.

Alternatively, sgr comes with plenty of examples to get you started.

If you're stuck or have any questions, check out the documentation or join our Discord channel!

Contributing

Setting up a development environment

  • sgr requires Python 3.7 or later.
  • Install Poetry: curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python to manage dependencies
  • Install pre-commit hooks (we use Black to format code)
  • git clone --recurse-submodules https://github.com/splitgraph/sgr.git
  • poetry install
  • To build the engine Docker image: cd engine && make

Running tests

The test suite requires docker-compose. You will also need to add these lines to your /etc/hosts or equivalent:

127.0.0.1       local_engine
127.0.0.1       remote_engine
127.0.0.1       objectstorage

To run the core test suite, do

docker-compose -f test/architecture/docker-compose.core.yml up -d
poetry run pytest -m "not mounting and not example"

To run the test suite related to "mounting" and importing data from other databases (PostgreSQL, MySQL, Mongo), do

docker-compose -f test/architecture/docker-compose.core.yml -f test/architecture/docker-compose.mounting.yml up -d
poetry run pytest -m mounting

Finally, to test the example projects, do

# Example projects spin up their own engines
docker-compose -f test/architecture/docker-compose.core.yml -f test/architecture/docker-compose.core.yml down -v
poetry run pytest -m example

All of these tests run in CI.

More Repositories

1

seafowl

Analytical database for data-driven Web applications ðŸŠķ
Rust
425
star
2

seafowl-gcsfuse

Scale to zero Seafowl hosting with Cloud Run
Dockerfile
39
star
3

splitgraph-llm-demo

Python
6
star
4

madatdata

😠 📈 Madatdata ("mad at data") is a TypeScript library for managing and querying SQL databases (so far including Seafowl and Splitgraph, but with an interface that makes it easy to add plugins for other databases).
TypeScript
6
star
5

experimental-datafusion-webassembly

proof-of-concept: compile datafusion to `wasm32-wasi` (run in `wasmedge`) and `wasm32-unknown-unknown` (run in browser)
4
star
6

open-data-monitor

A newsfeed of open government datasets, tracking when they're added or deleted. Built with data from Splitgraph, deployed to Seafowl on Fly.io
TypeScript
4
star
7

engine

Splitgraph engine Docker image packaging
Dockerfile
4
star
8

dbt-transform-example

Example of a dbt transform on Splitgraph Cloud with Github Actions
3
star
9

seafowl-udf-rust

Example User Defined Function (UDF) for Seafowl in Rust.
Rust
2
star
10

seafowl-udf-go

Example User Defined Function (UDF) for Seafowl in Go
Go
2
star
11

splitgraph.com

DEPRECATED: the marketing and documentation website is currently in a private repo. Please open any issues related to documentation in the relevant seafowl or sgr repos.
TypeScript
2
star
12

splitgraph-chatgpt-plugin

A ChatGPT plugin for searching the Splitgraph Data Delivery Network using natural language questions.
Python
2
star
13

lakehouse-loader

CLI utility to load data into Delta Lake and other lakehouse formats
Rust
1
star
14

yarn-plugin-pin-deps

Yarn plugin to pin dependencies to their currently resolved version. Available for Yarn v2 and Yarn v3
JavaScript
1
star