sgr
Overview
sgr
is the CLI for Splitgraph, a
serverless API for data-driven Web applications.
With addition of the optional sgr
Engine component, sgr
can become a stand-alone tool for building, versioning and querying reproducible
datasets. We use it as the storage engine for Splitgraph. It's inspired by
Docker and Git, so it feels familiar. And it's powered by
PostgreSQL, so it works seamlessly with existing tools
in the Postgres ecosystem. Use sgr
to package your data into self-contained
Splitgraph data images that you can
share with other sgr
instances.
To install the sgr
CLI or a local sgr
Engine, see the
Installation section of this readme.
Build and Query Versioned, Reproducible Datasets
Splitfiles give you a declarative language, inspired by Dockerfiles, for expressing data transformations in ordinary SQL familiar to any researcher or business analyst. You can reference other images, or even other databases, with a simple JOIN.
When you build data images with Splitfiles, you get provenance tracking of the
resulting data: it's possible to find out what sources went into every dataset
and know when to rebuild it if the sources ever change. You can easily integrate
sgr
your existing CI pipelines, to keep your data up-to-date and stay on top
of changes to upstream sources.
Splitgraph images are also version-controlled, and you can manipulate them with
Git-like operations through a CLI. You can check out any image into a PostgreSQL
schema and interact with it using any PostgreSQL client. sgr
will capture your
changes to the data, and then you can commit them as delta-compressed changesets
that you can package into new images.
sgr
supports PostgreSQL
foreign data wrappers.
We call this feature
mounting. With mounting,
you can query other databases (like PostgreSQL/MongoDB/MySQL) or open data
providers (like
Socrata) from your
sgr
instance with plain SQL. You can even snapshot the results or use them in
Splitfiles.
Components
The code in this repository contains:
sgr
CLI:sgr
is the main command line tool used to work with Splitgraph "images" (data snapshots). Use it to ingest data, work with Splitfiles, and push data to Splitgraph.sgr
Engine: a Docker image of the latest Postgres withsgr
and other required extensions pre-installed.- Splitgraph Python library:
All
sgr
functionality is available in the Python API, offering first-class support for data science workflows including Jupyter notebooks and Pandas dataframes.
Docs
We also recommend reading our Blog, including some of our favorite posts:
- Supercharging
dbt
withsgr
: versioning, sharing, cross-DB joins - Querying 40,000+ datasets with SQL
- Foreign data wrappers: PostgreSQL's secret weapon?
Installation
Pre-requisites:
- Docker is required to run the
sgr
Engine.sgr
must have access to Docker. You either need to install Docker locally or have access to a remote Docker socket.
You can get the sgr
single binary from
the releases page.
Optionally, you can run
sgr engine add
to create an engine.
For Linux and OSX, once Docker is running, install sgr
with a single script:
$ bash -c "$(curl -sL https://github.com/splitgraph/sgr/releases/latest/download/install.sh)"
This will download the sgr
binary and set up the sgr
Engine Docker
container.
See the installation guide for more installation methods.
Quick start guide
You can follow the
quick start guide
that will guide you through the basics of using sgr
with Splitgraph or
standalone.
Alternatively, sgr
comes with plenty of examples to get you
started.
If you're stuck or have any questions, check out the documentation or join our Discord channel!
Contributing
Setting up a development environment
sgr
requires Python 3.7 or later.- Install Poetry:
curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/get-poetry.py | python
to manage dependencies - Install pre-commit hooks (we use Black to format code)
git clone --recurse-submodules https://github.com/splitgraph/sgr.git
poetry install
- To build the
engine
Docker image:
cd engine && make
Running tests
The test suite requires docker-compose. You
will also need to add these lines to your /etc/hosts
or equivalent:
127.0.0.1 local_engine
127.0.0.1 remote_engine
127.0.0.1 objectstorage
To run the core test suite, do
docker-compose -f test/architecture/docker-compose.core.yml up -d
poetry run pytest -m "not mounting and not example"
To run the test suite related to "mounting" and importing data from other databases (PostgreSQL, MySQL, Mongo), do
docker-compose -f test/architecture/docker-compose.core.yml -f test/architecture/docker-compose.mounting.yml up -d
poetry run pytest -m mounting
Finally, to test the example projects, do
# Example projects spin up their own engines
docker-compose -f test/architecture/docker-compose.core.yml -f test/architecture/docker-compose.core.yml down -v
poetry run pytest -m example
All of these tests run in CI.