• This repository has been archived on 02/Oct/2024
  • Stars
    star
    236
  • Rank 170,480 (Top 4 %)
  • Language
    Rust
  • License
    Apache License 2.0
  • Created almost 2 years ago
  • Updated about 2 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Distributed SQL Query Engine in Python using Ray

RaySQL: DataFusion on Ray

This is a research project to evaluate performing distributed SQL queries from Python, using Ray and DataFusion.

Goals

  • Demonstrate how easily new systems can be built on top of DataFusion. See the design documentation to understand how RaySQL works.
  • Drive requirements for DataFusion's Python bindings.
  • Create content for an interesting blog post or conference talk.

Non Goals

  • Build and support a production system.

Example

Run the following example live in your browser using a Google Colab notebook.

import ray
from raysql.context import RaySqlContext
from raysql.worker import Worker

# Start our cluster
ray.init()

# create some remote Workers
workers = [Worker.remote() for i in range(2)]

# create a remote context and register a table
ctx = RaySqlContext.remote(workers)
ray.get(ctx.register_csv.remote('tips', 'tips.csv', True))

# Parquet is also supported
# ctx.register_parquet('tips', 'tips.parquet')

result_set = ray.get(ctx.sql.remote('select sex, smoker, avg(tip/total_bill) as tip_pct from tips group by sex, smoker'))
print(result_set)

Status

  • RaySQL can run all queries in the TPC-H benchmark

Features

  • Mature SQL support (CTEs, joins, subqueries, etc) thanks to DataFusion
  • Support for CSV and Parquet files

Limitations

  • Requires a shared file system currently

Performance

This chart shows the performance of RaySQL compared to Apache Spark for SQLBench-H at a very small data set (10GB), running on a desktop (Threadripper with 24 physical cores). Both RaySQL and Spark are configured with 24 executors.

Overall Time

RaySQL is ~1.9x faster overall for this scale factor and environment with disk-based shuffle.

SQLBench-H Total

Per Query Time

Spark is much faster on some queries, likely due to broadcast exchanges, which RaySQL hasn't implemented yet.

SQLBench-H Per Query

Performance Plan

I'm planning on experimenting with the following changes to improve performance:

  • Make better use of Ray futures to run more tasks in parallel
  • Use Ray object store for shuffle data transfer to reduce disk I/O cost
  • Keep upgrading to newer versions of DataFusion to pick up the latest optimizations

Building

# prepare development environment (used to build wheel / install in development)
python3 -m venv venv
# activate the venv
source venv/bin/activate
# update pip itself if necessary
python -m pip install -U pip
# install dependencies (for Python 3.8+)
python -m pip install -r requirements-in.txt

Whenever rust code changes (your changes or via git pull):

# make sure you activate the venv using "source venv/bin/activate" first
maturin develop
python -m pytest

Benchmarking

Create a release build when running benchmarks, then use pip to install the wheel.

maturin develop --release

How to update dependencies

To change test dependencies, change the requirements.in and run

# install pip-tools (this can be done only once), also consider running in venv
python -m pip install pip-tools
python -m piptools compile --generate-hashes -o requirements-310.txt

To update dependencies, run with -U

python -m piptools compile -U --generate-hashes -o requirements-310.txt

More details here

More Repositories

1

bdt

Boring Data Tool
Rust
207
star
2

datafusion-tui

Terminal based, extensible, interactive data analysis tool using SQL
Rust
74
star
3

datafusion-java

Java binding to Apache Arrow DataFusion
Java
63
star
4

datafusion-objectstore-s3

S3 as an ObjectStore for DataFusion
Rust
59
star
5

datafusion-dolomite

Experimental DataFusion Optimizer
Rust
47
star
6

datafusion-federation

Allow DataFusion to resolve queries across remote query engines while pushing down as much compute as possible down.
Rust
47
star
7

datafusion-table-providers

DataFusion TableProviders for reading data from other systems
Rust
46
star
8

datafusion-substrait

Experimental support for serializing DataFusion plans using substrait
Rust
44
star
9

datafusion-orc

Implementation of Apache ORC file format use Apache Arrow in-memory format
Rust
37
star
10

datafusion-functions-json

JSON support for DataFusion (unofficial)
Rust
28
star
11

tpctools

Tools for generating TPC-* datasets
Rust
26
star
12

datafusion-catalogprovider-glue

Rust
21
star
13

datafusion-streams

Rust
20
star
14

sqlbench-runners

SQLBench Runners
Rust
13
star
15

datafusion-hdfs-native

Connecting DataFusion to HDFS based on libhdfs3
Rust
12
star
16

datafusion-sprox

A DataFusion-powered Serverless S3 Proxy.
Rust
12
star
17

datafusion-objectstore-hdfs

HDFS based on Java implementation as a remote ObjectStore for DataFusion
Rust
9
star
18

arrow-zarr

Implementation of Zarr file format in Rust
Rust
8
star
19

datafusion-python

Rust
7
star
20

datafusion-cookbook

Cookbook with recipes for datafusion
Rust
7
star
21

datafusion-wasm-bindings

WASM bindings for DataFusion
Rust
6
star
22

fs-hdfs

C
5
star
23

datafusion-functions-extra

Various additional function packages for Apache DataFusion
Rust
5
star
24

hdfs-native-object-store

Native rust implementation of object_store HDFS
Rust
3
star
25

datafusion-jdbc

Run DataFusion as a JDBC server.
Rust
3
star
26

datafusion-index-provider

Prototype approaches to implementing Indexes in DataFusion
3
star
27

hdfs-native

Rust
2
star
28

datafusion-wasm-playground

Web based playground for Apache DataFusion via WASM
TypeScript
2
star
29

datafusion-async-parquet-index

Example of building external index on parquet files with an remote catalog
1
star