• This repository has been archived on 04/Jul/2023
  • Stars
    star
    222
  • Rank 179,123 (Top 4 %)
  • Language
    Rust
  • License
    Other
  • Created almost 3 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Streaming and Incremental Computation Framework

License: MIT CI workflow codecov

Latest News

The development for this project is now done at https://github.com/feldera/dbsp.

Database Stream Processor

Database Stream Processor (DBSP) is a framework for computing over data streams that aims to be more expressive and performant than existing streaming engines.

DBSP Mission Statement

Computing over streaming data is hard. Streaming computations operate over changing inputs and must continuously update their outputs as new inputs arrive. They must do so in real time, processing new data and producing outputs with a bounded delay.

We believe that software engineers and data scientists who build streaming data pipelines should not be exposed to this complexity. They should be able to express their computations as declarative queries and use a streaming engine to evaluate these queries correctly and efficiently. DBSP aims to be such an engine. To this end we set the following high-level objectives:

  1. Full SQL support and more. While SQL is just the first of potentially many DBSP frontends, it offers a reference point to characterize the expressiveness of the engine. Our goal is to support the complete SQL syntax and semantics, including joins and aggregates, correlated subqueries, window functions, complex data types, time series operators, UDFs, etc. Beyond standard SQL, DBSP supports recursive queries, which arise for instance in graph analytics problems.

  2. Scalability in multiple dimensions. The engine scales with the number and complexity of queries, streaming data rate and the amount of state the system maintains in order to process the queries.

  3. Performance out of the box. The user should be able to focus on the business logic of their application, leaving it to the system to evaluate this logic efficiently.

Theory

The above objectives can only be achieved by building on a solid mathematical foundation. The formal model that underpins our system, also called DBSP, is described in the accompanying paper:

The model provides two things:

  1. Semantics. DBSP defines a formal language of streaming operators and queries built out of these operators, and precisely specifies how these queries must transform input streams to output streams.

  2. Algorithm. DBSP also gives an algorithm that takes an arbitrary query and generates a dataflow program that implements this query correctly (in accordance with its formal semantics) and efficiently. Efficiency here means, in a nutshell, that the cost of processing a set of input events is proportional to the size of the input rather than the entire state of the database.

DBSP Concepts

DBSP unifies two kinds of streaming data: time series data and change data.

  • Time series data can be thought of as an infinitely growing log indexed by time.

  • Change data represents updates (insertions, deletions, modifications) to some state modeled as a table of records.

In DBSP, a time series is just a table where records are only ever added and never removed or modified. As a result, this table can grow unboundedly; hence most queries work with subsets of the table within a bounded time window. DBSP does not need to wait for all data within a window to become available before evaluating a query (although the user may choose to do so): like all queries, time window queries are updated on the fly as new inputs become available. This means that DBSP can work with arbitrarily large windows as long as they fit within available storage.

DBSP queries are composed of the following classes of operators that apply to both time series and change data:

  1. Per-record operators that parse, validate, filter, transform data streams one record at a time.

  2. The complete set of relational operators: select, project, join, aggregate, etc.

  3. Recursion: Recursive queries express iterative computations, e.g., partitioning a graph into strongly connected components. Like all DBSP queries, recursive queries update their outputs incrementally as new data arrives.

In addition, DBSP supports windowing operators that group time series data into time windows, including various forms of tumbling and sliding windows, windows driven by watermarks, etc.

Architecture

The following diagram shows the architecture of the DBSP platform. Solid blocks indicate components that we are currently working on; white blocks with dashed borders are on our TODO list.

DBSP architecture

The DBSP core engine is written in Rust and provides a Rust API for building data-parallel dataflow programs by instantiating and connecting streaming operators. Developers can use this API directly to implement complex streaming queries. We are also developing a compiler from SQL to DBSP that enables engineers and data scientists to use the engine via a familiar query language. In the future, we will add DBSP bindings for languages like Python and Scala.

At runtime, DBSP can consume inputs from and send outputs to event streams, e.g., Kafka, databases, e.g., Postgres, and data warehouses, e.g., Snowflake.

The distributed runtime will extend DBSP's data-parallel execution model to multiple nodes for high availability and throughput.

Applications

TODO

Documentation

The project is still in its early days. API and internals documentation is coming soon.

TODO

Contributing

Setting up git hooks

Execute the following command to make git commit check the code before commit:

GITDIR=$(git rev-parse --git-dir)
ln -sf $(pwd)/tools/pre-push ${GITDIR}/hooks/pre-push

Running Benchmarks against DBSP

The repository has a number of benchmarks available in the benches directory that provide a comparison of DBSP's performance against a known set of tests.

Each benchmark has its own options and behavior, as outlined below.

Nexmark Benchmark

You can run the complete set of Nexmark queries, with the default settings, with:

cargo bench --bench nexmark --features with-nexmark

By default this will run each query with a total of 100 million events emitted at 10M per second (by two event generator threads), using 2 CPU cores for processing the data.

To run just the one query, q3, with only 10 million events, but using 8 CPU cores to process the data and 6 event generator threads, you can run:

cargo bench --bench nexmark --features with-nexmark -- --query q3 --max-events 10000000 --cpu-cores 8 --num-event-generators 6

For further options that you can use with the Nexmark benchmark,

cargo bench --bench nexmark --features with-nexmark -- --help

An extensive blog post about the implementation of Nexmark in DBSP: https://liveandletlearn.net/post/vmware-take-3-experience-with-rust-and-dbsp/

More Repositories

1

kubeless

Kubernetes Native Serverless Framework
Go
6,867
star
2

clarity

Clarity is a scalable, accessible, customizable, open source design system built with web components. Works with any JavaScript framework, built for enterprises, and designed to be inclusive.
TypeScript
6,456
star
3

octant

Highly extensible platform for developers to better understand the complexity of Kubernetes clusters.
Go
6,247
star
4

kubewatch

Watch k8s events and trigger Handlers
Go
2,448
star
5

scripted

The Scripted code editor
JavaScript
1,564
star
6

eventrouter

A simple introspective kubernetes service that forwards events to a specified sink.
Go
873
star
7

tgik

Official repository for TGI Kubernetes (TGIK)!
Shell
828
star
8

kube-prod-runtime

A standard infrastructure environment for Kubernetes
Jsonnet
776
star
9

healthcheck

A library for implementing Kubernetes liveness and readiness probe handlers in your Go application.
Go
675
star
10

pivotal_workstation

A cookbook of recipes for an OSX workstation
662
star
11

cabin

The Mobile Dashboard for Kubernetes
JavaScript
659
star
12

dispatch

Dispatch is a framework for deploying and managing serverless style applications.
Go
535
star
13

buildkit-cli-for-kubectl

BuildKit CLI for kubectl is a tool for building container images with your Kubernetes cluster
Go
491
star
14

haret

A strongly consistent distributed coordination system, built using proven protocols & implemented in Rust.
Rust
462
star
15

concourse-pipeline-samples

Sample code and recipes for Concourse CI pipelines and deployments.
Shell
447
star
16

cascade

A Just-In-Time Compiler for Verilog from VMware Research
C++
430
star
17

projectmonitor

Big Visible Chart CI aggregator
Ruby
427
star
18

kubeless-ui

Graphical User Interface for Kubeless
JavaScript
417
star
19

halite

DEPRECATED: A client-side web application interface to a running Salt infrastructure
Python
413
star
20

gangway

An application that can be used to easily enable authentication flows via OIDC for a kubernetes cluster.
Go
407
star
21

salty-vagrant

Use Salt as a Vagrant provisioner.
Shell
373
star
22

liota

Python
336
star
23

lightwave

Identity services for traditional infrastructure, applications and containers.
C
321
star
24

rbvmomi

Ruby interface to the VMware vSphere API.
Ruby
302
star
25

git_scripts

Developer workflow convenience scripts
Ruby
279
star
26

pcfdev

This is the depricated version of PCF Dev - please visit the current Github repository https://github.com/cloudfoundry-incubator/cfdev for the latest updates
Go
273
star
27

springsource-cloudfoundry-samples

Samples for Cloud Foundry
Java
259
star
28

admiral

Container management solution with an accent on modeling containerized applications and provide placement based on dynamic policy allocation
Java
254
star
29

vsphere-storage-for-docker

vSphere Storage for Docker
Python
254
star
30

salt-vim

Vim files for editing Salt files
Vim Script
246
star
31

rvc

RVC is a Linux console UI for vSphere, built on the RbVmomi bindings to the vSphere API.
Ruby
238
star
32

xenon

Xenon - Decentralized Control Plane Framework
Java
226
star
33

raet

Reliable Asynchronous Event Transport Protocol
Python
208
star
34

salt-cloud

Salt Cloud Working group.
200
star
35

vsphere-automation-sdk-rest

REST (Postman and JavaScript) samples and API reference documentation for vSphere using the VMware REST API
197
star
36

cloud-init-vmware-guestinfo

A cloud-init datasource for VMware vSphere's GuestInfo interface
Python
192
star
37

jsunit

The original unit-testing framework for JavaScript. These days we use Jasmine (http://github.com/pivotal/jasmine) by default for JS testing; JsUnit is not actively developed or supported.
Java
173
star
38

sql_magic

Magic functions for using Jupyter Notebook with Apache Spark and a variety of SQL databases.
Jupyter Notebook
171
star
39

purser

Kubernetes Cloud Native Applications visibility
Go
171
star
40

pyvcloud

Python SDK for VMware vCloud Director
Python
170
star
41

salt-contrib

Salt Module Contributions
Python
170
star
42

powernsx

PowerShell module that abstracts the VMware NSX-v API to a set of easily used PowerShell functions
PowerShell
170
star
43

p4c-xdp

Backend for the P4 compiler targeting XDP
C
166
star
44

webcommander

Powerful, flexible, intuitive and most importantly simple. That is what a real automation solution should be. No matter how complicated the task is, we'd like to turn it into a single click. Is that possible? Not without webcommander :)
PowerShell
166
star
45

vcd-cli

Command Line Interface for VMware vCloud Director
Python
164
star
46

wardroom

A tool for creating Kubernetes-ready base operating system images.
Python
162
star
47

pcf-pipelines

PCF Pipelines
Shell
158
star
48

spring-boot-cities

A Spring Boot + Spring Data + Spring Cloud Connectors demo app
Java
149
star
49

kube-manifests

A collection of misc Kubernetes configs for various jobs, as used in Bitnami's production clusters.
Jsonnet
136
star
50

ktx

manage kubernetes cluster configs
Shell
133
star
51

pg_rewind

Tool for resynchronizing a Postgres database after failover
125
star
52

AndroidIntelliJStarter

An IntelliJ template project for android developers, pre-configured to work with Robolectric, Roboguice, an other common, useful Android libraries.
Java
125
star
53

vctl-docs

VMware vctl Docs
124
star
54

cimonitor

This project has been renamed to ProjectMonitor - http://github.com/pivotal/projectmonitor
121
star
55

tmux-config

Configuration and tools for tmux. Can be used as a Vim plugin.
Shell
121
star
56

PivotalMySQLWeb

PivotalMySQL*Web is a free Pivotal open source project, intended to handle the administration of a Pivotal MySQL Service Instance over the Web
JavaScript
120
star
57

salt-api

RETIRED: Generic, modular network access system
Python
112
star
58

atc

old - now lives in https://github.com/concourse/concourse
111
star
59

nsxansible

A set of example Ansible Modules using the above two projects as the basis
Python
110
star
60

pg2mysql

Tool for safely migrating from PostgreSQL to MySQL
Go
107
star
61

clarity-seed

This is a repository for a seed project that includes Clarity Design System's dependencies.
TypeScript
104
star
62

helm-crd

Experimental CRD controller for managing Helm releases
Go
103
star
63

fly

old - now lives in https://github.com/concourse/concourse
100
star
64

declarative-cluster-management

Declarative cluster management using constraint programming, where constraints are described using SQL.
Java
99
star
65

hillview

Big data spreadsheet
Java
99
star
66

cbapi

Carbon Black API Resources
Python
94
star
67

vmware-vcenter

VMware vCenter Module
Ruby
87
star
68

ModSecurity-envoy

ModSecurity V3 Envoy Filter
C++
86
star
69

springtrader

JavaScript
83
star
70

tic

Bit9 + Carbon Black Threat Intelligence
Python
80
star
71

runtimes

Kubeless function runtimes: https://kubeless.io/docs/runtimes/
C#
79
star
72

pyvmomi-tools

Additional community developed python packages to help you work with pyvmomi
Python
77
star
73

gpdb-sandbox-tutorials

76
star
74

salt-windows-install

Open source installer for Windows
75
star
75

vagrant-vmware-appcatalyst

Vagrant provider for VMware AppCatalyst®
Ruby
73
star
76

transport-go

Transport is a full stack, simple, fast, expandable application event bus for your applications. It provides a standardized and simple API, implemented in multiple languages, to allow any individual component inside your applications to talk to one another. This is the Golang implementation of the Transport library.
Go
72
star
77

concord

🧱⛓️ A scalable decentralized blockchain
C++
71
star
78

terraforming-gcp

use terraform, deploy yourself a pcf
HCL
71
star
79

functions

Functions Repository for Kubeless
Python
70
star
80

Pivotal-Preferences-RubyMine

This repo is deprecated. Use the "Pivotal IDE Prefs" repo instead.
70
star
81

IoT-ConnectedCar

HTML
69
star
82

ironclad

Web Application Firewall (WAF) on Kubernetes
Go
69
star
83

vsphere-automation-sdk-.net

[DEPRECATED] Please see README. C# samples, language bindings, and API reference documentation for vSphere, VMC, and NSX-T using the VMware REST API
C#
67
star
84

pymadlib

A Python wrapper for MADlib(http://madlib.net) - an open source library for scalable in-database machine learning algorithms
Jupyter Notebook
65
star
85

chaperone

Python
64
star
86

bin

old - now lives in https://github.com/concourse/concourse
64
star
87

terraforming-aws

Templates to deploy PCF and PKS
HCL
64
star
88

legacy-terraform-provider-vra7

Terraform provider for vRealize Automation 7
Go
62
star
89

tutorials

PHP
59
star
90

nsxraml

A RAML Specification Describing the NSX for vSphere API
HTML
59
star
91

vra-api-samples-for-postman

API use case samples in Postman Rest Client collection format.
58
star
92

simple-k8s-test-env

For developers building and testing Kubernetes and core Kubernetes components
Shell
58
star
93

vm-operator-api

A client API for the VM Operator project, designed to allow for integration with vSphere 7 with Kubernetes
Go
58
star
94

gcp-pcf-quickstart

Install Pivotal Cloud Foundry on Google Cloud Platform With One Command
Go
56
star
95

sunspot_matchers

RSpec matchers for testing Sunspot searches
Ruby
56
star
96

ansible-security-hardening

ansible playbooks for linux distro security hardening
56
star
97

lobot

This project has been renamed to ciborg. Please visit the ciborg page for more info.
Ruby
56
star
98

salt-pack

Salt Package Builder
Shell
55
star
99

sublime-text

Salt-related syntax highlighting and snippets for Sublime Text
JavaScript
54
star
100

pynsxv

PyNSXv is a high level python based library and CLI tool to control NSX for vSphere
Python
54
star