Readings in Stream Processing

A list of articles that are essential to understand stream processing.

Books

Designing Data Intensive Applications. The Big Ideas Behind Reliable, Scalable, and Maintainable Systems. Martine Kleppmann This is a book we've been waiting for 10 years. A definitive guide for entering the field of distirbuted systems and stream data processing. This book covers fundamental concepts, techniques, and challenges in keep processing large volumes of data continously.
- Japanese translation of the book 「データ指向アプリケーションデザイン」 is also available
Streaming Systems: The What, Where, When, and How of Large-Scale Data Processing A book by the authors of Streaming 101, Spark Streaming. This book introduces how we can unify batch processing and stream processing within a single system and covers the basic ideas of Stream SQL.

Programming Models for Stream Processing

One SQL to Rule Them All. Edmon Begoli, Tyler Akidau, Fabian Hueske, Julian Hyde, Kathryn Knight, Kenneth Knowles SIGMOD 2019 A proposal to extend the current SQL semantics to support both batch and stream processing by adding time-varying relations (TVR), event time, and materialization control.
DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language. Y. Yu, et al. OSDI08 The origin of declarative data processing operators (e.g., map, filter, groupBy, etc.) in modern programming languages.
OpenMessaging:Common Use Cases Illustrations of typical stream processing patterns from OpenMessaging.
The world beyond batch: Streaming 101. An article written by the author of Google Dataflow. Streaming is actually a superset of batch processing for unbounded data. This article explains what is unbounded data and how to manage late coming data. You can also learn what streaming systems can do and can't do, and the differences of event times and processing times, and varieties of time-window based processing.
Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark. SIGMOD2018. A good summary of challenges around continous stream processing and unifying APIs for batch and stream processing.
- Discretized Streams: Fault-Tolerant Streaming Computation at Scale. NSDI 2013 An approacy for applying micro-batch style stream processing in Spark. This model has been redesigned in Spark 2.0 as Structured Streaming.
- Continuous Applications: Evolving Streaming in Apache Spark 2.0
Drizzle: Fast and Adaptable Stream Processing at Scale. SOSP 2017 An approach for reducing the overhead of the coordination between stream processing tasks.
Dataflow/Beam & Spark: A Programming Model Comparison
ReactiveX. Stream processing patterns for functional programming.
Akka Stream Stream processing DSL for Akka
A Practical Guide to Selecting A Stream Processing Technology. A good explanation of stream processing
Trill: A High-Performance Incremental Query Processor for Diverse Analytics. B. Chandramouli, et al. VLDB 2014

Table Catalog for Stream Processing

Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores Utilizing scalable object strage on the cloud (e.g., S3), Delta Lake provides a single table format with versioning (time-travel) and transaction support.
Iceberg
Big Metadata: When Metadata is Big Data. (VLDB 2021) Columnar table catalog with partition statistics used in Google's BigQuery.
Apache Hudi Apach Hudi provides a file layout for placing both streaming and batch data processing with a transaction support. You can merge fragmented partition data or use it as is for faster real-time data processing. Previously, it was called Uber Hoodie

Incremental Processing in DBMS

What’s the Difference? Incremental Processing with Change Queries in Snowflake (ACM Management of Data 2023) Snowflake introduces CHANGE queries and STREAM table objects to subscribe changes in the table.
dbt: Incremental Models dbt, a tool for compiling a sequence of queries from SQL templates, supports a simple incremental processing with conditional switch of SQL queries.

Watermark Management for Stream Processing

The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing. Akidau et al., (Google) VLDB 2015 The original paper of Google Cloud Dataflow, which describes how we can cope with the delay of data arrival (late-coming data) and periodical data processing in a unified API for batch and stream processing. You can also find a summary of this paper at the morning paper
Watermarks in Stream Processing Systems: Semantics and Comparative Analysis of Apache Flink and Google Cloud Dataflow. VLDB 2021. Describes basic definitions of watermarks and shows challenges and trade-offs in managing watermarks.
Watermarking in stream processing | Course in Spark Structured Streaming 3.0 | Lesson 7 A good tutorial explaining the notions of stream processing and watermark management.

Workload Optimization

Towards a Learning Optimizer for Shared Clouds (VLDB 2019). Estimate cardinality models from the previous job executions in order to optimize the overall workloads. This work uses the multi-layer perceptron (MLP) neural network for learning models from query exeuction features (e.g., job name, input cardinality, average row length, input dataset names, etc.)
CrocodileDB: Efficient Database Execution through Intelligent Deferment (CIDR 2020) This paper introduces Intermittent Query Processing (IQP) approach for utilizing the knowledge about new data, query semantics, and users' expectation together to reduce the overall processing cost. It uses Deep Q-Materialization (DQM) to make a tradeoff under a certain resource constraint (e.g., memory, CPUs, storage) to decide how much data will be cached, pre-computed, pre-loaded, etc.
Peregrine: Workload Optimization for Cloud Query Engines (SOCC 2019) Analyzing the workload of historical queries and optimize recurrring queries, similar queries, and coordinating queries by extracing common subexpressions that can be materialized. To support various query engines including Spark, Microsoft has creaetd a common intermediate representation (IR) of workloads.

Iterative Data Processing

Olston, C. et al. 2011. Nova: continuous Pig/Hadoop workflows. (Jun. 2011)
Naiad: A Timely Dataflow system (SOSP13 best paper) Differential data processing developed in Microsoft. Niad Project Page
Apache Flink: Spinning Fast Iterative Data Flows. PVLDB 2012

Incremental Processing with Materialized Views

DBToaster: Higher-order Delta Processing for Dynamic, Frequently Fresh Views (VLDB 2012). An approach for incremental view maintenance involving complex queries.
How to Win a Hot Dog Eating Contest: Distributed Incremental View Maintenance with Batch Updates M. Nicolik et al. (SIGMOD 2016) An efficient way for finding delta of delta queries for computing materialized views. For example, computing delta for select distinct x doesn't improve the query performance, so that we should avoid incremental processing for this kinds of queries.
Generalized Scale Independence Through Incremental Precomputation. M. Armbrust, et al. SIGMOD 2013 An approach for guaranteeing response time of queries by classifying query types and preparing materialized views if necessary.
Comet: batched stream processing for data intensive distributed computing (SoCC 10) A basic style of incremental processing
Continuous queries over append-only databases. SIGMOD 1992
Selecting Subexpressions to Materialize at Datacenter Scale. PVLDB 2018 Microsoftr SCOPE - Automatically finding common sub-expressions among queries and materializing their results for reducing the overhead of recurrent queries.
Napa: Powering Scalable Data Warehousing with Robust Query Performance at Google. VLDB 2021 Control the timing of eager materialization of queries based on the user's requirements (Favor freshness or performance)

Stream Log Collection Systems

Fluentd A unified logging layer from various data sources.
Kafka is often used for providing replayable streaming data sources.
Apache Pulsar A distributed pub-sub messaging system originally created at Yahoo!
OpenMessaging Cloud-oriented, simple, flexible, vendor-neutral and language-independent standards for messaging
Uber Hoodie Hybrid storage: Avro for streaming import, Parquet for analysis. This project has been moved to Apache Hudi
MQTT A machine-to-machine (M2M)/"Internet of Things" connectivity protocol.
Robust, Scalable, Real-Time Event Time Series Aggregation at Twitter. SIGMOD2018
- Twitter Heron: Stream Processing at Scale. SIGMOD 2015
- The Unified Logging Infrastructure for Data Analytics at Twitter. VLDB 2012
Questioning the Lambda Architecture A commonly used architecture for managing recent data and archived data. However combining two types of systems for batch and streaming is still painful because analysts need to understand both systems (e.g, Hadoop + Storm, Spark + Spark Streaming, Kafka + other data store)

Real-Time Stream Processing

Real-time stream processing usually means ultra-low latency applications to satisfy SLAs for returning results in a few seconds.

The Stratosphere Platform for Big Data Analytics. Stratospher is a former name of Apache Flink.
The 8 Requirements of Real-Time Stream Processing. M. Stonebraker, et al. SIGMOD Record 2005. A summary is also available in the morning paper
MacroBase: Prioritizing Attention in Fast Data. P. Bailis, et al. SIGMOD 2017. A data analytics engine that prioritizes end-user attention in high-volume fast data streams.
- A prototype implementation on GitHub

Stream SQL

Foundations of Streaming SQL by Tyler Akidau. Good illustrations for understanding how regular table-based SQL and streaming SQL are different.
Microsoft Azure Stream Analytics
Norikra Schema-less Stream Processing with SQL
Esper
KSQL A Streaming SQL Engine for Apache Kafka

GitHub Projects

OpenMessaging
Apache Beam A unified model for defining both batch and streaming data-parallel processing pipelines.
- It was Google Dataflow Open source API for Java
spotify/scio: A Scala API for Google Cloud Dataflow
twitter/heron
Microsoft Naiad
Norikra
KSQL
Apache Apex Unified stream and batch processing engine.

Commercial Services

Stream Ingestion

External Lists

List of projects related to stream-processing: https://github.com/manuzhang/awesome-streaming

xerial/streamdb-readings

xerial

Reviews

Repository Details

Readings in Stream Processing

Books

Programming Models for Stream Processing

Table Catalog for Stream Processing

Incremental Processing in DBMS

Watermark Management for Stream Processing

Workload Optimization

Iterative Data Processing

Incremental Processing with Materialized Views

Stream Log Collection Systems

Real-Time Stream Processing

Stream SQL

GitHub Projects

Commercial Services

Stream Ingestion

External Lists

More Repositories