Readings in Stream Processing
A list of articles that are essential to understand stream processing.
Books
- Designing Data Intensive Applications. The Big Ideas Behind Reliable, Scalable, and Maintainable Systems. Martine Kleppmann This is a book we've been waiting for 10 years. A definitive guide for entering the field of distirbuted systems and stream data processing. This book covers fundamental concepts, techniques, and challenges in keep processing large volumes of data continously.
- Japanese translation of the book 「データ指向アプリケーションデザイン」 is also available
- Streaming Systems: The What, Where, When, and How of Large-Scale Data Processing A book by the authors of Streaming 101, Spark Streaming. This book introduces how we can unify batch processing and stream processing within a single system and covers the basic ideas of Stream SQL.
Programming Models for Stream Processing
- One SQL to Rule Them All. Edmon Begoli, Tyler Akidau, Fabian Hueske, Julian Hyde, Kathryn Knight, Kenneth Knowles SIGMOD 2019 A proposal to extend the current SQL semantics to support both batch and stream processing by adding time-varying relations (TVR), event time, and materialization control.
- DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language. Y. Yu, et al. OSDI08 The origin of declarative data processing operators (e.g., map, filter, groupBy, etc.) in modern programming languages.
- OpenMessaging:Common Use Cases Illustrations of typical stream processing patterns from OpenMessaging.
- The world beyond batch: Streaming 101. An article written by the author of Google Dataflow. Streaming is actually a superset of batch processing for unbounded data. This article explains what is unbounded data and how to manage late coming data. You can also learn what streaming systems can do and can't do, and the differences of event times and processing times, and varieties of time-window based processing.
- Structured Streaming: A Declarative API for Real-Time Applications in Apache Spark. SIGMOD2018. A good summary of challenges around continous stream processing and unifying APIs for batch and stream processing.
- Discretized Streams: Fault-Tolerant Streaming Computation at Scale. NSDI 2013 An approacy for applying micro-batch style stream processing in Spark. This model has been redesigned in Spark 2.0 as Structured Streaming.
- Continuous Applications: Evolving Streaming in Apache Spark 2.0
- Drizzle: Fast and Adaptable Stream Processing at Scale. SOSP 2017 An approach for reducing the overhead of the coordination between stream processing tasks.
- Dataflow/Beam & Spark: A Programming Model Comparison
- ReactiveX. Stream processing patterns for functional programming.
- Akka Stream Stream processing DSL for Akka
- A Practical Guide to Selecting A Stream Processing Technology. A good explanation of stream processing
- Trill: A High-Performance Incremental Query Processor for Diverse Analytics. B. Chandramouli, et al. VLDB 2014
Table Catalog for Stream Processing
- Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores Utilizing scalable object strage on the cloud (e.g., S3), Delta Lake provides a single table format with versioning (time-travel) and transaction support.
- Iceberg
- Big Metadata: When Metadata is Big Data. (VLDB 2021) Columnar table catalog with partition statistics used in Google's BigQuery.
- Apache Hudi Apach Hudi provides a file layout for placing both streaming and batch data processing with a transaction support. You can merge fragmented partition data or use it as is for faster real-time data processing. Previously, it was called Uber Hoodie
Incremental Processing in DBMS
- What’s the Difference? Incremental Processing with Change Queries in Snowflake (ACM Management of Data 2023) Snowflake introduces CHANGE queries and STREAM table objects to subscribe changes in the table.
- dbt: Incremental Models dbt, a tool for compiling a sequence of queries from SQL templates, supports a simple incremental processing with conditional switch of SQL queries.
Watermark Management for Stream Processing
- The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing. Akidau et al., (Google) VLDB 2015 The original paper of Google Cloud Dataflow, which describes how we can cope with the delay of data arrival (late-coming data) and periodical data processing in a unified API for batch and stream processing. You can also find a summary of this paper at the morning paper
- Watermarks in Stream Processing Systems: Semantics and Comparative Analysis of Apache Flink and Google Cloud Dataflow. VLDB 2021. Describes basic definitions of watermarks and shows challenges and trade-offs in managing watermarks.
- Watermarking in stream processing | Course in Spark Structured Streaming 3.0 | Lesson 7 A good tutorial explaining the notions of stream processing and watermark management.
Workload Optimization
- Towards a Learning Optimizer for Shared Clouds (VLDB 2019). Estimate cardinality models from the previous job executions in order to optimize the overall workloads. This work uses the multi-layer perceptron (MLP) neural network for learning models from query exeuction features (e.g., job name, input cardinality, average row length, input dataset names, etc.)
- CrocodileDB: Efficient Database Execution through Intelligent Deferment (CIDR 2020) This paper introduces Intermittent Query Processing (IQP) approach for utilizing the knowledge about new data, query semantics, and users' expectation together to reduce the overall processing cost. It uses Deep Q-Materialization (DQM) to make a tradeoff under a certain resource constraint (e.g., memory, CPUs, storage) to decide how much data will be cached, pre-computed, pre-loaded, etc.
- Peregrine: Workload Optimization for Cloud Query Engines (SOCC 2019) Analyzing the workload of historical queries and optimize recurrring queries, similar queries, and coordinating queries by extracing common subexpressions that can be materialized. To support various query engines including Spark, Microsoft has creaetd a common intermediate representation (IR) of workloads.
Iterative Data Processing
- Olston, C. et al. 2011. Nova: continuous Pig/Hadoop workflows. (Jun. 2011)
- Naiad: A Timely Dataflow system (SOSP13 best paper) Differential data processing developed in Microsoft. Niad Project Page
- Apache Flink: Spinning Fast Iterative Data Flows. PVLDB 2012
Incremental Processing with Materialized Views
- DBToaster: Higher-order Delta Processing for Dynamic, Frequently Fresh Views (VLDB 2012). An approach for incremental view maintenance involving complex queries.
- How to Win a Hot Dog Eating Contest: Distributed Incremental View Maintenance with Batch Updates M. Nicolik et al. (SIGMOD 2016) An efficient way for finding delta of delta queries for computing materialized views. For example, computing delta for select distinct x doesn't improve the query performance, so that we should avoid incremental processing for this kinds of queries.
- Generalized Scale Independence Through Incremental Precomputation. M. Armbrust, et al. SIGMOD 2013 An approach for guaranteeing response time of queries by classifying query types and preparing materialized views if necessary.
- Comet: batched stream processing for data intensive distributed computing (SoCC 10) A basic style of incremental processing
- Continuous queries over append-only databases. SIGMOD 1992
- Selecting Subexpressions to Materialize at Datacenter Scale. PVLDB 2018 Microsoftr SCOPE - Automatically finding common sub-expressions among queries and materializing their results for reducing the overhead of recurrent queries.
- Napa: Powering Scalable Data Warehousing with Robust Query Performance at Google. VLDB 2021 Control the timing of eager materialization of queries based on the user's requirements (Favor freshness or performance)
Stream Log Collection Systems
- Fluentd A unified logging layer from various data sources.
- Kafka is often used for providing replayable streaming data sources.
- Apache Pulsar A distributed pub-sub messaging system originally created at Yahoo!
- OpenMessaging Cloud-oriented, simple, flexible, vendor-neutral and language-independent standards for messaging
- Uber Hoodie Hybrid storage: Avro for streaming import, Parquet for analysis. This project has been moved to Apache Hudi
- MQTT A machine-to-machine (M2M)/"Internet of Things" connectivity protocol.
- Robust, Scalable, Real-Time Event Time Series Aggregation at Twitter. SIGMOD2018
- Questioning the Lambda Architecture A commonly used architecture for managing recent data and archived data. However combining two types of systems for batch and streaming is still painful because analysts need to understand both systems (e.g, Hadoop + Storm, Spark + Spark Streaming, Kafka + other data store)
Real-Time Stream Processing
Real-time stream processing usually means ultra-low latency applications to satisfy SLAs for returning results in a few seconds.
- The Stratosphere Platform for Big Data Analytics. Stratospher is a former name of Apache Flink.
- The 8 Requirements of Real-Time Stream Processing. M. Stonebraker, et al. SIGMOD Record 2005. A summary is also available in the morning paper
- MacroBase: Prioritizing Attention in Fast Data. P. Bailis, et al. SIGMOD 2017. A data analytics engine that prioritizes end-user attention in high-volume fast data streams.
- A prototype implementation on GitHub
Stream SQL
- Foundations of Streaming SQL by Tyler Akidau. Good illustrations for understanding how regular table-based SQL and streaming SQL are different.
- Microsoft Azure Stream Analytics
- Norikra Schema-less Stream Processing with SQL
- Esper
- KSQL A Streaming SQL Engine for Apache Kafka
GitHub Projects
- OpenMessaging
- Apache Beam A unified model for defining both batch and streaming data-parallel processing pipelines.
- spotify/scio: A Scala API for Google Cloud Dataflow
- twitter/heron
- Microsoft Naiad
- Norikra
- KSQL
- Apache Apex Unified stream and batch processing engine.
Commercial Services
Stream Ingestion
External Lists
- List of projects related to stream-processing: https://github.com/manuzhang/awesome-streaming