• Stars
    star
    1,913
  • Rank 24,233 (Top 0.5 %)
  • Language
  • License
    Other
  • Created almost 5 years ago
  • Updated 5 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

An Awesome List of Open-Source Data Engineering Projects

Awesome Open-Source Data Engineering

Analytics

  • Apache Spark - A unified analytics engine for large-scale data processing. Includes APIs in Scala, Java, Python (known as PySpark), and R (SparkR).

  • Apache Beam - An open-source implementation of Google DataFlow. Provides capabilites of batch and streaming data processing jobs that run on any execution engine, including Spark, Flink, or its own DirectRunner. Supports multiple APIs in Java, Python, and Go.

  • Apache Flink - Stateful computations over data streams.

  • Trino (formerly known as PrestoSQL) - Distributed SQL Query Engine for Big Data.

Business Intelligence

  • Apache Superset - A modern, enterprise-ready business intelligence web application.

  • HUE - The Hadoop User Interface. Similar to Superset, but interfaces between RDBMS, Hive, Impala, HBase, Spark, HDFS & S3, Oozie, Pig, YARN Job Explorer, and more. Offers an extensible Django environment for custom app integration.

  • Metabase - An easy way for everyone in your company to ask questions and learn from data.

  • Redash - All the tools to unlock your data.

Data Lakehouse

  • Delta Lake - Open-source storage framework that enables building a lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs for Scala, Java, Rust, Ruby, and Python.

  • Apache Hudi - Transactional data lake platform that brings database and data warehouse capabilities to the data lake. Hudi reimagines slow old-school batch data processing with a powerful new incremental processing framework for low latency minute-level analytics.

  • Apache Iceberg - High-performance format for huge analytic tables. Iceberg brings the reliability and simplicity of SQL tables to big data, while making it possible for engines like Spark, Trino, Flink, Presto, Hive and Impala to safely work with the same tables, at the same time.

Change Data Capture

  • Debezium - Change data capture for MySQL, Postgres, MongoDB, SQL Server and others.

  • Maxwell - Maxwellโ€™s daemon, a MySQL-to-JSON Kafka producer.

Datastores

  • Apache Calcite - SQL parser, building blocks for datastores.

  • Apache Cassandra - Open Source distributed wide column store, NoSQL database.

  • Apache Druid - A high performance real-time analytics database.

  • Apache HBase - Open Source non-relational distributed database.

  • Apache Pinot - A realtime distributed OLAP datastore.

  • ClickHouse - Open Source distributed column-oriented DBMS.

  • InfluxDB - Purpose-Built Open Source Time Series Database.

  • MinIO - MinIO is a high performance, distributed object storage system and AWS S3 compatible.

  • Postgres - The Worldโ€™s Most Advanced Open Source Relational Database.

Data Governance and Registries

  • Amundsen - metadata catalogue.

  • Apache Atlas - Data governance and metadata framework for Hadoop.

  • DataHub - A Generalized Metadata Search & Discovery Tool.

  • Metacat - Unified metadata exploration API service.

  • Elementary - Data reliability solution, starting with plug-and-play data lineage and datasets operational status.

  • Monosi - Data observability & monitoring platform.

  • OpenMetadata - Generalized metadata, search, and lineage tool.

Data Virtualization

  • Apache Drill - Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage.

  • Dremio - A data lake engine. Provides an Apache Arrow-based query and acceleration engine together with the ability to create an IT-governed self-service layer for data scientists and analysts.

  • Teiid - A relational abstraction of different information sources.

  • Presto - Distributed SQL Query Engine for Big Data.

Data Orchestration

  • Alluxio - Scalable, multi-tiered distributed caching for HDFS, S3, Ceph, NFS, and related filestores. Provides integrations for SQL queries into a Catalog from Spark, Hive, and Presto.

Formats

  • Apache Avro - A data serialization system.

  • Apache Parquet - A columnar storage format.

  • Apache ORC - Another columnar storage format.

  • Apache Thrift - Data type and service interface definitions and code generator.

  • Apache Arrow - A cross-language development platform for in-memory data. It specifies a standardized, language-independent, columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy IPC and streaming messaging.

  • Capโ€™n Proto - A data interchange format and capability-based RPC system.

  • FlatBuffers - An efficient cross platform serialization library for C++, C#, C, Go, Java, JavaScript, Lobster, Lua, TypeScript, PHP, Python, and Rust.

  • MessagePack - An efficient binary serialization format. It lets you exchange data among multiple languages like JSON.

  • Protocol Buffers - Googleโ€™s language-neutral, platform-neutral, extensible mechanism for serializing structured data.

Integration

  • Apache Camel - Easily integrate various systems consuming or producing data.

  • Kafka Connect - Reusable framework to handle data int-and-out of Apache Kafka.

  • Logstash - Open Source server-side data processing pipeline.

  • Telegraf - a plugin-driven server agent writen in Go (deployed as a single binary with no external dependencies) for collecting and sending metrics and events from databases, systems, and IoT sensors. Offers hundreds of existing plugins.

Messaging Infrastructure

  • Apache ActiveMQ - Flexible & Powerful Multi-Protocol Messaging.

  • Apache Kafka - A distributed commit log with messaging capabilities.

  • Apache Pulsar - A distributed pub-sub messaging system.

  • Liiklus - An event gateway that provides reactive gRPC/RSocket access to Kafka-like systems.

  • Nakadi - A distributed event bus that implements a RESTful API abstraction on top of Kafka-like queues].

  • NATS - A simple, secure and high performance messaging system.

  • RabbitMQ - A message broker.

  • Waltz - A quorum-based distributed write-ahead log for replicating transactions.

  • ZeroMQ - An open-source universal, high-performance messaging library.

Specifications and Standards

  • CloudEvents - A specification for describing event data in a common way.

Stream Processing

  • Apache Heron - The "direct successor of Apache Storm", built to be backwards compatible with Stormโ€™s topology API but with a wide array of architectural improvements.

  • Apache Kafka Streams - A client library for building applications and microservices, where the input and output data are stored in Kafka.

  • Apache Samza - A distributed stream processing framework.

  • Apache Spark Structured Streaming - A scalable and fault-tolerant stream processing engine built on the Spark SQL engine.

  • Apache Storm - A distributed realtime computation system.

Testing

Versioning

  • lakeFS - Repeatable, atomic and versioned data lake on top of object storage.

Workflow Management

  • Awesome Workflow Engines - A curated list of awesome open source workflow engines.

  • Apache Airflow - A platform created by community to programmatically author, schedule and monitor workflows.

  • Apache NiFi - Apache NiFi supports powerful and scalable directed graphs of data routing, transformation, and system mediation logic

  • KNIME - KNIME Analytics Platform offers a WYSIWYG Editor for Spark-based workflows, with over 2000+ integrations. Offers visualization and flow analytics in-place. KNIME Server is a commercially licensed component that adds additional features.

  • Prefect - A workflow management system designed for modern infrastructure.

  • Dagster - A data orchestrator for machine learning, analytics, and ETL.

  • Kestra - Open source data orchestration and scheduling platform with declarative syntax.

only overview contents, no specific tools

Slide Decks, Recordings and Podcasts

Blog Posts and Articles

Collections

License

The contents of this repository is licensed under the "Creative Commons Attribution-ShareAlike 4.0 International License".

More Repositories

1

1brc

1๏ธโƒฃ๐Ÿ๐ŸŽ๏ธ The One Billion Row Challenge -- A fun exploration of how quickly 1B rows from a text file can be aggregated with Java
Java
5,179
star
2

awesome-annotation-processing

A curated list of resources related to the Java annotation processing API (JSR 269)
445
star
3

jfr-custom-events

Demo of Java Flight Recorder custom events and event streaming
Java
76
star
4

search.morling.dev

Serverless website search for morling.dev
Java
34
star
5

jdkapidiff

NOTE: This project has been migrated to https://github.com/AdoptOpenJDK/jdk-api-diff
Java
32
star
6

quarkus-qute

Java
30
star
7

quarkus-pdf-extract

Quarkus-based microservice to extract text from PDF files
Java
24
star
8

methodvalidation-integration

This project is the a playground for prototyping the integration of the method-level validation feature new in Hibernate Validator 4.2 and frameworks providing adequate AOP/interception facilities such as CDI or Spring AOP.
Java
21
star
9

records-bean-validation

Example showing how to apply Bean Validation constraints to the components of Java 14 record types
Java
17
star
10

kafka-knative-demo

Demo for DevNation Tech Talk 2020 (Debezium/Kafka Streams/Quarkus/Knative)
JavaScript
14
star
11

scriptable-dataset

A scriptable dataset for DBUnit
Java
13
star
12

simd-fizzbuzz

Simple demo for using the Java 16 Vector API, using the well-known "FizzBuzz" example
Java
11
star
13

kcetcd

An example source connector for Kafka Connect, ingesting data from etcd
Java
10
star
14

run-detached

A shell script for running tasks on a git repo in a detached branch, allowing to continue with other tasks in parallel
Shell
10
star
15

lazy-constants

Lazy constants via JDK 11's CONSTANT_Dynamic (JEP 309)
Java
10
star
16

morling.dev

Source code for Gunnar Morling's website.
CSS
9
star
17

pgoutput-cli

A command line client for consuming Postgres logical decoding events in the pgoutput format
Python
9
star
18

musings-of-a-programming-addict

Source code for posts from Gunnar Morling's blog
Java
8
star
19

quarkus-cds

Exploration of using CDS (class-data sharing) with Quarkus
HTML
6
star
20

zip-gatherer

Proof-of-concept for a Java 22 gatherer for zipping two streams
Java
6
star
21

kafka-testing

A container image for starting up Apache Kafka and ZooKeeper for testing
Java
5
star
22

unix-domain-socket-poc

A PoC for using Java 16's Unix domain socket support (JEP 380) with the Vert.x Postgres driver
Java
5
star
23

quarkus-with-jib

Exploration of using Quarkus with Jib
HTML
4
star
24

scripting-extension

A CDI portable extension allowing to retrieve JSR 223 scripting engines via dependency injection
Java
4
star
25

jx-binding

Advanced data binding and validation for JavaFX
Java
3
star
26

modular-resource-bundles

Demo for loading resources on module path and classpath
Java
3
star
27

concurrency-utilities-cdi

Java
3
star
28

cloud-boxes

Terraform scripts for provisioning nodes in the Hetzner cloud
HCL
3
star
29

script-assert

A JSR 303 ("Bean Validation") constraint, that allows to use script languages for constraint definitions. NOTE: This project is abandoned. The @ScriptAssert annotation is now part of Hibernate Validator 4.1
Java
3
star
30

bridge-methods-demo

Example for injecting bridge methods using Bridger
Java
2
star
31

hidden-classes

An example for Java 15 hidden classes
Java
2
star
32

postgres-ivm-demo

Exploration of pg_ivm (a Postgres extension for incrementally updated materialized views)
Dockerfile
2
star
33

scenicview-mvp

Java
2
star
34

gunnarmorling

My personal README
2
star
35

discussions.morling.dev

Discussions forum for morling.dev
Java
2
star
36

bv-tools

A set of Eclipse plug-ins for the Bean Validation API
Java
1
star
37

quarkus-on-render

Demo for deploying a native Quarkus application on render.com
HTML
1
star
38

hibernate-orm-on-java9-modules

Java
1
star
39

decodable-webhook-demo

A demo for processing GitHub web hooks with the Decodable REST connector
Java
1
star
40

loom-web

Explorations of Project Loom
Java
1
star
41

eventful

An exploration for implementing an event sourcing event store with SQLite
HTML
1
star
42

money-validation

Validation support for javax.money
Java
1
star
43

hyperfoil-playground

Experimentations with Hyperfoil
HTML
1
star
44

quarkus-with-sqlite

Example project demonstrating the usage of SQLite with Quarkus
HTML
1
star
45

signature-check-jlink-plugin

A proof-of-concept for a jlink plug-in for detecting API signature mismatches between modules
Java
1
star
46

postgres-publication-filtering

A demo of using column lists and row filters from Postgres 15 with Debezium
Dockerfile
1
star
47

jfr-blocking-analysis

Playground for analysing blocked queue producers with JFR Analytics
Java
1
star
48

solar-watch

Scripts and configuration for retrieving the data from a GoodWe solar inverter, storing it in InfluxDB and visualizing it via Grafana.
Python
1
star