Apache Gobblin
Apache Gobblin is a highly scalable data management solution for structured and byte-oriented data in heterogeneous data ecosystems.
Capabilities
- Ingestion and export of data from a variety of sources and sinks into and out of the data lake. Gobblin is optimized and designed for ELT patterns with inline transformations on ingest (small t).
- Data Organization within the lake (e.g. compaction, partitioning, deduplication)
- Lifecycle Management of data within the lake (e.g. data retention)
- Compliance Management of data across the ecosystem (e.g. fine-grain data deletions)
Highlights
- Battle tested at scale: Runs in production at petabyte-scale at companies like LinkedIn, PayPal, Verizon etc.
- Feature rich: Supports task partitioning, state management for incremental processing, atomic data publishing, data quality checking, job scheduling, fault tolerance etc.
- Supports stream and batch execution modes
- Control Plane (Gobblin-as-a-service) supports programmatic triggering and orchestration of data plane operations.
Common Patterns used in production
- Stream / Batch ingestion of Kafka to Data Lake (HDFS, S3, ADLS)
- Bulk-loading serving stores from the Data Lake (e.g. HDFS -> Couchbase)
- Support for data sync across Federated Data Lake (HDFS <-> HDFS, HDFS <-> S3, S3 <-> ADLS)
- Integrate external vendor API-s (e.g. Salesforce, Dynamics etc.) with data store (HDFS, Couchbase etc)
- Enforcing Data retention policies and GDPR deletion on HDFS / ADLS
Apache Gobblin is NOT
- A general purpose data transformation engine like Spark or Flink. Gobblin can delegate complex-data processing tasks to Spark, Hive etc.
- A data storage system like Apache Kafka or HDFS. Gobblin integrates with these systems as sources or sinks.
- A general-purpose workflow execution system like Airflow, Azkaban, Dagster, Luigi.
Requirements
- Java >= 1.8
If building the distribution with tests turned on:
- Maven version 3.5.3
Instructions to download gradle wrapper
If you are going to build Gobblin from the source distribution, run the following command for downloading the gradle-wrapper.jar from Gobblin git repository to gradle/wrapper directory (replace GOBBLIN_VERSION in the URL with the version you downloaded).
wget --no-check-certificate -P gradle/wrapper https://github.com/apache/gobblin/raw/${GOBBLIN_VERSION}/gradle/wrapper/gradle-wrapper.jar
(or)
curl --insecure -L https://github.com/apache/gobblin/raw/${GOBBLIN_VERSION}/gradle/wrapper/gradle-wrapper.jar > gradle/wrapper/gradle-wrapper.jar
Alternatively, you can download it manually from:
https://github.com/apache/gobblin/blob/${GOBBLIN_VERSION}/gradle/wrapper/gradle-wrapper.jar
Make sure that you download it to gradle/wrapper directory.
Instructions to run Apache RAT (Release Audit Tool)
- Extract the archive file to your local directory.
- Run
./gradlew rat
. Report will be generated under build/rat/rat-report.html
Instructions to build the distribution
- Extract the archive file to your local directory.
- Skip tests and build the distribution:
Run
./gradlew build -x findbugsMain -x test -x rat -x checkstyleMain
The distribution will be created in build/gobblin-distribution/distributions directory. (or) - Run tests and build the distribution (requires Maven):
Run
./gradlew build
The distribution will be created in build/gobblin-distribution/distributions directory.