KalDB

KalDB is a cloud-native search and analytics engine for log, trace, and audit data. It is designed to easy to operate, cost-effective, and scale to petabytes of data.

Goals

Native support for log, trace, audit use cases.
Aggressively prioritize ingest of recent data over older data.
Full-text search capability.
Fist-class Kubernetes support for all components.
Autoscaling of ingest and query capacity.
Coordination free ingestion, so failure of a single node does not impact ingestion.
Works out of the box with sensible defaults.
Designed for zero data loss.
First-class Grafana support with accompanying plugin.
Built-in multi-tenancy, supporting several small use-cases on a single cluster.
Supports the majority of Apache Lucene features.
Drop-in replacement for most Opensearch log use cases.

Non-Goals

General-purpose search cases, such as for an ecommerce site.
Document mutability - records are expected to be append only.
Additional storage engines other than Lucene.
Support for JVM versions other than the current LTS.
Supporting multiple Lucene versions.

Quick Start

IntelliJ: Import the project as a Maven project.

IntelliJ run configs are provided for all node types, and execute using the provided config/config.yaml. These configurations are stored in the .run folder and should automatically be detected by IntelliJ upon importing the project.

To start KalDB and it's dependencies (Zookeeper, Kafka, S3) you can use the provided docker compose file:

docker-compose up

Index Data

Data from the "test-topic-in" (preprocessorConfig/kafkaStreamConfig/upstreamTopics in config.yaml) Kafka topic is read as input by the preprocessor.
The input data transformer "json" (preprocessorConfig/dataTransformer in config.yaml) is how the preprocessor will parse the data.
Each document must contain 2 mandatory fields - "service_name" and "timestamp" (DateTimeFormatter.ISO_INSTANT)
There needs to be a dataset entry for the incoming data that maps the incoming service name
To create a dataset entry, go to the manager node (default http://localhost:8083/docs) and call CreateDatasetMetadata with name/owner as "test" and serviceNamePattern = "_all"
Then we need to update partition assignment. For this we have to go to the manager node (default http://localhost:8083/docs) and call UpdatePartitionAssignment with name="test", throughputBytes=1000000 (1 MB/s after which messages will be dropped) and partitionIds=["0"] (the partition is a string and here we tell to only read from partition 0 of test-topic-in)
Now we can start producing data to Kafka partiton=0 partition="test-topic-in"
The preprocessor writes data into the following kafka topic "test-topic"(preprocessorConfig/downstreamTopic in config.yaml). We apply rate-limits etc.
The indexer service is configured to read from "test-topic" (indexerConfig/kafkaConfig/kafkaTopic in config.yaml) and creates lucene indexes locally

Query via Grafana

http://localhost:3000/explore

Contributing

If you are interested in reporting/fixing issues and contributing directly to the code base, please see CONTRIBUTING for more information on what we're looking for and how to get started.

slackhq/astra

slackhq

Reviews

Repository Details

KalDB

Goals

Non-Goals

Quick Start

Contributing

Community

Presentations

Licensing

More Repositories