• This repository has been archived on 29/Sep/2021
  • Stars
    star
    177
  • Rank 215,985 (Top 5 %)
  • Language
    Java
  • License
    Apache License 2.0
  • Created about 4 years ago
  • Updated about 3 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Profile and monitor your ML data pipeline end-to-end

ARCHIVED: Migrated to whylogs repo

WhyLogs Java Library

license javadoc openjdk

This is a Java implementation of WhyLogs, with support for Apache Spark integration for large scale datasets. The Python implementation can be found here.

Understanding the properties of data as it moves through applications is essential to keeping your ML/AI pipeline stable and improving your user experience, whether your pipeline is built for production or experimentation. WhyLogs is an open source statistical logging library that allows data science and ML teams to effortlessly profile ML/AI pipelines and applications, producing log files that can be used for monitoring, alerts, analytics, and error analysis.

WhyLogs calculates approximate statistics for datasets of any size up to TB-scale, making it easy for users to identify changes in the statistical properties of a model's inputs or outputs. Using approximate statistics allows the package to run on minimal infrastructure and monitor an entire dataset, rather than miss outliers and other anomalies by only using a sample of the data to calculate statistics. These qualities make WhyLogs an excellent solution for profiling production ML/AI pipelines that operate on TB-scale data and with enterprise SLAs.

Key Features

  • Data Insight: WhyLogs provides complex statistics across different stages of your ML/AI pipelines and applications.

  • Scalability: WhyLogs scales with your system, from local development mode to live production systems in multi-node clusters, and works well with batch and streaming architectures.

  • Lightweight: Lightweight: WhyLogs produces small mergeable lightweight outputs in a variety of formats, using sketching algorithms and summarizing statistics.

  • Unified data instrumentation: To enable data engineering pipelines and ML pipelines to share a common framework for tracking data quality and drifts, the WhyLogs library supports multiple languages and integrations.

  • Observability: In addition to supporting traditional monitoring approaches, WhyLogs data can support advanced ML-focused analytics, error analysis, and data quality and data drift detection.

Glossary/Concepts

Project: A collection of related data sets used for multiple models or applications.

Pipeline: One or more datasets used to build a single model or application. A project may contain multiple pipelines.

Dataset: A collection of records. WhyLogs v0.0.2 supports structured datasets, which represent data as a table where each row is a different record and each column is a feature of the record.

Feature: In the context of WhyLogs v0.0.2 and structured data, a feature is a column in a dataset. A feature can be discrete (like gender or eye color) or continuous (like age or salary).

WhyLogs Output: WhyLogs returns profile summary files for a dataset in JSON format. For convenience, these files are provided in flat table, histogram, and frequency formats.

Statistical Profile: A collection of statistical properties of a feature. Properties can be different for discrete and continuous features.

Statistical Profile

WhyLogs collects approximate statistics and sketches of data on a column-basis into a statistical profile. These metrics include:

  • Simple counters: boolean, null values, data types.
  • Summary statistics: sum, min, max, variance.
  • Unique value counter or cardinality: tracks an approximate unique value of your feature using HyperLogLog algorithm.
  • Histograms for numerical features. WhyLogs binary output can be queried to with dynamic binning based on the shape of your data.
  • Top frequent items (default is 128). Note that this configuration affects the memory footprint, especially for text features.

Performance

We tested WhyLogs Java performance on the following datasets to validate WhyLogs memory footprint and the output binary.

We ran our profile (in cli sub-module in this package) on each the dataset and collected JMX metrics.

Dataset Size No. of Entries No. of Features Est. Memory Consumption Output Size (uncompressed)
Lending Club 1.6GB 2.2M 151 14MB 7.4MB
NYC Tickets 1.9GB 10.8M 43 14MB 2.3MB
Pain pills in the USA 75GB 178M 42 15MB 2MB

Usage

To get started, add WhyLogs to your Maven POM:

<dependency>
  <groupId>ai.whylabs</groupId>
  <artifactId>whylogs-core</artifactId>
  <version>0.1.0</version>
</dependency>

For the full Java API signature, see the Java Documentation.

Spark package (Scala 2.11 or 2.12 only):

<dependency>
  <groupId>ai.whylabs</groupId>
  <artifactId>whylogs-spark_2.11</artifactId>
  <version>0.1.0</version>
</dependency>

For the full Scala API signature, see the Scala API Documentation.

Examples Repo

For examples in different languages, please checkout our whylogs-examples repository.

Simple tracking

The following code is a simple tracking example that does not output data to disk:

import com.whylogs.core.DatasetProfile;
import java.time.Instant;
import java.util.HashMap;
import com.google.common.collect.ImmutableMap;

public class Demo {
    public void demo() {
        final Map<String, String> tags = ImmutableMap.of("tag", "tagValue");
        final DatasetProfile profile = new DatasetProfile("test-session", Instant.now(), tags);
        profile.track("my_feature", 1);
        profile.track("my_feature", "stringValue");
        profile.track("my_feature", 1.0);

        final HashMap<String, Object> dataMap = new HashMap<>();
        dataMap.put("feature_1", 1);
        dataMap.put("feature_2", "text");
        dataMap.put("double_type_feature", 3.0);
        profile.track(dataMap);
    }
}

Serialization and deserialization

WhyLogs uses Protobuf as the backing storage format. To write the data to disk, use the standard Protobuf serialization API as follows.

import com.whylogs.core.DatasetProfile;
import java.io.InputStream;import java.nio.file.Files;
import java.io.OutputStream;
import java.nio.file.Paths;
import com.whylogs.core.message.DatasetProfileMessage;

class SerializationDemo {
    public void demo(DatasetProfile profile) {
        try (final OutputStream fos = Files.newOutputStream(Paths.get("profile.bin"))) {
            profile.toProtobuf().build().writeDelimitedTo(fos);
        }
        try (final InputStream is = new FileInputStream("profile.bin")) {
            final DatasetProfileMessage msg = DatasetProfileMessage.parseDelimitedFrom(is);
            final DatasetProfile profile = DatasetProfile.fromProtobuf(msg);

            // continue tracking
            profile.track("feature_1", 1);
        }

    }
}

Merging dataset profiles

In enterprise systems, data is often partitioned across multiple machines for distributed processing. Online systems may also process data on multiple machines, requiring engineers to run ad-hoc analysis using an ETL-based system to build complex metrics, such as counting unique visitors to a website.

WhyLogs resolves this by allowing users to merge sketches from different machines. To merge two WhyLogs DatasetProfile files, those files must:

  • Have the same name
  • Have the same session ID
  • Have the same data timestamp
  • Have the same tags

The following is an example of the code for merging files that meet these requirements.

import com.whylogs.core.DatasetProfile;
import java.io.InputStream;import java.nio.file.Files;
import java.io.OutputStream;
import java.nio.file.Paths;
import com.whylogs.core.message.DatasetProfileMessage;

class SerializationDemo {
    public void demo(DatasetProfile profile) {
        try (final InputStream is1 = new FileInputStream("profile1.bin");
                final InputStream is2 = new FileInputStream("profile2.bin")) {
            final DatasetProfileMessage msg = DatasetProfileMessage.parseDelimitedFrom(is);
            final DatasetProfile profile1 = DatasetProfile.fromProtobuf(DatasetProfileMessage.parseDelimitedFrom(is1));
            final DatasetProfile profile2 = DatasetProfile.fromProtobuf(DatasetProfileMessage.parseDelimitedFrom(is2));

            // merge
            profile1.merge(profile2);
        }

    }
}

Apache Spark integration

Our integration is compatible with Apache Spark 2.x (3.0 support is to come).

This example shows how we use WhyLogs to profile a dataset based on time and categorical information. The data is from the public dataset for Fire Department Calls & Incident.

import org.apache.spark.sql.functions._
// implicit import for WhyLogs to enable newProfilingSession API
import com.whylogs.spark.WhyLogs._

// load the data
val raw_df = spark.read.option("header", "true").csv("/databricks-datasets/timeseries/Fires/Fire_Department_Calls_for_Service.csv")
val df = raw_df.withColumn("call_date", to_timestamp(col("Call Date"), "MM/dd/YYYY"))

val profiles = df.newProfilingSession("profilingSession") // start a new WhyLogs profiling job
                 .withTimeColumn("call_date") // split dataset by call_date
                 .groupBy("City", "Priority") // tag and group the data with categorical information
                 .aggProfiles() //  runs the aggregation. returns a dataframe of <timestamp, datasetProfile> entries

For further analysis, dataframes can be stored in a Parquet file, or collected to the driver if the number of entries is small enough.

Building and Testing

Before building, update the proto submodule,

git submodule update --init --recursive
  • To build, run ./gradlew build
  • To test, run ./gradlew test

More Repositories

1

whylogs

An open-source data logging library for machine learning models and data pipelines. 📚 Provides visibility into data quality & model performance over time. 🛡️ Supports privacy-preserving data collection, ensuring safety & robustness. 📈
Jupyter Notebook
2,644
star
2

langkit

🔍 LangKit: An open-source toolkit for monitoring Large Language Models (LLMs). 📚 Extracts signals from prompts & responses, ensuring safety & security. 🛡️ Features include text quality, relevance metrics, & sentiment analysis. 📊 A comprehensive tool for LLM observability. 👀
Jupyter Notebook
833
star
3

whylogs-examples

A collection of WhyLogs examples in various languages
Jupyter Notebook
48
star
4

openllmtelemetry

Open LLM Telemetry package
Jupyter Notebook
22
star
5

whylogs-proto

Protobuf definition for WhyLogs format
14
star
6

datasketches

A fork of datasketches for consumption in WhyLogs
C++
13
star
7

whylabs-toolkit

Python
12
star
8

whylabs-tutorials

Tutorials for WhyLabs
Jupyter Notebook
6
star
9

whylogs-container

Container code for WhyLogs
Kotlin
6
star
10

whylogs_action

Repo for running Whylogs as part of a CI workflow using github actions.
Python
5
star
11

llm-traceguard

End-to-end observability with built-in security guardrails
Makefile
5
star
12

whylabs-docs

WhyLabs documentation repository
JavaScript
3
star
13

whylabs-client-python

Public Python client for WhyLabs API
Python
2
star
14

airflow-provider-whylogs

A repo to contain whylogs operators to work with Apache Airflow
Python
2
star
15

whylabs-ray-examples

Python
2
star
16

monitor-schema

A repository for the WhyLabs monitor config schema
1
star
17

bigquery-dataflow-templates

Python
1
star
18

whylogs-container-python

Python
1
star
19

whylogs-container-python-client

Python swagger client for the whylogs container
Python
1
star
20

whylabs

Python library for configuring and managing WhyLabs organizations.
Jupyter Notebook
1
star
21

langkit-container-examples

1
star