whylogs repo
ARCHIVED: Migrated toWhyLogs Java Library
This is a Java implementation of WhyLogs, with support for Apache Spark integration for large scale datasets. The Python implementation can be found here.
Understanding the properties of data as it moves through applications is essential to keeping your ML/AI pipeline stable and improving your user experience, whether your pipeline is built for production or experimentation. WhyLogs is an open source statistical logging library that allows data science and ML teams to effortlessly profile ML/AI pipelines and applications, producing log files that can be used for monitoring, alerts, analytics, and error analysis.
WhyLogs calculates approximate statistics for datasets of any size up to TB-scale, making it easy for users to identify changes in the statistical properties of a model's inputs or outputs. Using approximate statistics allows the package to run on minimal infrastructure and monitor an entire dataset, rather than miss outliers and other anomalies by only using a sample of the data to calculate statistics. These qualities make WhyLogs an excellent solution for profiling production ML/AI pipelines that operate on TB-scale data and with enterprise SLAs.
Key Features
-
Data Insight: WhyLogs provides complex statistics across different stages of your ML/AI pipelines and applications.
-
Scalability: WhyLogs scales with your system, from local development mode to live production systems in multi-node clusters, and works well with batch and streaming architectures.
-
Lightweight: Lightweight: WhyLogs produces small mergeable lightweight outputs in a variety of formats, using sketching algorithms and summarizing statistics.
-
Unified data instrumentation: To enable data engineering pipelines and ML pipelines to share a common framework for tracking data quality and drifts, the WhyLogs library supports multiple languages and integrations.
-
Observability: In addition to supporting traditional monitoring approaches, WhyLogs data can support advanced ML-focused analytics, error analysis, and data quality and data drift detection.
Glossary/Concepts
Project: A collection of related data sets used for multiple models or applications.
Pipeline: One or more datasets used to build a single model or application. A project may contain multiple pipelines.
Dataset: A collection of records. WhyLogs v0.0.2 supports structured datasets, which represent data as a table where each row is a different record and each column is a feature of the record.
Feature: In the context of WhyLogs v0.0.2 and structured data, a feature is a column in a dataset. A feature can be discrete (like gender or eye color) or continuous (like age or salary).
WhyLogs Output: WhyLogs returns profile summary files for a dataset in JSON format. For convenience, these files are provided in flat table, histogram, and frequency formats.
Statistical Profile: A collection of statistical properties of a feature. Properties can be different for discrete and continuous features.
Statistical Profile
WhyLogs collects approximate statistics and sketches of data on a column-basis into a statistical profile. These metrics include:
- Simple counters: boolean, null values, data types.
- Summary statistics: sum, min, max, variance.
- Unique value counter or cardinality: tracks an approximate unique value of your feature using HyperLogLog algorithm.
- Histograms for numerical features. WhyLogs binary output can be queried to with dynamic binning based on the shape of your data.
- Top frequent items (default is 128). Note that this configuration affects the memory footprint, especially for text features.
Performance
We tested WhyLogs Java performance on the following datasets to validate WhyLogs memory footprint and the output binary.
- Lending Club Data: Kaggle Link
- NYC Tickets: Kaggle Link
- Pain Pills in the USA: Kaggle Link
We ran our profile (in cli
sub-module in this package) on each the dataset and collected JMX metrics.
Dataset | Size | No. of Entries | No. of Features | Est. Memory Consumption | Output Size (uncompressed) |
---|---|---|---|---|---|
Lending Club | 1.6GB | 2.2M | 151 | 14MB | 7.4MB |
NYC Tickets | 1.9GB | 10.8M | 43 | 14MB | 2.3MB |
Pain pills in the USA | 75GB | 178M | 42 | 15MB | 2MB |
Usage
To get started, add WhyLogs to your Maven POM:
<dependency>
<groupId>ai.whylabs</groupId>
<artifactId>whylogs-core</artifactId>
<version>0.1.0</version>
</dependency>
For the full Java API signature, see the Java Documentation.
Spark package (Scala 2.11 or 2.12 only):
<dependency>
<groupId>ai.whylabs</groupId>
<artifactId>whylogs-spark_2.11</artifactId>
<version>0.1.0</version>
</dependency>
For the full Scala API signature, see the Scala API Documentation.
Examples Repo
For examples in different languages, please checkout our whylogs-examples repository.
Simple tracking
The following code is a simple tracking example that does not output data to disk:
import com.whylogs.core.DatasetProfile;
import java.time.Instant;
import java.util.HashMap;
import com.google.common.collect.ImmutableMap;
public class Demo {
public void demo() {
final Map<String, String> tags = ImmutableMap.of("tag", "tagValue");
final DatasetProfile profile = new DatasetProfile("test-session", Instant.now(), tags);
profile.track("my_feature", 1);
profile.track("my_feature", "stringValue");
profile.track("my_feature", 1.0);
final HashMap<String, Object> dataMap = new HashMap<>();
dataMap.put("feature_1", 1);
dataMap.put("feature_2", "text");
dataMap.put("double_type_feature", 3.0);
profile.track(dataMap);
}
}
Serialization and deserialization
WhyLogs uses Protobuf as the backing storage format. To write the data to disk, use the standard Protobuf serialization API as follows.
import com.whylogs.core.DatasetProfile;
import java.io.InputStream;import java.nio.file.Files;
import java.io.OutputStream;
import java.nio.file.Paths;
import com.whylogs.core.message.DatasetProfileMessage;
class SerializationDemo {
public void demo(DatasetProfile profile) {
try (final OutputStream fos = Files.newOutputStream(Paths.get("profile.bin"))) {
profile.toProtobuf().build().writeDelimitedTo(fos);
}
try (final InputStream is = new FileInputStream("profile.bin")) {
final DatasetProfileMessage msg = DatasetProfileMessage.parseDelimitedFrom(is);
final DatasetProfile profile = DatasetProfile.fromProtobuf(msg);
// continue tracking
profile.track("feature_1", 1);
}
}
}
Merging dataset profiles
In enterprise systems, data is often partitioned across multiple machines for distributed processing. Online systems may also process data on multiple machines, requiring engineers to run ad-hoc analysis using an ETL-based system to build complex metrics, such as counting unique visitors to a website.
WhyLogs resolves this by allowing users to merge sketches from different machines. To merge two WhyLogs
DatasetProfile
files, those files must:
- Have the same name
- Have the same session ID
- Have the same data timestamp
- Have the same tags
The following is an example of the code for merging files that meet these requirements.
import com.whylogs.core.DatasetProfile;
import java.io.InputStream;import java.nio.file.Files;
import java.io.OutputStream;
import java.nio.file.Paths;
import com.whylogs.core.message.DatasetProfileMessage;
class SerializationDemo {
public void demo(DatasetProfile profile) {
try (final InputStream is1 = new FileInputStream("profile1.bin");
final InputStream is2 = new FileInputStream("profile2.bin")) {
final DatasetProfileMessage msg = DatasetProfileMessage.parseDelimitedFrom(is);
final DatasetProfile profile1 = DatasetProfile.fromProtobuf(DatasetProfileMessage.parseDelimitedFrom(is1));
final DatasetProfile profile2 = DatasetProfile.fromProtobuf(DatasetProfileMessage.parseDelimitedFrom(is2));
// merge
profile1.merge(profile2);
}
}
}
Apache Spark integration
Our integration is compatible with Apache Spark 2.x (3.0 support is to come).
This example shows how we use WhyLogs to profile a dataset based on time and categorical information. The data is from the public dataset for Fire Department Calls & Incident.
import org.apache.spark.sql.functions._
// implicit import for WhyLogs to enable newProfilingSession API
import com.whylogs.spark.WhyLogs._
// load the data
val raw_df = spark.read.option("header", "true").csv("/databricks-datasets/timeseries/Fires/Fire_Department_Calls_for_Service.csv")
val df = raw_df.withColumn("call_date", to_timestamp(col("Call Date"), "MM/dd/YYYY"))
val profiles = df.newProfilingSession("profilingSession") // start a new WhyLogs profiling job
.withTimeColumn("call_date") // split dataset by call_date
.groupBy("City", "Priority") // tag and group the data with categorical information
.aggProfiles() // runs the aggregation. returns a dataframe of <timestamp, datasetProfile> entries
For further analysis, dataframes can be stored in a Parquet file, or collected to the driver if the number of entries is small enough.
Building and Testing
Before building, update the proto
submodule,
git submodule update --init --recursive
- To build, run
./gradlew build
- To test, run
./gradlew test