• Stars
    star
    1,152
  • Rank 38,948 (Top 0.8 %)
  • Language
    Java
  • License
    Other
  • Created almost 9 years ago
  • Updated 6 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A Java package to automatically detect anomalies in large scale time-series data

Build Status

EGADS Java Library

EGADS (Extensible Generic Anomaly Detection System) is an open-source Java package to automatically detect anomalies in large scale time-series data. EGADS is meant to be a library that contains a number of anomaly detection techniques applicable to many use-cases in a single package with the only dependency being Java. EGADS works by first building a time-series model which is used to compute the expected value at time t. Then a number of errors E are computed by comparing the expected value with the actual value at time t. EGADS automatically determines thresholds on E and outputs the most probable anomalies. EGADS library can be used in a wide variety of contexts to detect outliers and change points in time-series that can have a various seasonal, trend and noise components.

How to get started

EGADS was designed as a self contained library that has a collection of time-series and anomaly detection models that are applicable to a wide-range of use cases. To compile the library into a single jar, clone the repo and type the following:

mvn clean compile assembly:single

You may have to set your JAVA_HOME variable to the appropriate JVM. To do this run:

export JAVA_HOME=/usr/lib/jvm/{JVM directory for desired version}

Usage

To run a simple example type:

java -Dlog4j.configurationFile=src/test/resources/log4j2.xml -cp target/egads-*-jar-with-dependencies.jar com.yahoo.egads.Egads src/test/resources/sample_config.ini src/test/resources/sample_input.csv

which produces the following picture (Note that you can enable this UI by setting OUTPUT config key to GUI in sample_config.ini).

gui

One can also specify config parameters on a command line. For example to do anomaly detection using Olympic Scoring as a time-series model and a density based method as an anomaly detection model use the following.

java -Dlog4j.configurationFile=src/test/resources/log4j2.xml -cp target/egads-*-jar-with-dependencies.jar com.yahoo.egads.Egads "MAX_ANOMALY_TIME_AGO:999999999;AGGREGATION:1;OP_TYPE:DETECT_ANOMALY;TS_MODEL:OlympicModel;AD_MODEL:ExtremeLowDensityModel;INPUT:CSV;OUTPUT:STD_OUT;BASE_WINDOWS:168;PERIOD:-1;NUM_WEEKS:3;NUM_TO_DROP:0;DYNAMIC_PARAMETERS:0;TIME_SHIFTS:0" src/test/resources/sample_input.csv

To run anomaly detection using no time-series model with an auto static threshold for anomaly detection, use the following:

java -Dlog4j.configurationFile=src/test/resources/log4j2.xml -cp target/egads-*-jar-with-dependencies.jar com.yahoo.egads.Egads "MAX_ANOMALY_TIME_AGO:999999999;AGGREGATION:1;OP_TYPE:DETECT_ANOMALY;TS_MODEL:NullModel;AD_MODEL:SimpleThresholdModel;SIMPLE_THRESHOLD_TYPE:AdaptiveMaxMinSigmaSensitivity;INPUT:CSV;OUTPUT:STD_OUT;AUTO_SENSITIVITY_ANOMALY_PCNT:0.2;AUTO_SENSITIVITY_SD:2.0" src/test/resources/sample_input.csv

To embed the EGADs library in an application, pull the compiled JAR from JCenter by adding the proper repository. For example in a Maven POM file add:

<repositories>
  <repository>
    <id>jcenter</id>
    <url>https://jcenter.bintray.com/</url>
  </repository>
</repositories>

Then import the dependency, e.g.:

<dependency>
  <groupId>com.yahoo.egads</groupId>
  <artifactId>egads</artifactId>
  <version>0.4.0</version>
</dependency>

Overview

While rapid advances in computing hardware and software have led to powerful applications, still hundreds of software bugs and hardware failures continue to happen in a large cluster compromising user experience and subsequently revenue. Non-stop systems have a strict uptime requirement and continuous monitoring of these systems is critical. From the data analysis point of view, this means non-stop monitoring of large volume of time-series data in order to detect potential faults or anomalies. Due to the large scale of the problem, human monitoring of this data is practically infeasible which leads us to automated anomaly detection. An anomaly, or an outlier, is a data point which is significantly different from the rest of the data. Generally, the data in most applications is created by one or more generating processes that reflect the functionality of a system.

When the underlying generating process behaves in an unusual way, it creates outliers. Fast and efficient identification of these outliers is useful for many applications including: intrusion detection, credit card fraud, sensor events, medical diagnoses, law enforcement and others. Current approaches in automated anomaly detection suffer from a large number of false positives which prohibit the usefulness of these systems in practice. Use-case, or category specific, anomaly detection models may enjoy a low false positive rate for a specific application, but when the characteristics of the time-series change, these techniques perform poorly without proper retraining.

EGADS (Extensible Generic Anomaly Detection System) enables the accurate and scalable detection of time-series anomalies. EGADS separates forecasting and anomaly detection two separate components which allows the person to add her own models into any of the components.

Architecture

The EGADS framework consists of two main components: the time-series modeling module (TMM), the anomaly detection module (ADM). Given a time-series the TMM component models the time-series producing an expected value later consumed by the ADM that computes anomaly scores. EGADS was built as a framework to be easily integrated into an existing monitoring infrastructure. At Yahoo, our internal Yahoo Monitoring Service (YMS) processes millions of data-points every second. Therefore, having a scalable, accurate and automated anomaly detection for YMS is critical. For this reason, EGADS can be compiled into a single light-weight jar and deployed easily at scale.

The TMM and ADM can be found under main/java/com/yahoo/egads/models.

The example of the models supported by TMM and ADM can be found in in the two table below. We expect this collection of models to grow as more contribution is put forward by the community.

List of current TimeSeries Models

models

List of current Anomaly Detection Models

admodels

Configuration

Below are the various configuration parameters supported by EGADS.

# Only show anomalies no older than this.
# If this is set to 0, then only output an anomaly
# if it occurs on the last time-stamp.
MAX_ANOMALY_TIME_AGO  99999

# Denotes how much should the time-series be aggregated by.
# If set to 1 or less, this setting is ignored.
AGGREGATION	1

# OP_TYPE specifies the operation type.
# Options: DETECT_ANOMALY,
#          UPDATE_MODEL,
#	   TRANSFORM_INPUT
OP_TYPE	DETECT_ANOMALY

# TS_MODEL specifies the time-series
# model type.
# Options: AutoForecastModel
#          DoubleExponentialSmoothingModel
#          MovingAverageModel
#          MultipleLinearRegressionModel
#          NaiveForecastingModel
#          OlympicModel
#          PolynomialRegressionModel
#          RegressionModel
#          SimpleExponentialSmoothingModel
#          TripleExponentialSmoothingModel
#          WeightedMovingAverageModel
# 	   SpectralSmoother
# 	   NullModel
TS_MODEL	OlympicModel

# AD_MODEL specifies the anomaly-detection
# model type.
# Options: ExtremeLowDensityModel
#          AdaptiveKernelDensityChangePointDetector
#          KSigmaModel
#          NaiveModel
#          DBScanModel
#          SimpleThresholdModel
AD_MODEL	ExtremeLowDensityModel

# Type of the simple threshold model.
# Options: AdaptiveMaxMinSigmaSensitivity
#          AdaptiveKSigmaSensitivity
# SIMPLE_THRESHOLD_TYPE

# Specifies the input src.
# Options: STDIN
#          CSV
INPUT	CSV

# Specifies the output src.
# Options: STD_OUT,
#          ANOMALY_DB
#          GUI
#          PLOT
OUTPUT  STD_OUT

# THRESHOLD specifies the threshold for the
# anomaly detection model.
# Comment to auto-detect all thresholds.
# Options: mapee,mae,smape,mape,mase.
# THRESHOLD mape#10,mase#15

#####################################
### Olympic Forecast Model Config ###
#####################################

# The possible time-shifts for Olympic Scoring.
TIME_SHIFTS 0,1

# The possible base windows for Olympic Scoring.
BASE_WINDOWS  24,168

# Period specifies the periodicity of the
# time-series (e.g., the difference between successive time-stamps).
# Options: (numeric)
#          0 - auto detect.
#          -1 - disable.
PERIOD	-1


# NUM_WEEKS specifies the number of weeks
# to use in OlympicScoring.
NUM_WEEKS 8

# NUM_TO_DROP specifies the number of
# highest and lowest points to drop.
NUM_TO_DROP 0

# If dynamic parameters is set to 1, then
# EGADS will dynamically vary parameters (NUM_WEEKS)
# to produce the best fit.
DYNAMIC_PARAMETERS  0

###################################################
### ExtremeLowDensityModel & DBScanModel Config ###
###################################################

# Denotes the expected % of anomalies
# in your data.
AUTO_SENSITIVITY_ANOMALY_PCNT	0.01

# Refers to the cluster standard deviation.
AUTO_SENSITIVITY_SD	3.0

############################
### NaiveModel Config ###
############################

# Window size where the spike is to be found.
WINDOW_SIZE	0.1

#######################################################
### AdaptiveKernelDensityChangePointDetector Config ###
#######################################################

# Change point detection parameters
PRE_WINDOW_SIZE	48
POST_WINDOW_SIZE	48
CONFIDENCE	0.8

###############################
### SpectralSmoother Config ###
###############################

# WINDOW_SIZE should be greater than the size of longest important seasonality.
# By default it is set to 192 = 8 * 24 which is worth of 8 days (> 1 week) for hourly time-series.
WINDOW_SIZE 192

# FILTERING_METHOD specifies the filtering method for Spectral Smoothing
# Options:  		GAP_RATIO		(Recommended: FILTERING_PARAM = 0.01)
#			EIGEN_RATIO		(Recommended: FILTERING_PARAM = 0.1)
#			EXPLICIT		(Recommended: FILTERING_PARAM = 10)
#			K_GAP			(Recommended: FILTERING_PARAM = 8)
#			VARIANCE		(Recommended: FILTERING_PARAM = 0.99)
#			SMOOTHNESS		(Recommended: FILTERING_PARAM = 0.97)
FILTERING_METHOD GAP_RATIO

FILTERING_PARAM 0.01

Contributions

  1. Clone your fork
  2. Hack away
  3. If you are adding new functionality, document it in the README
  4. Verify your code by running mvn package and adding additional tests.
  5. Push the branch up to GitHub
  6. Send a pull request to the yahoo/egads project.

We actively welcome contributions. If you don't know where to start, try checking out the issue list and fixing up the place. Or, you can add a model - a goal of this project is to have a robust, lightweight and dependency-free set of models to choose from that are ready to be deployed in production.

References

Generic and Scalable Framework for Automated Time-series Anomaly Detection by Nikolay Laptev, Saeed Amizadeh, Ian Flint , KDD 2015 (August 10, 2015)

Citation

If you use EGADS in your projects, please cite: Generic and Scalable Framework for Automated Time-series Anomaly Detection by Nikolay Laptev, Saeed Amizadeh, Ian Flint , KDD 2015

BibTeX:

@inproceedings{laptev2015generic,
		title={Generic and Scalable Framework for Automated Time-series Anomaly Detection},
		author={Laptev, Nikolay and Amizadeh, Saeed and Flint, Ian},
		booktitle={Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining},
		pages={1939--1947},
		year={2015},
		organization={ACM}
}

License

Code licensed under the GPL License. See LICENSE file for terms.

More Repositories

1

CMAK

CMAK is a tool for managing Apache Kafka clusters
Scala
11,625
star
2

open_nsfw

Not Suitable for Work (NSFW) classification using deep neural network Caffe models.
Python
5,791
star
3

TensorFlowOnSpark

TensorFlowOnSpark brings TensorFlow programs to Apache Spark clusters.
Python
3,860
star
4

serialize-javascript

Serialize JavaScript to a superset of JSON that includes regular expressions and functions.
JavaScript
2,785
star
5

gryffin

Gryffin is a large scale web security scanning platform.
Go
2,075
star
6

fluxible

A pluggable container for universal flux applications.
JavaScript
1,815
star
7

AppDevKit

AppDevKit is an iOS development library that provides developers with useful features to fulfill their everyday iOS app development needs.
Objective-C
1,439
star
8

mysql_perf_analyzer

MySQL performance monitoring and analysis.
Java
1,436
star
9

squidb

SquiDB is a SQLite database library for Android and iOS
Java
1,313
star
10

CaffeOnSpark

Distributed deep learning on Hadoop and Spark clusters.
Jupyter Notebook
1,262
star
11

react-stickynode

A performant and comprehensive React sticky component.
JavaScript
1,227
star
12

blink-diff

A lightweight image comparison tool.
JavaScript
1,191
star
13

elide

Elide is a Java library that lets you stand up a GraphQL/JSON-API web service with minimal effort.
Java
985
star
14

vssh

Go Library to Execute Commands Over SSH at Scale
Go
930
star
15

webseclab

set of web security test cases and a toolkit to construct new ones
Go
915
star
16

kubectl-flame

Kubectl plugin for effortless profiling on kubernetes
Go
746
star
17

streaming-benchmarks

Benchmarks for Low Latency (Streaming) solutions including Apache Storm, Apache Spark, Apache Flink, ...
Jupyter Notebook
621
star
18

lopq

Training of Locally Optimized Product Quantization (LOPQ) models for approximate nearest neighbor search of high dimensional data in Python and Spark.
Python
558
star
19

redislite

Redis in a python module.
Python
556
star
20

HaloDB

A fast, log structured key-value store.
Java
486
star
21

hecate

Automagically generate thumbnails, animated GIFs, and summaries from videos
C++
468
star
22

fetchr

Universal data access layer for web applications.
JavaScript
447
star
23

storm-yarn

Storm-yarn enables Storm clusters to be deployed into machines managed by Hadoop YARN.
Java
418
star
24

react-i13n

A performant, scalable and pluggable approach to instrumenting your React application.
JavaScript
383
star
25

FEL

Fast Entity Linker Toolkit for training models to link entities to KnowledgeBase (Wikipedia) in documents and queries.
Java
334
star
26

monitr

A Node.js process monitoring tool.
C++
312
star
27

Oak

A Scalable Concurrent Key-Value Map for Big Data Analytics
Java
266
star
28

TDOAuth

A BSD-licensed single-header-single-source OAuth1 implementation.
Swift
250
star
29

routr

A component that provides router related functionalities for both client and server.
JavaScript
246
star
30

mysql_partition_manager

MySQL Partition Manager
SQLPL
210
star
31

l3dsr

Direct Server Return load balancing across Layer 3 boundaries.
Shell
190
star
32

dnscache

dnscache for Node
JavaScript
184
star
33

object_relation_transformer

Implementation of the Object Relation Transformer for Image Captioning
Python
174
star
34

check-log4j

To determine if a host is vulnerable to log4j CVE‐2021‐44228
Shell
173
star
35

fili

Easily make RESTful web services for time series reporting with Big Data analytics engines like Druid and SQL Databases.
Java
171
star
36

sherlock

Sherlock is an anomaly detection service built on top of Druid
Java
149
star
37

YMTreeMap

High performance Swift treemap layout engine for iOS and macOS.
Swift
129
star
38

maha

A framework for rapid reporting API development; with out of the box support for high cardinality dimension lookups with druid.
Scala
127
star
39

covid-19-data

COVID-19 datasets are constructed entirely from primary (government and public agency) sources
110
star
40

subscribe-ui-event

Subscribe-ui-event provides a cross-browser and performant way to subscribe to browser UI Events.
JavaScript
109
star
41

jafar

🌟!(Just another form application renderer)
JavaScript
109
star
42

panoptes

A Global Scale Network Telemetry Ecosystem
Python
98
star
43

reginabox

Registry In A Box
JavaScript
97
star
44

preceptor

Test runner and aggregator
JavaScript
85
star
45

hive-funnel-udf

Hive UDFs for funnel analysis
Java
85
star
46

SparkADMM

Generic Implementation of Consensus ADMM over Spark
Python
82
star
47

react-cartographer

Generic component for displaying Yahoo / Google / Bing maps.
JavaScript
82
star
48

graphkit

A lightweight Python module for creating and running ordered graphs of computations.
Python
80
star
49

storm-perf-test

A simple storm performance/stress test
Java
76
star
50

UDPing

UDPing measures latency and packet loss across a link.
C++
72
star
51

bgjs

TypeScript
66
star
52

YMCache

YMCache is a lightweight object caching solution for iOS and Mac OS X that is designed for highly parallel access scenarios.
Objective-C
63
star
53

ycb

A multi-dimensional configuration library that builds bundles from resource files describing a variety of values.
JavaScript
63
star
54

ariel

Ariel is an AWS Lambda designed to collect, analyze, and make recommendations about Reserved Instances for EC2.
Python
62
star
55

validatar

Functional testing framework for Big Data pipelines.
Java
58
star
56

imapnio

Java imap nio client that is designed to scale well for thousands of connections per machine and reduce contention when using large number of threads and cpus.
Java
54
star
57

serviceping

A ping like utility for tcp services
Python
50
star
58

express-busboy

A simple body-parser like module for express that uses connect-busboy under the hood.
JavaScript
44
star
59

covid-19-api

Yahoo Knowledge COVID-19 API provides JSON-API and GraphQL interfaces to access COVID-19 publicly sourced data
JavaScript
43
star
60

proxy-verifier

Proxy Verifier is an HTTP replay tool designed to verify the behavior of HTTP proxies. It builds a verifier-client binary and a verifier-server binary which each read a set of YAML or JSON files that specify the HTTP traffic for the two to exchange.
C++
42
star
61

panoptes-stream

A cloud native distributed streaming network telemetry.
Go
41
star
62

yql-plus

The YQL+ parser, execution engine, and source SDK.
Java
40
star
63

context-parser

A robust HTML5 context parser that parses HTML 5 web pages and reports the execution context of each character.
HTML
40
star
64

cocoapods-blocklist

A CocoaPods plugin used to check a project against a list of pods that you do not want included in your build. Security is the primary use, but keeping specific pods that have conflicting licenses is another possible use.
Ruby
39
star
65

covid-19-dashboard

Source code for the Yahoo Knowledge Graph COVID-19 Dashboard
JavaScript
38
star
66

FmFM

Python
36
star
67

ember-gridstack

Ember components to build drag-and-drop multi-column grids powered by gridstack.js
JavaScript
36
star
68

VerizonVideoPartnerSDK-controls-ios

Public iOS implementation of the OneMobileSDK default custom controls interface... demonstrating how customers can implement their own custom video player controls.
Swift
35
star
69

k8s-namespace-guard

K8s - Admission controller for guarding namespace
Go
34
star
70

fluxible-action-utils

Utility methods to aid in writing actions for fluxible based applications.
JavaScript
34
star
71

parsec

A collection of libraries and utilities to simplify the process of building web service applications.
Java
34
star
72

mod_statuspage

Simple express/connect middleware to provide a status page with following details of the nodejs host.
JavaScript
32
star
73

bftkv

A distributed key-value storage that's tolerant to Byzantine fault.
JavaScript
30
star
74

protractor-retry

Use protractor features to automatically re-run failed tests with a specific configurable number of attempts.
JavaScript
28
star
75

cubed

Data Mart As A Service
Java
27
star
76

spivak

Python
27
star
77

jsx-test

An easy way to test your React Components (`.jsx` files).
JavaScript
27
star
78

ycb-java

YCB Java
Java
27
star
79

fluxible-immutable-utils

A mixin that provides a convenient interface for using Immutable.js inside react components.
JavaScript
25
star
80

SubdomainSleuth

Scanner to identify dangling DNS records and subdomain takeovers
Go
25
star
81

maaf

Modality-Agnostic Attention Fusion for visual search with text feedback
Python
25
star
82

node-limits

Simple express/connect middleware to set limit to upload size, set request timeout etc.
JavaScript
24
star
83

GitHub-Security-Alerts-Workflow

Automation to Incorporate GitHub Security Alerts Into your Business Workflow
Python
23
star
84

bandar-log

Monitoring tool to measure flow throughput of data sources and processing components that are part of Data Ingestion and ETL pipelines.
Scala
21
star
85

fumble

Simple error objects in node. Created specifically to be used with https://github.com/yahoo/fetchr and based on https://github.com/hapijs/boom
JavaScript
21
star
86

express-csp

Express extension for Content Security Policy
JavaScript
19
star
87

elide-js

Elide is a library that makes it easy to talk to a JSON API compliant backend.
JavaScript
18
star
88

Zake

A python package that works to provide a nice set of testing utilities for the kazoo library.
Python
18
star
89

npm-auto-version

Automatically generate new NPM versions based on Git tags when publishing
JavaScript
18
star
90

httpmi

An HTTP proxy for IPMI commands.
Python
17
star
91

hodman

Selenium object library
JavaScript
17
star
92

cerebro

JavaScript
17
star
93

SongbirdCharts

Allows for other apps to render accessible audio charts
Kotlin
17
star
94

Override

In app feature flag management
Swift
16
star
95

ychaos

YChaos - The Resilience Framework by Yahoo!
Python
16
star
96

elide-spring-boot-example

Spring Boot example using the Elide framework.
Java
15
star
97

parsec-libraries

Tools to simplify deploying web services with Parsec.
Java
15
star
98

node-info

Node environment information
JavaScript
14
star
99

NetCHASM

An Automated health checking and server status verification system.
C++
13
star
100

invirtualenv

Tool to deploy python virtualenvs
Python
13
star