• Stars
    star
    102
  • Rank 334,258 (Top 7 %)
  • Language
    Java
  • Created over 10 years ago
  • Updated almost 4 years ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A simple demonstration of sub-sequence sampling as used for anomaly detection with EKG signals

Anomaly Detection using Sub-sequence Clustering

This project provides a demonstration of a simple time-series anomaly detector.

The idea is to use sub-sequence clustering of an EKG signal to reconstruct the EKG. The difference between the original and the reconstruction can be used as a measure of how much like the signal is like a prototypical EKG. Poor reconstruction can thus be used to find anomalies in the original signal.

The data for this demo are taken from physionet. See http://physionet.org/physiobank/database/#ecg-databases

The particular data used for this demo is the Apnea ECG database which can be found at

http://physionet.org/physiobank/database/apnea-ecg/

All necessary data for this demo is included as a resource in the source code (see src/main/resources/a02.dat). You can find original version of the training data at

http://physionet.org/physiobank/database/apnea-ecg/a02.dat

This file is 6.1MB in size and contains several hours of recorded EKG data from a patient in a sleep apnea study. This file contains 3.2 million samples of which we use the first 200,000 for training.

Installing and Running the Demo

The class com.tdunning.sparse.Learn goes through the steps required to read and process this data to produce a simple anomaly detector. The output of this program consists of the clustering itself (in dict.tsv) as well as a reconstruction of the test signal (in trace.tsv). These outputs can be visualized using the provided R script.

To compile and run the demo,

mvn -q exec:java -Dexec.mainClass=com.tdunning.sparse.Learn

To produce the figures showing how the anomalies are detected

rm *.pdf ; Rscript figures.r

What the Figures Show

Figure 1 shows how an ordinary, non-anomalous signal (top line) is reconstructed (middle line) with relatively small errors. Figures 2, 3 and 4 show magnified views of the successive 5 second periods.

Looking at the distribution of the reconstruction error in Figure 5 shows that the error is distinctly not normally distributed. Instead, the distribution of the error has longer tails than the normal distribution would have.

Figure 6 shows a histogram of the error. The standard deviation of the error magnitude is about 5, but nearly 2% of the errors are larger than 15 (3 standard deviations). This is implausibly large for a normal distribution which would only have less than 0.3% of the errors that large. Even more extreme, 50 samples per million are larger than 20 standard deviations.

Scanning for errors greater than 100 takes us to a point 100 seconds into the recording where the error spikes sharply. Figure 7 shows the error and Figure 8 shows the original and reconstructed signal for this 5 second period. The reconstruction clearly isn't capturing the negative excursion of the original signal, but it isn't clear why. Figure 9 shows a magnified view of the 1 second right around the anomaly and we can see that the problem is a double beat.

Scanning for more anomalies takes us to 240s into the trace where there is a clear signal acquisition malfunction as shown in Figures 10 and 11.

The 64 most commonly used sub-sequence clusters are shown in figure 12. The left-most column shows how translations of the same portion of the heartbeat show up as clusters in the signal dictionary. These patterns are scaled, shifted and added to reconstruct the original signal.

More Repositories

1

t-digest

A new data structure for accurate on-line accumulation of rank-based statistics such as quantiles and trimmed means
Java
1,833
star
2

MiA

Mahout in Action Example Code
Java
348
star
3

log-synth

Generates more or less realistic log data for testing simple aggregation queries.
Java
255
star
4

Plume

Explorations relative to cloning FlumeJava
Java
93
star
5

knn

Large scale k-nn experiments
Java
69
star
6

pig-vector

Mahout vector encoding for pig
Java
54
star
7

bandit-ranking

HTML
51
star
8

feature-extraction

Sample techniques for a variety of feature extraction methods
Java
32
star
9

python-llr

A python implementation of the most commonly used variants of the G-test
Python
23
star
10

open-json

Open JSON - a truly open source JSON implementation
Java
18
star
11

probability-book

A copy of the source for Grinstead and Snell's lovely probability book
TeX
13
star
12

k-means-auto-encoder

Some quick exploration of how k-means auto-encoders work
R
11
star
13

t-digest-benchmark

A simple JMH benchmark for various versions of t-digest
Java
9
star
14

sequencemodel

The sequence anomaly detector from our second in the Practical Machine Learning Series
8
star
15

in-memory-cooccurrence

Analyze for significant cooccurrence using Mahout sparse matrices
Java
8
star
16

Chapter-16

Example server for Chapter 16 of Mahout in Action
Java
6
star
17

pcap-filter

Experiments in PCAP file decoding at speed
Java
6
star
18

ancient-stats

The ancient C version of the LLR statistics and related utilities. This is for reference only.
C
5
star
19

ponies

Sample recommender flow for search as recommendation
5
star
20

mahout-examples

Mahout Examples
Java
4
star
21

sequence-model

A simple implementation of a probabilistic model for event sequences
4
star
22

parksim

Java
4
star
23

TDigest

Native Julia implementation of t-digest
Julia
3
star
24

cluster-hinting

R
2
star
25

config-print

Prints Hadoop configuration variables
Java
2
star
26

freezer

A hybrid discrete/continuous simulation of a freezer and its users
Java
2
star
27

graph-demo

Demonstrates use of the multi command in Zookeeper
Java
2
star
28

H3Geometry

H3 convenience package
Julia
2
star
29

h2o-matrix

Demonstration of Mahout compatible matrix and vector types based on h2o
Java
2
star
30

G2

Implements the G^2 test for comparing counts
Julia
1
star
31

meta-ep

Recorded-step meta-mutation implementation and paper
C
1
star
32

timeSkew

Quick test for timers
Java
1
star
33

ubuntu-bounce-host

A simple container that supports bouncing a login to another host via ssh
Dockerfile
1
star
34

OpenUnits

1
star
35

image-rep

A repo for storing images referenced in other projects
1
star
36

split-search

Quick tests of the basins of attraction in an ERT optimizer
Java
1
star
37

t-digest-example

Java
1
star
38

ibm2ieee

Julia package to convert between IBM floating point (aka hexadecimal floating point or HFP) to IEEE floating point
Julia
1
star
39

work-group

Simple management framework for a fixed set of workers that come up at somewhat unpredictable times.
Java
1
star