• Stars
    star
    363
  • Rank 117,374 (Top 3 %)
  • Language
    Python
  • License
    BSD 2-Clause "Sim...
  • Created almost 5 years ago
  • Updated 12 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

(MLSys' 21) An Acceleration System for Large-scare Unsupervised Heterogeneous Outlier Detection (Anomaly Detection)

SUOD: Accelerating Large-scare Unsupervised Heterogeneous Outlier Detection

Deployment & Documentation & Stats

PyPI version Documentation Status GitHub stars GitHub forks Downloads Downloads

Build Status & Coverage & Maintainability & License

testing Coverage Status License

News: SUOD is now integrated into PyOD. It can be easily invoked in PyOD by following the SUOD example. In a nutshell, we could easily initialize a few outlier detectors and then use SUOD for collective training and prediction!

from pyod.models.suod import SUOD

# initialized a group of outlier detectors for acceleration
detector_list = [LOF(n_neighbors=15), LOF(n_neighbors=20),
                 LOF(n_neighbors=25), LOF(n_neighbors=35),
                 COPOD(), IForest(n_estimators=100),
                 IForest(n_estimators=200)]

# decide the number of parallel process, and the combination method
# then clf can be used as any outlier detection model
clf = SUOD(base_estimators=detector_list, n_jobs=2, combination='average',
           verbose=False)

Background: Outlier detection (OD) is a key data mining task for identifying abnormal objects from general samples with numerous high-stake applications including fraud detection and intrusion detection. Due to the lack of ground truth labels, practitioners often have to build a large number of unsupervised models that are heterogeneous (i.e., different algorithms and hyperparameters) for further combination and analysis with ensemble learning, rather than relying on a single model. However, this yields severe scalability issues on high-dimensional, large datasets.

SUOD (Scalable Unsupervised Outlier Detection) is an acceleration framework for large-scale unsupervised heterogeneous outlier detector training and prediction. It focuses on three complementary aspects to accelerate (dimensionality reduction for high-dimensional data, model approximation for complex models, and execution efficiency improvement for taskload imbalance within distributed systems), while controlling detection performance degradation.

Since its inception in Sep 2019, SUOD has been successfully used in various academic researches and industry applications with more than 700,000 downloads, including PyOD [2] and IQVIA medical claim analysis.

SUOD System

SUOD is featured for:

  • Unified APIs, detailed documentation, and examples for the easy use.
  • Optimized performance with JIT and parallelization when possible, using numba and joblib.
  • Fully compatible with the models in PyOD.
  • Customizable modules and flexible design: each module may be turned on/off or totally replaced by custom functions.

Roadmap:

  • Provide more choices of distributed schedulers (adapted for SUOD), e.g., batch sampling, Sparrow (SOSP'13), Pigeon (SoCC'19) etc.
  • Enable the flexibility of selecting data projection methods.

API Demo:

from suod.models.base import SUOD

# initialize a set of base outlier detectors to train and predict on
base_estimators = [
    LOF(n_neighbors=5, contamination=contamination),
    LOF(n_neighbors=15, contamination=contamination),
    LOF(n_neighbors=25, contamination=contamination),
    HBOS(contamination=contamination),
    PCA(contamination=contamination),
    OCSVM(contamination=contamination),
    KNN(n_neighbors=5, contamination=contamination),
    KNN(n_neighbors=15, contamination=contamination),
    KNN(n_neighbors=25, contamination=contamination)]

# initialize a SUOD model with all features turned on
model = SUOD(base_estimators=base_estimators, n_jobs=6,  # number of workers
             rp_flag_global=True,  # global flag for random projection
             bps_flag=True,  # global flag for balanced parallel scheduling
             approx_flag_global=False,  # global flag for model approximation
             contamination=contamination)

model.fit(X_train)  # fit all models with X
model.approximate(X_train)  # conduct model approximation if it is enabled
predicted_labels = model.predict(X_test)  # predict labels
predicted_scores = model.decision_function(X_test)  # predict scores
predicted_probs = model.predict_proba(X_test)  # predict outlying probability

The corresponding paper is published in Conference on Machine Learning Systems (MLSys). See https://mlsys.org/ for more information.

If you use SUOD in a scientific publication, we would appreciate citations to the following paper:

@inproceedings{zhao2021suod,
  title={SUOD: Accelerating Large-scale Unsupervised Heterogeneous Outlier Detection},
  author={Zhao, Yue and Hu, Xiyang and Cheng, Cheng and Wang, Cong and Wan, Changlin and Wang, Wen and Yang, Jianing and Bai, Haoping and Li, Zheng and Xiao, Cao and others},
  journal={Proceedings of Machine Learning and Systems},
  year={2021}
}
Zhao, Y., Hu, X., Cheng, C., Wang, C., Wan, C., Wang, W., Yang, J., Bai, H., Li, Z., Xiao, C. and Wang, Y., 2021. SUOD: Accelerating Large-scale Unsupervised Heterogeneous Outlier Detection. Proceedings of Machine Learning and Systems (MLSys).

Table of Contents:


Installation

It is recommended to use pip for installation. Please make sure the latest version is installed, as suod is updated frequently:

pip install suod            # normal install
pip install --upgrade suod  # or update if needed
pip install --pre suod      # or include pre-release version for new features

Alternatively, you could clone and run setup.py file:

git clone https://github.com/yzhao062/suod.git
cd suod
pip install .

Required Dependencies:

  • Python 3.5, 3.6, or 3.7
  • joblib
  • numpy>=1.13
  • pandas (optional for building the cost forecast model)
  • pyod
  • scipy>=0.19.1
  • scikit_learn>=0.19.1

Note on Python 2: The maintenance of Python 2.7 will be stopped by January 1, 2020 (see official announcement). To be consistent with the Python change and suod's dependent libraries, e.g., scikit-learn, SUOD only supports Python 3.5+ and we encourage you to use Python 3.5 or newer for the latest functions and bug fixes. More information can be found at Moving to require Python 3.


API Cheatsheet & Reference

Full API Reference: (https://suod.readthedocs.io/en/latest/api.html).

  • fit(X, y): Fit estimator. y is optional for unsupervised methods.
  • approximate(X): Use supervised models to approximate unsupervised base detectors. Fit should be invoked first.
  • predict(X): Predict on a particular sample once the estimator is fitted.
  • predict_proba(X): Predict the probability of a sample belonging to each class once the estimator is fitted.

Examples

All three modules can be executed separately and the demo codes are in /examples/module_examples/{M1_RP, M2_BPS, and M3_PSA}. For instance, you could navigate to /M1_RP/demo_random_projection.py. Demo codes all start with "demo_*.py".

The examples for the full framework can be found under /examples folder; run "demo_base.py" for a simplified example. Run "demo_full.py" for a full example.

It is noted the best performance may be achieved with multiple cores available.


Model Save & Load

SUOD takes a similar approach of sklearn regarding model persistence. See model persistence for clarification.

In short, we recommend to use joblib or pickle for saving and loading SUOD models. See "examples/demo_model_save_load.py" for an example. In short, it is simple as below:

from joblib import dump, load

# save the fitted model
dump(model, 'model.joblib')
# load the model
model = load('model.joblib')

More to come... Last updated on Jan 14th, 2021.

Feel free to star and watch for the future update :)


References

[1]Johnson, W.B. and Lindenstrauss, J., 1984. Extensions of Lipschitz mappings into a Hilbert space. Contemporary mathematics, 26(189-206), p.1.
[2]Zhao, Y., Nasrullah, Z. and Li, Z., 2019. PyOD: A Python Toolbox for Scalable Outlier Detection. Journal of Machine Learning Research, 20, pp.1-7.

More Repositories

1

pyod

A Comprehensive and Scalable Python Library for Outlier Detection (Anomaly Detection)
Python
7,613
star
2

anomaly-detection-resources

Anomaly detection related books, papers, videos, and toolboxes
Python
7,290
star
3

combo

(AAAI' 20) A Python Toolbox for Machine Learning Model Combination
Python
616
star
4

data-mining-conferences

Ranking, acceptance rate, deadline, and publication tips
Python
305
star
5

awesome-ensemble-learning

Ensemble learning related books, papers, videos, and toolboxes
Python
256
star
6

pytod

TOD: GPU-accelerated Outlier Detection via Tensor Operations
Python
149
star
7

MetaOD

Automating Outlier Detection via Meta-Learning (Code, API, and Contribution Instructions)
Python
144
star
8

WSAD

A Collection of Resources for Weakly-supervised Anomaly Detection (WSAD)
Python
134
star
9

XGBOD

Supplementary material for IJCNN paper "XGBOD: Improving Supervised Outlier Detection with Unsupervised Representation Learning"
Python
69
star
10

LSCP

Supplementary material for SDM 19 paper "LSCP: Locally Selective Combination in Parallel Outlier Ensembles"
Python
29
star
11

DCSO

Supplementary material for KDD 2018 workshop "DCSO: Dynamic Combination of Detector Scores for Outlier Ensembles"
Python
19
star
12

UOMS

Resources and environment for unsupervised outlier model selection (UOMS)
Jupyter Notebook
17
star
13

mmad

multimodal anomaly detection
Python
13
star
14

yzhao062

11
star
15

OutlierDetection.jl

A Julia Library for Outlier Detection (Anomaly Detection)
Julia
9
star
16

fedod

Python
8
star
17

pyod-dask

An embarrassingly simple extension of PyOD for scalable outlier detection
8
star
18

ELECT

Toward Unsupervised Outlier Model Selection (ICDM 2022)
Python
6
star
19

algs

An AutoML system for algorithm selection (model selection)
3
star
20

DataStructure_CPP

It is a repository to store multiple implementation of data structures and algorithms in C++ written by me in the past several years.
C++
2
star
21

MLMM

A Monitoring framework to track Machine Learning Model training processes
2
star
22

SIML

SImilarity Measure Library: an extended python library for measuring similarities
2
star
23

Coding-questions

Elements in Interview C++ alternative solutions
C++
2
star
24

Simulation-Modeling-with-Machine-Learning

The project demonstrate how to incorporate simulation modeling with machine learning techniques
2
star
25

Financial-Models

2
star
26

OD-Econometrics

Outlier Detection and Removal for Econometrics Models
Python
1
star
27

starter-hugo-academic

1
star
28

yzhao062.github.io

HTML
1
star