• Stars
    star
    198
  • Rank 196,898 (Top 4 %)
  • Language
    Jupyter Notebook
  • License
    MIT License
  • Created over 3 years ago
  • Updated over 1 year ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Data stream analytics: Implement online learning methods to address concept drift and model drift in data streams using the River library. Code for the paper entitled "PWPAE: An Ensemble Framework for Concept Drift Adaptation in IoT Data Streams" published in IEEE GlobeCom 2021.

PWPAE-Concept-Drift-Detection-and-Adaptation

This is the code for the paper entitled "PWPAE: An Ensemble Framework for Concept Drift Adaptation in IoT Data Streams" published in 2021 IEEE Global Communications Conference (GLOBECOM), doi: 10.1109/GLOBECOM46510.2021.9685338.
Authors: Li Yang, Dimitrios Michael Manias, and Abdallah Shami
Organization: The Optimized Computing and Communications (OC2) Lab, ECE Department, Western University

This repository also introduces concept drift definitions and online machine learning methods for data stream analytics using the River library.

A complete tutorial code for the comprehensive and complete pipeline for concept drift, online machine learning, and data stream analytics, including dynamic data pre-processing, drift-based dynamic feature selection, dynamic model learning & selection, and online ensemble models, can be found in: MSANA-Online-Data-Stream-Analytics-And-Concept-Drift-Adaptation

Another tutorial code for concept drift, online machine learning, and data stream analytics can be found in: OASW-Concept-Drift-Detection-and-Adaptation

Concept Drift

In non-stationary and dynamical environments, such as IoT environments, the distribution of input data often changes over time, known as concept drift. The occurrence of concept drift will result in the performance degradation of the current trained data analytics model. Traditional offline machine learning (ML) models cannot deal with concept drift, making it necessary to develop online adaptive analytics models that can adapt to the predictable and unpredictable changes in data streams.

To address concept drift, effective methods should be able to detect concept drift and adapt to the changes accordingly. Therefore, concept drift detection and adaptation are the two major steps for online learning on data streams.

Drift Detection

  • Adaptive Windowing (ADWIN) is a distribution-based method that uses an adaptive sliding window to detect concept drift based on data distribution changes. ADWIN identifies concept drift by calculating and analyzing the average of certain statistics over the two sub-windows of the adaptive window. The occurrence of concept drift is indicated by a large difference between the averages of the two sub-windows. Once a drift point is detected, all the old data samples before that drift time point are discarded.

    • Albert Bifet and Ricard Gavalda. "Learning from time-changing data with adaptive windowing." In Proceedings of the 2007 SIAM international conference on data mining, pp. 443-448. Society for Industrial and Applied Mathematics, 2007.
    from river.drift import ADWIN
    adwin = ADWIN()
  • Drift Detection Method (DDM) is a popular model performance-based method that defines two thresholds, a warning level and a drift level, to monitor model's error rate and standard deviation changes for drift detection.

    • João Gama, Pedro Medas, Gladys Castillo, Pedro Pereira Rodrigues: Learning with Drift Detection. SBIA 2004: 286-295
    from river.drift import DDM
    ddm = DDM()

Drift Adaptation

  • Hoeffding tree (HT) is a type of decision tree (DT) that uses the Hoeffding bound to incrementally adapt to data streams. Compared to a DT that chooses the best split, the HT uses the Hoeffding bound to calculate the number of necessary samples to select the split node. Thus, the HT can update its node to adapt to newly incoming samples.

    • G. Hulten, L. Spencer, and P. Domingos. Mining time-changing data streams. In KDD’01, pages 97–106, San Francisco, CA, 2001. ACM Press.
    from river import tree
    model = tree.HoeffdingTreeClassifier(
         grace_period=100,
         split_confidence=1e-5,
         ...
    )
  • Extremely Fast Decision Tree (EFDT), also named Hoeffding Anytime Tree (HATT), is an improved version of the HT that splits nodes as soon as it reaches the confidence level instead of detecting the best split in the HT.

    • C. Manapragada, G. Webb, and M. Salehi. Extremely Fast Decision Tree. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD '18). ACM, New York, NY, USA, 1953-1962, 2018.
    from river import tree
    model = tree.ExtremelyFastDecisionTreeClassifier(
         grace_period=100,
         split_confidence=1e-5,
         min_samples_reevaluate=100,
         ...
     )
  • Adaptive random forest (ARF) algorithm uses HTs as base learners and ADWIN as the drift detector for each tree to address concept drift. Through the drift detection process, the poor-performing base trees are replaced by new trees to fit the new concept.

    • Heitor Murilo Gomes, Albert Bifet, Jesse Read, Jean Paul Barddal, Fabricio Enembreck, Bernhard Pfharinger, Geoff Holmes, Talel Abdessalem. Adaptive random forests for evolving data stream classification. In Machine Learning, DOI: 10.1007/s10994-017-5642-8, Springer, 2017.
    from river import ensemble
    model = ensemble.AdaptiveRandomForestClassifier(
         n_models=3,
         drift_detector=ADWIN(),
         ...
     )
  • Streaming Random Patches (SRP) uses the similar technology of ARF, but it uses the global subspace randomization strategy, instead of the local subspace randomization technique used by ARF. The global subspace randomization is a more flexible method that improves the diversity of base learners.

    • Heitor Murilo Gomes, Jesse Read, Albert Bifet. Streaming Random Patches for Evolving Data Stream Classification. IEEE International Conference on Data Mining (ICDM), 2019.
    from river import ensemble
    base_model = tree.HoeffdingTreeClassifier(
       grace_period=50, split_confidence=0.01,
       ...
     )
    model = ensemble.SRPClassifier(
       model=base_model, n_models=3, drift_detector=ADWIN(),
       ...
    )
  • Leverage bagging (LB) is another popular online ensemble that uses bootstrap samples to construct base learners. It uses Poisson distribution to increase the data diversity and leverage the bagging performance.

    • Bifet A., Holmes G., Pfahringer B. (2010) Leveraging Bagging for Evolving Data Streams. In: Balcázar J.L., Bonchi F., Gionis A., Sebag M. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2010. Lecture Notes in Computer Science, vol 6321. Springer, Berlin, Heidelberg.
    from river import ensemble
    from river import linear_model
    from river import preprocessing
    model = ensemble.LeveragingBaggingClassifier(
       model=(
           preprocessing.StandardScaler() |
           linear_model.LogisticRegression()
       ),
       n_models=3,
       ...
    )

Abstract of The Paper

As the number of Internet of Things (IoT) devices and systems have surged, IoT data analytics techniques have been developed to detect malicious cyber-attacks and secure IoT systems; however, concept drift issues often occur in IoT data analytics, as IoT data is often dynamic data streams that change over time, causing model degradation and attack detection failure. This is because traditional data analytics models are static models that cannot adapt to data distribution changes. In this paper, we propose a Performance Weighted Probability Averaging Ensemble (PWPAE) framework for drift adaptive IoT anomaly detection through IoT data stream analytics. Experiments on two public datasets show the effectiveness of our proposed PWPAE method compared against state-of-the-art methods.

Implementation

Online Learning/Concept Drift Adaptation Algorithms

  • Adaptive Random Forest (ARF)
  • Streaming Random Patches (SRP)
  • Extremely Fast Decision Tree (EFDT)
  • Hoeffding Tree (HT)
  • Leveraging Bagging (LB)
  • Performance Weighted Probability Averaging Ensemble (PWPAE)
    • Proposed Method

Drift Detection Algorithms

  • Adaptive Windowing (ADWIN)
  • Drift Detection Method (DDM)

Dataset

  1. IoTID20 dataset, a novel IoT botnet dataset

  2. CICIDS2017 dataset, a popular network traffic dataset for intrusion detection problems

For the purpose of displaying the experimental results in Jupyter Notebook, the sampled subsets of the two datasets are used in the sample code. The subsets are in the "data" folder.

Code

Requirements & Libraries

Contact-Info

Please feel free to contact us for any questions or cooperation opportunities. We will be happy to help.

Citation

If you find this repository useful in your research, please cite this article as:

L. Yang, D. M. Manias and A. Shami, "PWPAE: An Ensemble Framework for Concept Drift Adaptation in IoT Data Streams," 2021 IEEE Global Communications Conference (GLOBECOM), 2021, pp. 1-6, doi: 10.1109/GLOBECOM46510.2021.9685338.

@INPROCEEDINGS{9685338,
  author={Yang, Li and Manias, Dimitrios Michael and Shami, Abdallah},
  booktitle={2021 IEEE Global Communications Conference (GLOBECOM)}, 
  title={PWPAE: An Ensemble Framework for Concept Drift Adaptation in IoT Data Streams}, 
  year={2021},
  pages={1-6},
  doi={10.1109/GLOBECOM46510.2021.9685338}
  }

More Repositories

1

AutoML-Implementation-for-Static-and-Dynamic-Data-Analytics

Implementation/Tutorial of using Automated Machine Learning (AutoML) methods for static/batch and online/continual learning
Jupyter Notebook
614
star
2

Intrusion-Detection-System-Using-Machine-Learning

Code for IDS-ML: intrusion detection system development using machine learning algorithms (Decision tree, random forest, extra trees, XGBoost, stacking, k-means, Bayesian optimization..)
Jupyter Notebook
385
star
3

Intrusion-Detection-System-Using-CNN-and-Transfer-Learning

Code for intrusion detection system (IDS) development using CNN models and transfer learning
Jupyter Notebook
126
star
4

Vibration-Based-Fault-Diagnosis-with-Low-Delay

Python codes “Jupyter notebooks” for the paper entitled "A Hybrid Method for Condition Monitoring and Fault Diagnosis of Rolling Bearings With Low System Delay, IEEE Trans. on Instrumentation and Measurement, Aug. 2022. Techniques used: Wavelet Packet Transform (WPT) & Fast Fourier Transform (FFT). Application: vibration-based fault diagnosis.
Jupyter Notebook
53
star
5

OASW-Concept-Drift-Detection-and-Adaptation

An online learning method used to address concept drift and model drift. Code for the paper entitled "A Lightweight Concept Drift Detection and Adaptation Framework for IoT Data Streams" published in IEEE Internet of Things Magazine.
Jupyter Notebook
47
star
6

MSANA-Online-Data-Stream-Analytics-And-Concept-Drift-Adaptation

Data stream analytics: Implement online learning methods to address concept drift and model drift in dynamic data streams. Code for the paper entitled "A Multi-Stage Automated Online Network Data Stream Analytics Framework for IIoT Systems" published in IEEE Transactions on Industrial Informatics.
Jupyter Notebook
30
star
7

FL-IOV-ITS

Code for the case study presented in "Making a Case for Federated Learning in the Internet of Vehicles and Intelligent Transportation Systems" accepted for publication in the IEEE Network Magazine May 2021 Special Issue on AI-empowered Mobile Edge Computing in the Internet of Vehicles.
Jupyter Notebook
22
star
8

AutoML-and-Adversarial-Attack-Defense-for-Zero-Touch-Network-Security

This repository includes code for the AutoML-based IDS and adversarial attack defense case studies presented in the paper "Enabling AutoML for Zero-Touch Network Security: Use-Case Driven Analysis" published in IEEE Transactions on Network and Service Management.
Jupyter Notebook
21
star
9

5G-Core-Networks-Datasets

13
star
10

Signal-Processing-for-Machine-Learning

This repository serves as a platform for posting a diverse collection of Python codes for signal processing, facilitating various operations within a typical signal processing pipeline (pre-processing, processing, and application).
Jupyter Notebook
11
star
11

Student-Performance-and-Engagement-Prediction-eLearning-datasets

This repository contains the datasets used as part of the OC2 lab's work on Student Performance prediction and student engagement prediction in eLearning environments using machine learning methods.
10
star
12

Similarity-Based-Predictive-Maintenance-Framework-for-Rotating-Machinery

Python code “Jupyter notebooks” for the paper entitled " Similarity-Based Predictive Maintenance Framework for Rotating Machinery" has been presented in the Fifth International Conference on Communications, Signal Processing, and their Applications (ICCSPA’22), Cairo, Egypt, 27-29 December 2022. Techniques used: statistical analysis, FFT, and STFT.
Jupyter Notebook
9
star
13

Wireless-Resource-Virtualization-with-Device-to-Device-Communication-Underlaying-LTE-Networks

Implementation of Wireless Resource Virtualization with Device-to-Device Communication Underlaying LTE Networks
MATLAB
7
star
14

Data-driven-Methods-for-the-Reduction-of-Energy-Consumption-in-Warehouses-Use-Case

This is the repository that includes the code of the use case in the paper titled "Data-driven Methods for the Reduction of Energy Consumption in Warehouses: Use-Case Driven Analysis"
Jupyter Notebook
4
star
15

CorrFL

This repository includes the code used in the paper titled "CorrFL: Correlation-based Neural Network Architecture for Unavailability Concerns in a Heterogeneous IoT Environment"
Python
3
star
16

SB-PdM-a-tool-for-predictive-maintenance-of-rolling-bearings-based-on-limited-labeled-data

SB-PdM is a non-machine learning code to perform Predictive Maintenance (PdM) of rolling bearings without the need to train a classifier. In SM-PdM, the classification task is performed by applying a similarity measure between test sample and class-reference labeled samples in the feature space.
Jupyter Notebook
3
star
17

DNS_Typosquatting_Detection_Datasets

This repository contains the datasets used as part of the OC2 lab's work on DNS Typosquatting Detection using machine learning methods
MATLAB
2
star
18

FDE

Jupyter Notebook
1
star
19

TRL-HPO

Python
1
star
20

Joint-Instantaneous-Amplitude-Frequency-Analysis-for-Vibration-Based-Condition-Monitoring

Jupyter Notebook
1
star
21

hierarchical-CO2

This is a repository that includes the code used in the paper titled "Hierarchical Modelling for CO2 Variation Prediction for HVAC System Operation"
Python
1
star
22

TinyML_EVCI

This repository contains code for comparing traditional Machine Learning (ML) and Tiny Machine Learning (TinyML) in terms of time, memory usage, and performance, specifically in the context of electric vehicle charging infrastructure. It also offers practical insights by implementing TinyML on the ESP32 microcontroller.
Python
1
star