Adaptive Synthetic Sampling Approach for Imbalanced Learning
ADASYN is a python module that implements an adaptive oversampling technique for skewed datasets.
Many ML algorithms have trouble dealing with largely skewed datasets. If your dataset is 1000 examples and 950 of them belong to class 'Haystack' and the rest 50 belong to class 'Needle' it gets hard to predict new unseen data that belong to 'Needle' . What this algorithm does is create new artificial data that belong to the minority class by adding some semi-random noise to existing examples. For more information read the full paper
Dependencies
- pip (needed for install)
- numpy
- scipy
- scikit-learn
Installation
To use ADASYN you will need to running the following :
pip install git+https://github.com/stavskal/ADASYN
After you have installed the packages you can proceed with using:
from adasyn import ADASYN
adsn = ADASYN(k=7,imb_threshold=0.6, ratio=0.75)
new_X, new_y = adsn.fit_transform(X,y) # your imbalanced dataset is in X,y
# In many applications you may want to keep artificial data separately
# adsn.index_new is a list that holds the indexes of these examples
Original paper can be found here
This module implements the idea presented in the paper by Haibo He et al. and also includes oversampling for multiclass classification problems. It is designed to be compatible with [scikit-learn] (https://github.com/scikit-learn/scikit-learn). It focuses on oversampling the examples that are harder to classify and has shown results which sometimes outperform SMOTE or SMOTEboost.
An example can be seen below:
Props to fmfn who implemented different oversampling techniques for his good code structure, which highly influenced this module, and documentation
Reference:
- H. He, Y. Bai, E. A. Garcia, and S. Li, “ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning,” in Proc. Int. Joint Conf. Neural Networks (IJCNN’08), pp. 1322-1328, 2008.