A Hybrid Sampling Approach for Imbalanced Binary and Multi-Class Data Using Clustering Analysis

Unequal data distribution among different classes usually cause a class imbalance problem. Due to the class imbalance, the classification models become biased toward the majority class and misclassify the minority class. Class imbalance issue becomes more complex when it occurs in multi-class data....

Full description

Saved in:
Bibliographic Details
Published in:IEEE access Vol. 10; pp. 118639 - 118653
Main Authors: Palli, Abdul Sattar, Jaafar, Jafreezal, Hashmani, Manzoor Ahmed, Gomes, Heitor Murilo, Gilal, Abdul Rehman
Format: Journal Article
Language:English
Published: Piscataway IEEE 2022
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Unequal data distribution among different classes usually cause a class imbalance problem. Due to the class imbalance, the classification models become biased toward the majority class and misclassify the minority class. Class imbalance issue becomes more complex when it occurs in multi-class data. The most common method to handle the class imbalance is data resampling that involves either over-sampling minority class instances or under-sampling majority class instances. In the case of under-sampling, there is a chance of losing some crucial information, whereas over-sampling can cause an overfitting problem. Therefore, we propose a novel Cluster-based Hybrid Sampling for Imbalance Data (CBHSID) approach to address these issues. The CBHSID calculates the mean of the data observations based on the number of classes. It uses the calculated mean as a threshold value to segregate majority and minority classes. CBHSID applies affinity propagation cluster analysis to each class to create sub-clusters and calculates the distance of each data item of sub-cluster using centroid mean. CBHSID removes data observations that are away from the center of sub-cluster during under-sampling. On the other hand, during the over-sampling, it generates synthetic samples using data observations near to the center of sub-cluster. We compared CBHSID with a few state-of-the-art data balancing methods on 12 binary and 4 multi-class benchmark datasets. Based on Geometric-Mean (G-Mean), Recall, and F1-score, our method outperformed the other compared methods on 14 datasets out of 16. Results also revealed that CBHSID is suitable for addressing class imbalance issues in both binary and multi-class classifications. In the current state, we have only validated CBHSID on stationary data streams. Consequently, CBHSID can further be tested on non-stationary data streams in online learning environments.
ISSN:2169-3536
2169-3536
DOI:10.1109/ACCESS.2022.3218463