Feature Selection and Sensitivity Analysis of Oversampling in Big and Highly Imbalanced Bank's Credit Data

Machine learning has evolved as a multidisciplinary study in the last few years and gains more popularity in big data analytics, including in the banking industry. Numerous methods can be used in predictive analytics through supervised machine learning, either for regression or classification proble...

Full description

Saved in:
Bibliographic Details
Published in:2022 10th International Conference on Information and Communication Technology (ICoICT) pp. 35 - 40
Main Authors: Kurniawan, Aznovri, Rifa'i, Ahmad, Nafis, Moch Abdillah, Andriaswuri, Nimas Sefrida, Patria, Harry, Purwitasari, Diana
Format: Conference Proceeding
Language:English
Published: IEEE 02-08-2022
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Machine learning has evolved as a multidisciplinary study in the last few years and gains more popularity in big data analytics, including in the banking industry. Numerous methods can be used in predictive analytics through supervised machine learning, either for regression or classification problems. In the banking industry, credit quality is one of the core focuses, since it is one of the main areas that is reviewed regularly by regulators and impacts banks' profitability. This research is intended to give recommendations on how to select appropriate machine learning technique, perform feature selection and sensitivity analysis on bank's credit data with more than one million records and highly imbalanced, i.e., 97.5% of data is at one category. By using several supervised machine learning classification methods including the application of SMOTE (synthetic minority oversampling technique), computational results are compared and summarized, resulting in recommendations on the most appropriate technique for big and extremely imbalanced datasets, i.e., the Tree Ensemble method with SMOTE, with the computational issue is solved through data sampling, without significantly reducing its accuracy. It is also concluded that optimum number of features will increase model accuracy, however significant reduction of number of features will not necessarily increase model accuracy. The research is expected to be useful for the banking industry, especially in credit portfolio analytics, or other industries with a big and imbalanced dataset, to perform predictive analytics to support business objectives. Further research is possible, to cover more in-depth analytics for the decision-making process in banking.
DOI:10.1109/ICoICT55009.2022.9914889