An Empirical Comparison of Classification Algorithms for Imbalanced Credit Scoring Datasets

The profitability of banks is highly dependent on credit scoring models, which support decision making to approve a loan to a customer. State-of-the-art credit scoring models are based on learning methods. These methods need to cope with the problem of imbalanced classes since credit scoring dataset...

Full description

Saved in:
Bibliographic Details
Published in:2019 18th IEEE International Conference On Machine Learning And Applications (ICMLA) pp. 747 - 754
Main Authors: Soares de Melo Junior, Leopoldo, Nardini, Franco Maria, Renso, Chiara, Fernandes de Macedo, Jose Antonio
Format: Conference Proceeding
Language:English
Published: IEEE 01-12-2019
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The profitability of banks is highly dependent on credit scoring models, which support decision making to approve a loan to a customer. State-of-the-art credit scoring models are based on learning methods. These methods need to cope with the problem of imbalanced classes since credit scoring datasets usually contain mainly paid loans and few defaults (unpaid ones). Recently, new imbalanced learning techniques have been proposed in the literature, and they can improve the credit scoring results. Motivated by this scenario, we evaluate several classification approaches to credit scoring. Besides, we also assess some preprocessing methods to overcome skewed datasets. To achieve it, we use three public real-world credit scoring datasets. In our experiments, we progressively increase the class imbalance in each of these datasets by randomly undersampling the minority class of defaulters to identify how the predictive power is affected. The results indicate that random forest, extreme gradient boosting perform very well in all imbalance levels. We also find that a complete grid search step can increase the prediction power of classification approaches in high imbalanced datasets.
DOI:10.1109/ICMLA.2019.00133