Classification of Diabetes Data Set from Iraq via Different Machine Learning Techniques

Diabetes has become one of the most prevalent diseases in Iraq and is listed as one of the leading causes of death. Machine learning provides effective information extraction results by creating predictive models from diagnostic medical datasets collected from diabetes patients in Iraq. In this stud...

Full description

Saved in:
Bibliographic Details
Published in:المجلة العراقية للعلوم الاحصائية Vol. 21; no. 1; pp. 170 - 189
Main Authors: Dilshad Altalabani, Fevzi Erdogan
Format: Journal Article
Language:Arabic
English
Published: College of Computer Science and Mathematics, University of Mosul 01-06-2024
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Diabetes has become one of the most prevalent diseases in Iraq and is listed as one of the leading causes of death. Machine learning provides effective information extraction results by creating predictive models from diagnostic medical datasets collected from diabetes patients in Iraq. In this study, we applied machine learning classification to compare and contrast the performances of classification and regression trees (CART), support vector machines (SVM), random forests (RF), linear discrimination analysis (LDA), and K-nearest neighbors (KNN). We sought to design a model that can predict with maximum accuracy the probability that a person has, is healthy, or is expected to develop diabetes in the future using the two scales of accuracy and kappa. Based on the results obtained from the algorithms, it showed that the accuracy and sequence of the algorithms concerning the training data were Random Forest (RF), Classification and Regression Trees (CART), Support Vector Machine (SVM), Linear Discrimination Analysis (LDA), and K-Nearest Neighbors (KNN). While the test data results showed some differences, the sequence of the algorithms was as follows: SVM, RF, CART, LDA, and KNN were the highest, respectively. The training data set refers to the samples that were used to construct the model, whereas the testing data set is used to evaluate the model's performance. Based on the assessment criteria discussed above, we chose the best machine learning approach to predict diabetes mellitus in Iraq to achieve high performance. All of the strategies listed above are approximated using a supervised diabetes testing dataset. The approach that achieves the maximum performance in terms of accuracy and kappa is regarded as the best option. Based on the results, it can be seen that the SVM and RF algorithms predicted diabetes with more accuracy.
ISSN:1680-855X
2664-2956
DOI:10.33899/iqjoss.2024.183258