MD-HIT: Machine learning for material property prediction with dataset redundancy control

Materials datasets usually contain many redundant (highly similar) materials due to the tinkering approach historically used in material design. This redundancy skews the performance evaluation of machine learning (ML) models when using random splitting, leading to overestimated predictive performan...

Full description

Saved in:

Bibliographic Details
Published in:	npj computational materials Vol. 10; no. 1; pp. 245 - 11
Main Authors:	Li, Qin, Fu, Nihang, Omee, Sadman Sadeed, Hu, Jianjun
Format:	Journal Article
Language:	English
Published:	London Nature Publishing Group UK 18-10-2024 Nature Publishing Group Nature Portfolio
Subjects:	639/301 639/638/298 Accuracy Algorithms Amino acid sequence Bioinformatics Characterization and Evaluation of Materials Chemistry and Materials Science Computational Intelligence Datasets Energy Energy gap Free energy Heat conductivity Heat of formation Learning algorithms Machine learning Material properties Materials Science Mathematical and Computational Engineering Mathematical and Computational Physics Mathematical Modeling and Industrial Mathematics Neural networks Performance evaluation Performance prediction Predictions Property values Redundancy Theoretical
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Materials datasets usually contain many redundant (highly similar) materials due to the tinkering approach historically used in material design. This redundancy skews the performance evaluation of machine learning (ML) models when using random splitting, leading to overestimated predictive performance and poor performance on out-of-distribution samples. This issue is well-known in bioinformatics for protein function prediction, where tools like CD-HIT are used to reduce redundancy by ensuring sequence similarity among samples greater than a given threshold. In this paper, we survey the overestimated ML performance in materials science for material property prediction and propose MD-HIT, a redundancy reduction algorithm for material datasets. Applying MD-HIT to composition- and structure-based formation energy and band gap prediction problems, we demonstrate that with redundancy control, the prediction performances of the ML models on test sets tend to have relatively lower performance compared to the model with high redundancy, but better reflect models’ true prediction capability.
ISSN:	2057-3960 2057-3960
DOI:	10.1038/s41524-024-01426-z