Stress testing reveals gaps in clinic readiness of image-based diagnostic artificial intelligence models

Artificial intelligence models match or exceed dermatologists in melanoma image classification. Less is known about their robustness against real-world variations, and clinicians may incorrectly assume that a model with an acceptable area under the receiver operating characteristic curve or related...

Full description

Saved in:

Bibliographic Details
Published in:	NPJ digital medicine Vol. 4; no. 1; p. 10
Main Authors:	Young, Albert T., Fernandez, Kristen, Pfau, Jacob, Reddy, Rasika, Cao, Nhat Anh, von Franque, Max Y., Johal, Arjun, Wu, Benjamin V., Wu, Rachel R., Chen, Jennifer Y., Fadadu, Raj P., Vasquez, Juan A., Tam, Andrew, Keiser, Michael J., Wei, Maria L.
Format:	Journal Article
Language:	English
Published:	London Nature Publishing Group UK 21-01-2021 Nature Publishing Group Nature Portfolio
Subjects:	692/308/575 692/699/67/1813/1634 692/699/67/2322 692/700/139 692/700/1421 Artificial intelligence Biomedicine Biotechnology Digital technology Health informatics Medical diagnosis Medicine Medicine & Public Health Melanoma
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Artificial intelligence models match or exceed dermatologists in melanoma image classification. Less is known about their robustness against real-world variations, and clinicians may incorrectly assume that a model with an acceptable area under the receiver operating characteristic curve or related performance metric is ready for clinical use. Here, we systematically assessed the performance of dermatologist-level convolutional neural networks (CNNs) on real-world non-curated images by applying computational “stress tests”. Our goal was to create a proxy environment in which to comprehensively test the generalizability of off-the-shelf CNNs developed without training or evaluation protocols specific to individual clinics. We found inconsistent predictions on images captured repeatedly in the same setting or subjected to simple transformations (e.g., rotation). Such transformations resulted in false positive or negative predictions for 6.5–22% of skin lesions across test datasets. Our findings indicate that models meeting conventionally reported metrics need further validation with computational stress tests to assess clinic readiness.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	2398-6352 2398-6352
DOI:	10.1038/s41746-020-00380-6