Integration of a deep learning system for automated chest x-ray interpretation in the emergency department: A proof-of-concept

The translation of deep learning (DL) techniques from research to effective clinical implementations has to overcome an important gap between the DL-development setting and the daily clinical practice. The purpose of this work was to carry out a proof-of-concept study of a DL tool for chest x-rays (...

Full description

Saved in:
Bibliographic Details
Published in:Intelligence-based medicine Vol. 5; p. 100039
Main Authors: Mosquera, Candelaria, Binder, Fernando, Diaz, Facundo Nahuel, Seehaus, Alberto, Ducrey, Gabriel, Ocantos, Jorge Alberto, Aineseder, Martina, Rubin, Luciana, Rabinovich, Diego Ariel, Quiroga, Angel Ezequiel, Martinez, Bernardo, Beresñak, Alejandro Daniel, Benitez, Sonia Elizabeth, Luna, Daniel Roberto
Format: Journal Article
Language:English
Published: Elsevier B.V 2021
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:The translation of deep learning (DL) techniques from research to effective clinical implementations has to overcome an important gap between the DL-development setting and the daily clinical practice. The purpose of this work was to carry out a proof-of-concept study of a DL tool for chest x-rays (CXR) at the emergency department (ED) of our health institution, to measure changes in the performance compared to the retrospective test reported in a prior study; and to compare the model with ED physicians, as they are the intended users of this system. We collected all CXR studies performed during April 2020 in the ED of a 650-bed university hospital, obtaining 508 CXRs from 499 patients. No manual selection or enrichment method were applied. We built a reference standard based on the diagnosis of three senior radiologists and used it to compare the DL model with ED physicians. The model showed a sensitivity of 0.853 and specificity of 0.715 for abnormal findings detection, and an area under the ROC curve of 0.784 (95% CI: 0.746–0.822), which is significantly lower than the value of the prior retrospective test. However, it is significantly higher than the 0.598 (95% CI: 0.54–0.62) value obtained by ED physicians (p < 0.001). For abnormality detection, the DL model showed significantly higher sensitivity, specificity, and predictive values. The DL model showed lower evaluation metrics on real-world emergency department images than what was previously observed in digital tests. These findings caution against overconfident in-silico performance estimates and highlight the importance of proof-of-concept studies of AI-based diagnostic tools to better approach real clinical settings. Despite its suboptimal performance, the algorithm showed significantly better performance than Emergency Department physicians for the detection of abnormal findings, which suggests there is room for conducive human-AI collaboration in real-world clinical scenarios. •Model performance declined in the proof-of-concept study compared to the retrospective test.•This cautions against overconfident performance metrics and highlights the importance of tests that emulate clinical settings.•The source setting of the images and the prevalence of findings affect performance estimates.•Model performance was significantly higher than emergency physicians for abnormality detection.•This encourages future human-machine collaboration in emergency settings.
ISSN:2666-5212
2666-5212
DOI:10.1016/j.ibmed.2021.100039