Integration of a deep learning system for automated chest x-ray interpretation in the emergency department: A proof-of-concept
The translation of deep learning (DL) techniques from research to effective clinical implementations has to overcome an important gap between the DL-development setting and the daily clinical practice. The purpose of this work was to carry out a proof-of-concept study of a DL tool for chest x-rays (...
Saved in:
Published in: | Intelligence-based medicine Vol. 5; p. 100039 |
---|---|
Main Authors: | , , , , , , , , , , , , , |
Format: | Journal Article |
Language: | English |
Published: |
Elsevier B.V
2021
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | The translation of deep learning (DL) techniques from research to effective clinical implementations has to overcome an important gap between the DL-development setting and the daily clinical practice. The purpose of this work was to carry out a proof-of-concept study of a DL tool for chest x-rays (CXR) at the emergency department (ED) of our health institution, to measure changes in the performance compared to the retrospective test reported in a prior study; and to compare the model with ED physicians, as they are the intended users of this system.
We collected all CXR studies performed during April 2020 in the ED of a 650-bed university hospital, obtaining 508 CXRs from 499 patients. No manual selection or enrichment method were applied. We built a reference standard based on the diagnosis of three senior radiologists and used it to compare the DL model with ED physicians.
The model showed a sensitivity of 0.853 and specificity of 0.715 for abnormal findings detection, and an area under the ROC curve of 0.784 (95% CI: 0.746–0.822), which is significantly lower than the value of the prior retrospective test. However, it is significantly higher than the 0.598 (95% CI: 0.54–0.62) value obtained by ED physicians (p < 0.001). For abnormality detection, the DL model showed significantly higher sensitivity, specificity, and predictive values.
The DL model showed lower evaluation metrics on real-world emergency department images than what was previously observed in digital tests. These findings caution against overconfident in-silico performance estimates and highlight the importance of proof-of-concept studies of AI-based diagnostic tools to better approach real clinical settings. Despite its suboptimal performance, the algorithm showed significantly better performance than Emergency Department physicians for the detection of abnormal findings, which suggests there is room for conducive human-AI collaboration in real-world clinical scenarios.
•Model performance declined in the proof-of-concept study compared to the retrospective test.•This cautions against overconfident performance metrics and highlights the importance of tests that emulate clinical settings.•The source setting of the images and the prevalence of findings affect performance estimates.•Model performance was significantly higher than emergency physicians for abnormality detection.•This encourages future human-machine collaboration in emergency settings. |
---|---|
ISSN: | 2666-5212 2666-5212 |
DOI: | 10.1016/j.ibmed.2021.100039 |