ChatGPT vs Gemini: Comparative Accuracy and Efficiency in CAD-RADS Score Assignment from Radiology Reports

Abstract This study aimed to evaluate the accuracy and efficiency of ChatGPT-3.5, ChatGPT-4o, Google Gemini, and Google Gemini Advanced in generating CAD-RADS scores based on radiology reports. This retrospective study analyzed 100 consecutive coronary computed tomography angiography reports perform...

Full description

Saved in:

Bibliographic Details
Published in:	Journal of imaging informatics in medicine
Main Authors:	Silbergleit, Matthew, Tóth, Adrienn, Chamberlin, Jordan H., Hamouda, Mohamed, Baruah, Dhiraj, Derrick, Sydney, Schoepf, U. Joseph, Burt, Jeremy R., Kabakus, Ismail M.
Format:	Journal Article
Language:	English
Published:	11-11-2024
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Abstract This study aimed to evaluate the accuracy and efficiency of ChatGPT-3.5, ChatGPT-4o, Google Gemini, and Google Gemini Advanced in generating CAD-RADS scores based on radiology reports. This retrospective study analyzed 100 consecutive coronary computed tomography angiography reports performed between March 15, 2024, and April 1, 2024, at a single tertiary center. Each report containing a radiologist-assigned CAD-RADS score was processed using four large language models (LLMs) without fine-tuning. The findings section of each report was input into the LLMs, and the models were tasked with generating CAD-RADS scores. The accuracy of LLM-generated scores was compared to the radiologist’s score. Additionally, the time taken by each model to complete the task was recorded. Statistical analyses included Mann–Whitney U test and interobserver agreement using unweighted Cohen’s Kappa and Krippendorff’s Alpha. ChatGPT-4o demonstrated the highest accuracy, correctly assigning CAD-RADS scores in 87% of cases ( κ = 0.838, α = 0.886), followed by Gemini Advanced with 82.6% accuracy ( κ = 0.784, α = 0.897). ChatGPT-3.5, although the fastest (median time = 5 s), was the least accurate (50.5% accuracy, κ = 0.401, α = 0.787). Gemini exhibited a higher failure rate (12%) compared to the other models, with Gemini Advanced slightly improving upon its predecessor. ChatGPT-4o outperformed other LLMs in both accuracy and agreement with radiologist-assigned CAD-RADS scores, though ChatGPT-3.5 was significantly faster. Despite their potential, current publicly available LLMs require further refinement before being deployed for clinical decision-making in CAD-RADS scoring.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	2948-2933 2948-2933
DOI:	10.1007/s10278-024-01328-y