Can large language models provide accurate and quality information to parents regarding chronic kidney diseases?

Rationale Artificial Intelligence (AI) large language models (LLM) are tools capable of generating human‐like text responses to user queries across topics. The use of these language models in various medical contexts is currently being studied. However, the performance and content quality of these l...

Full description

Saved in:
Bibliographic Details
Published in:Journal of evaluation in clinical practice Vol. 30; no. 8; pp. 1556 - 1564
Main Authors: Naz, Rüya, Akacı, Okan, Erdoğan, Hakan, Açıkgöz, Ayfer
Format: Journal Article
Language:English
Published: England Wiley Subscription Services, Inc 01-12-2024
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Rationale Artificial Intelligence (AI) large language models (LLM) are tools capable of generating human‐like text responses to user queries across topics. The use of these language models in various medical contexts is currently being studied. However, the performance and content quality of these language models have not been evaluated in specific medical fields. Aims and objectives This study aimed to compare the performance of AI LLMs ChatGPT, Gemini and Copilot in providing information to parents about chronic kidney diseases (CKD) and compare the information accuracy and quality with that of a reference source. Methods In this study, 40 frequently asked questions about CKD were identified. The accuracy and quality of the answers were evaluated with reference to the Kidney Disease: Improving Global Outcomes guidelines. The accuracy of the responses generated by LLMs was assessed using F1, precision and recall scores. The quality of the responses was evaluated using a five‐point global quality score (GQS). Results ChatGPT and Gemini achieved high F1 scores of 0.89 and 1, respectively, in the diagnosis and lifestyle categories, demonstrating significant success in generating accurate responses. Furthermore, ChatGPT and Gemini were successful in generating accurate responses with high precision values in the diagnosis and lifestyle categories. In terms of recall values, all LLMs exhibited strong performance in the diagnosis, treatment and lifestyle categories. Average GQ scores for the responses generated were 3.46 ± 0.55, 1.93 ± 0.63 and 2.02 ± 0.69 for Gemini, ChatGPT 3.5 and Copilot, respectively. In all categories, Gemini performed better than ChatGPT and Copilot. Conclusion Although LLMs provide parents with high‐accuracy information about CKD, their use is limited compared with that of a reference source. The limitations in the performance of LLMs can lead to misinformation and potential misinterpretations. Therefore, patients and parents should exercise caution when using these models.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:1356-1294
1365-2753
1365-2753
DOI:10.1111/jep.14084