Can large language models provide accurate and quality information to parents regarding chronic kidney diseases?
Rationale Artificial Intelligence (AI) large language models (LLM) are tools capable of generating human‐like text responses to user queries across topics. The use of these language models in various medical contexts is currently being studied. However, the performance and content quality of these l...
Saved in:
Published in: | Journal of evaluation in clinical practice Vol. 30; no. 8; pp. 1556 - 1564 |
---|---|
Main Authors: | , , , |
Format: | Journal Article |
Language: | English |
Published: |
England
Wiley Subscription Services, Inc
01-12-2024
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Rationale
Artificial Intelligence (AI) large language models (LLM) are tools capable of generating human‐like text responses to user queries across topics. The use of these language models in various medical contexts is currently being studied. However, the performance and content quality of these language models have not been evaluated in specific medical fields.
Aims and objectives
This study aimed to compare the performance of AI LLMs ChatGPT, Gemini and Copilot in providing information to parents about chronic kidney diseases (CKD) and compare the information accuracy and quality with that of a reference source.
Methods
In this study, 40 frequently asked questions about CKD were identified. The accuracy and quality of the answers were evaluated with reference to the Kidney Disease: Improving Global Outcomes guidelines. The accuracy of the responses generated by LLMs was assessed using F1, precision and recall scores. The quality of the responses was evaluated using a five‐point global quality score (GQS).
Results
ChatGPT and Gemini achieved high F1 scores of 0.89 and 1, respectively, in the diagnosis and lifestyle categories, demonstrating significant success in generating accurate responses. Furthermore, ChatGPT and Gemini were successful in generating accurate responses with high precision values in the diagnosis and lifestyle categories. In terms of recall values, all LLMs exhibited strong performance in the diagnosis, treatment and lifestyle categories. Average GQ scores for the responses generated were 3.46 ± 0.55, 1.93 ± 0.63 and 2.02 ± 0.69 for Gemini, ChatGPT 3.5 and Copilot, respectively. In all categories, Gemini performed better than ChatGPT and Copilot.
Conclusion
Although LLMs provide parents with high‐accuracy information about CKD, their use is limited compared with that of a reference source. The limitations in the performance of LLMs can lead to misinformation and potential misinterpretations. Therefore, patients and parents should exercise caution when using these models. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
ISSN: | 1356-1294 1365-2753 1365-2753 |
DOI: | 10.1111/jep.14084 |