Evaluating large language models on medical evidence summarization

Recent advances in large language models (LLMs) have demonstrated remarkable successes in zero- and few-shot performance on various downstream tasks, paving the way for applications in high-stakes domains. In this study, we systematically examine the capabilities and limitations of LLMs, specificall...

Full description

Saved in:

Bibliographic Details
Published in:	NPJ digital medicine Vol. 6; no. 1; pp. 158 - 8
Main Authors:	Tang, Liyan, Sun, Zhaoyi, Idnay, Betina, Nestor, Jordan G., Soroush, Ali, Elias, Pierre A., Xu, Ziyang, Ding, Ying, Durrett, Greg, Rousseau, Justin F., Weng, Chunhua, Peng, Yifan
Format:	Journal Article
Language:	English
Published:	London Nature Publishing Group UK 24-08-2023 Nature Publishing Group Nature Portfolio
Subjects:	631/114/2406 692/308/575 692/700/3935 Artificial intelligence Biomedicine Biotechnology Chatbots Evidence-based medicine Language Large language models Medicine Medicine & Public Health Quality control
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Recent advances in large language models (LLMs) have demonstrated remarkable successes in zero- and few-shot performance on various downstream tasks, paving the way for applications in high-stakes domains. In this study, we systematically examine the capabilities and limitations of LLMs, specifically GPT-3.5 and ChatGPT, in performing zero-shot medical evidence summarization across six clinical domains. We conduct both automatic and human evaluations, covering several dimensions of summary quality. Our study demonstrates that automatic metrics often do not strongly correlate with the quality of summaries. Furthermore, informed by our human evaluations, we define a terminology of error types for medical evidence summarization. Our findings reveal that LLMs could be susceptible to generating factually inconsistent summaries and making overly convincing or uncertain statements, leading to potential harm due to misinformation. Moreover, we find that models struggle to identify the salient information and are more error-prone when summarizing over longer textual contexts.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	2398-6352 2398-6352
DOI:	10.1038/s41746-023-00896-7