Performance of ChatGPT, Bard, Claude, and Bing on the Peruvian National Licensing Medical Examination: a cross-sectional study

We aimed to describe the performance and evaluate the educational value of justifications provided by artificial intelligence chatbots, including GPT-3.5, GPT-4, Bard, Claude, and Bing, on the Peruvian National Medical Licensing Examination (P-NLME). This was a cross-sectional analytical study. On J...

Full description

Saved in:

Bibliographic Details
Published in:	Journal of educational evaluation for health professions Vol. 20; p. 30
Main Authors:	Torres-Zegarra, Betzy Clariza, Rios-Garcia, Wagner, Ñaña-Cordova, Alvaro Micael, Arteaga-Cisneros, Karen Fatima, Chalco, Xiomara Cristina Benavente, Ordoñez, Marina Atena Bustamante, Rios, Carlos Jesus Gutierrez, Godoy, Carlos Alberto Ramos, Quezada, Kristell Luisa Teresa Panta, Gutierrez-Arratia, Jesus Daniel, Flores-Cohaila, Javier Alejandro
Format:	Journal Article
Language:	English
Published:	Korea (South) Korea Health Personnel Licensing Examination Institute 2023 한국보건의료인국가시험원
Subjects:	Artificial Intelligence Cross-Sectional Studies educational measurement Educational Status Humans Knowledge medical education Peru 교육학 Peru Medical education Educational measurement Artificial intelligence Peru
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	We aimed to describe the performance and evaluate the educational value of justifications provided by artificial intelligence chatbots, including GPT-3.5, GPT-4, Bard, Claude, and Bing, on the Peruvian National Medical Licensing Examination (P-NLME). This was a cross-sectional analytical study. On July 25, 2023, each multiple-choice question (MCQ) from the P-NLME was entered into each chatbot (GPT-3, GPT-4, Bing, Bard, and Claude) 3 times. Then, 4 medical educators categorized the MCQs in terms of medical area, item type, and whether the MCQ required Peru-specific knowledge. They assessed the educational value of the justifications from the 2 top performers (GPT-4 and Bing). GPT-4 scored 86.7% and Bing scored 82.2%, followed by Bard and Claude, and the historical performance of Peruvian examinees was 55%. Among the factors associated with correct answers, only MCQs that required Peru-specific knowledge had lower odds (odds ratio, 0.23; 95% confidence interval, 0.09-0.61), whereas the remaining factors showed no associations. In assessing the educational value of justifications provided by GPT-4 and Bing, neither showed any significant differences in certainty, usefulness, or potential use in the classroom. Among chatbots, GPT-4 and Bing were the top performers, with Bing performing better at Peru-specific MCQs. Moreover, the educational value of justifications provided by the GPT-4 and Bing could be deemed appropriate. However, it is essential to start addressing the educational value of these chatbots, rather than merely their performance on examinations.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	1975-5937 1975-5937
DOI:	10.3352/jeehp.2023.20.30