Examining ChatGPT Performance on USMLE Sample Items and Implications for Assessment

In late 2022 and early 2023, reports that ChatGPT could pass the United States Medical Licensing Examination (USMLE) generated considerable excitement, and media response suggested ChatGPT has credible medical knowledge. This report analyzes the extent to which an artificial intelligence (AI) agent&...

Full description

Saved in:
Bibliographic Details
Published in:Academic medicine Vol. 99; no. 2; pp. 192 - 197
Main Authors: Yaneva, Victoria, Baldwin, Peter, Jurich, Daniel P, Swygert, Kimberly, Clauser, Brian E
Format: Journal Article
Language:English
Published: United States Lippincott Williams & Wilkins 01-02-2024
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract In late 2022 and early 2023, reports that ChatGPT could pass the United States Medical Licensing Examination (USMLE) generated considerable excitement, and media response suggested ChatGPT has credible medical knowledge. This report analyzes the extent to which an artificial intelligence (AI) agent's performance on these sample items can generalize to performance on an actual USMLE examination and an illustration is given using ChatGPT. As with earlier investigations, analyses were based on publicly available USMLE sample items. Each item was submitted to ChatGPT (version 3.5) 3 times to evaluate stability. Responses were scored following rules that match operational practice, and a preliminary analysis explored the characteristics of items that ChatGPT answered correctly. The study was conducted between February and March 2023. For the full sample of items, ChatGPT scored above 60% correct except for one replication for Step 3. Response success varied across replications for 76 items (20%). There was a modest correspondence with item difficulty wherein ChatGPT was more likely to respond correctly to items found easier by examinees. ChatGPT performed significantly worse ( P < .001) on items relating to practice-based learning. Achieving 60% accuracy is an approximate indicator of meeting the passing standard, requiring statistical adjustments for comparison. Hence, this assessment can only suggest consistency with the passing standards for Steps 1 and 2 Clinical Knowledge, with further limitations in extrapolating this inference to Step 3. These limitations are due to variances in item difficulty and exclusion of the simulation component of Step 3 from the evaluation-limitations that would apply to any AI system evaluated on the Step 3 sample items. It is crucial to note that responses from large language models exhibit notable variations when faced with repeated inquiries, underscoring the need for expert validation to ensure their utility as a learning tool.
AbstractList In late 2022 and early 2023, reports that ChatGPT could pass the United States Medical Licensing Examination (USMLE) generated considerable excitement, and media response suggested ChatGPT has credible medical knowledge. This report analyzes the extent to which an artificial intelligence (AI) agent's performance on these sample items can generalize to performance on an actual USMLE examination and an illustration is given using ChatGPT. As with earlier investigations, analyses were based on publicly available USMLE sample items. Each item was submitted to ChatGPT (version 3.5) 3 times to evaluate stability. Responses were scored following rules that match operational practice, and a preliminary analysis explored the characteristics of items that ChatGPT answered correctly. The study was conducted between February and March 2023. For the full sample of items, ChatGPT scored above 60% correct except for one replication for Step 3. Response success varied across replications for 76 items (20%). There was a modest correspondence with item difficulty wherein ChatGPT was more likely to respond correctly to items found easier by examinees. ChatGPT performed significantly worse ( P < .001) on items relating to practice-based learning. Achieving 60% accuracy is an approximate indicator of meeting the passing standard, requiring statistical adjustments for comparison. Hence, this assessment can only suggest consistency with the passing standards for Steps 1 and 2 Clinical Knowledge, with further limitations in extrapolating this inference to Step 3. These limitations are due to variances in item difficulty and exclusion of the simulation component of Step 3 from the evaluation-limitations that would apply to any AI system evaluated on the Step 3 sample items. It is crucial to note that responses from large language models exhibit notable variations when faced with repeated inquiries, underscoring the need for expert validation to ensure their utility as a learning tool.
In late 2022 and early 2023, reports that ChatGPT could pass the United States Medical Licensing Examination (USMLE) generated considerable excitement, and media response suggested ChatGPT has credible medical knowledge. This report analyzes the extent to which an artificial intelligence (AI) agent's performance on these sample items can generalize to performance on an actual USMLE examination and an illustration is given using ChatGPT.PURPOSEIn late 2022 and early 2023, reports that ChatGPT could pass the United States Medical Licensing Examination (USMLE) generated considerable excitement, and media response suggested ChatGPT has credible medical knowledge. This report analyzes the extent to which an artificial intelligence (AI) agent's performance on these sample items can generalize to performance on an actual USMLE examination and an illustration is given using ChatGPT.As with earlier investigations, analyses were based on publicly available USMLE sample items. Each item was submitted to ChatGPT (version 3.5) 3 times to evaluate stability. Responses were scored following rules that match operational practice, and a preliminary analysis explored the characteristics of items that ChatGPT answered correctly. The study was conducted between February and March 2023.METHODAs with earlier investigations, analyses were based on publicly available USMLE sample items. Each item was submitted to ChatGPT (version 3.5) 3 times to evaluate stability. Responses were scored following rules that match operational practice, and a preliminary analysis explored the characteristics of items that ChatGPT answered correctly. The study was conducted between February and March 2023.For the full sample of items, ChatGPT scored above 60% correct except for one replication for Step 3. Response success varied across replications for 76 items (20%). There was a modest correspondence with item difficulty wherein ChatGPT was more likely to respond correctly to items found easier by examinees. ChatGPT performed significantly worse ( P < .001) on items relating to practice-based learning.RESULTSFor the full sample of items, ChatGPT scored above 60% correct except for one replication for Step 3. Response success varied across replications for 76 items (20%). There was a modest correspondence with item difficulty wherein ChatGPT was more likely to respond correctly to items found easier by examinees. ChatGPT performed significantly worse ( P < .001) on items relating to practice-based learning.Achieving 60% accuracy is an approximate indicator of meeting the passing standard, requiring statistical adjustments for comparison. Hence, this assessment can only suggest consistency with the passing standards for Steps 1 and 2 Clinical Knowledge, with further limitations in extrapolating this inference to Step 3. These limitations are due to variances in item difficulty and exclusion of the simulation component of Step 3 from the evaluation-limitations that would apply to any AI system evaluated on the Step 3 sample items. It is crucial to note that responses from large language models exhibit notable variations when faced with repeated inquiries, underscoring the need for expert validation to ensure their utility as a learning tool.CONCLUSIONSAchieving 60% accuracy is an approximate indicator of meeting the passing standard, requiring statistical adjustments for comparison. Hence, this assessment can only suggest consistency with the passing standards for Steps 1 and 2 Clinical Knowledge, with further limitations in extrapolating this inference to Step 3. These limitations are due to variances in item difficulty and exclusion of the simulation component of Step 3 from the evaluation-limitations that would apply to any AI system evaluated on the Step 3 sample items. It is crucial to note that responses from large language models exhibit notable variations when faced with repeated inquiries, underscoring the need for expert validation to ensure their utility as a learning tool.
Author Clauser, Brian E
Yaneva, Victoria
Baldwin, Peter
Jurich, Daniel P
Swygert, Kimberly
AuthorAffiliation V. Yaneva is manager, Natural Language Processing Research, Office of Research Strategy, National Board of Medical Examiners, Philadelphia, Pennsylvania
K. Swygert is director, Test Development Innovations, National Board of Medical Examiners, Philadelphia, Pennsylvania
D.P. Jurich is associate vice president, United States Medical Licensing Examination, National Board of Medical Examiners, Philadelphia, Pennsylvania
P. Baldwin is principal measurement scientist, Office of Research Strategy, National Board of Medical Examiners, Philadelphia, Pennsylvania
B.E. Clauser is distinguished research scientist, Office of Research Strategy, National Board of Medical Examiners, Philadelphia, Pennsylvania
AuthorAffiliation_xml – name: V. Yaneva is manager, Natural Language Processing Research, Office of Research Strategy, National Board of Medical Examiners, Philadelphia, Pennsylvania
– name: D.P. Jurich is associate vice president, United States Medical Licensing Examination, National Board of Medical Examiners, Philadelphia, Pennsylvania
– name: K. Swygert is director, Test Development Innovations, National Board of Medical Examiners, Philadelphia, Pennsylvania
– name: B.E. Clauser is distinguished research scientist, Office of Research Strategy, National Board of Medical Examiners, Philadelphia, Pennsylvania
– name: P. Baldwin is principal measurement scientist, Office of Research Strategy, National Board of Medical Examiners, Philadelphia, Pennsylvania
Author_xml – sequence: 1
  givenname: Victoria
  surname: Yaneva
  fullname: Yaneva, Victoria
– sequence: 2
  givenname: Peter
  surname: Baldwin
  fullname: Baldwin, Peter
– sequence: 3
  givenname: Daniel P
  surname: Jurich
  fullname: Jurich, Daniel P
– sequence: 4
  givenname: Kimberly
  surname: Swygert
  fullname: Swygert, Kimberly
– sequence: 5
  givenname: Brian E
  surname: Clauser
  fullname: Clauser, Brian E
BackLink https://www.ncbi.nlm.nih.gov/pubmed/37934828$$D View this record in MEDLINE/PubMed
BookMark eNpdUUtr3DAQFiWhef6DUnTsxYlkvexTWZbNZmFDA5tAbmIsjxIXW9pa3tL8-yjkQdK5zAzzPQa-I7IXYkBCvnF2xlltzmfzqzP2oZSS9RdyyGtRFRWr7vbyzCQrSin1ATlK6XcGaaPEV3IgTC1kVVaHZLP4B0MXunBP5w8wLa9v6DWOPo4DBIc0Bnq7uVov6AaGbY90NeGQKISWrvLeOZi6GBLNeDpLCVMaMEwnZN9Dn_D0tR-T24vFzfyyWP9aruazdeEkq6fCK5DMQ6lbzxrUjTHQ8tZ4XTe8qR06QKe8cBKxNBoAauW119i0XqBvhTgmP190t7tmwNZl6xF6ux27AcZHG6Gzny-he7D38a_lXEoplM4KP14Vxvhnh2myQ5cc9j0EjLtky6oy0hiuVIbKF6gbY0oj-ncfzuxzIDYHYv8PJNO-f_zxnfSWgHgC9qyK9g
CitedBy_id crossref_primary_10_58600_eurjther2201
crossref_primary_10_2196_60807
Cites_doi 10.1371/journal.pdig.0000198
10.1056/NEJMra2302038
10.3390/app11146421
10.1056/NEJMe2206291
10.1056/NEJMsr2214184
10.2196/45312
ContentType Journal Article
Copyright Copyright © 2023 The Author(s). Published by Wolters Kluwer Health, Inc. on behalf of the Association of American Medical Colleges.
Copyright © 2023 The Author(s). Published by Wolters Kluwer Health, Inc. on behalf of the Association of American Medical Colleges 2023 Lippincott Williams & Wilkins
Copyright_xml – notice: Copyright © 2023 The Author(s). Published by Wolters Kluwer Health, Inc. on behalf of the Association of American Medical Colleges.
– notice: Copyright © 2023 The Author(s). Published by Wolters Kluwer Health, Inc. on behalf of the Association of American Medical Colleges 2023 Lippincott Williams & Wilkins
DBID CGR
CUY
CVF
ECM
EIF
NPM
AAYXX
CITATION
7X8
5PM
DOI 10.1097/ACM.0000000000005549
DatabaseName Medline
MEDLINE
MEDLINE (Ovid)
MEDLINE
MEDLINE
PubMed
CrossRef
MEDLINE - Academic
PubMed Central (Full Participant titles)
DatabaseTitle MEDLINE
Medline Complete
MEDLINE with Full Text
PubMed
MEDLINE (Ovid)
CrossRef
MEDLINE - Academic
DatabaseTitleList MEDLINE
MEDLINE - Academic
Database_xml – sequence: 1
  dbid: ECM
  name: MEDLINE
  url: https://search.ebscohost.com/login.aspx?direct=true&db=cmedm&site=ehost-live
  sourceTypes: Index Database
DeliveryMethod fulltext_linktorsrc
Discipline Medicine
Education
EISSN 1938-808X
EndPage 197
ExternalDocumentID 10_1097_ACM_0000000000005549
37934828
Genre Journal Article
GroupedDBID ---
.-D
.Z2
01R
0R~
1J1
23M
2FS
40H
4Q1
4Q2
4Q3
53G
5GY
5RE
5VS
77Y
7O~
AAAAV
AAAXR
AAGIX
AAHPQ
AAIQE
AAMOA
AAMTA
AAQKA
AARTV
AASCR
AASOK
AAWTL
AAXQO
ABASU
ABBUW
ABDIG
ABIVO
ABJNI
ABVCZ
ABXVJ
ABZAD
ACDDN
ACEWG
ACGFO
ACGFS
ACILI
ACLDA
ACNCT
ACNWC
ACWDW
ACWRI
ACXJB
ACXNZ
ADGGA
ADHPY
AENEX
AFDTB
AFUWQ
AGINI
AHJKT
AHOMT
AHQNM
AHVBC
AIJEX
AINUH
AJIOK
AJNWD
AJZMW
AKULP
ALMA_UNASSIGNED_HOLDINGS
ALMTX
AMJPA
AMKUR
AMNEI
AOHHW
AWKKM
BOYCO
BQLVK
C45
CGR
CS3
CUY
CVF
DIWNM
E.X
EBS
ECM
EEVPB
EIF
ERAAH
EX3
F2K
F2L
F2M
F2N
F5P
FCALG
FL-
GNXGY
GQDEL
H0~
HLJTE
HZ~
IKREB
IKYAY
IN~
IPNFZ
JK3
JK8
K8S
KD2
KMI
L-C
L7B
MVM
MZP
N9A
NEJ
NPM
N~7
N~B
O9-
OAG
OAH
ODA
OJAPA
OLG
OLH
OLU
OLV
OLW
OLY
OLZ
OPUJH
OVD
OVDNE
OVIDH
OVLEI
OWV
OWW
OWY
OWZ
OXXIT
P2P
P6G
RIG
RLZ
S4R
S4S
SJN
TEORI
TR2
TSPGW
TWZ
V2I
VVN
W3M
WF8
WH7
WOQ
WOW
X3V
X3W
XYM
YHG
ZFV
ZY1
ZZMQN
.3C
.55
.GJ
07C
1CY
3O-
AAYXX
ACCJW
ADFPA
ADNKB
AE3
AEETU
AFFNX
AHRYX
AI.
AJNYG
BS7
C1A
CITATION
DUNZO
EJD
FW0
H~9
J5H
JF9
JG8
N4W
N~M
OCUKA
OHT
OK1
ORVUJ
OUVQU
OWU
OWX
P-K
R58
T8P
VH1
WF9
X7M
XJT
YCJ
ZCG
ZGI
ZXP
7X8
5PM
ID FETCH-LOGICAL-c409t-f5a40fa26df0be6b77ad1d7f69b1b9cecaec5f3c4ee276aaa95f6f6ebdf3efd33
ISSN 1040-2446
1938-808X
IngestDate Thu Oct 03 05:31:55 EDT 2024
Sat Oct 26 02:14:38 EDT 2024
Thu Nov 21 21:44:29 EST 2024
Sat Nov 02 12:24:47 EDT 2024
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 2
Language English
License Copyright © 2023 The Author(s). Published by Wolters Kluwer Health, Inc. on behalf of the Association of American Medical Colleges.
This is an open-access article distributed under the terms of the Creative Commons Attribution-Non Commercial-No Derivatives License 4.0 (CCBY-NC-ND), where it is permissible to download and share the work provided it is properly cited. The work cannot be changed in any way or used commercially without permission from the journal.
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-c409t-f5a40fa26df0be6b77ad1d7f69b1b9cecaec5f3c4ee276aaa95f6f6ebdf3efd33
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
OpenAccessLink https://pubmed.ncbi.nlm.nih.gov/PMC11444356
PMID 37934828
PQID 2887477155
PQPubID 23479
PageCount 6
ParticipantIDs pubmedcentral_primary_oai_pubmedcentral_nih_gov_11444356
proquest_miscellaneous_2887477155
crossref_primary_10_1097_ACM_0000000000005549
pubmed_primary_37934828
PublicationCentury 2000
PublicationDate 2024-02-01
PublicationDateYYYYMMDD 2024-02-01
PublicationDate_xml – month: 02
  year: 2024
  text: 2024-02-01
  day: 01
PublicationDecade 2020
PublicationPlace United States
PublicationPlace_xml – name: United States
PublicationTitle Academic medicine
PublicationTitleAlternate Acad Med
PublicationYear 2024
Publisher Lippincott Williams & Wilkins
Publisher_xml – name: Lippincott Williams & Wilkins
References (bib3-20240209) 2023; 388
(bib2-20240209) 2023; 388
(bib4-20240209) 2023; 388
(bib11-20240209) 2021; 11
(bib18-20240209) 2023; 55
(bib8-20240209) 2023; 2
(bib7-20240209) 2023; 9
References_xml – volume: 2
  start-page: e0000198
  year: 2023
  ident: bib8-20240209
  article-title: Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models
  publication-title: PLOS Digit Health
  doi: 10.1371/journal.pdig.0000198
– volume: 388
  start-page: 1201
  year: 2023
  ident: bib3-20240209
  article-title: Artificial intelligence and machine learning in clinical medicine
  publication-title: N Engl J Med
  doi: 10.1056/NEJMra2302038
– volume: 11
  start-page: 6421
  year: 2021
  ident: bib11-20240209
  article-title: What disease does this patient have? A large-scale open domain question answering dataset from medical exams
  publication-title: Appl Sci
  doi: 10.3390/app11146421
– volume: 388
  start-page: 1220
  year: 2023
  ident: bib4-20240209
  article-title: Artificial intelligence in medicine
  publication-title: N Engl J Med
  doi: 10.1056/NEJMe2206291
– volume: 388
  start-page: 1233
  year: 2023
  ident: bib2-20240209
  article-title: Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine
  publication-title: N Engl J Med
  doi: 10.1056/NEJMsr2214184
– volume: 9
  start-page: e45312
  year: 2023
  ident: bib7-20240209
  article-title: How does ChatGPT perform on the United States Medical Licensing Examination? The implications of large language models for medical education and knowledge assessment
  publication-title: JMIR Med Educ
  doi: 10.2196/45312
– volume: 55
  start-page: 1
  year: 2023
  ident: bib18-20240209
  article-title: QA dataset explosion: a taxonomy of NLP resources for question answering and reading comprehension
  publication-title: ACM Comput Surv
SSID ssj0006753
Score 2.5025911
Snippet In late 2022 and early 2023, reports that ChatGPT could pass the United States Medical Licensing Examination (USMLE) generated considerable excitement, and...
SourceID pubmedcentral
proquest
crossref
pubmed
SourceType Open Access Repository
Aggregation Database
Index Database
StartPage 192
SubjectTerms Artificial Intelligence
Computer Simulation
Humans
Knowledge
Language
Learning
Research Reports
Title Examining ChatGPT Performance on USMLE Sample Items and Implications for Assessment
URI https://www.ncbi.nlm.nih.gov/pubmed/37934828
https://www.proquest.com/docview/2887477155
https://pubmed.ncbi.nlm.nih.gov/PMC11444356
Volume 99
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://sdu.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV3Pb9MwFLZYJyEuCMavMkBG4oYi8sO142NVMii0AykpGqfISWxWaUsRazf473mOHSftdhgHcogiN0oqv0_P78Xf-x5Cb6IykjyW3BMkiD0SSN8TJWCZyBjQBQDhommdkLLjk_h9QpKOEtSN_VdLwxjYWlfO_oO13UNhAK7B5nAGq8P5VnZPfovzpumD3klff_iaaZK7qw0AWy_S-Sx5mwotC9x8qTcqzdM-tVxzD8dOs7MfwDo-_e6m_HdRy8smEv221BsByy7TF2fVlZEq2GIDf9KCRqddnXtXapZe_flha4k-L3XHEss5tV8nQtISmp1D5eBQY79pHwzrzQ1j1gubNkkWbWHPpQamV941V28khMeTuZGgtAdER7xb2trt_OMv-dFiNsuz5CTbQ_shOCUyQPvjyXQ2des2NZql7t-1hZacvbvpLduBzLXsZJdk24tasgfovk038Njg5CG6I-sD3anbsnoO0N25teMjlDroYAsd3IMOXtW4gQ420MENdDBAB_ehg-F-3EHnMVocJdnko2d7bnglZPprT40E8ZUIaaX8QtKCMVEFFVOUF0HBS1kKWY5UVBIpQ0aFEHykqKKyqFQkVRVFT9CgXtXyGcJMyUAWo4JVjBLw_AUTlMoo8mXAKxKGQ-S1E5j_NNIqeUuJgAnPdyd8iF63s5yDD9QbW4Ds1eYiD2GlJIxBaDxET82suydG2tZxGA9RvGUPd4PWV9_-pV6eNjrrQUAIZBP0-S1efIjudfB_gQbrXxv5Eu1dVJtXFmd_AbEjlfw
link.rule.ids 230,315,782,786,887,27934,27935
linkProvider Ovid
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Examining+ChatGPT+Performance+on+USMLE+Sample+Items+and+Implications+for+Assessment&rft.jtitle=Academic+medicine&rft.au=Yaneva%2C+Victoria&rft.au=Baldwin%2C+Peter&rft.au=Jurich%2C+Daniel+P&rft.au=Swygert%2C+Kimberly&rft.date=2024-02-01&rft.issn=1938-808X&rft.eissn=1938-808X&rft.volume=99&rft.issue=2&rft.spage=192&rft_id=info:doi/10.1097%2FACM.0000000000005549&rft.externalDBID=NO_FULL_TEXT
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1040-2446&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1040-2446&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1040-2446&client=summon