Examining ChatGPT Performance on USMLE Sample Items and Implications for Assessment
In late 2022 and early 2023, reports that ChatGPT could pass the United States Medical Licensing Examination (USMLE) generated considerable excitement, and media response suggested ChatGPT has credible medical knowledge. This report analyzes the extent to which an artificial intelligence (AI) agent&...
Saved in:
Published in: | Academic medicine Vol. 99; no. 2; pp. 192 - 197 |
---|---|
Main Authors: | , , , , |
Format: | Journal Article |
Language: | English |
Published: |
United States
Lippincott Williams & Wilkins
01-02-2024
|
Subjects: | |
Online Access: | Get full text |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Abstract | In late 2022 and early 2023, reports that ChatGPT could pass the United States Medical Licensing Examination (USMLE) generated considerable excitement, and media response suggested ChatGPT has credible medical knowledge. This report analyzes the extent to which an artificial intelligence (AI) agent's performance on these sample items can generalize to performance on an actual USMLE examination and an illustration is given using ChatGPT.
As with earlier investigations, analyses were based on publicly available USMLE sample items. Each item was submitted to ChatGPT (version 3.5) 3 times to evaluate stability. Responses were scored following rules that match operational practice, and a preliminary analysis explored the characteristics of items that ChatGPT answered correctly. The study was conducted between February and March 2023.
For the full sample of items, ChatGPT scored above 60% correct except for one replication for Step 3. Response success varied across replications for 76 items (20%). There was a modest correspondence with item difficulty wherein ChatGPT was more likely to respond correctly to items found easier by examinees. ChatGPT performed significantly worse ( P < .001) on items relating to practice-based learning.
Achieving 60% accuracy is an approximate indicator of meeting the passing standard, requiring statistical adjustments for comparison. Hence, this assessment can only suggest consistency with the passing standards for Steps 1 and 2 Clinical Knowledge, with further limitations in extrapolating this inference to Step 3. These limitations are due to variances in item difficulty and exclusion of the simulation component of Step 3 from the evaluation-limitations that would apply to any AI system evaluated on the Step 3 sample items. It is crucial to note that responses from large language models exhibit notable variations when faced with repeated inquiries, underscoring the need for expert validation to ensure their utility as a learning tool. |
---|---|
AbstractList | In late 2022 and early 2023, reports that ChatGPT could pass the United States Medical Licensing Examination (USMLE) generated considerable excitement, and media response suggested ChatGPT has credible medical knowledge. This report analyzes the extent to which an artificial intelligence (AI) agent's performance on these sample items can generalize to performance on an actual USMLE examination and an illustration is given using ChatGPT.
As with earlier investigations, analyses were based on publicly available USMLE sample items. Each item was submitted to ChatGPT (version 3.5) 3 times to evaluate stability. Responses were scored following rules that match operational practice, and a preliminary analysis explored the characteristics of items that ChatGPT answered correctly. The study was conducted between February and March 2023.
For the full sample of items, ChatGPT scored above 60% correct except for one replication for Step 3. Response success varied across replications for 76 items (20%). There was a modest correspondence with item difficulty wherein ChatGPT was more likely to respond correctly to items found easier by examinees. ChatGPT performed significantly worse ( P < .001) on items relating to practice-based learning.
Achieving 60% accuracy is an approximate indicator of meeting the passing standard, requiring statistical adjustments for comparison. Hence, this assessment can only suggest consistency with the passing standards for Steps 1 and 2 Clinical Knowledge, with further limitations in extrapolating this inference to Step 3. These limitations are due to variances in item difficulty and exclusion of the simulation component of Step 3 from the evaluation-limitations that would apply to any AI system evaluated on the Step 3 sample items. It is crucial to note that responses from large language models exhibit notable variations when faced with repeated inquiries, underscoring the need for expert validation to ensure their utility as a learning tool. In late 2022 and early 2023, reports that ChatGPT could pass the United States Medical Licensing Examination (USMLE) generated considerable excitement, and media response suggested ChatGPT has credible medical knowledge. This report analyzes the extent to which an artificial intelligence (AI) agent's performance on these sample items can generalize to performance on an actual USMLE examination and an illustration is given using ChatGPT.PURPOSEIn late 2022 and early 2023, reports that ChatGPT could pass the United States Medical Licensing Examination (USMLE) generated considerable excitement, and media response suggested ChatGPT has credible medical knowledge. This report analyzes the extent to which an artificial intelligence (AI) agent's performance on these sample items can generalize to performance on an actual USMLE examination and an illustration is given using ChatGPT.As with earlier investigations, analyses were based on publicly available USMLE sample items. Each item was submitted to ChatGPT (version 3.5) 3 times to evaluate stability. Responses were scored following rules that match operational practice, and a preliminary analysis explored the characteristics of items that ChatGPT answered correctly. The study was conducted between February and March 2023.METHODAs with earlier investigations, analyses were based on publicly available USMLE sample items. Each item was submitted to ChatGPT (version 3.5) 3 times to evaluate stability. Responses were scored following rules that match operational practice, and a preliminary analysis explored the characteristics of items that ChatGPT answered correctly. The study was conducted between February and March 2023.For the full sample of items, ChatGPT scored above 60% correct except for one replication for Step 3. Response success varied across replications for 76 items (20%). There was a modest correspondence with item difficulty wherein ChatGPT was more likely to respond correctly to items found easier by examinees. ChatGPT performed significantly worse ( P < .001) on items relating to practice-based learning.RESULTSFor the full sample of items, ChatGPT scored above 60% correct except for one replication for Step 3. Response success varied across replications for 76 items (20%). There was a modest correspondence with item difficulty wherein ChatGPT was more likely to respond correctly to items found easier by examinees. ChatGPT performed significantly worse ( P < .001) on items relating to practice-based learning.Achieving 60% accuracy is an approximate indicator of meeting the passing standard, requiring statistical adjustments for comparison. Hence, this assessment can only suggest consistency with the passing standards for Steps 1 and 2 Clinical Knowledge, with further limitations in extrapolating this inference to Step 3. These limitations are due to variances in item difficulty and exclusion of the simulation component of Step 3 from the evaluation-limitations that would apply to any AI system evaluated on the Step 3 sample items. It is crucial to note that responses from large language models exhibit notable variations when faced with repeated inquiries, underscoring the need for expert validation to ensure their utility as a learning tool.CONCLUSIONSAchieving 60% accuracy is an approximate indicator of meeting the passing standard, requiring statistical adjustments for comparison. Hence, this assessment can only suggest consistency with the passing standards for Steps 1 and 2 Clinical Knowledge, with further limitations in extrapolating this inference to Step 3. These limitations are due to variances in item difficulty and exclusion of the simulation component of Step 3 from the evaluation-limitations that would apply to any AI system evaluated on the Step 3 sample items. It is crucial to note that responses from large language models exhibit notable variations when faced with repeated inquiries, underscoring the need for expert validation to ensure their utility as a learning tool. |
Author | Clauser, Brian E Yaneva, Victoria Baldwin, Peter Jurich, Daniel P Swygert, Kimberly |
AuthorAffiliation | V. Yaneva is manager, Natural Language Processing Research, Office of Research Strategy, National Board of Medical Examiners, Philadelphia, Pennsylvania K. Swygert is director, Test Development Innovations, National Board of Medical Examiners, Philadelphia, Pennsylvania D.P. Jurich is associate vice president, United States Medical Licensing Examination, National Board of Medical Examiners, Philadelphia, Pennsylvania P. Baldwin is principal measurement scientist, Office of Research Strategy, National Board of Medical Examiners, Philadelphia, Pennsylvania B.E. Clauser is distinguished research scientist, Office of Research Strategy, National Board of Medical Examiners, Philadelphia, Pennsylvania |
AuthorAffiliation_xml | – name: V. Yaneva is manager, Natural Language Processing Research, Office of Research Strategy, National Board of Medical Examiners, Philadelphia, Pennsylvania – name: D.P. Jurich is associate vice president, United States Medical Licensing Examination, National Board of Medical Examiners, Philadelphia, Pennsylvania – name: K. Swygert is director, Test Development Innovations, National Board of Medical Examiners, Philadelphia, Pennsylvania – name: B.E. Clauser is distinguished research scientist, Office of Research Strategy, National Board of Medical Examiners, Philadelphia, Pennsylvania – name: P. Baldwin is principal measurement scientist, Office of Research Strategy, National Board of Medical Examiners, Philadelphia, Pennsylvania |
Author_xml | – sequence: 1 givenname: Victoria surname: Yaneva fullname: Yaneva, Victoria – sequence: 2 givenname: Peter surname: Baldwin fullname: Baldwin, Peter – sequence: 3 givenname: Daniel P surname: Jurich fullname: Jurich, Daniel P – sequence: 4 givenname: Kimberly surname: Swygert fullname: Swygert, Kimberly – sequence: 5 givenname: Brian E surname: Clauser fullname: Clauser, Brian E |
BackLink | https://www.ncbi.nlm.nih.gov/pubmed/37934828$$D View this record in MEDLINE/PubMed |
BookMark | eNpdUUtr3DAQFiWhef6DUnTsxYlkvexTWZbNZmFDA5tAbmIsjxIXW9pa3tL8-yjkQdK5zAzzPQa-I7IXYkBCvnF2xlltzmfzqzP2oZSS9RdyyGtRFRWr7vbyzCQrSin1ATlK6XcGaaPEV3IgTC1kVVaHZLP4B0MXunBP5w8wLa9v6DWOPo4DBIc0Bnq7uVov6AaGbY90NeGQKISWrvLeOZi6GBLNeDpLCVMaMEwnZN9Dn_D0tR-T24vFzfyyWP9aruazdeEkq6fCK5DMQ6lbzxrUjTHQ8tZ4XTe8qR06QKe8cBKxNBoAauW119i0XqBvhTgmP190t7tmwNZl6xF6ux27AcZHG6Gzny-he7D38a_lXEoplM4KP14Vxvhnh2myQ5cc9j0EjLtky6oy0hiuVIbKF6gbY0oj-ncfzuxzIDYHYv8PJNO-f_zxnfSWgHgC9qyK9g |
CitedBy_id | crossref_primary_10_58600_eurjther2201 crossref_primary_10_2196_60807 |
Cites_doi | 10.1371/journal.pdig.0000198 10.1056/NEJMra2302038 10.3390/app11146421 10.1056/NEJMe2206291 10.1056/NEJMsr2214184 10.2196/45312 |
ContentType | Journal Article |
Copyright | Copyright © 2023 The Author(s). Published by Wolters Kluwer Health, Inc. on behalf of the Association of American Medical Colleges. Copyright © 2023 The Author(s). Published by Wolters Kluwer Health, Inc. on behalf of the Association of American Medical Colleges 2023 Lippincott Williams & Wilkins |
Copyright_xml | – notice: Copyright © 2023 The Author(s). Published by Wolters Kluwer Health, Inc. on behalf of the Association of American Medical Colleges. – notice: Copyright © 2023 The Author(s). Published by Wolters Kluwer Health, Inc. on behalf of the Association of American Medical Colleges 2023 Lippincott Williams & Wilkins |
DBID | CGR CUY CVF ECM EIF NPM AAYXX CITATION 7X8 5PM |
DOI | 10.1097/ACM.0000000000005549 |
DatabaseName | Medline MEDLINE MEDLINE (Ovid) MEDLINE MEDLINE PubMed CrossRef MEDLINE - Academic PubMed Central (Full Participant titles) |
DatabaseTitle | MEDLINE Medline Complete MEDLINE with Full Text PubMed MEDLINE (Ovid) CrossRef MEDLINE - Academic |
DatabaseTitleList | MEDLINE MEDLINE - Academic |
Database_xml | – sequence: 1 dbid: ECM name: MEDLINE url: https://search.ebscohost.com/login.aspx?direct=true&db=cmedm&site=ehost-live sourceTypes: Index Database |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Medicine Education |
EISSN | 1938-808X |
EndPage | 197 |
ExternalDocumentID | 10_1097_ACM_0000000000005549 37934828 |
Genre | Journal Article |
GroupedDBID | --- .-D .Z2 01R 0R~ 1J1 23M 2FS 40H 4Q1 4Q2 4Q3 53G 5GY 5RE 5VS 77Y 7O~ AAAAV AAAXR AAGIX AAHPQ AAIQE AAMOA AAMTA AAQKA AARTV AASCR AASOK AAWTL AAXQO ABASU ABBUW ABDIG ABIVO ABJNI ABVCZ ABXVJ ABZAD ACDDN ACEWG ACGFO ACGFS ACILI ACLDA ACNCT ACNWC ACWDW ACWRI ACXJB ACXNZ ADGGA ADHPY AENEX AFDTB AFUWQ AGINI AHJKT AHOMT AHQNM AHVBC AIJEX AINUH AJIOK AJNWD AJZMW AKULP ALMA_UNASSIGNED_HOLDINGS ALMTX AMJPA AMKUR AMNEI AOHHW AWKKM BOYCO BQLVK C45 CGR CS3 CUY CVF DIWNM E.X EBS ECM EEVPB EIF ERAAH EX3 F2K F2L F2M F2N F5P FCALG FL- GNXGY GQDEL H0~ HLJTE HZ~ IKREB IKYAY IN~ IPNFZ JK3 JK8 K8S KD2 KMI L-C L7B MVM MZP N9A NEJ NPM N~7 N~B O9- OAG OAH ODA OJAPA OLG OLH OLU OLV OLW OLY OLZ OPUJH OVD OVDNE OVIDH OVLEI OWV OWW OWY OWZ OXXIT P2P P6G RIG RLZ S4R S4S SJN TEORI TR2 TSPGW TWZ V2I VVN W3M WF8 WH7 WOQ WOW X3V X3W XYM YHG ZFV ZY1 ZZMQN .3C .55 .GJ 07C 1CY 3O- AAYXX ACCJW ADFPA ADNKB AE3 AEETU AFFNX AHRYX AI. AJNYG BS7 C1A CITATION DUNZO EJD FW0 H~9 J5H JF9 JG8 N4W N~M OCUKA OHT OK1 ORVUJ OUVQU OWU OWX P-K R58 T8P VH1 WF9 X7M XJT YCJ ZCG ZGI ZXP 7X8 5PM |
ID | FETCH-LOGICAL-c409t-f5a40fa26df0be6b77ad1d7f69b1b9cecaec5f3c4ee276aaa95f6f6ebdf3efd33 |
ISSN | 1040-2446 1938-808X |
IngestDate | Thu Oct 03 05:31:55 EDT 2024 Sat Oct 26 02:14:38 EDT 2024 Thu Nov 21 21:44:29 EST 2024 Sat Nov 02 12:24:47 EDT 2024 |
IsDoiOpenAccess | true |
IsOpenAccess | true |
IsPeerReviewed | true |
IsScholarly | true |
Issue | 2 |
Language | English |
License | Copyright © 2023 The Author(s). Published by Wolters Kluwer Health, Inc. on behalf of the Association of American Medical Colleges. This is an open-access article distributed under the terms of the Creative Commons Attribution-Non Commercial-No Derivatives License 4.0 (CCBY-NC-ND), where it is permissible to download and share the work provided it is properly cited. The work cannot be changed in any way or used commercially without permission from the journal. |
LinkModel | OpenURL |
MergedId | FETCHMERGED-LOGICAL-c409t-f5a40fa26df0be6b77ad1d7f69b1b9cecaec5f3c4ee276aaa95f6f6ebdf3efd33 |
Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
OpenAccessLink | https://pubmed.ncbi.nlm.nih.gov/PMC11444356 |
PMID | 37934828 |
PQID | 2887477155 |
PQPubID | 23479 |
PageCount | 6 |
ParticipantIDs | pubmedcentral_primary_oai_pubmedcentral_nih_gov_11444356 proquest_miscellaneous_2887477155 crossref_primary_10_1097_ACM_0000000000005549 pubmed_primary_37934828 |
PublicationCentury | 2000 |
PublicationDate | 2024-02-01 |
PublicationDateYYYYMMDD | 2024-02-01 |
PublicationDate_xml | – month: 02 year: 2024 text: 2024-02-01 day: 01 |
PublicationDecade | 2020 |
PublicationPlace | United States |
PublicationPlace_xml | – name: United States |
PublicationTitle | Academic medicine |
PublicationTitleAlternate | Acad Med |
PublicationYear | 2024 |
Publisher | Lippincott Williams & Wilkins |
Publisher_xml | – name: Lippincott Williams & Wilkins |
References | (bib3-20240209) 2023; 388 (bib2-20240209) 2023; 388 (bib4-20240209) 2023; 388 (bib11-20240209) 2021; 11 (bib18-20240209) 2023; 55 (bib8-20240209) 2023; 2 (bib7-20240209) 2023; 9 |
References_xml | – volume: 2 start-page: e0000198 year: 2023 ident: bib8-20240209 article-title: Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models publication-title: PLOS Digit Health doi: 10.1371/journal.pdig.0000198 – volume: 388 start-page: 1201 year: 2023 ident: bib3-20240209 article-title: Artificial intelligence and machine learning in clinical medicine publication-title: N Engl J Med doi: 10.1056/NEJMra2302038 – volume: 11 start-page: 6421 year: 2021 ident: bib11-20240209 article-title: What disease does this patient have? A large-scale open domain question answering dataset from medical exams publication-title: Appl Sci doi: 10.3390/app11146421 – volume: 388 start-page: 1220 year: 2023 ident: bib4-20240209 article-title: Artificial intelligence in medicine publication-title: N Engl J Med doi: 10.1056/NEJMe2206291 – volume: 388 start-page: 1233 year: 2023 ident: bib2-20240209 article-title: Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine publication-title: N Engl J Med doi: 10.1056/NEJMsr2214184 – volume: 9 start-page: e45312 year: 2023 ident: bib7-20240209 article-title: How does ChatGPT perform on the United States Medical Licensing Examination? The implications of large language models for medical education and knowledge assessment publication-title: JMIR Med Educ doi: 10.2196/45312 – volume: 55 start-page: 1 year: 2023 ident: bib18-20240209 article-title: QA dataset explosion: a taxonomy of NLP resources for question answering and reading comprehension publication-title: ACM Comput Surv |
SSID | ssj0006753 |
Score | 2.5025911 |
Snippet | In late 2022 and early 2023, reports that ChatGPT could pass the United States Medical Licensing Examination (USMLE) generated considerable excitement, and... |
SourceID | pubmedcentral proquest crossref pubmed |
SourceType | Open Access Repository Aggregation Database Index Database |
StartPage | 192 |
SubjectTerms | Artificial Intelligence Computer Simulation Humans Knowledge Language Learning Research Reports |
Title | Examining ChatGPT Performance on USMLE Sample Items and Implications for Assessment |
URI | https://www.ncbi.nlm.nih.gov/pubmed/37934828 https://www.proquest.com/docview/2887477155 https://pubmed.ncbi.nlm.nih.gov/PMC11444356 |
Volume | 99 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://sdu.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV3Pb9MwFLZYJyEuCMavMkBG4oYi8sO142NVMii0AykpGqfISWxWaUsRazf473mOHSftdhgHcogiN0oqv0_P78Xf-x5Cb6IykjyW3BMkiD0SSN8TJWCZyBjQBQDhommdkLLjk_h9QpKOEtSN_VdLwxjYWlfO_oO13UNhAK7B5nAGq8P5VnZPfovzpumD3klff_iaaZK7qw0AWy_S-Sx5mwotC9x8qTcqzdM-tVxzD8dOs7MfwDo-_e6m_HdRy8smEv221BsByy7TF2fVlZEq2GIDf9KCRqddnXtXapZe_flha4k-L3XHEss5tV8nQtISmp1D5eBQY79pHwzrzQ1j1gubNkkWbWHPpQamV941V28khMeTuZGgtAdER7xb2trt_OMv-dFiNsuz5CTbQ_shOCUyQPvjyXQ2des2NZql7t-1hZacvbvpLduBzLXsZJdk24tasgfovk038Njg5CG6I-sD3anbsnoO0N25teMjlDroYAsd3IMOXtW4gQ420MENdDBAB_ehg-F-3EHnMVocJdnko2d7bnglZPprT40E8ZUIaaX8QtKCMVEFFVOUF0HBS1kKWY5UVBIpQ0aFEHykqKKyqFQkVRVFT9CgXtXyGcJMyUAWo4JVjBLw_AUTlMoo8mXAKxKGQ-S1E5j_NNIqeUuJgAnPdyd8iF63s5yDD9QbW4Ds1eYiD2GlJIxBaDxET82suydG2tZxGA9RvGUPd4PWV9_-pV6eNjrrQUAIZBP0-S1efIjudfB_gQbrXxv5Eu1dVJtXFmd_AbEjlfw |
link.rule.ids | 230,315,782,786,887,27934,27935 |
linkProvider | Ovid |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Examining+ChatGPT+Performance+on+USMLE+Sample+Items+and+Implications+for+Assessment&rft.jtitle=Academic+medicine&rft.au=Yaneva%2C+Victoria&rft.au=Baldwin%2C+Peter&rft.au=Jurich%2C+Daniel+P&rft.au=Swygert%2C+Kimberly&rft.date=2024-02-01&rft.issn=1938-808X&rft.eissn=1938-808X&rft.volume=99&rft.issue=2&rft.spage=192&rft_id=info:doi/10.1097%2FACM.0000000000005549&rft.externalDBID=NO_FULL_TEXT |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1040-2446&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1040-2446&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1040-2446&client=summon |