Performance of ChatGPT on the Peruvian National Licensing Medical Examination: Cross-Sectional Study

Background: ChatGPT has shown impressive performance in national medical licensing examinations, such as the United States Medical Licensing Examination (USMLE), even passing it with expert-level performance. However, there is a lack of research on its performance in low-income countries’ national licensing medical examinations. In Peru, where almost one out of three examinees fails the national licensing medical examination, ChatGPT has the potential to enhance medical education. Objective: We aimed to assess the accuracy of ChatGPT using GPT-3.5 and GPT-4 on the Peruvian National Licensing Medical Examination (Examen Nacional de Medicina [ENAM]). Additionally, we sought to identify factors associated with


Introduction
ChatGPT (OpenAI), a large language model (LLM) trained with over 175 billion parameters, has gained growing attention owing to its performance in different tasks, including mathematics, economics, and medicine [1].During the first trimester of 2023, its performance in the United States Medical Licensing Examination (USMLE) has improved exponentially, from almost passing the USMLE Step 1 and Step 2 Clinical Knowledge with 40%-60% accuracy [2] to passing both with expert-level performance, achieving 80%-90% accuracy in a recent study with the latest ChatGPT version [3].Even with recent communications from different organizations and authors on the potential of ChatGPT to improve accessibility to high-quality education [4], including medical education [5][6][7], more research is required on the performance of ChatGPT on the national licensing medical examination (NLME) from low-income countries.
In the Peruvian context, low-quality medical education is evidenced by high failure rates (42.8%) in the Peruvian NLME (Examen Nacional de Medicina [ENAM] in Spanish) [8].This translates into lower-to-medium self-perceived competencies of Peruvian doctors in the treatment of mental health disorders [9], leadership and management skills [10], evidence-based medicine [11], and clinical practices [12].Furthermore, the pupil-to-teacher ratio in tertiary education in Peru is 19:1, according to the World Bank, which is higher than the recommended 16:1.Although there are no studies on the training of clinical educators or medical teachers, we believe that the situation in Peru may be similar to that described in a study conducted on Israeli physicians, in which 65% reported that they did not receive any training in medical education [13].In this context, ChatGPT may enhance Peruvian medical education, especially from students' perspectives.
ENAM is a professional requirement for Peruvian medical doctors and international physicians who aspire to practice medicine within Peruvian borders.Since its introduction in 2003 by the Peruvian Society of Medical Schools, this examination has served as a key evaluation of doctors' readiness to practice medicine in the country [14].ENAM is a written assessment conducted in Spanish that follows a multiple-choice question format.The test, comprising 180 questions, is primarily based on clinical vignettes related to the most common diseases and health issues prevalent in Peru in clinical, surgical, and public health areas.For Peruvian doctors, this crucial exam is conducted at the end of their internship, culminating in their 7-year undergraduate medical training [14,15].
The passing score on the ENAM is 10.5 on a vigesimal scale (95/180).Over the years, the examination has gained even more significance owing to the regulatory measures that have made it a critical element in the selection process for Rural Service positions [16].Additionally, ENAM scores heavily influenced the allocation of medical specialties, further underlining the role of the exam in shaping the professional paths of aspiring doctors in Peru.Therefore, passing the ENAM is not just about obtaining a license to practice medicine but also plays a considerable role in the professional trajectory of medical practitioners in the country.
Bearing this in mind, we hypothesized that if ChatGPT can pass the ENAM, it may be used as a medical tutor to enhance medical students' experience.Thus, in this study, we aimed to assess the accuracy of ChatGPT (GPT-3.5 and GPT-4) on the ENAM and identify factors associated with incorrect answers provided by ChatGPT.

Data Set
Our primary data source was the 2022 ENAM question set obtained directly from the official website of the Peruvian Society of Medical Schools (ASPEFAM) [15].The data set, comprising 180 multiple-choice questions, was subsequently uploaded to a Google Spreadsheet for evaluation.We refrained from translating the questions into English while maintaining their original Spanish language for authenticity and accuracy.
The 2022 data set was chosen for two main reasons: first, the ENAM blueprint ensures that each examination evaluates the same construct, thereby allowing a single year's data to be representative; second, since ChatGPT's training information only covers knowledge up to September 2021, the 2022 data set assures that the selected questions were not part of the model's training data.Therefore, we assert that our data set selection strategy offers a degree of generalizability to the ENAM.The ENAM 2022 data set is available in Multimedia Appendix 1.
We carefully collected the exam questions and divided them into four parts: (1) stem, the main problem or story (for example, "A 75-year-old man..."); (2) lead-in, the question asked (for example, "What is the most probable diagnosis?");(3) response options, the different answers provided for each question; and (4) the correct answer, as given by the exam creators [17].

Procedures
Two ChatGPT versions were used, namely, GPT-3.5 and GPT-4.Our approach involved the development of three distinct prompts to guide the artificial intelligence (AI) response.To create these prompts, two authors (JAF-C and JG-A) engaged in discussions to ensure they accurately represented the cognitive processes an examinee would typically use when answering a multiple-choice question.After reaching a consensus, we designed a three-step prompt that, to the best of our understanding, mimics this thought process effectively.
The prompt was, "Analyze the following question, determine what is being assessed, and provide the correct answer/explanation."With this prompt, we followed the same process as Kung et al [18], inputting questions in three formats: 1. Open-ended prompt: We removed response options, thus providing only the stem and lead-in with the prompt. 2. Multiple-choice question with no justification: We provided the whole question with a stem, lead-in, and response options.In the prompt, we asked only to provide the correct answer with no further explanation. 3. Multiple-choice question with justification: We provided the whole question with stem, lead-in, and response options.
In the prompt, we asked for a lengthy explanation.
Five of us (four medical students and one medical doctor) entered the questions into ChatGPT.Students received training on how to use ChatGPT through a prerecorded video, and their proficiency was assessed to ensure consistency in the application of prompts.A new chat session was initiated for each question to eliminate any potential memory retention bias.In situations where ChatGPT initially failed to deliver a clear response, we reattempted the question up to three times.The responses were then transferred to a structured Google Spreadsheet for further examination.The first (GPT-3.5)data extraction process was conducted between March 15 and 20, 2023, and the second (GPT-4) was conducted on May 5, 2023.
On May 20, 2023, we conducted a second run, which incorporated three prompts following incorrect answers in GPT-3.5 and GPT-4.After providing the question and lead-in without instructions, if an incorrect answer was provided, we asked, "Are you sure?Pretend to be a junior doctor with expertise in clinical practice and exam solving and retry."If an incorrect answer was provided, the following final prompt was provided: "Are you sure?Re-assess the question and pretend to be a Peruvian junior doctor with expertise in clinical practice and exam solving and retry." Additionally, we obtained the results of 1025 examinees who took the ENAM as a progress test in a national preparation course.The examinees comprised final-year medical students and medical doctors preparing to undertake the ENAM in 2023.Using this data set, we analyzed questions using classical test theory to calculate the difficulty and discrimination index using the psychometrics package in RStudio (version 4.2.1, RStudio, PBC).The difficulty index was calculated as a quantitative assessment of the proportion of examinees answering each question correctly, estimating the individual question's difficulty level.The discrimination index refers to the question's capacity to differentiate between high and low performers on the overall test [19].These two metrics were used to assess the validity of an assessment and to distinguish between examinees, thus enabling us to evaluate the performance of ChatGPT more accurately.

Variables
The outcome was the performance of ChatGPT (GPT-3.5 and GPT-4) on the ENAM measured as correct or incorrect answers.We classified answers as correct if the answer provided by both versions matched the official answers provided by ASPEFAM.
Independent variables were as follows: (1) type of objective, which was categorized as recall, whenever a question only required factual knowledge, or application, whenever a question required application of knowledge through clinical, therapeutic, communication, or professional decision-making; (2) Peruvian-specific knowledge (ie, if the question required knowledge specific to Peru, such as documentation or specific guidelines used in the country); (3) discrimination index; (4) difficulty index; (5) quality of questions; and (6) subject, which was categorized into basic sciences, internal medicine, surgery, obstetrics and gynecology, pediatrics, emergency medicine and critical care, and public health by two physicians with experience in assessing and preparing candidates for the ENAM.
Both the discrimination and difficulty indices were calculated using classic test theory for the sample of 1025 examinees.For the discrimination index, we considered the question to provide good discrimination if the index was ≥0.25.For difficulty, questions were classified as hard (<0.30), moderate (0.30-0.70), or easy (>0.70).The quality of questions was measured by JAF-C and JG-A using a 5-point Likert scale with the question, "What is the quality of this question?".Using this approach, we estimated the overall quality of the questions including the stem, lead-in, and response options using a tool based on the National Board of Medical Examinees' item writing flaws [17].

Statistical Analysis
We downloaded the data as Microsoft Excel files and exported the data to RStudio for analysis.
For descriptive analyses, we used absolute and relative frequencies for categorical variables and measures of central tendency and dispersion for numerical variables.
To compare the agreement between GPT-3.5 and GPT-4, we used Cohen κ.To evaluate factors associated with incorrect answers from GPT-3.5 and GPT-4, we used a logistic regression model to calculate the odds ratio (OR) and 95% CI.
We used the variance inflation factor (VIF) and Hosmer-Lemeshow test for goodness of fit to assess multicollinearity among predictors.All variables of interest were entered into the multivariable model, and this process was conducted for GPT-3.5 and GPT-4.The predictive accuracy of each version of ChatGPT was assessed using the receiver operating characteristic (ROC), from which we calculated the area under the curve (AUC).The data set and the RStudio script are available in Multimedia Appendices 2 and 3, respectively.

Ethical Considerations
This study adhered to the Helsinki Declaration.No humans were involved during the study.Therefore, evaluation by the ethics committee was not considered necessary.

Comparison of GPT-3.5 and GPT-4
As shown in Figure 2, GPT-4 outperformed GPT-3.5 in almost all medical areas except surgery (GPT-4, 81.8%; GPT-3.5, 84.8%) and emergency medicine (GPT-4, 87.5%; GPT-3.5, 100%); however, these differences were not significant.When conducting a subanalysis for each subcategory, we found that GPT-4 outperformed GPT-3.5 in all categories except for medium-quality questions, as shown in Table 1.b Proportion of agreement between raters.This was calculated when Cohen κ calculation was not feasible.
We used Cohen κ to assess the agreement between GPT-3.5 and GPT-4; the overall agreement was κ=0.38 (Table 1).The agreement was higher for questions that required Peruvian knowledge (κ=0.76),questions that assessed recall of knowledge (κ=0.65), and questions from obstetrics and gynecology (κ=0.57).When calculating Cohen κ was not feasible, we calculated the proportion of agreement between raters, which was highest for high-difficulty questions (100%), low-difficulty questions (90%), and questions from emergency and critical care (87.5%).A more in-depth analysis is portrayed in Figure 3.

Factors Associated With ChatGPT Incorrect Answers
When analyzing the odds for incorrect answers on GPT-3.5 and GPT-4, we found that high-and moderate-difficulty questions presented higher odds for incorrect answers in the adjusted model both for GPT-3.5 (OR 6.6, 95% CI 2.73-15.95) and GPT-4 (OR 33.23, 95% CI 4.3-257.12),and low-quality questions were associated with correct answers in the GPT-3.5 adjusted model (OR 0.14, 95% CI 0.02-0.87),as shown in Table 2. Furthermore, the GPT-3.5 and GPT-4 adjusted models had AUCs of 0.782 and 0.851, respectively.None of the variables included had a VIF>5.a The area under the curve was 0.782 for GPT-3.5 and 0.851 for GPT-4.The variance inflation factor was <5 for all variables.
b OR: odds ratio.
c Model adjusted by Peru-specific knowledge requirement, area, quality of questions, bloom taxonomy, discrimination, and difficulty.

Reinput of Prompts for Incorrect Answers
Finally, we reinput prompts for incorrect answers following a three-step process, as shown in Figure 4.After reinputting prompts, GPT-3.5 provided 12 (29%) persistent incorrect answers, and GPT-4 provided 4 (16%), thus exhibiting improved scores when modeled through different prompts.

Principal Findings
Here we showed that ChatGPT (GPT-3.5 and GPT-4) can pass the ENAM with expert-level performance.Furthermore, GPT-4 surpassed almost 90% of examinees in our data set with an accuracy of 86.1%, and GPT-3.5 surpassed 80% of examinees with an accuracy of 77.2%.These results are in concordance with the findings of Nori et al [3], who reported an accuracy of 84.75% and 48.12% for GPT-4 and GPT-3.5, respectively, in the USMLE Step 2 Clinical Knowledge.Another study on the Neurosurgery Oral Board Preparation Question Bank showed that GPT-4 performed with an accuracy of 82.6%, while GPT-3.5 achieved an accuracy of 62.4% [20].However, in our study, GPT-3.5 performed better on the NLME compared to previous studies where it failed examinations, including the USMLE and Spanish, Japanese, and Chinese NLMEs [2,[21][22][23].This can be explained by our use of a prompt that resembles the "chain-of-thought prompting approach," in which ChatGPT decomposes multistep problems into smaller and manageable steps to enhance accuracy [24].However, more studies are needed to understand whether this prompt structure improves performance in health care-related tasks.
When analyzing differences between the two versions, GPT-4 outperformed GPT-3.5 in almost all areas; however, we observed fair agreement between versions.The agreement was higher for high-difficulty questions, for which both versions failed all questions, and low-difficulty questions, for which both versions answered all questions correctly.These results suggest that the improvement in performance from GPT-3.5 to GPT-4 is due to enhanced reasoning rather than randomness [1].

RenderX
Although previous studies reported the likelihood of lower accuracy in GPT-3.5 for higher-order problem-solving [20], we found that when adjusting for all variables, moderate-to-high difficulty questions were associated with incorrect answers for both GPT-3.5 and GPT-4 and that low-quality questions were associated with correct answers for only GPT-3.5.Notably, our findings differ from those of another study that did not find a correlation between question difficulty and accuracy using GPT-3.5 [25]; however, in that study, difficulty was measured through perception rather than through classic test theory.Lastly, we showed that when reinputting questions, ChatGPT provided new and more accurate responses and that role-play and context-setting in prompts effectively improved performance, reducing GTP-3.5'sincorrect answers from 41 to 12 and GTP-4's incorrect answers from 25 to 4. Our findings resemble those of a previous study that showed that novel explanations provided when reinputting questions improved performance from 8.61% to 9.79% [25].

Strengths and Limitations
To our knowledge, this is the first study to assess the agreement between GPT-3.5 and GPT-4 in the context of medical education and to examine factors linked to incorrect answers.We demonstrated that reformulating incorrect answers by varying prompts and changing roles and contexts improved the accuracy of ChatGPT.
However, certain limitations of this study should be considered when interpreting our results.First, our study was confined to the Peruvian medical education system and involved a relatively limited number of questions.Therefore, the results may not be generalizable to other educational settings or a wider range of questions.We recommend future research with larger sample sizes, more diverse examinations, broader question sets, and different factors to identify reasons for wrong answers, such as the date of the questions.
Second, while GPT-4 exhibited expert-level performance on the ENAM, this finding must be cautiously interpreted.The competencies required by a medical professional, as defined by frameworks such as CanMEDS or the Accreditation Council for Graduate Medical Education core competencies, extend beyond the confines of a licensing examination.These examinations assess knowledge and its application under controlled conditions, which may differ substantially from real-world clinical scenarios.Furthermore, more valid assessment tools, such as entrustable professional activities, represent the gold standard in medical education.Consequently, despite GPT-4's promising performance, it is premature to suggest that it could replace human doctors.We encourage additional research to assess the potential use of ChatGPT in different roles or as a supportive tool for medical practitioners.
Finally, our study did not evaluate the use of "mega-prompts"-large, intricate prompts detailing specific roles, contexts, and tasks, which might elicit more sophisticated and targeted responses-or other novel methods, such as chain-of-thought prompts [24] or three-of-thoughts [26].Therefore, our findings may not fully encompass the range and depth of responses that GPT-3.5 and GPT-4 can achieve.We recommend that future studies explore the effects of different prompts on the performance of ChatGPT in medical education.

Implications
This study has several implications for both medical education and research on ChatGPT and AI.First, we demonstrated that ChatGPT can pass the ENAM with expert-level performance, surpassing 9 out of 10 examinees.Although our sample does not represent the real score in the ENAM, a previous study [9] found that high ENAM scores from examinees from 2009 to 2019 ranged between 16.58-17.63,which is on par with GPT-4's score of 17.2.Using a variety of LLMs, we can begin to tailor assessments for different students' needs, as each LLM (InstructGPT, GPT-3.5, GPT-4, or others) may be representative of a cluster of subjects or performance levels from novice to expert.Thus, assessments may be inputted into LLMs, and an ease-rapid-valid evaluation of the level of the assessment may be estimated using the percentage of correct answers obtained by the selected LLM.
Second, we found that incorrect answers provided by ChatGPT using GPT-3.5 and GPT-4 were associated with question difficulty, which opens further research directions to identify reasons for why ChatGPT fails some questions and inform new directions to understand the behavior of LLMs.Also, to our knowledge, this study is the first to apply psychometrics to ChatGPT, and further studies could explore different theories, such as cognitive diagnostic modeling or other diagnostic classification models with larger data sets, searching for a more in-depth understanding of the reasoning process of ChatGPT.Third, by reinputting incorrectly answered questions and adjusting prompts with more complexity (ie, adding roles and context), we found that ChatGPT may perform better.This requires further research on prompt engineering in medical education with tailored prompts for specific tasks, such as the development of assessment tools, curriculum development, communication with patients, or tutoring students.Additionally, tailored LLMs trained with specific and curated medical knowledge are needed for these different applications.
Finally, despite the outstanding performance of ChatGPT in the ENAM, as previously stated by Thirunavukarasu [27], practicing medicine requires more than just responding correctly to a set of multiple-choice questions.Thus, being a doctor is a complex and never-ending process that requires us to wear several hats as medical experts, communicators, collaborators, academics, and several other roles.Consequently, we recommend that future research be aligned with medical competencies and roles; this will allow us to guide research on ChatGPT and LLMs to answer more specific questions that may aid us in spending time on more meaningful tasks.

Conclusions
Our study found that ChatGPT (GPT-3.5 and GPT-4) can achieve expert-level performance on the ENAM, outperforming most of our examinees.We found fair agreement between both versions.
There was an association between high-to-moderate-difficulty questions and wrong answers in both versions of ChatGPT.Furthermore, we observed enhanced performance by reinputting new prompts for incorrectly XSL • FO RenderX answered questions and adding roles and context for ChatGPT.Despite the outstanding performance of ChatGPT, we note that being a doctor goes beyond passing a licensing examination.©Javier A Flores-Cohaila, Abigaíl García-Vicente, Sonia F Vizcarra-Jiménez, Janith P De la Cruz-Galán, Jesús D Gutiérrez-Arratia, Blanca Geraldine Quiroga Torres, Alvaro Taype-Rondan.Originally published in JMIR Medical Education (https://mededu.jmir.org),28.09.2023.This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/),which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in JMIR Medical Education, is properly cited.The complete bibliographic information, a link to the original publication on https://mededu.jmir.org/,as well as this copyright and license information must be included.

d
Ref: reference category.e N/A: not applicable.f Clinical areas include internal medicine and pediatrics.g Surgical areas include obstetrics and gynecology and surgery.h Longitudinal areas include public health, basic sciences, and emergency and critical care.i Not available.*P<.05.

Figure 4 .
Figure 4. Flowchart of the reinput process for incorrect answers provided by GPT-3.5 and GPT-4.
a Prompts were formatted as multiple-choice questions with justification.

Table 2 .
Factors associated with incorrect answers given by GPT-3.5 and GPT-4 a .