AI chatbots show promise but limitations on UK medical exam questions: a comparative performance study

Large language models (LLMs) like ChatGPT have potential applications in medical education such as helping students study for their licensing exams by discussing unclear questions with them. However, they require evaluation on these complex tasks. The purpose of this study was to evaluate how well publicly accessible LLMs performed on simulated UK medical board exam questions. 423 board-style questions from 9 UK exams (MRCS, MRCP, etc.) were answered by seven LLMs (ChatGPT-3.5, ChatGPT-4, Bard, Perplexity, Claude, Bing, Claude Instant). There were 406 multiple-choice, 13 true/false, and 4 "choose N" questions covering topics in surgery, pediatrics, and other disciplines. The accuracy of the output was graded. Statistics were used to analyze differences among LLMs. Leaked questions were excluded from the primary analysis. ChatGPT 4.0 scored (78.2%), Bing (67.2%), Claude (64.4%), and Claude Instant (62.9%). Perplexity scored the lowest (56.1%). Scores differed significantly between LLMs overall (p < 0.001) and in pairwise comparisons. All LLMs scored higher on multiple-choice vs true/false or “choose N” questions. LLMs demonstrated limitations in answering certain questions, indicating refinements needed before primary reliance in medical education. However, their expanding capabilities suggest a potential to improve training if thoughtfully implemented. Further research should explore specialty specific LLMs and optimal integration into medical curricula.


Input source and data abstraction
In this prospective study that was carried out from July 1st to July 31st, 2023, 440 multidisciplinary board-style test questions with public access from various sample questions provided by official sites of board exams were used to assess the performance of multiple artificial intelligence language modules with access to large datasets.Included sample questions were retrieved from board examination websites including MRCS, MRCP, RCPCH, RCOG, RCOopth, MRCPsych, FRCR (physics), FRCA, and MCEM in addition to sample obstetrics and gynecology questions provided by BMJ.Moreover, all inputs used were a true representation of real-exam scenarios assessing the performance of these AI models in a wide range of advanced medical disciplines.The inputs were further evaluated by being systematically assessed to ensure that none of the test answers, explanations, or exam-related content were recorded on the chatbots' databases.Two researchers independently assessed each question for leakage by searching for both a sentence and a full question on Google and the C4 database 13 , which is included in most chatbots. 14The Google search was modified by the date filter "before:2022,1,1"-which represents the latest date accessible to the training of ChatGPT-and quotation marks for a sentence of the question and the full question.All questions that were leaked to Google, whether before or after 2022, or C4 database were excluded from the primary analysis.Furthermore, all sample test questions were screened to ensure the removal of questions containing visual or audiological inputs such as clinical images, graphs, and clinical audio inputs.After screening and excluding 17 questions containing images (all from pediatrics section), 423 board-style items involving multiple medical disciplines were advanced to data extraction and analysis.While using ChatGPT-4 and Bard, we made sure to not activate the web-search feature in these chatbots.

Statistical analysis
The extracted data was then clustered into two categories, with the output = 1 representing that the AI module answered the question correctly, and an output = 0 representing a false or no answer.Subsequently, the data was analyzed using Cochran's Q test and assessed for difference between chatbots with a significance level of p = 0.05.Further pairwise analysis was conducted using Bonferroni Correction with a significance level of p = 0.002.Statistical analysis was carried out using Jamovi 15 , and SPSS 16 .Whenever chatbots refused to answer on account of not giving medical advice, we considered this datum missing.

Results
In this study, we assessed the performance of various AI modules in solving board-style questions including the MRCS, MRCP, RCPCH, RCOG, RCOopth, MRCPsych, FRCR (physics), FRCA, and MCEM.A total of 423 questions were included in the final analysis, the chatbot output was recorded and compared to the standardized question answer.

Assessment of test set leakage
We found 7 questions leaked to the C4 database all of which are from the obstetrics and gynecology specialty and came from the MRCOG website.We found 18 MCQ questions leaked to Google while filtering by date, and an
A Cochran's Q test was conducted to assess whether there were differences in performance between the seven samples: Perplexity, GPT3.5, Bard, Claude Instant, Claude, Bing, and GPT4.The results of the Cochran's Q test were statistically significant, χ2(6) = 68.640238,p < 0.001, indicating significant differences in performance between the samples overall.(Table 6).
Further pairwise comparisons were conducted with a Bonferroni correction to pinpoint where the differences existed between pairs of samples.Analysis revealed that ChatGPT4 significantly outperformed all other samples, scoring higher than Perplexity (p < 0.001), ChatGPT 3.5 (p < 0.001), Bard (p < 0.001), Claude Instant (p < 0.001), Claude (p < 0.001), and Bing (p < 0.001), suggesting that ChatGPT 4 was superior to all other models   www.nature.com/scientificreports/tested.Moreover, Perplexity scored significantly lower than several other models, it performed worse than ChatGPT4 (p < 0.001), and Bing (p < 0.001).A summary of pairwise comparisons are presented in Table 7 in addition to Figs. 2 and 3.

Assessment results for leaked questions
All chatbots scored higher on questions leaked to the C4 database except for ClaudeInstant which performed worse on the seven questions leaked to the common crawl database (0.571 ± 0.535) than other questions (0.631 ± 0.483).Bard got all questions leaked to the C4 database correctly compared to a lower score of 0.585 ± 0.493 for other questions.Questions leaked to Google before the predetermined date of 1/1/2022, however, did not show any correlation with chatbot performance.In fact, all chatbots performed worse on these questions than on other questions.The breakdown of the results of the assessment of chatbots for leaked questions is presented in Table 8.

Discussion
In this study, we examined the performance of various publicly available LLMs on questions derived from standardized United Kingdom medical board examinations.This was done to explore their potential use as educational and test preparation tools for medical students/doctors in the United Kingdom.The Seven AI models used in the study were ChatGPT-3.5,ChatGPT-4, Bard, Perplexity, Claude, Bing, and Claude Instant.Three formats of questions were given to the AI models: multiple choice, true/false, and "choose N from many" questions.
Our results showed statistically significant variations in the average scores for each AI model.We found that ChatGPT-4 had the best performance and overall scores.Meanwhile, Perplexity and Bard had the worst performance among the seven AI models.The remaining four AI models performed averagely, with no significant difference in performance between them.Despite ChatGPT-4 scoring the highest average across multiple-choice and true/false questions, it scored the lowest on "Choose N from many" questions (25%).In terms of average scores based on question format, the multiple-choice questions yielded the highest scores overall, with the different LLMs averaging between 60 and 81% correct (overall average 66%).In comparison, performance was lower for www.nature.com/scientificreports/true/false and "Choose N from many" formats.The true/false questions proved to be the most challenging-LLMs scored between 0 and 31% correct, with Perplexity unable to answer any question correctly.On the "Choose N from many" questions, performance was better than true/false, but worse than multiple choices.LLMs averaged 25-50% correct, with Claude Instant and Bing scoring 50%, the highest of any model in this format.These results highlight the differences in how well LLMs can handle various question types.Even an LLM that scores highly on one format, like GPT-4 on multiple choice, does not necessarily perform as well on other formats like "Choose N from many." This suggests that the models have strengths and weaknesses based on their prompt structure.Overall, their ability to reason through and answer medical exam questions accurately across different formats remains limited compared to that of human experts.However, performance is steadily improving, underscoring the importance of continued research on refining LLM skills for complex tasks.Similar to our study, many other papers have shown the remarkable ability of LLMs to pass reputable exams.Antaki et al. demonstrated the ability of ChatGPT to pass ophthalmology examinations at the level of a first-year resident 17 .Furthermore, it was found to pass the United States Medical Licensing exam with a score equal to that of an average third-year medical student 11 .However, most of these studies were limited to OpenAI's ChatGPT alone.In contrast, our study explored seven LLMs including ChatGPT.This allowed for a more comprehensive performance analysis of currently available LLMs.Moreover, this is the first study to explore the performance of LLMs in various United Kingdom medical board examinations.Our findings can be summarized into three major themes: 17 ChatGPT-4 remains the best average performer among AI models (2).The performance of AI models may differ depending on the formulation of the prompt question (3).The use of AI models as a secondary educational tool is propitious; however, using such models as a primary source is not recommended before further refining.
Recent advancements in LLMs, specifically ChatGPT, seem to disrupt current medical education and assessment models.Trends in AI improvement indicate that the implementation of this technology in all fields, including medicine, is inevitable.The notion of continuous improvement in these models can be seen by the documented increase in ChatGPT performance on the Medical Licensing Exam of the United States of 60% when compared with previous studies that found a much lower accuracy rate on comparable tests 11,18 .Additionally, in our study, ChatGPT-4 scored 78% correct overall which is 18% higher than the previously reported score on the USMLE examinations.Considering that these exams are intended to test medical personnel at a similar level, it would be reasonable to assume that this may indicate the continuous improvement of such models.Therefore, such models must be treated as opportunities to improve all aspects of medical education in an ethical and responsible manner.Efforts must be directed at exploring further methods to enhance the ability of LLMs to answer prompts with higher accuracy.Currently, the performance of LLMs suggests that their use as an educational tool must be as an adjuvant source in a comprehensive educational approach rather than as the primary source 19 .This takes into consideration the current limitations of such LLMs in scientific and mathematical knowledge and applications 19 .
An important aspect to consider with the rise of these models is the ethical concern of potential misuse of malicious intent, such as cheating.The risk of such misuse should be weighed against the expected gains from

Factors affecting chatbot accuracy
The varying accuracy of chatbot answers can be attributed to the low sample size of questions we were able to acquire.For instance, all chatbots performed worse on the 13 true/false (0.187 ± 0.392) and in four choose (n) from many (0.286 ± 0.46) questions than 406 MCQs (0.664 ± 0.473).This unbalanced sample may hinder the generalizability of our results in questions other than MCQs.As for the leaked questions on Google, the websites that hosted them varied, as some were locked behind a paywall, such as on Scribd website 20 , others were in a PDF format, as a part of questions samples 20,21 , while others were on flashcards on websites such as Quizlet. 22nvestigating leakage of exam questions to databases included in publicly available LLMs can be very advantageous for academic or research purposes.It can be done, akin to our approach, by search the C4 database or by implementing guided prompting to answer medical questions from a specific dataset. 23

Prompting
As mentioned previously, the disparity between the percentage of correct answers in MCQ questions and true/ false questions can be explained by multiple factors.The first factor is prompting.Prompt engineering refer to the practice of carefully designing and optimizing the prompts or instruction given to AI systems (such as ChatGPT) to improve their performance on specific tasks.This can help communicate user intent and desired outputs to LLMs.It also improves performance, provides customizable interaction, allow incorporation of external knowledge, control output features, and mitigate biases.The published research on prompt engineering for medical users is scarce.However, many preprints [24][25][26][27] suggested some practices for good prompt engineering.

Leakage
Data leakage significantly impacts the accuracy of chatbots, particularly in the domain of medical question answering.Data leakage occurs when the training data of a model inadvertently includes information from the test set, leading to an overestimation of the model's true performance.Brookshire et al. 28 explored this effect by studying the effect of data leakage on the neural networks' ability to correctly identify a range of disorders using EEG.In this example, the leakage of EEG segments to the training set and its reappearance in the test set leads to inflated model accuracy.Leakage can create a false sense of reliability and even an inflated accuracy 29 .A model trained on leaked data may appear to perform exceptionally well during testing, but this performance does not translate to real-world scenarios where the model must answer previously unseen questions.This discrepancy is particularly concerning for medical students who rely on the chatbot for studying and acquiring accurate medical

Table 1 .
Frequencies of questions by leakage and specialty.

Table 2 .
Frequencies of questions by type of question and leakage.

Table 3 .
Mean score (after removal of leaked questions) per chatbot.

Table 4 .
Mean score by chatbot and question type.

Table 5 .
Mean score by chatbot and specialty.

Table 7 .
Pairwise comparisons.Each row tests the null hypothesis that the Sample 1 and Sample 2 distributions are the same.Asymptotic significances (2-sided tests) are displayed.The significance level is .050. a Significance values have been adjusted by the Bonferroni correction for multiple tests.
[24]tly,[26][27]dvised to provide clear specific instructions as ambiguous prompts can lead to unclear or irrelevant responses.Moreover, users are encouraged to continuously test and tweak prompts based on model responses to improve responses.[24][25][26][27]Whileall chatbots included in this study can be described as LLMs which provide text generation based on user-developed prompt, it is better to deal with available options as specialized tools for different tasks.While more research is needed with future development of medically oriented LLMs, we can deduce from each chatbot descriptions and characteristics the different uses in which each chatbot may excel its peers.For instance, from included chatbots, only Perplexity and Bing AI provide sources, with Perplexity being able to refine sources more-accurately to academic ones.Moreover, Perplexity has a GPT-4 co-pilot which may enhance results of answering medical questions, but we did not assess it.On the other hand, only ChatGPT 4.0 (paid version) and Claude has file analysis features which make them able to summarize texts and analyze sheets and codes.Claude, ChatGPT (both free and paid versions) are not currently available in some regions which may encumber users (both researchers, medical practitioners, and medical students) from numerous countries from accessing them.It is interesting to see how the current AI-revolution folds out and what new tools can contribute to medical education and medical decision making.

Table 8 .
Assessment of chatbot answer accuracy in answering leaked questions.