Assessment of ChatGPT-4 in Family Medicine Board Examinations: An Observational Study Using Advanced AI Learning and Analytical Methods

Background: This research explores the capabilities of ChatGPT-4 in passing the American Board of Family Medicine (ABFM) Certification Examination. Addressing a gap in existing literature, where earlier Artificial Intelligence (AI) models showed limitations in medical board exams, this study evaluates the enhanced features and potential of ChatGPT-4, especially in document analysis and information synthesis. Objective: The primary goal is to assess whether ChatGPT-4, when provided with extensive preparation resources and using sophisticated data analysis, can achieve a score equal to or above the passing threshold for the Family Medicine Board Examinations. Methods: In this study, ChatGPT-4 was embedded in a specialized subenvironment, "AI Family Medicine Board Exam Taker," designed to closely mimic the conditions of the ABFM Certification Examination. This subenvironment enabled the AI to access and analyze a range of relevant study materials, including a primary medical textbook and supplementary online resources. The AI was presented with a series of past ABFM exam questions, reflecting the breadth and complexity typical of the exam. Emphasis was placed on assessing the AI's ability to interpret and respond to these questions accurately, leveraging its advanced data processing and analysis capabilities within this controlled subenvironment. Results: In our study, ChatGPT-4's performance was quantitatively assessed on 300 practice ABFM exam questions. The AI achieved a correct response rate of 88.67% (95% CI: 85.08% to 92.25%) for the Custom Robot version and 87.33% (95% CI: 83.57% to 91.10%) for the Regular version. Statistical analysis, including the McNemar test (P-Value: 0.4533), indicated no significant difference in accuracy between the two versions. Additionally, the Chi-square test for error type distribution (P-Value: 0.3163) revealed no significant variation in the pattern of errors across versions. These results highlight ChatGPT-4's capacity for high-level performance and consistency in responding to complex medical examination questions under controlled conditions. Conclusions: The study demonstrates that ChatGPT-4, particularly when equipped with specialized preparation and operating in a tailored subenvironment, shows promising potential in handling the intricacies of medical board examinations. While its performance is comparable to the expected standards for passing the ABFM Certification Examination, further enhancements in AI technology and tailored training methods could push these capabilities to new heights. This exploration opens avenues for integrating AI tools like ChatGPT-4 in medical education and assessment, emphasizing the importance of continuous advancement and specialized training in AI applications in healthcare.


Preprint Settings
1) Would you like to publish your submitted manuscript as preprint?Please make my preprint PDF available to anyone at any time (recommended).
Please make my preprint PDF available only to logged-in users; I understand that my title and abstract will remain visible to all users.Only make the preprint title and abstract visible.
No, I do not wish to publish my submitted manuscript as a preprint.2) If accepted for publication in a JMIR journal, would you like the PDF to be visible to the public?
Yes, please make my accepted manuscript PDF available to anyone at any time (Recommended).
Yes, but please make my accepted manuscript PDF available only to logged-in users; I understand that the title and abstract will remain v Yes, but only make the title and abstract visible (see Important note, above).I understand that if I later pay to participate in <a href="http

Introduction
Family physicians in the United States are required to complete the American Board of Family Medicine (ABFM) Certification Examination following residency and every ten years after to maintain board-certified status.This exam consists of 300 questions with a scaled scoring system ranging from 200 to 800, this corresponds to a percent correct of 57.7% to 61.0% (1).There are extensive online review materials that are used to help prepare for this examination, such as textbooks and question banks.Several studies have examined the performance of advance Artificial Intelligence (AI) language models (e.g.ChatGPT) in attempting and failing similar board examinations (2,3).Many of these studies used ChatGPT version 3.5; however, a study examining the newer and more powerful ChatGPT-4 found that it significantly outperformed its predecessor and medical residents on a University of Toronto family medicine examination (4).
ChatGPT-4 can now analyse documents in several file formats such as PDF.This would allow a user to simulate the process of learning and studying by providing learning material for the AI to consult in advance of being tested.With this approach the AI can be given material targeted to a specific region's regulations and ensure that it has access to the most up-to-date clinical guidelines.
Users engage with ChatGPT through the use of text inputs, called prompts.The contents of the prompt dictate the output.Prompt engineering is the purposeful structural construction of the input and significantly impacts the output.The four core elements of the prompt include: the instruction, the context, input data, and output indicator (5).This means that, for the best result, the user must assign a task, provide context and background knowledge, ask a specific question, and specify the type of output desired.
Both humans and AI can make errors when answering questions.The classification of these errors can be made into three categories: logical, informational, or explicit fallacy (6).This allows for an understanding of why the AI struggles to ascertain the correct answer and could allow for comparison to humans if that data were to be collected.
International shortages of family physicians, especially in rural areas (7)(8)(9) underscores the importance and urgency of maximising the efficiency of family doctors.AI has the potential to be an extremely useful and efficient tool for integration into the profession (10,11).However, before any integration of AI into patient care is possible, it must be demonstrated to function in collaboration with human input, to provide accurate and reliable information that can help to reduce physician error.
This research is predicated on the hypothesis that the AI's performance may significantly improve when provided with comprehensive preparatory material and using sophisticated data analysis functions.Before any integration of AI into patient care is possible, it must be demonstrated to function, in collaboration with human input, to provide accurate and reliable information that can help to reduce physician error.

Research Questions
1. Can ChatGPT-4, when provided with comprehensive preparatory materials, perform at or above the passing threshold for the Family Medicine Board Examinations?2. Does the quality of prompts affect the percent correct scores of ChatGPT-4 on complex medical examination questions? 3. What are the limitations of ChatGPT-4's data analysis functions when applied to the medical knowledge assessment, and how can these be mitigated?

Creation and Programming of AI Family Medicine Board Exam Taker
The specialized artificial intelligence, named "AI Family Medicine Exam Expert" (Appendix 1), a version of ChatGPT, was customized specifically to take the American Family Medicine Board Exam.It was programmed with the following instructions and capabilities: 1. Instruction: The AI model, ChatGPT-4: "AI Family Medicine Exam Expert", was programmed to operate under a specific set of instructions designed to guide its behaviour towards producing outputs relevant to the American Family Medicine Board Exam.These instructions mandated the model to prioritize information from three key textbooks: "Textbook of Family Medicine" (Ninth Edition) by Rakel and Rakel (12), "The Family Medicine Board Review Book" by Baldor (13), and "Family Medicine: A Practical Approach" (Second Edition) by Al-Gelban, Al-Khaldi, and Diab (14).In instances where these sources did not provide sufficient information, the model was instructed to utilize its browsing capabilities to access current, peer-reviewed medical literature and websites for additional data.The instruction set explicitly directed the AI to provide answers with clear explanations, referencing either the textbooks, internet sources, or its in-built medical knowledge.In cases where neither the textbook nor the internet provided a definitive answer, the AI was directed to apply its medical knowledge to give the best possible educated guess.2. Context: To enhance the model's performance, additional context was provided.This included the latest guidelines and protocols in family medicine, updates in medical research, and changes in examination formats and criteria.The model was also informed about the typical structure of board examination questions, encompassing multiple-choice questions, case studies, and diagnostic interpretations.This contextual knowledge was crucial in enabling the AI to align its responses more accurately with the expectations of the Family Medicine Board Examination.3. Input Data: Input data consisted of a diverse set of questions from AAFP's "Family Medicine Board Review Questions", modelled after past Family Medicine Board Examinations (15).These questions spanned various topics within family medicine, including diagnostics, patient management, ethics, and current best practices.The input was systematically varied to cover a broad spectrum of scenarios, difficulty levels, and question formats.Each question was presented to the AI model as a standalone task, ensuring that responses were generated independently, without influence from previous queries.4. Output Indicator: The desired output was included a selection from a series of multiple choice answer options per question.Incorrect answers were labelled according to its Error Type: Logical, Informational, Explicit Fallacy.That is, Logical Errors involved problems in the AI's solving process, Informational Errors included deriving answers from incorrect information/facts, and Explicit Fallacies included errors that didn't fall into either of the two categories or if the AI made an assumption that was incorrect.
This methodological framework was designed to rigorously evaluate the AI's capability to mimic the performance of a final-year Family Medicine Resident in answering board exam questions, providing a structured approach for assessing its effectiveness in this specific application.

Operational Procedure
The AI was presented with a series of questions from the AAFP's Family Medicine Board Review Questions.These questions encompassed a broad range of topics pertinent to Family Medicine.For each question, the AI utilized its primary knowledge source, browsing capabilities, and medical understanding to formulate answers.The responses were then recorded in an excel sheet for analysis.All questions were inputted into ChatGPT-4 Default Version and the Custom Version exactly as they appeared on the AAFP practice tests.

Data Analysis
The AI's responses were evaluated against the correct answers as per the AAFP's Family Medicine Board Review Questions.The minimum passing threshold for the 2009 certification examination was a scaled score of 390, corresponding to 57.7% to 61.0% (1,16).

Ethical Considerations
As an observational study involving an AI system, there were no human or animal subjects, thus minimizing ethical concerns.However, the study was conducted with an emphasis on the responsible use of AI in medical education and exam preparation, adhering to ethical standards in educational research.Ethical approval was not required for this study.

Statistical Analysis
In this investigation, we evaluated the performance of two language model versions, ChatGPT-4 Custom Robot and ChatGPT-4 Regular, by comparing their responses to a set of 300 questions on a question-by-question basis.We estimated the percentage of correct responses for each version and calculated 95% confidence intervals (CIs) using the normal approximation method to assess the precision of these estimates.
Given the paired nature of our data, we applied the McNemar test to assess the difference in performance between the two versions in terms of correct/incorrect responses.This test is particularly suited for paired categorical data and provides a robust comparison of the two versions' accuracy.The results of the McNemar test indicated no statistically significant difference in performance, suggesting that the accuracy of the two versions is statistically similar.Additionally, we conducted a Chi-square test to compare the distribution of error types (Logical, Informational, Explicit Fallacy) between the two versions.This test aimed to identify significant variations in error patterns.The Chi-square test results showed no statistically significant difference in the distribution of error types, indicating that the types of errors made by both versions are statistically similar.All statistical analyses were conducted using Python (version 3.8), employing the statsmodels and numpy libraries for statistical computations and data handling.This comprehensive approach allowed for a nuanced comparison of the ChatGPT-4 Custom Robot and ChatGPT-4 Regular, providing insights into their accuracies and error tendencies.

Results
Compared Performance

Error Type Analysis
Distribution of Error Types: The distribution of error types across the two versions was evaluated using a Chi-square test.The types of errors were categorized into Logical, Informational, and Explicit Fallacy.The test resulted in a p-value of 0.32.

Statistical Significance
The McNemar test, applied to assess the significance of the difference in performance between the two versions, yielded a p-value of 0.45.

Principal Results
Accuracy Assessment results suggested that the observed differences in correct response rates between the Custom Robot and Regular versions are not statistically significant, implying comparable performance in accuracy.Error Type Analysis indicated no statistically significant difference in the distribution of error types between the two versions.The result of the McNemar test suggests that the observed differences in correct response rates between the Custom Robot and Regular versions are not statistically significant, implying comparable performance in accuracy.

Evaluation Outcomes
The lack of a significant difference in performance indicates that the quality of prompts and resources given to the Custom Robot, "AI Family Medicine Exam Expert" improved ChatGPT-4's performance but were not found to be significantly impactful.However, their accuracy rates are indicative of a passing level of proficiency in understanding and responding to the complex medical scenarios presented in the exam questions (1,16).This observation aligns with previous research showing that large language models like ChatGPT can perform at or near the passing thresholds in medical examinations without specialized training or reinforcement, as demonstrated in the study on the United States Medical Licensing Exam (USMLE) (17).

Implications for AI Performance
The lack of significant variation in error types highlights that both versions of ChatGPT-4 exhibit similar patterns in processing and interpreting medical information.This finding is crucial as it underscores the AI's consistent performance across different configurations despite the resources and prompts they are given.

Limitations
One key limitation of our study is the reliance of the custom pre-trained language model on textbooks, which may not fully capture the nuanced and evolving nature of medical knowledge.
Given the static nature of the AI's textbook knowledge base, which doesn't account for the rapid advancements in medical research and practice, it was hypothesized that the Custom Robot was forced to depend on its dynamic learning capabilities using the internet to stay current with medical knowledge and guidelines and answer the questions.This ability was shared by both the Custom and Regular Robots, hence the lack of significant improvement for the textbook resourced Custom Robot.

Comparison with Prior Work
Comparing our findings with prior work, we observe a progression in the capabilities of AI models in medical knowledge assessment for family medicine board exams.Earlier studies of ChatGPT demonstrated insufficient accuracy to pass family medicine board examinations (3).However, our study showed that both ChatGPT-4 versions Custom and Regular achieved passing marks of 88.67% and 87.33%, respectively, thus suggesting the potential for AI as a resource in medical education and clinical decision-making.

Conclusions
Our study has provided compelling evidence that ChatGPT-4, in both its Regular and Custom Robot versions, exhibits a high level of proficiency in tackling the complex questions typical of the Family Medicine Board Examinations.The performance of these AI models, with correct response rates of 88.67% and 87.33% respectively, demonstrates their potential utility in the realm of medical education and examination preparation as reliable study material.
Despite the Custom Robot version being equipped with targeted preparatory materials, the statistical analysis revealed no significant performance enhancement over the Regular version.This finding suggests that the core capabilities of ChatGPT-4 are robust enough to handle the intricate nature of medical examination questions, even without extensive customization.
The similarity in error types between the two versions underscores a consistent performance characteristic of ChatGPT-4, regardless of its programming nuances.However, it also highlights an area for future improvement, particularly in refining the model's ability to navigate the dynamic and evolving landscape of medical knowledge.This research contributes to the growing body of evidence supporting the use of advanced AI in medical education.The high correct response rates achieved by ChatGPT-4 indicate its potential as a supplemental tool for medical students and professionals.Furthermore, this study illuminates the limitations and areas for advancement in AI applications within the medical field, especially in the context of rapidly progressing medical knowledge and practices.
In conclusion, while the integration of AI like ChatGPT-4 into clinical practice and education shows promising prospects, it is crucial to continue exploring its capabilities, limitations, and ethical implications.The evolution of AI in medicine demands ongoing evaluation and adaptation to ensure it complements and enhances, rather than replaces, human expertise in healthcare.

Table 1 .
Summary of statistical analysis comparing two version of ChatGPT-4