Enhancing Medical Interview Skills through AI Simulated Patient Interactions: Non-Randomized Controlled Trial

Background: Medical interviewing is a critical skill in clinical practice, yet opportunities for practical training are limited in Japanese medical schools, necessitating urgent measures. Given advancements in AI technology, its application in the medical field is expanding. However, reports on its application in medical interviews in medical education are scarce. Objective: This study aimed to investigate whether medical students' interview skills could be improved by engaging with AI-simulated patients using large language models (LLMs), including the provision of feedback. Methods: A simulation program using LLMs was provided to 35 fourth-year medical students in Japan in 2023. As a control group, 110 fourth-year medical students from 2022 who did not participate in the intervention were selected. The primary outcome was the score on the pre-Clinical Clerkship Objective Structured Clinical Examination (pre-CC OSCE), a national standardized clinical skills examination, in medical interviewing. Secondary outcomes included surveys such as Simulation-Based Training Quality Assurance Tool (SBT-QA10). Results: The AI intervention group showed significantly higher scores on medical interviews than the control group (AI group vs. control group: median 28.0 vs. 27.0, p = 0.01). There was a trend of inverse correlation between the SBT-QA10 and pre-CC OSCE scores (regression coefficient -2.0 to -2.1). No significant safety concerns were observed. Conclusions: Education through medical interviews using AI-simulated patients has demonstrated safety and a certain level of educational effectiveness, suggesting promising prospects for multifaceted applications in the field of medical education.


Yes, please make my accepted manuscript PDF available to anyone at any time (Recommended).
Yes, but please make my accepted manuscript PDF available only to logged-in users; I understand that the title and abstract will remain v Yes, but only make the title and abstract visible (see Important note, above).I understand that if I later pay to participate in <a href="http

Introduction
Medical interviews play a crucial role not only in the diagnostic process with patients but also in building trust and rapport (1).Medical interviewing skills are a necessary competency in medical practice and are categorized in the Japanese Model Core Curriculum for Medical Education under the categories of "Comprehensive Patient and Community Perspective" and "Clinical Competencies for Patient Care." (2) In Japan, the Pre-Clinical Clerkship Objective Structured Clinical Examination (pre-CC OSCE), provided by the Public Interest Incorporated Association, Common Achievement Tests Organization, assesses fourth-year medical students for their competence and aptitude to participate in clinical clerkships (3).This examination evaluates basic clinical skills, including medical interviewing.Upon passing, medical students are expected to acquire the skill to conduct medical interviews through proper communication and gathering necessary information before graduation through participatory clinical clerkships.The standard practice method involves learning medical interviewing in lectures, followed by practice sessions under the supervision of instructors and simulated patients (4).
However, opportunities for Japanese medical students to practice medical interviewing within the medical education curriculum are limited (5).Simulation education is effective across many fields for learners, not just medical interviewing, but opportunities to utilize such education are limited in terms of both location and time (6).Additionally, from educators' perspective, introducing medical interview education through simulation faces numerous barriers, including a lack of tutors, staff, simulated patients (including mannequins), and budget constraints (7).
Since the release of ChatGPT by OpenAI in the fall of 2022 (8), generative artificial intelligence (AI) technologies such as large language models (LLMs) have undergone rapid evolution and have been applied across various fields.In the medical domain, their integration is being considered in both clinical and research contexts (9).One study demonstrated that LLMs can accurately answer questions of the United States Medical Licensing Examination (USMLE), demonstrating their utility in medical education and assessment (10).The COVID-19 pandemic accelerated the digital transformation from traditional bedside teaching to simulation education, including research into remote education models using chatbots (11,12).However, research integrating LLMs into simulation education remains in its developmental phase (13).
In the field of medical interviewing, a survey of 3,018 medical students revealed mixed feelings regarding the integration of LLMs.While some expressed concerns that it might deteriorate the patient--physician relationship, others were hopeful about the potential of AI technology in education, recognizing its dual value (14).LLMs, which are distinct from previous deep learning-based algorithms, can predict the likelihood of a sequence of words based on the context of the preceding words.Natural and meaningful language sequences can be generated by learning from sufficient textual data.This capability led us to consider their application in practicing medical interviews.
In response to new advances in artificial intelligence technology and the ongoing digital transformation and to alleviate the lack of educational resources for medical interview training, our team designed a simulation program to improve students' medical interview skills.This program utilizes the generative pretrained transformer (GPT)-4 Turbo to fulfill two roles: simulated patients, and instructors providing feedback.To assess the educational impact of AIassisted medical interview training on novice learners, specifically fourth-year medical students, we compared the scores from the clinical skills examination, pre-CC OSCE, between the control group, which practiced medical interviews only through traditional methods under the supervision of simulated patients and instructors, and the AI group, which received additional training through AI-simulated patient interviews.

Recruitment
As of November 2023, 35 fourth-year medical students at a national university in Japan who consented to participate in this educational research approved by the Institutional Review Board (IRB No. 2312-006) and had completed medical interview practices using our developed AI-simulated patient were designated as the intervention group (AI group, n = 35).Fourth-year medical students from Okayama University as of November 2022 who had only a traditional educational program and did not participate in the intervention were selected as the control group (control group, n = 110).The practice period was set to one month, and the students were provided with an educational environment that allowed them to practice at any time using their own laptops or smartphones.

Educational Platforms
The responses of the AI-simulated patients were powered by GPT-4 Turbo released in November 2023.We integrated it with the service "miibo" (miibo Corporation) through an API, which allows conversations with specified generative AI in a chat format.In this service, learners cannot see the prompts but can interact with fixed texts, such as case selections and questionnaires that do not involve AI, and choose from options and branch scenarios.While miibo is accessible via a web browser, it was also linked with LINE (LINE Corporation), which is widely used among students in Japan, for enhanced usability and to allow them to practice medical interviews via LINE as well.Learners could conduct interviews in chat format on either platform.
The GPT prompts were primarily composed of three elements: basic structure, case information, and feedback.The basic structure designated the GPT to act as the simulated patient and the learner as the physician practicing medical interviewing, with the emotional parameters fluctuating in response to the physician's statements.All outputs were in Japanese.The emotional parameters were set from 1 to 10 for eight emotions-joy, sadness, anticipation, surprise, fear, disgust, trust, anger-based on Ekman's theory (15) and Robert Plutchik's work (16).Initially, we loaded the case information into ChatGPT-4, ran a common prompt three times to estimate the initial emotional parameters, and set the average values.Case information included basic patient details, such as name, age, date of birth, and sex, along with relevant medical history.We prepared cases based on eight primary symptoms, namely chest pain, abdominal pain, cough, heartburn, fatigue, fever, dizziness, and shortness of breath, which were developed and revised by multiple specialists.The feedback prompt was designed to provide feedback on general communication skills, elicitation of medically important information, and changes in patient emotions based on the conversation logs after the start of the medical interview.An example of a GPT prompt set on miibo is shown in Supplementary Text 1.
Consenting students could access the miibo platform page or a dedicated LINE account via a specified URL, where they could enter their name and select a case.After case selection, they were presented with a scenario starting with the patient entering the consultation room and initiating a greeting, marking the beginning of the medical interview.The conversations were primarily text-based, although voice input was also possible.After completing the medical interview practice, the session could be ended by clicking a button on the screen labeled "End medical interview" or by declaring it, followed by a transition to feedback within the miibo scenario.After the feedback, the conversation log was deleted, and the session proceeded to a questionnaire.After completing the questionnaire, participants were redirected to the case selection section, allowing them to repeat the practicing of medical interviews (Figure 1).

Questionnaire
After completing the case, the questionnaire asked participants to rate the difficulty of the case on a 5-point scale and assess the realism of the AI-simulated patient, the sense of presence (interaction through emotions with the AI-simulated patient), and their levels of tension and anxiety on a 10-point scale.Participants were also asked to provide open-ended feedback on what they found good and bad about the experience.After completing the first practice session, students were asked to complete a questionnaire based on Simulation-Based Training Quality Assurance Tool (SBT-QA10, pre-questionnaire) (17) to evaluate the quality of the simulation training program.This questionnaire was administered again at the end of the study (postquestionnaire) by gathering open-ended feedback on the overall positive and negative experiences throughout the study.

Statistical Analysis
The primary outcome measure was the scores related to medical interviewing in the pre-CC OSCE.
The pre-CC OSCE consisted of two evaluation formats: an overall performance evaluation (summary evaluation), and a score assessment based on individual skills according to a checklist (total score evaluation), both of which were targeted for assessment.As secondary outcome measures, we evaluated the SBT-QA10 and post-case practice questionnaires, specifically assessing the difficulty of the case, the realism of the simulated patient, interaction through emotions, and levels of anxiety and tension.The conversation logs from each practice session were also reviewed.
Statistical analysis was performed using Prism 9 for macOS (Version 9.5.1).The scores from the pre-CC OSCE were treated as interval data.In addition to the open-ended responses, the questionnaire utilized a Likert scale.The Mann-Whitney U test was used for comparisons between two unrelated groups.Fisher's exact test was applied to compare sex ratios and students' t tests were used to compare backgrounds between groups based on GPA.Multiple regression analysis was conducted with the pre-CC OSCE scores as the dependent variable and the questionnaire items as independent variables.Missing values in the questionnaire items were excluded from the analysis.

Results
Finally, the AI group that received LLM-based simulation education consisted of 35 of 87 students who had consented to participate in this study, whereas the control group comprised 110 students who had an opportunity to decline participation, but none chose to refuse.
In the AI and Control groups, no significant differences were observed in age, sex, or GPA of medically related subjects (Table 1).In respect of the medical interview practice, an abbreviated version of a representative conversation log and AI feedback, translated from Japanese to English, is shown in Table 2.
Regarding the evaluation of educational effects, when comparing the scores for medical interviews in the pre-CC OSCE, the AI group scored significantly higher than the control group in both summary evaluation (AI vs. Control: 5.0 vs. 4.5 (median), two-tailed, p = 0.007) and total score evaluation (AI vs. Control: 28.0 vs. 27.0(median), two-tailed, p = 0.01) (Figure 2).
The questionnaire results for each case regarding the realism of the AI-simulated patient, interaction through emotions, levels of anxiety and tension, and difficulty of the case are shown in Figure 3.
The responses regarding the AI-simulated patients' reproducibility and interaction through emotions remained stable throughout, with median scores ranging from 7 to 9 for reproducibility and 7-8 for emotional interaction.Regarding the levels of anxiety and tension, it was observed that participants experienced them to some degree, but without significant stress.Lastly, for the case difficulty, 75% of the responses indicated it was "appropriate," 19% found it "difficult," 3% each considered it "easy" and "very easy," and 0% found it "very difficult" in the first instance of the case (n = 32).The response "appropriate" was the most common throughout the entire training period, ranging from 50-100%.
The scores for the SBT-QA10 in both the preand post-questionnaires were relatively high across all items, with the median scores ranging between 4 and 5 (Table 3).No significant changes were observed across all items from pre-questionnaire to post-questionnaire.
Next, we evaluated the group that received AI education to determine which subgroup achieved higher scores on the pre-CC OSCE.Given the high correlation coefficient of 0.75 between the total score evaluation and summary evaluation of the pre-CC OSCE and considering multicollinearity, we focused solely on the total score evaluation for further analysis, incorporating various questionnaire items, GPA, and age in a multiple regression analysis.Among these, a consistent trend was observed with the SBT-QA10, where many items showed a negative correlation with the pre-CC OSCE scores.Specifically, the item "I felt part of the team" showed this trend statistically significantly in both pre-(coefficients: -1.8, standard errors 0.77, p = 0.047, R-squared 0.41) and post-(coefficients: -3.2, standard errors 0.54, p = 0.0004, R-squared 0.81) evaluations (Figure 4).When analyzing the total scores of each item in relation to the pre-CC OSCE scores to illustrate the overall trend, a negative correlation was observed; however, none were statistically significant.The analysis results of the combined pre and post SBT-QA10 scores are presented in Table 3, including the items of "I felt part of the team" (Table 4).
The AI group submitted responses that detailed the advantages and disadvantages of the simulation system.The positives were categorized into four categories: "Practicality of training with AI," "Convenience and accessibility," "Increased confidence through practice," and "Educational value and skill improvement."The negatives were consolidated into three categories: "Dialogue and communication with AI," "Technical and functional aspects," and "Comparison of medical interviews with pre-CC OSCE" (Table 5).SBT-QA10 scores showed an overall inverse correlation with pre-CC OSCE scores, whereas the other items did not show a consistent trend.To illustrate the overall trend, the combined total scores of the SBT-QA10 from both pre and post are presented, including the item of "I felt part of the team" (pre-1 and post-1), in terms of which a significant difference was observed.β: Coefficients, SE: Standard Errors, t: t value, p: p value, R 2 : R-squared, Adj R 2 : Adjusted Rsquared.

Evaluation Outcomes
Table 5. Open-ended feedback from students.

Good points
Bad points 1. Practicality of training with AI I was able to practice with an emotional AI in a realistic manner, just like interacting with an actual patient.

Convenience and accessibility I was able to practice alone (without a supervisor).
Training was available on an easy-to-use platform such as LINE.Flexibility in time and frequency of practice.

Dialogue and communication with AI
Unnatural responses (e.g., "I'm worried" all the time).Instability of feedback.

Technical and functional aspects
Typing and response time are issues.System errors sometimes occurred.Inconvenience due to system operation (e.g., incorrectly sent messages).Voice input should be used.

Increased confidence through practice
Understood the flow of a medical interview.Learned important questions in clinical sites.

Educational value and skill improvement
The participants were able to gain practical experience in medical interviewing.The feedback helped me to improve my skills.Learned what to ask in various clinical situations.

Comparison of medical interviews with pre-CC OSCE
There were differences in the information provided to simulated patients in pre-CC OSCE.I felt the superiority of AI in conversation.There were some cases where the AI's casual way of speaking made it difficult to concentrate.A: AI-simulated patient reproducibility is rated on a scale of 1-10, with 10 indicating "Very High Accuracy" and 1 indicating "no reproduction."B: Interaction through emotions is rated on a scale from 1 to 10, where 10 signifies "Very Effective" and 1 signifies "Not Effective at All." C: Anxiety and nervousness are rated on a scale from 1 to 10, with 10 indicating "Not Felt at All"' and 1 meaning "Felt Very Strongly."'D: The difficulty of the case is rated on a scale from 1 to 5, where 1 represents "Very Easy" and 5 represents "Very Difficult."This supplemental material presents a structured code developed using GPT-4 Turbo, designed to facilitate medical interview practice.The code is organized into three primary sections: Basic Structure, Case Information, and Feedback.

Discussion
This study is the first to quantitatively verify the effectiveness of entrusting all aspects of medical interview education to AI, from acting as simulated patients to providing feedback as evaluators.It was found that the AI group, for which medical interview practice by LLM-based simulated patients was added to traditional medical interview education when practicing with simulated human patients, scored higher in the pre-CC OSCE medical interviews compared to the control group that only practiced with simulated human patients (4).As previously reported, medical students do not resist the use of AI in medical education (18), which was evident in this study.The educational style of this study, which allowed students to practice using their own smartphones and PCs, enabled them to repeatedly practice at their convenience, as mentioned in the open-ended feedback.This measure not only improved medical interview skills but also reduced anxiety due to a lack of practice and enhanced self-efficacy, suggesting a positive impact on the examination results.Although the two groups were from different academic years and might have confounding background factors, basic information such as GPA remained consistent between the groups.This educational method supported by LLMs has the potential to reduce financial and time costs for instructors and simulated patients.This study demonstrates that incorporating this method can effectively supplement the existing shortcomings in medical interview education, thus proving beneficial.

Association between Pre-CC OSCE scores and AI educational interventions
When exploring which subgroups within the AI group tended to score higher on the pre-CC OSCE, there was an inverse correlation with the SBT-QA10 scores.Educators used the SBT-QA10 to understand the various perceptions experienced by learners during simulation education.High SBT-QA10 scores are generally thought to reflect positive experiences during simulations, leading to subsequent learning.The overall trend of high scores in this study suggests that the training had a positive impact on learners.However, subgroup analysis revealed results that contradict this implication.Unlike traditional simulation education with human-simulated patients, simulations conducted on one's smartphone or laptop allow for learning in a mentally safe state, potentially resulting in effortless learning within the comfort zone of students, thereby diminishing its effectiveness (19).Conversely, for students who felt challenged, this may have created a learning zone that enhanced the learning effect.Additionally, the SBT-QA10 is based on research in Western cultures, and this study, targeting learners in a Japanese cultural context, may require a different interpretation.People from Asian cultures have been reported to be stricter in self-evaluations.This cultural difference may have influenced the results significantly (20,21).It is therefore considered important to adjust the learning environment, such as the difficulty level of cases, while constantly checking feedback from learners and educational outcomes because a good learning environment can vary among learners.

Fabrications by LLMs
Although concerns about fabrication by LLMs have been raised in various contexts (22), their occurrence in this study was limited, and no expressions deviating from the case settings were observed.During the alpha-testing phase with GPT-3.5 Turbo, fabrications were somewhat common, especially in instances where the AI began playing the role of the doctor instead of the simulated patient early in the conversation.Although modifications to the prompts somewhat mitigated this issue with GPT-3.5 Turbo, the change to GPT-4 and GPT-4 Turbo significantly reduced fabrications to a practical level of improvement (23).
The behavioral anomalies of AI in this study can be summarized as follows: The first concerns violations related to public order and morals based on OpenAI's guidelines.Upon analyzing the conversation logs, it was evident that the students' inputs did not contain any issues, indicating that the observed discrepancies were due to inaccuracies in the AI output.As the students were preinformed about the possibility of such errors, they could continue with their medical interviews by starting another consultation, preventing it from becoming a significant issue.The second point is related to fabrication in the feedback.For instance, despite confirming the patient's date of birth and name, there were a few cases in which the feedback suggested that these were not confirmed.This issue was thought to be caused by the prompts treating "confirming the patient's date of birth and name" as a continuous stream of information, and it was resolved by breaking down the information into separate elements.While prompt adjustments could improve some aspects, the specifications of GPT, which only allow reference to a certain amount of context window and have a limit on the amount of conversation that can be stored, are also considered to be contributing factors (24).

Safety
No excessive tension or anxiety associated with learning was observed during the simulations.Furthermore, responses from the GPT throughout the study period did not contain any statements that could harm learners' safety, and no students reported such concerns.

Limitations
This study was conducted with voluntary participation in educational research without employing more desirable intervention methods, such as randomized controlled trials.The emphasis was on equality of educational opportunities, keeping the opportunities of traditional practice with simulated patients.The study was carried out as a pilot study for the future use of LLMs in medical interview education.In addition, this study employed LLM-simulated patients' interventions and evaluated their effectiveness through a simulation-based assessment such as the pre-CC OSCE.However, reports suggest that qualitative improvements in simulators do not directly cause clinical skill enhancement, underscoring the importance of conducting clinical skill assessments in realworld settings as much as possible (25).

Plan
This study suggests the potential for a significant reduction in the workload of instructors and simulated patients in medical interview practice while maintaining educational effects for medical students.Furthermore, the introduction of LLM-simulated patients to clinical skill examinations such as the pre-CC OSCE is conceivable.It holds promise not only for educating young doctors but also for the lifelong education of doctors, including simulations for handling complex cases in clinical settings.However, when introducing LLM simulations into medical education, caution is necessary regarding ethical considerations and accuracy, as previously pointed out.Completely replacing traditional instructor-led training with AI carries risks, and further studies thereon are required (13,18).
Improvements in prompts and the evolution of AI technology suggest that more realistic and accurate simulation education can be expected in the future.The integration of AI into medical education is inevitable; however, it has the potential to disrupt traditional medical education practices.Educators must remain vigilant regarding the potential positive and negative impacts of this integration (26).Concurrently, it is essential to continue research on AI-mediated medical education to explore its applicability and limitations.

Conclusions
Education on medical interviewing using LLM-simulated patients demonstrated superior educational effectiveness while maintaining safety.This platform holds promise for multifaceted applications in the field of medical education in the future.

Feedback
AI: Thank you for your effort in the medical interview practice.Now, I will provide feedback on the medical interview.scenario I was not distracted by nonrelevant objects and events during the the SBT-QA10 administered after the first session (pre) and pre-CC OSCE (post) are presented for each item.Before-and after-comparisons were analyzed using the Wilcoxon test.SBT-QA10: Questionnaire after the first session (n = 14), and pre-CC OSCE (n = 24).SBT-QA10: Simulation-Based Training Quality Assurance Tool.IQR: Interquartile Range.

Figure 3 .
Figure 3. Questionnaire to be taken at the end of each case.A: AI-simulated patient reproducibility is rated on a scale of 1-10, with 10 indicating "Very High Accuracy" and 1 indicating "no reproduction."B: Interaction through emotions is rated on a scale from 1 to 10, where 10 signifies "Very Effective" and 1 signifies "Not Effective at All." C: Anxiety and nervousness are rated on a scale from 1 to 10, with 10 indicating "Not Felt at All"' and 1 meaning "Felt Very Strongly."'D: The difficulty of the case is rated on a scale from 1 to 5, where 1 represents "Very Easy" and 5 represents "Very Difficult."

Figure 4 .
Figure 4. Multiple regression analysis of pre-CC OSCE and SBT-QA10.Multiple regression analysis of the "I felt part of the team" item in the SBT-QA10 questionnaire.A: pre, B: post.Pre-CC OSCE: Pre-Clinical Clerkship Objective Structured Clinical Examination SBT-QA10: Simulation-Based Training Quality Assurance Tool Supplemental Text 1: Prompt for Medical Interview Practice.This supplemental material presents a structured code developed using GPT-4 Turbo, designed to facilitate medical interview practice.The code is organized into three primary sections: Basic Structure, Case Information, and Feedback.

Table 1 .
Background.Fisher's exact test was used for the sex ratio, the Mann-Whitney U test for age, and the student's t test for GPA analysis.GPA: Grade Point Average, SD: Standard Deviation, IQR: Interquartile Range.

Table 2 .
Representative communication log and feedback in medical interview practice.: Is there anything that makes the symptoms better or worse?AI: Hmm, nothing really makes it better, but when I take a deep breath or lie down, the cough gets stronger and the pain increases.Student: Do you have any symptoms other than cough, phlegm, and chest pain?AI: Yes, I also have a slight headache.And I feel extremely lethargic.Seeing the blood in my phlegm really shocked me ... : Understood.Let me summarize our conversation.Please let me know if there's anything I've missed or you'd like to add.You started coughing two days ago, and last night you coughed up blood-mixed phlegm.In addition to the throbbing chest pain, you feel lethargic and likely have a fever.You want to undergo all necessary tests to address the symptoms.Is that correct?AI: Yes, that's right.Also, I have a slight headache.I hope to find the cause soon and receive appropriate treatment.Student: My apologies.Thank you.Now, we'll move on to the physical examination, so please wait in the outer waiting room.Thank you for today.Please take care.AI: Understood.Thank you for the consultation.I appreciate your help.
Case Selection AI: Please tell me your name.Student: My name is (name).AI: Please choose a scenario below: chest pain, abdominal pain, cough, heartburn, fatigue, fever, dizziness, and shortness of breath.Student: [Cough] AI: Then, Dr. (name), let's start the role-play from the point where the patient with [cough] enters the consultation room.First, you, as the doctor, should greet them.Student: Could you point to where it hurts?AI: Here, around the middle of my chest is especially painful.It throbs every time I cough ... Student: Understood.So, it's a throbbing pain.Has the intensity of the pain changed since last night?AI: No, the pain hasn't really changed since yesterday.But,I feel like the cough and phlegm are getting worse.StudentStudent:That makes sense.Have you been able to sleep well since the symptoms appeared?AI: No, the cough is so bad at night that I can't sleep well.And that makes me very tired during the day ... Student: I understand.Have you noticed any changes in your weight?AI: I haven't weighed myself, so I can't say for sure, but my appetite has decreased, so I might have lost some weight.Student: How about your bowel movements?AI: No problems there.It's as usual.Student: Thank you.I would like to ask a bit more about your university life.Is it busy?AI: Yes, it's quite busy.There are many classes, and preparing for reports and tests is hard.But right now, I'm more worried about my health and can't concentrate much.Student: That's understandable.Do you live alone?AI: Yes, I live alone.So, even when I'm feeling like this, I have to manage on my own.Student: Understood.Besides COVID-19, are there any other diseases you are worried about?AI: Well, because I coughed up blood, I'm worried it might be some lung disease.Also, since my mother had cancer, that's a bit of a concern too.Student: Understood.May I ask about any tests or treatments you are hoping for?AI: Yes, I want to find out the cause of my condition and undergo any necessary tests.I want to get treated and return to my normal life as soon as possible.StudentStudent: [Ends the medical interview].

Table 4 .
Multiple regression analysis for analyzing pre-CC OSCE scores and group characteristics.