Evaluating the Utility of a Large Language Model in Answering Common Patients’ Gastrointestinal Health-Related Questions: Are We There Yet?

Background and aims: Patients frequently have concerns about their disease and find it challenging to obtain accurate Information. OpenAI’s ChatGPT chatbot (ChatGPT) is a new large language model developed to provide answers to a wide range of questions in various fields. Our aim is to evaluate the performance of ChatGPT in answering patients’ questions regarding gastrointestinal health. Methods: To evaluate the performance of ChatGPT in answering patients’ questions, we used a representative sample of 110 real-life questions. The answers provided by ChatGPT were rated in consensus by three experienced gastroenterologists. The accuracy, clarity, and efficacy of the answers provided by ChatGPT were assessed. Results: ChatGPT was able to provide accurate and clear answers to patients’ questions in some cases, but not in others. For questions about treatments, the average accuracy, clarity, and efficacy scores (1 to 5) were 3.9 ± 0.8, 3.9 ± 0.9, and 3.3 ± 0.9, respectively. For symptoms questions, the average accuracy, clarity, and efficacy scores were 3.4 ± 0.8, 3.7 ± 0.7, and 3.2 ± 0.7, respectively. For diagnostic test questions, the average accuracy, clarity, and efficacy scores were 3.7 ± 1.7, 3.7 ± 1.8, and 3.5 ± 1.7, respectively. Conclusions: While ChatGPT has potential as a source of information, further development is needed. The quality of information is contingent upon the quality of the online information provided. These findings may be useful for healthcare providers and patients alike in understanding the capabilities and limitations of ChatGPT.


Introduction
Gastrointestinal (GI) complaints and symptoms account for approximately 10% of all general practice consultations [1,2] and are apparently very common in the general population. As healthcare providers specializing in this field, we are frequently called upon to answer a wide range of patients' gastrointestinal health questions. In recent years, large language models, such as OpenAI's recent release of the ChatGPT chatbot [3], have been developed to provide answers to a wide range of questions in various fields, including healthcare [4].
AI chatbots employ deep learning-based natural language processing (NLP) that evaluates natural human language input and replies accordingly in a conversational mode [5]. These models have the potential to provide patients with quick and easy access to accurate and reliable information about their gastrointestinal health with 24/7 availability, high cost-effectiveness, and potentially less bias based on patients' demographic characteristics such as gender, race, or age [6]. A recent systematic review assessing the value of AI chatbots in healthcare showed promising results in terms of the effectiveness provided [7].
The recent release of OpenAI's ChatGPT in November 2022 [3] has garnered immense popularity, as its technical abilities are believed to be superior to previous chatbot versions [8]. In a recent article [8], the chatbot was described as "astonishingly skilled at mimicking authentic writing" and was regarded as "So Good It Can Fool Humans". It was believed to have passed the ultimate" Turing test"-which states that a machine will be regarded as intelligent if its responses are indistinguishable from those given by a human comparator [9].
The effectiveness of large language models in answering patients' questions in the field of gastroenterology has not yet been thoroughly evaluated. In this paper, we aim to evaluate the performance of OpenAI's ChatGPT chatbot in answering patients' questions concerning gastrointestinal (GI) topics. This evaluation will focus on several key areas, including the accuracy and clarity of the information provided, the ability of the model to handle a wide range of questions, and the overall effectiveness of the model in addressing patients' concerns and providing them with the information they need to make informed decisions about their health.
Our evaluation aims to provide insights into the capabilities and limitations of large language models in answering patients' questions in various gastroenterology subjects. Ultimately, our goal is to contribute to the ongoing development of large language models and their use in the field of gastroenterology, with the aim of improving the quality of care and information available to patients.

Methods
To evaluate the performance of the newly-released large language model, OpenAI's ChatGPT chatbot, in answering patients' GI-related questions, we conducted a comprehensive study using a representative sample of real-life questions from patients in this field. The study was designed to assess the accuracy and clarity of the information provided by the chatbot, as well as its overall effectiveness in addressing patients' concerns and providing them with the information they need to make informed decisions about their health.
The study sample consisted of 110 real-life questions from patients in the field of gastroenterology, gathered from open internet sites providing medical information to diverse patients' questions. These questions were selected to cover a wide range of topics, including common symptoms, diagnostic tests, and treatments for various gastrointestinal conditions. The questions were selected to reflect the types of questions typically asked by patients in gastroenterology and to provide a representative sample of the types of questions that the chatbot would encounter in a real-world setting.
The questions were provided to the OpenAI chatbot and the answers were recorded. The answers provided by the chatbot were then assessed in consensus by three experienced gastroenterologists, each with more than 20 years of experience. All gastroenterologists work in a tertiary medical center, as well as in community clinics, and together cover all sub-specializations of gastroenterology: IBD experts, motility, hepatology, nutrition, and advanced endoscopy.

Statistical Analysis
Python (Ver. 3.9) was used for all statistical calculations. The average score ± standard deviation (SD) was calculated for each category and is shown in a table. A distribution graph of the answer grades for all groups by answer category was plotted. A Kruskal-Wallis one-way analysis of variance was used to determine the statistical difference between the groups.

Statistical Analysis
Python (Ver. 3.9) was used for all statistical calculations. The average score ± standard deviation (SD) was calculated for each category and is shown in a table. A distribution graph of the answer grades for all groups by answer category was plotted. A Kruskal-Wallis one-way analysis of variance was used to determine the statistical difference between the groups.

Statistical Analysis
Python (Ver. 3.9) was used for all statistical calculations. The average score ± standard deviation (SD) was calculated for each category and is shown in a table. A distribution graph of the answer grades for all groups by answer category was plotted. A Kruskal-Wallis one-way analysis of variance was used to determine the statistical difference between the groups.

Results
Overall, 110 various typical patient questions were presented to the OpenAI chatbot. Forty-two included various questions regarding treatment options in diverse gastroenterology fields. Questions were chosen to include the whole spectrum of specification from very specific questions (e.g., treatment for GI cancers according to the stage) to wider questions not referring to definite disease staging or severity (e.g., "what is the treatment for ulcerative colitis"?). All questions chosen were short and clear for better standardization. The results of the evaluation for the treatment questions are shown in Table 1.
A full list of all questions and answers is presented in the Supplementary Materials. Overall, the OpenAI chatbot answers to the questions dealing with treatment options varied wildly between the different topics and were inconsistent in their level. Notably, at the end of almost all the answers, the patient is referred to a doctor/healthcare provider for further evaluation.
The results of the evaluation for questions related to symptoms are shown in Table 2. In the answers to the symptom-related questions, the OpenAI chatbot shows the same pattern, and the majority of answers referred the patient to consult a physician.
The results of the evaluation for questions related to diagnostic examinations are shown in Table 3.
Diversity in the quality of answers to the diagnostic examination was the highest of all groups. While many answers were correct and informative, other medical terms were simply not recognized by the chatbot (Table 3). Figures 1-3 graphically display the distribution pattern of grades assigned to the responses provided by the OpenAI chatbot in relation to treatment, symptoms, and diagnosticrelated questions, respectively.
As shown graphically in Figures 1-3, the distribution of grades by answer group in the treatment-related and symptoms-related groups are similar and relatively consistent, while the distribution of the diagnostic-related group is wide and inconsistent.
Statistical difference between the groups was shown for the subjects of accuracy (p = 0.019) and up-to-date (p = 0.009), but not for the subjects of clarity (p = 0.071) and efficacy (p = 0.071).

Discussion
Overall, our study provides a comprehensive evaluation of the performance of a large language model, OpenAI's ChatGPT chatbot, in answering patients' questions in the field of gastroenterology. The results of the study will hopefully be valuable for healthcare providers and patients alike, as they will provide important insights into the capabilities and limitations of these models in providing accurate and reliable information about gastrointestinal health.
Generally, the results of our evaluation of OpenAI's ChatGPT large language model, as a potential tool for answering patients' GI-related questions, are concerning. Despite the impressive capabilities of this model in other applications, our study found that it was insufficient for reliably answering patients' questions in this domain.
In our study, we presented the model with a range of common gastroenterological questions and assessed its ability to provide accurate and helpful responses. While the model could generate responses for all questions, the accuracy and helpfulness of these responses varied significantly. In many cases, the model's responses were either incomplete or entirely incorrect, indicating a lack of understanding of the subject matter.
These findings are particularly concerning given the importance of accurate information in the medical field. Patients often rely on Internet information to make decisions about their health and well-being, and inaccurate information can have serious consequences. As such, it is essential that any tool used to provide medical information to patients be able to provide accurate and reliable responses.
Interestingly, while the answers to treatment-based questions and symptom-based questions showed a similar and relatively steady quality and effectivity (Tables 1 and 2, Figure 1), diagnostic-related results diverged between high performance to incorrect/missing data (Table 3, Figure 1). These results may improve with time as the OpenAI chatbot acquires more data.
Furthermore, our study also highlighted the limitations of OpenAI's ChatGPT chatbot in the medical domain. The ability to understand and accurately respond to complex medical questions requires a deep understanding of the subject matter, as well as the ability to process and integrate information from a variety of sources [10]. While ChatGPT is an impressive model with many capabilities, it is not yet advanced enough to fulfill this role. Notably, it should be acknowledged that the utilization of real-world questions in our study resulted in the inclusion of numerous acronyms, which could potentially pose challenges in web-based research settings. However, due to our intention to assess the effectiveness of ChatGPT in addressing authentic patient inquiries, we refrained from including the complete expressions, despite the potential difficulties associated with the acronyms.
Our results are in line with previous data assessing other artificial intelligence conversational agents in healthcare [6,7,11]. As a group, the bot's function was found to be relatively satisfactory and useful, with moderate effectiveness. It is, therefore, important to characterize specific strengths and weaknesses of the chatbot in order to maximize its benefits.
Recent data from the last months suggest that LLMs hold the potential for significant impacts in medicine [12,13].
In our previous work, we examined the utility of ChatGPT in identifying top research questions in gastroenterology. Our conclusion was that ChatGPT may be a useful tool for identifying research priorities in the field of GI, yet we noted the need to improve the originality of the research questions identified [12].
In telehealth, LLMs such as ChatGPT could improve efficiency and scalability, and provide unbiased care [14]. Despite its proficiency in generating differential diagnosis lists, ChatGPT does not surpass human physicians [15]. Moreover, LLMs can offer innovative methods in medical education [16]. Yet, it is crucial to address LLM's limitations to maintain accuracy and reliability in patient care. Furthermore, Rasmussen et al. [17] and Samaan et al. [18] evaluated the accuracy of ChatGPT in generating responses to typical questions related to vernal keratoconjunctivitis and bariatric surgery, respectively. Both found the model to be broadly accurate. Xie et al. [19] conducted a similar study using a simulated rhinoplasty consultation, and while they observed that ChatGPT provided easily comprehensible answers, it was less effective at providing detailed, personalized advice. Yeo et al. [20] tested ChatGPT on questions about cirrhosis and hepatocellular carcinoma, finding the AI capable but not comprehensive in its knowledge. Johnson et al. [21] investigated the accuracy of ChatGPT in generating responses to common cancer myths and misconceptions, concluding that it often provided accurate information. Lastly, a comprehensive pilot study on ChatGPT by Johnson et al. [22] discovered that its responses were largely accurate across a wide range of medical questions.
Significantly, it is worth mentioning that all of these studies have been published within the recent months subsequent to the initial release of the inaugural version of ChatGPT. This accumulation of research findings attests to the remarkable potential impact and significance that ChatGPT holds for the field of medicine in the foreseeable future.
Our study had a few limitations. First, the choice of questions naturally influenced the answers obtained by the OpenAI chatbot. However, the wide variety of questions and their relatively high number minimized the chance for bias. Furthermore, the answers were graded with the consensus of three physicians; there is a possibility that other physicians might have had a different point of view. Nevertheless, to minimize that risk, we chose highly experienced gastroenterologists with wide clinical experience and more than 70 years of practicing gastroenterology both in a tertiary medical center and in community practice. Lastly, there were no comparisons with other chatbots to gauge the relative merits of ChatGPT against other similar technologies, and no blinded comparison between human