Learning the Randleman Criteria in Refractive Surgery: Utilizing ChatGPT-3.5 Versus Internet Search Engine

Introduction Large language models such as OpenAI's (San Francisco, CA) ChatGPT-3.5 hold immense potential to augment self-directed learning in medicine, but concerns have risen regarding its accuracy in specialized fields. This study compares ChatGPT-3.5 with an internet search engine in their ability to define the Randleman criteria and its five parameters within a self-directed learning environment. Methods Twenty-three medical students gathered information on the Randleman criteria. Each student was allocated 10 minutes to interact with ChatGPT-3.5, followed by 10 minutes to search the internet independently. Each ChatGPT-3.5 conversation, student summary, and internet reference were subsequently analyzed for accuracy, efficiency, and reliability. Results ChatGPT-3.5 provided the correct definition for 26.1% of students (6/23, 95% CI: 12.3% to 46.8%), while an independent internet search resulted in sources containing the correct definition for 100% of students (23/23, 95% CI: 87.5% to 100%, p = 0.0001). ChatGPT-3.5 incorrectly identified the Randleman criteria as a corneal ectasia staging system for 17.4% of students (4/23), fabricated a “Randleman syndrome” for 4.3% of students (1/23), and gave no definition for 52.2% of students (12/23). When a definition was given (47.8%, 11/23), a median of two of the five correct parameters was provided along with a median of two additional falsified parameters. Conclusion Internet search engine outperformed ChatGPT-3.5 in providing accurate and reliable information on the Randleman criteria. ChatGPT-3.5 gave false information, required excessive prompting, and propagated misunderstandings. Learners should exercise discernment when using ChatGPT-3.5. Future initiatives should evaluate the implementation of prompt engineering and updated large-language models.


Introduction
Keeping up with the rapidly advancing field of refractive surgery can prove to be a challenge for entry-level learners.Residents and medical students hoping to engage in the field must identify reliable sources of information for self-directed learning [1].The future of refractive surgery depends upon the ability to identify accurate, reliable, and efficient learning tools to educate the next generation of refractive surgeons.
Recent advances in artificial intelligence (AI) have introduced new avenues for medical education.An AIpowered large language model (LLM) can consolidate vast swaths of information into a singular, interactive chatbot [2].Functioning as a personal tutor, an LLM can respond to follow-up questions, rephrase concepts, and generate unlimited practice cases.These novel capacities suggest that LLMs hold promise in augmenting self-directed medical learning.Therefore, steps must be taken to ensure that information provided by an LLM is accurate, reliable, and efficient within a self-directed learning environment.
The present study compares the performance of ChatGPT-3.5 and an internet search engine in providing information on the Randleman criteria.First introduced in 2008, the Randleman criteria represent a model for identifying patients at high risk of developing corneal ectasia following laser in situ keratomileusis (LASIK) surgery [13][14][15][16][17].Its five parameters include corneal topography, residual bed thickness, patient age, corneal thickness, and preoperative manifest refraction spherical equivalent (MRSE).As a well-defined, thoroughly discussed topic in refractive surgery, the Randleman criteria present an ideal challenge for ChatGPT-3.5.
The study's primary outcome is the ability of ChatGPT-3.5 or an internet search engine to define the Randleman criteria and its five parameters within a self-directed learning environment.In addition, a qualitative analysis of the ChatGPT-3.5conversation threads is provided, yielding valuable insights for learners of refractive surgery.

Data collection
Twenty-three medical students participated in the study between May 2023 and December 2023.Participation was restricted to currently enrolled medical students with an expressed interest in ophthalmology and no prior knowledge of the Randleman criteria.Each author collaborated with the Hoopes Vision Research Center and recruited eligible participants from their respective institutions.
Participants were informed that the study aimed to assess the effectiveness of two different learning tools: ChatGPT-3.5 and an internet search engine.To mimic a self-directed learning situation, the topic "the Randleman criteria in ophthalmology" was provided with no additional contextual information.
Each participant was instructed to interact with ChatGPT-3.5 within a single conversation thread.Participants were allowed to use an unlimited number of prompts within a 10-minute time limit.Participants were not allowed to search the internet during the conversation with ChatGPT-3.5.Afterwards, participants were asked to submit a copy of the conversation thread and a summary of their findings.
Following the session with ChatGPT-3.5,participants were instructed to learn about the Randleman criteria using an internet search engine of their choice for a maximum duration of 10 minutes.Participants then provided a summary of their findings with a list of references.
Each entry was collected via e-mail and assigned an entry number between 1 and 23.

Data analysis
Each ChatGPT-3.5 conversation thread was examined, with the provided definition of the Randleman criteria and its five parameters being recorded, if any were given.Discrepancies from the definition and parameters described by Randleman et al. were noted [13].Prompts given to ChatGPT-3.5 were counted and characterized.Innovative prompts were identified, including requests for mnemonic aids, tables, case examples, and comparative explanations.Instances where ChatGPT-3.5 requested more context, advised the student, or recognized a knowledge cut-off date were documented.
References discovered via internet search were categorized by publication type and inspected for accuracy.Each student's reflective summary was evaluated and compared against the information found in their accompanying references.
The McNemar test was run using R software (R Foundation for Statistical Computing, Vienna, Austria) to compare the paired samples with dichotomous outcomes of incorrect and correct definitions of the Randleman criteria.Post-hoc power analysis suggests that a sample size of 23 yields a power of 0.92 with an alpha of 0.00001.Confidence intervals (CI) were calculated using the modified Wald method.

Prompt analysis
ChatGPT-3.5 conversation threads contained a median of four total prompts, with a minimum of two and a maximum of 10 prompts.The majority of initial queries (78.2%, 18/23) did not provide any contextual information to ChatGPT-3.5, with the most common prompt (22.2%, 7/23) being "What is/are the Randleman criteria?".The term "ophthalmology" was included in 21.7% (5/23) of initial prompts.Multi-sentence prompts were not utilized.* ChatGPT-3.5 initially supplied the listed definition; however, over the course of the conversation, the definition was retracted.N/A: non-applicable -a definition was not provided by ChatGPT-3.5.
ChatGPT-3.5 was challenged to provide definitions, case applications, value cut-offs, tables, mnemonics, and comparisons to other clinical models.ChatGPT-3.5 successfully defined LASIK, criteria, ectasia, and corneal ectasia for students.When asked to apply the Randleman criteria to a clinical case (17.3%, 4/23 conversations), partially accurate examples were supplied in each case.ChatGPT-3.5 did not provide specific cut-off values for parameters when requested (13.0%, 3/23 conversations).When ChatGPT-3.5 was asked to compare the Randleman criteria to other criteria and classification systems (30.4%, 7/23 conversations), a response was provided for 71.4% of requests (5/7 requests), with partially accurate information being provided in each case.When directly prompted for a table (4.3%, 1/23 conversations), ChatGPT-3.5 successfully provided a summary of the Randleman criteria in tabulated format.In another instance (4.3%, 1/23 conversations), ChatGPT-3.5 provided a mnemonic to assist in remembering the criteria.

Secondary outcomes: efficiency, accuracy, and reliability
For students who received a usable definition (11/23), a median of two prompts was required to elicit a definition (min: 1; max: 5).Falsified information was present in every ChatGPT-3.

Discussion
Our study found that the internet search engine outperformed ChatGPT-3.5 in providing accurate and reliable information on the Randleman criteria.ChatGPT-3.5 gave false information, required excessive prompting, and propagated misunderstandings.Dangerously, two student summaries contained false information from ChatGPT-3.5 even after the students had consulted the correct internet sources.Thus, exposing students to inaccurate information may result in insidious misunderstandings that resist correction.Learners of refractive surgery must be made aware of the risks associated with using ChatGPT-3.5 as an independent learning tool.
A common criticism of LLMs has been their inability to indicate the source of provided information, ChatGPT-3.5 included.A surrogate for credibility is to reference the dataset used to train the LLM.We found that ChatGPT-3.5 consistently identified a knowledge cut-off date for its training data.However, when ChatGPT-3.5 failed to provide a definition for the Randleman criteria, students mistakenly assumed that the topic must have been introduced after the knowledge cut-off date.For an LLM to be consistently used as a learning tool, credibility and reliability must be established, a current pitfall of ChatGPT-3.5.
Since the release of ChatGPT-3.5,newer, more effective LLMs have already begun to surface.Released in March 2023, Google Bard has outperformed ChatGPT-3.5 in certain topics [18].In addition, GPT-4, OpenAI's successor to ChatGPT-3.5, has demonstrated superiority on subspecialty practice questions [14,19,20].However, GPT-4 requires a paid subscription, limiting its accessibility to learners.Additional LLMs are in production and likely to emerge soon, including those trained on more reliable datasets [21].Further evaluations will be necessary to ensure the validity of such learning tools.
As LLMs progress, the development of effective "prompt engineering" has the potential to augment learning efficiency [22].Prompt engineering refers to the development of effective prompts with abundant context [23].In our study, students failed to use multi-sentence prompts and inconsistently provided background information, even when requested directly by ChatGPT-3.5.While a scarcity of such information is expected for self-directed learners, training in prompt engineering may be a valuable investment, especially as LLMs improve.
This study was limited as ChatGPT-3.5 was queried on only one refractive surgery topic.ChatGPT-3.5'sperformance in this study may not represent its expertise on all topics within refractive surgery.Furthermore, the efficacy of an LLM is dependent upon the prompts provided.The prompting capabilities of medical students in this study may not be representative of all medical students.Finally, learner comprehension was not assessed in this study.Future evaluations could assess learning outcomes when using an LLM or a combination of resources.

Conclusions
Although LLMs hold promise in augmenting medical learning, findings from this study raise significant concerns regarding the accuracy of ChatGPT-3.5 in providing information on refractive surgery.In the selfdirected learning environment where context is limited and efficiency is paramount, ChatGPT-3.5'srole as an independent resource is currently limited.Providing training in prompt engineering to medical learners may improve their ability to extract quality information from ChatGPT-3.5 and other LLMs.The development of more accurate, efficient, and reliable LLMs would prove to be an asset in advancing medical education.In the interim, these results suggest that learners in refractive surgery should exercise caution

FIGURE 1 :
FIGURE 1: Comparing the ability to accurately define the Randleman criteria.*** indicates statistical significance (p = 0.0001) as calculated by the McNemar test of paired proportions.

FIGURE 2 : 5 .
FIGURE 2: Definitions of the Randleman criteria provided by ChatGPT-3.5.Correct definition: ChatGPT-3.5 identified the Randleman criteria as a model used to predict the development of postoperative corneal ectasia.Corneal ectasia staging: ChatGPT-3.5 erroneously defined the Randleman criteria as a system to stage the development of corneal ectasia."Randleman syndrome": ChatGPT-3.5 fabricated a novel disease.None: ChatGPT-3.5 failed to provide a definition of the Randleman criteria.

TABLE 1 : Results of each ChatGPT-3.5 conversation and internet search.
* ChatGPT-3.5 initially supplied the listed definition; however, over the course of the conversation, the definition was retracted.** Participant provided an incorrect definition of the Randleman criteria despite citing the original paper describing the criteria.N/A: non-applicable, parameters were not provided through ChatGPT-3.5;CT: preoperative corneal thickness; MRSE: preoperative manifest refraction spherical equivalent; VA: visual acuity; IOP: intraocular pressure; BCVA: best corrected visual acuity.

Table 2
contains the initial query, access date, and total number of prompts for each entry.ID Initial