Evaluation and mitigation of the limitations of large language models in clinical decision-making

Clinical decision-making is one of the most impactful parts of a physician’s responsibilities and stands to benefit greatly from artificial intelligence solutions and large language models (LLMs) in particular. However, while LLMs have achieved excellent performance on medical licensing exams, these tests fail to assess many skills necessary for deployment in a realistic clinical decision-making environment, including gathering information, adhering to guidelines, and integrating into clinical workflows. Here we have created a curated dataset based on the Medical Information Mart for Intensive Care database spanning 2,400 real patient cases and four common abdominal pathologies as well as a framework to simulate a realistic clinical setting. We show that current state-of-the-art LLMs do not accurately diagnose patients across all pathologies (performing significantly worse than physicians), follow neither diagnostic nor treatment guidelines, and cannot interpret laboratory results, thus posing a serious risk to the health of patients. Furthermore, we move beyond diagnostic accuracy and demonstrate that they cannot be easily integrated into existing workflows because they often fail to follow instructions and are sensitive to both the quantity and order of information. Overall, our analysis reveals that LLMs are currently not ready for autonomous clinical decision-making while providing a dataset and framework to guide future studies.


Diagnosis Error
Final Diagnosis: Based on the patient's symptoms, laboratory results, and imaging studies, the final diagnosis is: Appendicitis  format.Keep all reasoning in the Thought section.The Action, Action Input, Final Diagnosis, and Treatment sections should be direct and to the point.The results of the action will be returned directly after the Action Input field in the "Observation:" field.This places an unnecessary burden upon clinicians who would need to consider and monitor the order in which they feed the models with information.The boxplots are calculated using the possible permutations of ordering physical examinations, laboratory results and imaging results which are shown as colored dots (n=6).The center line marks the median of the data.The edges of the box are the lower and upper bound demarking 25% and 75% of the data points.
The whiskers show the minimum and maximum values that are a maximum of 1.5 times the interquartile range from the edges of the box.Outliers are shown separately.The value above each top-whisker shows the difference between the best performing and worst performing order.
E LLMs are sensitive to information order

F Removing Summarization Degrades CDM Diagnostic Accuracy
Figure 2: Not summarizing progress degrades LLM diagnostic accuracy on MIMIC-CDM.When an LLM approaches its input limit, we ask it to summarize the information gathered thus far to allow for more context.Increased input sizes allows it more opportunities to ask for information and increases the chances of requesting information that is important for the diagnosis.

J Example Exchange using Synthetic Data
In this section we show an example exchange between our simulated clinical environment provided by our evaluation framework using the MIMIC-CDM dataset and the WizardLM model.The first prompt is the system prompt given by the framework to the model explaining the task and its options.The rest of the exchange pulls (synthetic) data from MIMIC-CDM and records the LLM's response.Seen in this exchange is an example of the model requesting an imaging modality that is not available in the dataset and also hallucinating a surgery tool that does not exist.Upon reminding it of the available tools it correctly diagnoses the patient.This example exchange uses synthetic data to further protect patient privacy.

Evaluation Framework using MIMIC-CDM
You are a medical artificial intelligence assistant.You give helpful, detailed and factually correct answers to the doctors questions to help him in his clinical duties.Your goal is to correctly diagnose the patient and provide treatment advice.You will consider information about a patient and provide a final diagnosis.
You can only respond with a single complete

LLM
Thought: Given the acute onset of right lower quadrant pain, along with anorexia and right-sided flank pain, there could be several possibilities such as appendicitis, kidney stones, or inflammatory bowel disease like Crohn's disease.However, since the patient does not have any past medical history or family history of these conditions, it might be less likely.Therefore, we need to perform further investigations to determine the exact cause of the pain.Stomach and small bowel maintain normal caliber.A hyperdense, dilated structure, likely the appendix, arises from the cecal base, measuring up to 10 mm and exhibiting mucosal hyperenhancement compared to adjacent bowel.Predominant features include hyperenhancement and wall thickening, with less emphasis on fluid content.Inflammatory changes are evident at the cecal base, around the appendix, especially at its indistinct tip.
Adjacent to the ascending colon's mesentery, two enlarged lymph nodes, 1.5 cm and 1.9 cm in size, show brisk enhancement.The smaller node contains central low-density material.Both nodes are surrounded by marked fat stranding and suspected necrotic or suppurative changes.
The ascending, transverse, and descending colon appear normal.No mesenteric free air is observed.

PELVIS:
Urinary bladder and adnexa are unremarkable.Uterus contains an intrauterine device in standard position.A small volume of simple free fluid is present in the pelvis, likely physiologic.

VESSELS:
Aorta and its major branches are normal in caliber and patent.

OSSEOUS STRUCTURES:
No focal lytic or sclerotic lesions suggestive of malignancy are identified.

LLM
Thought: The CT scan supports the diagnosis of acute appendicitis due to the enlarged and inflamed appendix observed.I will recommend surgical intervention for removal of the affected organ.Action: Surgery Action Input: Appendectomy  9: Additional metrics.This table shows additional performance metrics calculated on the MIMIC-CDM-FI dataset for the three primary models investigated in this study.Sensitivity is equal to per-class accuracy used throughout this work.To calculate these additional metrics, false positives and true negatives were counted over the diseases included in this dataset which is not representative of a general population and the metrics should thus be interpreted with caution.

FinalFinal
Diagnosis: AppendicitisFinal Diagnosis: Acute Gallstone Pancreatitis.The patient's symptoms of abdominal pain, nausea, vomiting, and fever, along with the laboratory results of elevated serum amylase and lipase levels, suggest a diagnosis of acute pancreatitis.
reflect on your progress and decide what to do next) Action: (the action name, should be one of [{tool_names}]) Action Input: (the input string to the action) Observation: (the observation from the action will be returned here) OR Format 2: Thought: (reflect on the gathered information and explain the reasoning for the final diagnosis) Final Diagnosis: (the final diagnosis to the original case) Treatment: (the treatment for the given diagnosis) The tools you can use are: Physical Examination: Perform physical examination of patient and receive the observations.Laboratory Tests: Run specific laboratory tests and receive their values.The specific tests must be specified in the 'Action Input' field.Imaging: Do specific imaging scans and receive the radiologist report.Scan region AND modality must be specified in the 'Action Input' field.{add_tool_descr}{system_tag_end}{user_tag_start}{examples}Consider the following case and come to a final diagnosis and treatment by thinking, planning, and using the aforementioned tools and format.Patient History: {input}{user_tag_end}{ai_tag_start}Thought:{agent_scratchpad}

Figure 1 :
Figure1: Boxplots showing LLMs are sensitive to the order of information.By mixing the order in which information is presented to LLMs, their diagnostic accuracy changes despite the information included staying the same.This places an unnecessary burden upon clinicians who would need to consider and monitor the order in which they feed the models with information.The boxplots are calculated using the possible permutations of ordering physical examinations, laboratory results and imaging results which are shown as colored dots (n=6).The center line marks the median of the data.The edges of the box are the lower and upper bound demarking 25% and 75% of the data points.The whiskers show the minimum and maximum values that are a maximum of 1.5 times the interquartile range from the edges of the box.Outliers are shown separately.The value above each top-whisker shows the difference between the best performing and worst performing order.

Figure 3 :
Figure 3: Expanding medical abbreviations does not improve LLM performance.Diagnostic accuracy on the clinician subset of MIMIC-CDM-FI stays the same or degrades when medical abbreviations are written out.
all reasoning in the Thought section.The Action, Action Input, Final Diagnosis, and Treatment sections should be direct and to the point.The results of the action will be returned directly after the Action Input field in the "Observation:" field.Format 1: Thought: (reflect on your progress and decide what to do next) Action: (the action name, should be one of [Physical Examination, Laboratory Tests, Imaging]) Action Input: (the input string to the action) Observation: (the observation from the action will be returned here) OR Format 2: Thought: (reflect on the gathered information and explain the reasoning for the final diagnosis) Final Diagnosis: (the final diagnosis to the original case) Treatment: (the treatment for the given diagnosis) The tools you can use are: Physical Examination: Perform physical examination of patient and receive the observations.Laboratory Tests: Run specific laboratory tests and receive their values.The specific tests must be specified in the 'Action Input' field.Imaging: Do specific imaging scans and receive the radiologist report.Scan region AND modality must be specified in the 'Action Input' field.Consider the following case and come to a final diagnosis and treatment by thinking, planning, and using the aforementioned tools and format.Evaluation Framework using MIMIC-CDM Patient History: ___ presenting with 3 days since acute onset of pain in RLQ.Pain started ___ afternoon but worsened significantly overnight.Intense episodes woke her from sleep.Patient proceeds to experience nausea and chills.No vomiting.Patient endorses anorexia.Past Medical History: Anorexia, Anxiety.Family History: Non-contributory.

Table 2 :
Example errors.Examples of the types of errors commonly made by models when providing actions and diagnoses.The corrected example in the desired format is also provided.Note that tool hallucination examples are simply not valid actions and so there are no corrected examples provided.{system_tag_start}Youare a medical artificial intelligence assistant.You give helpful, detailed and factually correct answers to the doctors questions to help him in his clinical duties.Your goal is to correctly diagnose the patient and provide treatment advice.You will consider information about a patient and provide a final diagnosis.
C.2 CDM Observation Summarize Template{system_tag_start}You are a medical artificial intelligence assistant.Your goal is to effectively, efficiently and accurately reduce text without inventing information.You want to return verbatim observations that are abnormal and of interest to a possible diagnosis of the patient.Normal observations can be combined.Do not invent information.on the provided information to assist a doctor in his clinical duties.Your goal is to correctly diagnose the patient.Based on the provided information you will provide the primary diagnosis.Don't write any further information.Give only a single diagnosis.{system_tag_end}{fewshot_examples}{user_tag_start}Providethe primary diagnosis of the following patient.
{input}{diagnostic_criteria}{user_tag_end}{ai_tag_start}Final Diagnosis:""" C.9 CDM-FI No User Template {system_tag_start}You are a medical artificial intelligence assistant.You directly diagnose patients based on the provided information to assist a doctor in his clinical duties.Your goal is to correctly diagnose the patient.Based on the provided information you will provide a final diagnosis of the most severe pathology.Don't write any further information.Give only a single diagnosis.{system_tag_end}{fewshot_examples}{user_tag_start}{input}{diagnostic_criteria} {user_tag_end}{ai_tag_start}Final Diagnosis:""" C.10 CDM-FI No Medical Template {system_tag_start}You are an artificial intelligence assistant.You answer questions to the best of your abilities.Think hard about the following problem and then provide an answer.{system_tag_end}{fewshot_examples}{user_tag_start}Provide the most likely final diagnosis of the following patient.{input}{diagnostic_criteria}{user_tag_end}{ai_tag_start}Final Diagnosis:""" C.11 CDM-FI Serious Final Template {system_tag_start}You are a medical artificial intelligence assistant.You directly diagnose patients based on the provided information to assist a doctor in his clinical duties.Your goal is to correctly diagnose the patient.Based on the provided information you will provide a final diagnosis of the most severe pathology.Don't write any further information.Give only C.12 CDM-FI Minimal System Template {system_tag_start}You are a medical artificial intelligence assistant.You diagnose patients based on the provided information to assist a doctor in his clinical D LLMs are Sensitive to the Order of Information

Table 3 :
Llama 2 Chat diagnostic performance with different diagnostic information orderings.Changing the order of the presented information changes diagnostic accuracy.Crucially, the best order (bold) is disease specific, meaning a clinician must deliver a preliminary diagnosis to ensure proper model performance, eliminating many of the benefits of an AI clinical decision making system.

Table 4 :
OASST diagnostic performance with different diagnostic information orderings.Changing the order of the presented information changes diagnostic accuracy.Crucially, the best order (bold) is disease specific, meaning a clinician must deliver a preliminary diagnosis to ensure proper model performance, eliminating many of the benefits of an AI clinical decision making system.

Table 5 :
WizardLM diagnostic performance with different diagnostic information orderings.Changing the order of the presented information changes diagnostic accuracy.Crucially, the best order (bold) is disease specific, meaning a clinician must deliver a preliminary diagnosis to ensure proper model performance, eliminating many of the benefits of an AI clinical decision making system.

Table 6 :
Accepted diagnostic phrasings for each pathology.For each pathology, alternative diagnostic phrasings were accepted.If within one diagnostic phrase (delineated through e.g.periods, commas, 'vs', or other separators) the location and modifier occurred without negation, the diagnosis was marked as correct.For example, for cholecystitis a diagnosis of 'Perforated gallbladder' or 'Gallbladder infection' were both accepted.A diagnosis of 'Gallbladder disease vs perforated appendix' would not be accepted.Shortened substrings of the locations and modifiers were used to allow for alternative endings to match such as both 'An infect of the gallbladder' and 'Infected gallbladder', or 'Periappendicular abscess' and 'Perforated appendix'.Manual controls were done to verify the specificity and sensitivity of the definitions.

Table 7 :
Demographic Statistics of Patients with Different Diseases

Table 8 :
Demographic Statistics of Physician Comparison Subset of 80 patients I LLMs Diagnostic Accuracy Without Medical Abbreviations After performing the physical examination, the vital signs appear normal except for a slightly elevated blood pressure.Findings of tenderness to palpation in the right lower quadrant of the abdomen supports the forementioned possibilities of the patient's complaints, such as appendicitis, kidney stones, urinary tract infection, ovarial cysts or inflammatory bowel disease.We need to investigate further to rule out potential causes.Since ultrasound is not suitable for this case, we should proceed with another imaging modality.Given the possibility of hepatobiliary involvement, a computed tomography (CT) scan would be more appropriate to visualize the internal organs better.
Action: Laboratory Tests Action Input: Complete Blood Count (CBC), Liver Function Tests (LFTs), Renal panel, Urinalysis LLM Thought: ABDOMEN: Liver and gallbladder present normally; portal and hepatic veins are patent.Pancreas and spleen are normal; bilateral adrenal glands are unremarkable.Kidneys show symmetrical enhancement without hydronephrosis.