Can the application of machine learning to electronic health records guide antibiotic prescribing decisions for suspected urinary tract infection in the Emergency Department?

Urinary tract infections (UTIs) are a major cause of emergency hospital admissions, but it remains challenging to diagnose them reliably. Application of machine learning (ML) to routine patient data could support clinical decision-making. We developed a ML model predicting bacteriuria in the ED and evaluated its performance in key patient groups to determine scope for its future use to improve UTI diagnosis and thus guide antibiotic prescribing decisions in clinical practice. We used retrospective electronic health records from a large UK hospital (2011–2019). Non-pregnant adults who attended the ED and had a urine sample cultured were eligible for inclusion. The primary outcome was predominant bacterial growth ≥104 cfu/mL in urine. Predictors included demography, medical history, ED diagnoses, blood tests, and urine flow cytometry. Linear and tree-based models were trained via repeated cross-validation, re-calibrated, and validated on data from 2018/19. Changes in performance were investigated by age, sex, ethnicity, and suspected ED diagnosis, and compared to clinical judgement. Among 12,680 included samples, 4,677 (36.9%) showed bacterial growth. Relying primarily on flow cytometry parameters, our best model achieved an area under the ROC curve (AUC) of 0.813 (95% CI 0.792–0.834) in the test data, and achieved both higher sensitivity and specificity compared to proxies of clinician’s judgement. Performance remained stable for white and non-white patients but was lower during a period of laboratory procedure change in 2015, in patients ≥65 years (AUC 0.783, 95% CI 0.752–0.815), and in men (AUC 0.758, 95% CI 0.717–0.798). Performance was also slightly reduced in patients with recorded suspicion of UTI (AUC 0.797, 95% CI 0.765–0.828). Our results suggest scope for use of ML to inform antibiotic prescribing decisions by improving diagnosis of suspected UTI in the ED, but performance varied with patient characteristics. Clinical utility of predictive models for UTI is therefore likely to differ for important patient subgroups including women <65 years, women ≥65 years, and men. Tailored models and decision thresholds may be required that account for differences in achievable performance, background incidence, and risks of infectious complications in these groups.

Background Urinary tract infections (UTIs) are a major cause of emergency admissions in high-income countries [1,2] with annual costs estimated in excess of $2.8 billion in the US alone [2]. However, the ability to diagnose UTI reliably in the emergency department (ED) and differentiate it from other reasons for attendance is undermined by a lack of rapid and accurate diagnostic tests for UTI [3], the fact that patients often present with non-specific symptoms (particularly in the elderly) [4], and the need to make rapid diagnostic decisions. Previous studies have therefore repeatedly reported both over-and undertreatment of suspected UTI in the ED [5,6].
Recently, researchers have started investigating whether the application of risk models to data that are routinely collected during ED visits may be used to support earlier diagnosis of UTI and guide antibiotic initiation [7][8][9][10]. In the largest study to date, Taylor et al. [7] showed that machine learning can predict bacteriuria with high accuracy using data from 80,000 ED patients who presented with symptoms that were broadly compatible with suspected UTI. Their model achieved both higher sensitivity and specificity when compared to retrospective proxies of clinicians' judgement. Similar results were reported by Müller et al. [8] on a smaller Swiss cohort. However, average performance measures alone may be insufficient to judge the utility of these models in clinical practice.
Due to the need for large sample sizes, previous models were developed using data from heterogeneous patient groups. Many patients included in these studies are actually at very low risk of UTI, attending the ED for other reasons-including non-specific symptoms like altered mental status, other infections such as pneumonia, or even non-infectious conditions like heart disease-and receiving routine investigations for UTI [6]. This makes it difficult to determine their value in the primary target population of patients with suspected UTI. Successful deployment of predictive models for UTI requires good performance in this more narrowly defined target population, and may further need to distinguish between clinically important subgroups such as younger women (<65 years), older women (�65 years), and men. These Trust, but restrictions apply to the availability of these data to protect individual confidentiality; they are not publicly available. Data are however available from Suzy Gallier (Head of Informatics Research & Commercial Development at University Hospitals Birmingham NHS Foundation Trust; email: staar@uhb.nhs.uk) upon reasonable request and with permission of University Hospitals Birmingham NHS Foundation Trust. The code used for all model development and evaluation in this study can be found at https://github.com/ prockenschaub/uti-prediction. groups differ in their background incidence of UTI, prevalence of asymptomatic bacteriuria (which does not usually require treatment), and risk of complications [11], and model performance and interpretation may vary as a result. Finally, predictive models need to show that they can achieve satisfactory performance without re-enforcing existing healthcare inequalities originating for example from race or ethnicity ("fair AI") [12].
In this study, we built on the work of Taylor et al. [7] to develop a model to predict bacteriuria in samples obtained from patients attending the ED in a large English hospital, which we evaluated in a temporally independent dataset. To explore scope to deploy such a model in clinical practice, we evaluated its performance in key patient subgroups including: age (<65 and �65 years), gender, ethnicity (white, non-white), and UTI syndrome at presentation (urinary symptoms, lower UTI, pyelonephritis, urosepsis).

Data and study population
We used electronic health record (EHR) data collected routinely in the ED at Queen Elizabeth Hospital Birmingham (QEHB), which serves an ethnically diverse population in southwest Birmingham. Approximately 115,000 ED patients attend QEHB each year. A detailed explanation of the study data was published previously [13]. In short, we included all adult patients who attended the ED at QEHB between November 1st 2011 and March 31st 2019 and who had a urine sample sent for microbiological testing within 24 hours of arrival (S1 Fig). After being seen by the ED physician, each patient at QEHB is routinely assigned one or more symptoms and/or suspected diagnoses out of a predefined list of~800 ED diagnostic codes (e.g., pyelonephritis, haematuria, or acute confusion; S1 Text) [14]. We excluded patients without a valid record of age or sex, patients aged <18 years, pregnant women (identified via a pregnancy-related discharge diagnosis within ±9 months of arrival), and those who had a urine sample that was not ultimately cultured. As our focus was community-onset UTI, we excluded patients whose earliest urine sample was taken more than 24 hours after their recorded arrival in the ED (to account for delays in delivering samples to the laboratory but exclude hospitalacquired infection), and those with a previous diagnosis of UTI recorded �30 days before the date of ED attendance.

Ethics approval and consent to participate
This research study was deemed exempt from NHS Research Ethics Committee review as there is no change to treatment or services or any study randomisation of patients into different treatment groups, and the study uses de-identified routinely collected data. Approval to undertake the study was obtained from the UK Health Research Authority ref: 17/HRA/3427.

Outcome
The binary primary outcome was bacterial growth in the ED urine sample defined as growth of a predominant pathogen �10 4 colony-forming units per millilitre (cfu/mL) during microbiological culture. Growth of several different organisms at once was considered mixed growth (unless growth of Escherichia coli was explicitly recorded) and-following standard procedure at QEHB-was considered sample contamination [15]. In order to not miss bloodstream infections from a urinary source, urine samples were also considered positive if they showed bacterial growth <10 4 cfu/mL but the same pathogen was also grown from a blood sample taken within 24 hours of arrival.
Notably, whether sent urines were actually cultured in the laboratory depended on the number of bacteria and white blood cells (WBCs) estimated from urine flow cytometry. The threshold values for proceeding to culture were WBC > 40/μL or bacteria > 4000/μL before October 2015 and were increased to WBC > 80/μL or bacteria > 8000/μL thereafter [13].

Candidate predictors
Candidate predictors were selected based on clinical expertise, previous literature, and availability of data within the EHR system [13]. Considered information included age at arrival (in ten-year age-bands), sex, ethnicity (Asian, Black, White, Other), Charlson Comorbidity Index (CCI), presence of underlying renal/urological conditions, previous hospital or emergency visits for UTI or other reasons, blood tests (WBC, platelets, C-reactive protein [CRP], creatinine, alkaline phosphatase [ALP], bilirubin), urine flow cytometry (bacteria, WBC, red blood cells [RBC], epithelial cells, casts, crystals), calendar time (month, day of year, day of week, time of day). Suspected ED diagnosis was also included and grouped into UTI syndromes (lower UTI, pyelonephritis, urosepsis), UTI symptoms (urinary symptoms, abdominal pain, altered mental status), other infections (LRTI, sepsis of other origin, other infections), or non-infectious. A detailed list of the definition of each variable can be found in Rockenschaub et al. [13]. If more than one value was recorded for a variable during a patient's time in the ED, the mean value was included. Immunosuppression, vital signs, and previous antibiotic-resistant urine organisms were excluded from the analysis due to them being recorded in <10% of patients. Following Taylor et al. [7], we also considered a reduced set of predictors-which could be more easily implemented as a model in the ED-using only age, sex, history of positive urine culture, and all available urine flow cytometry measurements.

Statistical analysis
Patient characteristics. Predictors were summarised for all patients and for those who did/didn't have a positive urine culture. Continuous variables were described using mean and standard deviation (if approximately normally distributed) or median and interquartile range (if non-normal). Categorical variables were described via counts and percentages. Differences in variable distribution by culture status were tested via t-test (normal continuous variables), Wilcoxon rank-sum test (non-normal continuous variables), and χ 2 test (categorical variables).
Predictive modelling. For the predictive modelling, continuous predictors were capped at the 1st and 99th percentile ("winsorised") and transformed to approximate normality using Yeo-Johnson transformations. Since we observed at least some missingness for most of our variables, we considered three increasingly complex imputation strategies: mean imputation, k-nearest neighbour imputation, and multivariate imputation by chained equations (see S2 Text for a detailed description of each).
We considered the following predictive algorithms: standard logistic regression (LR), logistic regression with fractional polynomials (LR-FP), elastic net (E-NET), random forest (RF), and extreme gradient boosting (XGB). For LR-FP, up to four degrees of freedom (equivalent to two polynomial terms) were considered and the best fitting one chosen via the Akaike Information Criterion (AIC) [16]. For E-NET, RF, and XGB, 30 hyperparameter combinations were randomly chosen (see S1 Table) [17] and the best performing combination was chosen after internal validation.
All models were trained on data up to December 2017. Data from January to March 2018 were set aside for recalibration, and data from April 2018 to March 2019 were reserved as a temporally external test set. Training and internal validation was performed on the training data via 10-times repeated 10-fold cross validation, with all transformations and imputations being performed separately for each run to avoid data leakage. The best model of each algorithm class was externally validated on the test set (with and without recalibration using Platt scaling). Discriminative performance was evaluated using the area under the receiver operating characteristic (AUC), specificity, and negative predictive value (NPV). Thresholds for the calculation of specificity and NPV were chosen such that a predefined sensitivity of 95% was achieved [13]. Difference in performance between models was tested via resampling (S3 Text). Calibration was assessed using calibration plots with locally estimated scatterplot smoothing.
Sensitivity analyses. In addition to the main analysis, we performed a broad set of sensitivity analyses to assess the robustness of our best model in specific situations and patient subgroups and to determine if there may be scope to deploy the model in clinical practice. We investigated changes in performance over time by re-running all analyses only on data before 2013 and testing it on data from 2013. We repeated this process for the years 2014, 2015, etc. Next, we evaluated the performance by age (<65 and �65 years), sex, ethnicity (white, nonwhite), and in the subgroup of patients with recorded suspicion of UTI indicated by an ED diagnosis of urinary symptoms, lower UTI, pyelonephritis, or urosepsis. The effect of mixed culture growth on our results was assessed by considering it as a positive culture or by excluding it from the analysis altogether. Finally, performance of our model was compared to two previously used proxies of clinicians' judgement [7]: ED diagnosis of UTI (lower UTI, pyelonephritis, or urosepsis) and/or prescription of systemic antibiotics recommended for UTI in QEHB's 2018 prescribing guidelines (see supporting information).
All analysis was performed in R (v3.6.2) and RStudio (v1.2.5033) on Windows 10. A prospective protocol for this analysis was published in Rockenschaub et al. [13]. All results were reported following the strengthening the Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) statement (S1 Checklist) [18].

Patient characteristics
Half (51.9%) of included visits were from patients �65 years and two-thirds (66.0%) were from women (Table 1). 23.8% and 17.9% of patients had CCIs of 1-2 and �3 respectively. History of renal (21.6%) or urological (28.5%) disease were common. Many included patients also had a hospital visit (47.8%) and/or urine sample (48.9%) recorded in the previous year. Over a third (39.4%) of included visits had a recorded ED diagnosis of UTI, with another 5.1% showing a record of urinary symptoms.
Bacterial growth was more commonly found among older patients, women, those of white ethnicity, and patients who previously had a positive urine culture (Table 1). It was also more commonly found among those with a recorded ED diagnosis of UTI (lower UTI, pyelonephritis, urosepsis) but not among those with only symptoms of UTI and was strongly associated with urine flow cytometry results and some blood tests, most notably CRP and platelet counts.

Predictive modelling
The best performing model to predict bacteriuria was an XGB including all predictors, which achieved an AUC of 0.813 (95% CI 0.792-0.834; Table 2 and Fig 1) during external validation   Results from internal validation were similar but performance was slightly worse using multiple imputation (S2 Table and S3 Table). The primary importance of urine flow cytometry resultswhich make up most predictors in the reduced set-for discriminative power was also observed in univariate analyses (S4 Table). The final XGB model tended to underestimate the risk of bacterial growth, which was (over-)corrected after re-calibration (Fig 2).

Sensitivity analyses
Coinciding with changes in laboratory procedures, estimated performance of our XGB model was reduced around 2015 (AUC 0.766, 95% CI 0.740-0.793; Fig 3). Reduced performance was also seen in patients aged �65 years (AUC 0.783, 95% CI 0.752-0.815) and in men (AUC 0.758, 95% CI 0.717-0.798) with bootstrapped p-values for a difference in performance of p = 0.004 and p<0.001 compared to those aged <65 years and compared to women. There was no significant difference in performance for patients with an ED diagnosis of lower UTI, pyelonephritis, urosepsis, or UTI symptoms (AUC 0.797, 95% CI 0.765-0.828, p = 0.210; Table 3), and no evidence that performance varied by ethnicity (AUC 0.831, 95% CI 0.780-  [28,29], multiplied by two to account for the two-sided nature of our hypothesis. AUC, area under the receiver operating characteristic; CI, confidence interval; E-NET, elastic net; LR, logistic regression; LR-FP, logistic regression with fractional polynomials; NPV, negative predictive value; RF, random forest; XGB, extreme gradient boosting trees. 0.873, p = 0.153). The model showed some miscalibration in subgroups, primarily in the elderly (Fig 4). Estimated performance differed strongly depending on how the microbiological culture finding of mixed growth (23.5% of all samples) was classified, which is often considered indicative of a contaminated / unreliable sample [7,15]. When mixed culture growth was considered positive growth during training and testing, estimated external model performance increased to AUC 0.864 (95% CI 0.847-0.880), which further increased to AUC 0.892 (95% CI 0.875-0.909) if samples with mixed growth were excluded from the analysis altogether. Bootstrapping showed a clear difference in performance in both cases (p<0.001). Importantly, samples with mixed growth were frequently assigned high probabilities of bacteriuria, irrespective of how mixed growth was classified in the model (S3 Fig).
When compared to retrospective proxies of clinician's judgement (ED diagnosis of UTI and/or prescription of systemic antibiotics recommended for UTI), our model achieved both higher sensitivity and specificity (Table 4). At a specificity of 63.7%-which would be achieved by a model that predicts bacteriuria whenever there was a recorded ED diagnosis of UTI-our model obtained considerably higher sensitivity (83.0%, 95% CI 80.3-85.0 versus 48.2%, 95% CI 44.3-52.1). Conversely, at a sensitivity of 59.9% achieved by using recorded ED diagnosis of UTI and/or antibiotic prescribing to infer clinical judgement, our model achieved notably higher specificity (85.5%, 95% CI 83.1-87.7 versus 51.6%, 95% CI 48.3-55.0).

Discussion
In this retrospective EHR study, our best-performing model was able to predict bacterial growth in ED urine samples with an AUC of 0.815 in an ethnically diverse patient population and outperformed retrospective proxies of clinical judgement. However, performance differed over time and depending on the patient population in which it was used, with reduced performance in patients aged �65 years and men. Given the differences in UTI incidence, prevalence of asymptomatic bacteriuria, and risk of infectious complications in these important target populations, this has implications for the model's potential use to predict UTI and thus guide antibiotic prescribing in clinical practice and may suggest the need for separate models or thresholds for decision making.
Our model primarily relied on urine flow cytometry parameters when making its predictions. A reduced model based on age, sex, history of positive urine culture, and urine flow cytometry performed almost as well as a model using all predictors. Some flow cytometry results-bacterial count and WBC-were already used at QEHB's laboratory during the study period in a simple decision rule to screen for samples with extremely low probability of bacterial growth, which were then excluded from culture. The good predictive power of our model even in pre-screened urines suggests that the value of flow cytometry to support early diagnosis of bacteriuria and UTI may currently be underused in clinical practice. For example, if the model were used, it would have correctly identified 95% of samples which later showed bacterial growth while ruling out bacterial growth early in 36.6% of ultimately culture-negative samples. Of those samples that were flagged as likely negative, 90.3% were correctly classified and did not exhibit bacterial growth during culture. This highlights the potential of data-driven models to aid the diagnosis of UTI and to reduce laboratory cost if clinical parameters are used in addition to cytometry to select which urines are cultured. While our model thus achieved good performance, this was lower than previously reported results from both the US (AUC 0.904) [7] and Switzerland (AUC 0.930) [8]. Reasons for this discrepancy are not immediately obvious, but important differences exist with regards to the data used for training and evaluation. Whereas only 2.9% of ED patients in our study had a urine culture requested, 25.6% of ED patients in the US-based study by Taylor et al. had a culture requested [7]. Although propensity to culture might genuinely be higher in the US, nation-wide estimates suggest much lower rates of 8.1% [19] and another US single centre study reported rates as low as 2.3% [20]. The US study may therefore have been subject to selection bias, or-if urine cultures were indeed requested for one in four patients attending the ED-was not representative of other hospitals in the UK and US. Patient denominators are not available for the Swiss study by Müller et al. [8]. However, Müller et al. treated mixed growth as positive growth, which was also associated with higher performance in our study. Furthermore, samples that were a priori dismissed by our laboratory due to low bacteria or urinary WBC counts were cultured in Switzerland. These samples were unlikely to grow bacteria (S3 Fig), thus representing "easy wins". As a result, the algorithm developed by the Swiss authors would be expected to perform worse when transferred to our patient population and clinical and laboratory practice. This emphasises the importance of understanding variation in laboratory processes, which can have a major impact on the implementation of ML models in clinical practice. It is reassuring, though, that both models-like ours-predominantly relied on urinalysis parameters in their predictions, which agrees with findings from non-ED populations [3,9].

Strengths and limitations
To the best of our knowledge, this is the first model predicting bacterial growth in ED urines from a UK patient population. A major strength of this analysis is the use of a large sample of high-quality EHR data from a major teaching hospital. QEHB's long history of electronic record keeping [21] allowed us to use records collected over multiple years, perform extensive sensitivity analyses, and assess likely future model performance.
However, the data used in this analysis were nevertheless recorded as part of routine care rather than specifically for research. Our data contained missing data, which needed to be addressed by imputation. Some key variables relevant to the diagnosis of UTI were completely absent from the EHR data for the duration of our study, including urine dipstick results and prior antibiotic prescribing outside of hospital. Dipsticks are commonly used to support the diagnosis of UTI [3,22] and prior antibiotic use may have prevented the growth of microorganisms during culture [6], potentially limiting the model's power to predict bacterial growth [23]. Furthermore, a substantial proportion of urine samples included in this analysis were submitted for culture in the absence of any recorded suspicion of UTI or weren't cultured despite suspicion of UTI. While this likely reflects real-world clinical practice [4], clinical guidelines suggest that bacteriuria should only guide treatment in the presence of clear symptoms [24]. Reducing unnecessary investigations in patients who are very unlikely to have UTI could bring cost savings and support antimicrobial stewardship. Further research is required to understand the reasons why these patients are being investigated for suspected UTI. Finally, definitions of clinical judgement in this study were inferred indirectly from retrospective data and do not reflect the full complexity of real-world clinical decision making, potentially underestimating clinicians' performance in predicting bacteriuria.

Clinical, policy and research implications
Our results suggest a potential need for separate prediction models and decision thresholds in key populations such as the elderly or men. Variations in performance around a change in laboratory procedures in our observation period further demonstrate the difficulties of developing a single model that retains performance across time and hospitals. Instead of a one size fits all approach, it may be necessary to (re-)train and validate models using local data from the target population [25]. Fortunately, key variables such as urine flow cytometry remained stable across studies [7,8,22] and clinical settings [9], and there might be an opportunity to improve diagnosis of UTI simply by feeding back these raw results to clinicians in real-time.
Our results also highlight the prevalence of mixed growth in ED settings, with one quarter of cultured urine samples showing mixed growth. While generally regarded as sample contamination [7,15], some authors have argued that strict microbiological protocols might miss important bacteriuria [26,27]. Either way, mixed growth has important implications for models that aim to predict (predominant) growth and is difficult to predict with currently available diagnostics [8].
We suggest that our model may be embedded within the laboratory workflow. Assuming all relevant clinical data have been recorded at the time of urine sample submission and are readily available within the electronic patient record, the laboratory can use this data and our model to rate the results of flow cytometry and provide rapid feedback to ED clinicians. The results should be reported back to clinicians in such a way that it helps them decide on the likelihood of UTI and choice of antibiotic treatment. On-going education of clinical personnel and audit of the process will be necessary for a successful implementation.

Conclusion
The ML models used in this study were able to predict bacterial growth in ED urine samples with good predictive accuracy but expected performance varied with patient characteristics. Effective deployment of predictive models to guide antibiotic prescribing decisions for UTI are likely to require tailored approaches for patient subgroups with a high prevalence of asymptomatic bacteriuria (patients aged �65 years) or high risk of complication (men).
Supporting information S1 Checklist. Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis (TRIPOD) statement. (PDF)