Random forest differentiation of Escherichia coli in elderly sepsis using biomarkers and infectious sites

This study addresses the challenge of accurately diagnosing sepsis subtypes in elderly patients, particularly distinguishing between Escherichia coli (E. coli) and non-E. coli infections. Utilizing machine learning, we conducted a retrospective analysis of 119 elderly sepsis patients, employing a random forest model to evaluate clinical biomarkers and infection sites. The model demonstrated high diagnostic accuracy, with an overall accuracy of 87.5%, and impressive precision and recall rates of 93.3% and 87.5%, respectively. It identified infection sites, platelet distribution width, reduced platelet count, and procalcitonin levels as key predictors. The model achieved an F1 Score of 90.3% and an area under the receiver operating characteristic curve of 88.0%, effectively differentiating between sepsis subtypes. Similarly, logistic regression and least absolute shrinkage and selection operator analysis underscored the significance of infectious sites. This methodology shows promise for enhancing elderly sepsis diagnosis and contributing to the advancement of precision medicine in the field of infectious diseases.

information, comorbidities, hematological parameters, and details of the infection site.Our investigation involved a rigorous statistical analysis, machine learning techniques, and feature importance analysis to identify the key predictors of biomarkers and infectious sites in sepsis elderly patients with and without E. coli infection.

Correlation analysis among various clinical biomarkers
To further simplify the predictive model, this study analyzed pairwise Spearman's correlation coefficients (r) among 18 statistically significant biomarkers, excluding sex, through heatmap visualization.This approach helped identify and exclude highly correlated variables to streamline the model.The heatmap analysis, illustrated in Fig. 1, revealed the interrelationships between clinical biomarkers in elderly sepsis patients.A notably strong positive correlation (r = 0.99) between neutrophil counts and WBC indicated a close association between their levels.Conversely, a significant negative correlation was found between CRP levels and the ALB-CRP ratio (r = − 0.61), suggesting an inverse relationship in the context of sepsis progression.The correlations for the rest of the variables were relatively weak (absolute value of r < 0.6), emphasizing their distinct contributions to the disease process.These insights highlighted the importance of certain biomarkers in enhancing the diagnosis and management of sepsis among the elderly.

Logistic regression analysis
Tables 2 and 3 display the results of both univariate and multivariate logistic regression analyses.The results of univariate logistic regression analysis showed that smoking (P = 0.019), the infectious site of abdominal (P = 0.023) and urinary (P < 0.001) were significant factors (Table 2).Nevertheless, in the multivariate logistic regression analysis, the variable of smoking showed no statistically significant (P = 0.157).The infectious site of urinary emerged as a significant factor in both univariate and multivariate logistic regression analyses (P < 0.001), with an odds ratio (OR) of 14.380 (95% confidence intervals [CIs]: 3.552, 58.207) in the multivariate logistic regression model, indicating a strong association with the type of infection.

LASSO analysis
The LASSO analysis revealed that the positive impactful features were urinary and age, with a prominence score of 0.2730 and 0.0031.Conversely, pulmonary, HGB, PDW showed a negative influence (with negative score: 0.0964, 0.0296, 0.005, respectively) (Fig. 3).

Random forest model analysis
Our analysis using the random forest model, as depicted in Fig. 4, identified critical predictors for distinguishing E. coli from non-E. coli infections among elderly sepsis patients.The site of infection emerged as the most influential feature, with a prominence score of 0.1655, followed by PDW, reduced platelet count, and PCT levels.Additionally, patient age and lymphocyte counts were significant but to a lesser degree.
Assessing model performance, we found notable precision and recall rates in the classification of E. coli versus non-E.coli infections.The model achieved a precision of 0.78 and a recall of 0.88 for E. coli, which translated to an F1-score of 0.82 across 8 instances (Table 4).For non-E. coli infections, the precision improved to 0.93, with a recall of 0.88, resulting in an F1-score of 0.90 based on 16 instances (Table 4).Our comprehensive evaluation, reflected in the confusion matrix (Fig. 5), validates the model's predictive strength, achieving an overall accuracy of 0.88 and balanced macro-average and weighted-average F-scores of 0.86 and 0.88, respectively, across 24 samples.The model's robustness is further supported by Table 4, which summarizes the performance metrics.
When analyzing the random forest model's ability to predict E. coli infections specifically, Table 5 shows an accuracy of 87.5%, with a precision of 93.3% and recall of 87.5%.The F1 Score stood at 90.3%, and the model demonstrated high sensitivity and specificity, both at 87.5%.The positive and negative predictive values were 93.3% and 77.8%, respectively.The receiver operating characteristic (ROC) curve, shown in Fig. 6, with an area under the curve (ROC AUC) of 88.0%, underscores the model's diagnostic efficacy.The random forest model, with its high accuracy and precision, holds significant promise for complex biological classification tasks and could be a valuable tool in the clinical management of sepsis among the elderly.
Table 1.Baseline characteristics of participants in the study.Continuous variables were presented as median (Q1, Q3) for skewed data or mean ± SD for normally distributed data.Categorical variables were presented as n (%).Q1, Q3, first and third quartiles, respectively; n (%), Number of participants and percentage; HGB, hemoglobin; MCV, mean corpuscular volume; RDW, red cell distribution width; WBC, white blood cell count; PDW, platelet distribution width; MPV, mean platelet volume; CRP, C-reactive protein; PCT, procalcitonin; TG, triglycerides; ALB, Albumin; ALB_CRP, albumin-CRP ratio; SD, standard deviation; Abdominal, infections located in the abdominal area; Pulmonary, infections located in the lungs; Urinary, infections located in the urinary tract; Other, infections located in areas not specified above.(5.0%), and other bacteria (5.9%). Vol

Discussion
This study offers significant insights into the characteristics, biomarkers, and microbial distributions in elderly sepsis patients and evaluates predictive models for infection type differentiation.Our findings highlight the critical role of comprehensive clinical and microbiological profiling in sepsis management, especially for the elderly, who face an increased risk from various comorbidities 14 .Our analysis revealed a high prevalence of lifestyle risks and comorbidities, such as hypertension and diabetes, in our elderly cohort, further emphasizing their impact on sepsis risk.A. Komori et al. 15 further support this by demonstrating how biomarkers such as CRP and PCT  www.nature.com/scientificreports/can effectively predict bacteremia in sepsis ICU patients.Their study advocates for integrating these clinical factors into predictive models to enhance sepsis outcome predictions 15 .These findings aligned with previous research, which indicated that lifestyle factors and pre-existing health conditions significantly influence sepsis outcomes in the elderly 16 .
Our results underscore the critical role of both the site of infection and specific biomarkers, including hemoglobin, PDW, reduced platelet count, and PCT, in the determination of infection types.This emphasizes the imperative for comprehensive clinical evaluations to ensure precise diagnostics.Hemoglobin levels, as a reflection of the oxygen-carrying capacity of the blood, are crucial in the assessment of sepsis severity.Low hemoglobin concentrations may indicate impaired oxygen delivery, which can exacerbate sepsis outcomes 17 .Reduced platelet count is often associated with increased severity of sepsis, as it may indicate disseminated intravascular coagulation or bone marrow suppression 18 .A low platelet count can serve as a warning sign of complications, making it a critical marker in sepsis evaluation 19 .Reduced platelet count, along with other biomarkers, provides valuable insights into the patient's immune response and infection status 20 .The significance of PDW and PCT, alongside RDW and HCT, as traditional biomarkers in the diagnosis of sepsis, is reaffirmed.Our findings resonate with the research conducted by K. Song et al. 21, which identified RDW and HCT as significant predictors of in-hospital   mortality among adult patients with E. coli-induced sepsis.This parallel underscores the importance of prompt and effective clinical assessment in improving sepsis patient outcomes.The random forest model's success in distinguishing E. coli from non-E. coli infections underscores machine learning's potential to enhance diagnostic accuracy.This parallels the findings of Jeng et al., who used similar techniques to predict recurrent urinary tract infections caused by E. coli 22 .The identification of key features such as the site of urinary and pulmonary infections, PDW, reduced platelet count, and PCT as crucial predictors further supports for the amalgamation of clinical and laboratory data in constructing predictive models.Previous research by J. Shi et al. 9 highlighted the pivotal role of PCT as a particularly effective biomarker for discerning sepsis patients.Additionally, findings from a study by M. Su et al. 23 demonstrated that among 17 statistically significant biomarkers, PCT exhibited the highest AUC for diagnosing urosepsis.The integration of these models has the potential to substantially enhance the speed and accuracy of sepsis diagnosis, facilitating more timely and precise interventions.
Understanding the site of infection is essential for formulating clinical strategies for managing elderly sepsis 24 .Our study revealed that urinary tract infections were the most common infectious site among elderly sepsis patients (32%), followed by pulmonary infections (30%), abdominal infections (20%), and other sites (18%).This finding aligns with prior research conducted by J. Doua et al. 25 , which also identified the urinary tract as the primary source of infection (62.9%), followed by intraabdominal infections (20.4%), other infections (14.2%), and respiratory tract infections (2.5%).Urinary and pulmonary infections are particularly critical in the context of sepsis, particularly in distinguishing between E. coli and non-E.coli infections.E. coli, a common Gram-negative bacterium in the gastrointestinal tract, frequently causes urinary tract infections 2,26 .Our research uncovers a multifaceted microbial environment in sepsis, predominantly characterized by Gram-negative bacteria, especially E. coli, along with other non-E.coli bacteria.A previous study 9 showed that E. coli (40.0%) was the predominant bacterial finding in COVID-19 sepsis patients.This complexity underscores the urgent need for a broad-spectrum empirical diagnostic approach, which is particularly crucial for managing sepsis in vulnerable groups, such as the elderly.Similarly, Klebsiella pneumoniae is a leading cause of pulmonary infections, such as pneumonia 27,28 .In sepsis, the infection site is crucial, as it acts as an entry point for pathogens and triggers systemic inflammatory responses 29 .This can lead to severe sepsis or septic shock, particularly when E. coli is involved, given its virulence factors and ability to evade host immune responses 30 .

Limitations and future directions
Our study has limitations, including its sample size and single-center design, which may affect the generalizability of the results.Future research should focus on multicenter studies with larger, more diverse populations to enhance the robustness and applicability of the findings.Further exploration of the mechanisms behind identified associations and the integration of genomic and proteomic data into the machine learning model could provide deeper insights into the pathophysiology of sepsis in elderly patients 31 .

Study design and participants
This retrospective study was conducted at the Department of Clinical Laboratory, Fuding Hospital, Fujian University of Traditional Chinese Medicine.Medical records of 119 elderly patients (aged ≥ 60 years) diagnosed with sepsis from January to December 2022, were reviewed.Patients were divided into two groups: E. coli infections (case group, n = 57) and non-E.coli infections (control group, n = 62).Inclusion criteria specified individuals over 60 years with solitary bacterial growth in blood cultures from sepsis patients.Exclusion criteria encompassed subjects with significant heart or liver function abnormalities, a history of tumors or coagulation dysfunction, pregnancy or breastfeeding, and recent trauma or surgery.The study protocol obtained approval from the Medical Ethics Committee of Fuding Hospital, Fujian University of Traditional Chinese Medicine, with the ethical approval number Fuding Hospital 2,022,325.All methods were performed in accordance with the relevant guidelines and regulations.Due to its retrospective nature, the study was exempted from requiring written informed consent by the Medical Ethics Committee of Fuding Hospital, Fujian University of Traditional Chinese Medicine.

Bacterial identification and detection of biomarkers
Peripheral venous blood samples, collected from patients prior to antibiotic therapy initiation using sterile techniques to minimize contamination.Samples were immediately inoculated into Bactec culture vials to facilitate aerobic and anaerobic bacterial growth.The vials were then placed in a Bactec incubator (BD Diagnostics, Franklin Lakes, NJ, USA) and monitored for bacterial growth.Only bacterial isolates meeting predefined pathogenicity criteria were analyzed further, ensuring the findings' relevance to clinical sepsis.
For the purpose of this study, isolated pathogens were classified into two primary categories: E. coli and non-E.coli bacteria.Within these categories, bacteria were further organized into distinct phylogenetic groups to facilitate a detailed analysis of microbial diversity in sepsis.These groups included E. coli, Klebsiella pneumoniae, Staphylococcus spp., Streptococcus spp., and Enterococcus spp.To accommodate the identification of less common pathogens, a novel sixth category was established.This category included pathogens that did not fit into the aforementioned groups but were identified as clinically significant based on specific pathogenic criteria.
Further classification within the Staphylococcus and Enterococcus genera was conducted to provide insight into the specific species contributing to sepsis in the elderly population.The Staphylococcus spp. group comprised Staphylococcus aureus, Staphylococcus epidermidis, Staphylococcus saprophyticus, and Staphylococcus haemolyticus, while the Enterococcus spp. group included Enterococcus faecalis, Enterococcus faecium, Enterococcus gallinarum, and Enterococcus avium.This detailed categorization was essential for understanding the microbiological landscape of sepsis in the study population.
Clinical data, meticulously extracted from electronic medical records, included demographic details, lifestyle behaviors (such as smoking and drinking habits), comorbidities (including hypertension, cardiovascular diseases, and diabetes), and a comprehensive set of laboratory measurements.These measurements included HGB, MCV, RDW, WBC, neutrophil, lymphocyte, monocyte, platelet counts, PDW, MPV, CRP, PCT, cholesterol, triglycerides, uric acid, albumin, and the albumin-CRP (ALB-CRP) ratio.The infection site for each patient was carefully recorded, encompassing pulmonary, abdominal, urinary, and other locations.All biomarkers were evaluated within the initial 24 h after admission.
Serum PCT levels were accurately measured using the Cobas e411/E601 systems (Roche Diagnostics, Mannheim, Germany), renowned for their precision in diagnostic assays.CRP levels were determined with the Dimension Vista 1500 Intelligent Lab system (Siemens Healthcare GmbH, Erlangen, Germany), adhering strictly to the manufacturer's guidelines to ensure accuracy.The Beckman Coulter AU 5800, a state-of-the-art fully automated clinical chemistry analyzer, was employed for the quantification of cholesterol, triglycerides, uric acid, albumin, and the ALB-CRP ratio, facilitating a comprehensive lipid and protein profile assessment.
Furthermore, the Sysmex XN-9000 hematology analyzer (Sysmex Corporation, Kobe, Japan), a cutting-edge instrument, was utilized for conducting a complete blood cell count, including measurements of HGB, MCV, RDW, WBC, neutrophil, lymphocyte, monocyte, platelet counts, PDW, and MPV.The ALB-CRP ratio, a novel marker of inflammation and nutritional status, was calculated by dividing the albumin value by the CRP level, offering additional insights into the patient's health status and the systemic response to infection.

Statistical analysis
Statistical analysis was conducted using the Statistical Package for Social Sciences (SPSS) Version 22.0 (IBM Corp., Armonk, NY, USA), GraphPad Prism 8.0 (GraphPad software, San Diego California USA, www.Graph-Pad.com), the R package "CBCgrps" 13 , and Python 3.7.This method enabled thorough data analysis and the application of machine learning, ensuring the reliability and reproducibility of our results.The Shapiro-Wilk test assessed variable distribution patterns in E. coli and non-E.coli groups, yielding median values and interquartile ranges (IQRs).For the analysis of categorical data or proportions across the two groups, either the Chi-square test or Fisher's exact test was utilized, depending on the data's suitability.Baseline characteristics of study participants were summarized using descriptive statistics.Continuous variables were presented as either median (Q1, Q3) for those not following a normal distribution or as mean ± standard deviation (SD) for data with a normal distribution.For categorical variables, frequencies and percentages were reported.The comparison of continuous variables between groups was conducted using either the Mann-Whitney U test or the Student's t test, based on the distribution characteristics of the data.

Logistic regression analysis
Univariate and multivariate logistic regression analyses were carried out to ascertain factors linked with the bacterial infection type, specifically distinguishing between E. coli and non-E.coli infections.Odds ratios (ORs) with 95% confidence intervals (CIs) were computed to quantify the strength and direction of associations.Variables demonstrating an association with the outcome in the univariate logistic regression analysis (p value < 0.20) 32 were subsequently incorporated into the multivariate logistic regression model using the backward stepwise elimination method to adjust for potential confounders and to identify independent predictors of E. coli infection subtype.Variables with a P-value higher than 0.05 were omitted from the multivariate logistic regression model.This methodological rigor underscored our commitment to unveiling statistically significant and clinically relevant determinants that might affect the risk of particular bacterial infections among elderly sepsis patients.

Figure 3 .
Figure 3. Important features identified from the LASSO analysis.The coefficients represent the impact of each feature on the prediction of the response variable.The feature 'Site_Urinary' shows a strong positive impact, while 'Site_Pulmonary' has a significant negative impact.

Figure 5 .
Figure 5. Confusion matrix for the random forest model showing the classification performance.The model correctly predicts 14 positive and 7 negative cases, while misclassifying 1 negative as positive and 2 positives as negative.

Figure 6 .
Figure 6.Receiver operating characteristic (ROC) curve for the random forest model.The curve has an area under the curve (AUC) of 0.88, demonstrating the model's ability to distinguish between Escherichia coli and non-Escherichia coli infections.

Table 2 .
Univariate logistic regression analyses results for clinical feature.B; coefficient.S.E., standard error; Ref, reference; HGB, hemoglobin; MCV, mean corpuscular volume; RDW, red cell distribution width; WBC, white blood cell count; PDW, platelet distribution width; MPV, mean platelet volume; CRP, C-reactive protein; PCT, procalcitonin; TG, triglycerides; ALB, albumin; ALB_CRP, albumin-CRP ratio; Abdominal, infections located in the abdominal area; Pulmonary, infections located in the lungs; Urinary, infections located in the urinary tract; Other, infections located in areas not specified above.

Table 3 .
Results of multivariate logistic regression analysis in the backward stepwise elimination method.HGB, hemoglobin; Ref, reference; B: coefficient.S.E., standard error; OR, odds ratio; CI, confidence interval.

Table 4 .
Performance metrics of the random forest classification model.Macro-average, average across all classes, giving each class equal weight; Weighted-average, average across all classes, weighted by support.

Table 5 .
Performance metrics of random forest model for predicting Escherichia coli infections.ROC AUC, receiver operating characteristic area under curve.