Is it possible to determine antibiotic resistance of E. coli by analyzing laboratory data with machine learning?

Objectives: Microbial antibiotic resistance remains a serious public health problem worldwide. Conventional culture-based techniques are time-taking procedures; therefore, there is need for new approaches for detecting bacterial resistance. The aim of this study was to assess antibiotic resistance of Escherichia coli by analyzing biochemical parameters with machine learning systems without using antibiogram. Material and methods: In this article, machine learning systems such as K-Nearest Neighbors, Artificial Neural Networks (ANN), Support Vector Machine and Decision Tree Learning were used to investigate whether E. coli is sensitive or resistant to antibiotics. The study was conducted based on the clinical records of 103 patients who were previously diagnosed with E. coli infection, including CBC and complete UA results, and CRP values. Results: The accuracy rates of antibiotic resistance/susceptibility detected by ANN were as follows: Amikacin (96.0%), Ampicillin (77%), Ceftazidime (62%), Cefixime (63%), Cefotaxime (68%), Colistin (95%), Ciprofloxacin (76%), Cefepime (70%), Ertapenem (96%), Nitrofurantoin (90%), Phosphomycin (98%), Gentamicin (84%), Levoflo-xacin (98%


Introduction
Antibiotic-resistant bacteria is a serious public health problem worldwide due to its potential in reducing the likelihood of treatment and even making it impossible.Infections caused by these bacteria often lead to higher rates of hospitalization and additional therapies as well as increased diagnostic and treatment costs.In the United States, almost 2.8 million people are diagnosed with antibiotic-resistant organisms each year and more than 35,000 of them are reported to die [1].
The growing incidence of antibiotic resistance has increased the variety of infections as well as the cost of additional treatments [2].These processes are primarily affected by excessive and unnecessary use of antibiotics and easy access to these drugs [3].
Antibiotic resistance in bacteria differs between Gram negative and Gram positive bacteria [4,5].Complex mechanisms of antibiotic resistance include (i) natural (intrinsic) resistance (caused by non-target antibiotics), (ii) acquired resistance (caused by mutations, plasmids, or transposons), (iii) cross-resistance (i.e., cross-resistance of a drug-resistant organism to drugs with similar effects), and (iv) multidrug resistance (i.e., resistance caused by multiple imported genes or enzymatic inactivation and structural changes) [6].
Urinary tract infections (UTI) are the most common bacterial infections Gram negative bacteria are the leading cause of UTI in all age groups and in both sexes, with Escherichia coli (E.coli) being the most common UTI pathogen (65-75%) [7].Assessment of antibiotic resistance by conventional methods can be a time-consuming process that comprises the following steps: (1) Collection of urine specimens and pre-analytical processing of specimens in the laboratory (0-1 h), (2) Addition of the specimen to the medium (Eosin Methylene-blue [EMB] Lactose Sucrose Agar for E. coli), (3) Incubation of the medium (approximately 24 h), (4) Identification of the bacteria following culture growth (approximately 4-6 h for Gram negative bacteria), (5) Performing antibiogram to determine the antibiotic resistance of the identified bacteria (approximately 16 h for E. coli).
In total, these five steps take approximately two days to implement.For this reason, computer-aided machine learning algorithms are needed to reduce this period and to support the decision-making processes.Moreover, reducing the analytical process is highly important for taking prompt measures both in the diagnosis and treatment of the patients.The present study was designed to assess antibiotic resistance of E. coli by only using urinalysis (UA) and complete blood count (CBC) parameters and C-reactive protein (CRP) value with four distinct machine learning systems including K-Nearest Neighbors (KNN), Artificial Neural Networks (ANN), Support Vector Machine (SVM), and Decision Tree Learning (DTL) without using antibiogram.

Materials and methods
The study was conducted based on the clinical records of 103 patients with E. coli infection aged 1-93 years (70 female and 33 male) who applied to Elazig Fethi Sekin City Hospital Central Laboratory between 2019 and 2020.Clinical records including complete blood count (CBC) and complete urinalysis (UA) results, and CRP values were analyzed by machine learning systems including K-Nearest Neighbors (KNN), Artificial Neural Networks (ANN), Support Vector Machine (SVM), and Decision Tree Learning (DTL) without using antibiogram on the same day of antibiogram requests in order to investigate whether E. coli was susceptible or resistant to antibiotics.Clinical information from laboratory information system (LIS) data was extracted on each patient.Patients with resistant and/or recurrent urinary tract infections, who were requested for biochemistry laboratory tests at the same time as an antibiogram laboratory request were included in the study.Feature selection requires experimenting with many different possibilities and bringing together the intuition of the domain expert.In our study, we selected features that are known to be associated with urinary tract infection and that are made up of routine and easily accessible biochemistry laboratory parameters that are studied daily.Patients with incompatible clinical findings, those with suspected contamination, those who did not have any growth as a result of culture, patients with bacterial growth other than E. coli, and antibiogram evaluations that were not requested by the laboratory together with attribute selection (CRP, UA and CBC) were excluded from the study.
CBC parameters were measured using a Beckman Coulter DxH 800 hematology analyzer, urinalysis was performed using a Beckman Coulter IQ-2000 Elite analyzer, and CRP was measured using a Beckman Coulter Image-800 Immunochemistry System.Bacterial culture was cultivated manually (From urine samples, 5% defibrinated sheep blood agar and eosin-methylene blue (EMB) agar plates were inoculated, single microorganism growth over 10 5 colonies (cfu/mL) was considered significant) and incubated for 24 h and then bacterial identification was achieved using conventional methods with MicroScan WalkAway 96 Plus ID/AST system (Beckman Coulter Inc., USA).In vitro antimicrobial sensitivity of isolates was assessed using MicroScan WalkAway 96 Plus ID/AST system with broth microdilution testing method, antibiotic susceptibility test according to the EUCAST (European Committee on Antimicrobial Susceptibility Testing) criteria [8].Identification and antimicrobial susceptibility were testing was performed with B1017-165: Rapid Negative ID4, B1016-195: MIC EN52 combination panels.ESBL (Expanded Spectrum Beta-Lactamase) için bacterial suspension adjusted to McFarland 0.5 was cultivated in Mueller Hinton agar (RTA, Turkey) medium, then ceftazidime (Oxoid, UK) and ceftazidime/clavulanate (Oxoid, UK) discs were placed.After 24 h of incubation, a difference of ≥5 mm in the zone diameter was interpreted in favor of ESBL production.Additionally, antibiotic susceptibility of these isolates against 19 antibiotics were tested according to the EUCAST.
Figure 1 illustrates the machine learning algorithm used for the assessment of antibiotic resistance of E. coli, which is the most common pathogen causing UTI.For each antibiotic drug, three parameters including (I) CBC (white blood cell count [WBC], neutrophil, lymphocyte, monocyte, eosinophil, basophil), (II) UA (density, pH, nitrite, erythrocyte, leukocyte), and (III) CRP were used as input parameters for the classifiers.As a result, a total of 12 parameters were obtained.
To facilitate data processing, raw data were normalized between −1 and +1.All the 12 parameters were used as an input vector for ANN, SVM, KNN, and DTL and the performances of these classifiers were compared for each parameter.

Data analysis
The models used in the study were tested in the Matlab R2018b (The MathWorks, Inc. Cambridge, United Kingdom) platform on a computer with an i7 9750 H CPU, 2.6 GHz, 16 GB RAM and Geforce GTX 1050 gCPU.Laboratory data analysis (laboratory characteristics of patients) was

Classification
Artificial Neural Networks (ANN) is a machine learning technology that evolved from the idea of simulating the functioning of the human brain [9].In ANN, back propagation is the most used algorithm for updating a neural network.The generalized delta rule is a mathematically derived formula used to determine how to update a neural network.In this technique, some portion of the difference (error) between the target and output values is back propagated to each training unit during a (back propagation) training step in order to update the weights according to the error and this procedure is iterated for a certain number of times to minimize this error [9].
The ANN model used in the present study consisted of 12 inputs (i.e., 12 parameters shown in Table 1) and 1 output (i.e., antibiotic susceptibility of the drugs).The error and learning rates of the training were set to 0.01 and 0.005, respectively.
Support vector machine (SVM) is another machine learning algorithm used for classification problems based on the structural risk minimization principle [10].SVM tries to find the best hyperplane (also called decision boundary) to separate two classes.Equation (1) presents the decision boundary used for SVM: In this equation, α i represents the Lagrange multipliers, x i is the support vector, and b represents the bias term [11,12].Where linear separation is not possible, the following kernel functions can be use Radial Basis Function(RBF) : In these equations, σ and d represent the kernel function parameters.
The k-near neighbor (KNN) classifier is a commonly used machine learning algorithm that measures the closeness of the new data to be classified to the k closest training examples in the feature space [13].
In the present study, decision tree learning (DTL), which is a treebased learning algorithm, was also used to improve the classification performance [14].
On the other hand, k-fold cross validation was employed to minimize the distribution-related errors encountered during the training and testing of the proposed model [15].In the study, the number of k was chosen as 10 in accordance with the data number.Figure 2 illustrates the implementation of k-fold cross validation.

Performance evaluation
The classification performance of the model was determined based on multiple criteria including Sensitivity (SN), Specificity (SP), Precision (PREC), Negative Predictive Value (NPV), False Positive Rate (FPR), False Discovery Rate (FDR), False Negative Rate (FNR), Accuracy (ACC), and F-Measure.Supplementary Material 1 presents the definitions of the parameters used for the Confusion Matrix and Supplementary Material 2 presents the formulas used for the calculation of the performance parameters [16].

Results
Seventy of our patients were female (68%), 33 were male (32%), and 81 of these patients (78%) were outpatient and 23 of them were inpatient.ESBL was detected in 31 of our patients (30%).Parameters used for the analysis and their descriptions, reference ranges and the laboratory characteristics of patients are given in Table 1.
The resistance of E. coli isolated from patients to antibiotics was found to be 69.9% to ampicillin, 51.4% to Cefixime and the least resistance to Imipenem and Levofloxacin with <1%.The resistance of E. coli isolated from patients to antibiotics Table 2, respectively; All four  classifiers (ANN, SVM, KNN, and DTL) were used to determine whether E. coli isolates were resistant or susceptible to the 19 antibiotics administered in the patients.Classification results were then compared with antibiogram results.The results indicated that the performance of the classifiers varied across the antibiotics administered in the patients.Table 3 presents the performance results for each classifier based on the parameters used in the analysis.The performance of the proposed method in diagnosing antibiotic resistance was assessed by the individual use of CBC, UA parameters, and CRP and also by the use of all parameters (Figure 3).

Discussion
Inappropriate broad-spectrum antibiotics used in the treatment of community-acquired infections will cause the resistance of organisms to antibiotics to increase rapidly and the difficulties that will arise in this regard in the future will increase.For this reason, it is clear that much faster and inexpensive methods are needed to determine the resistance to antibiotics.
In the studies conducted our region, Duman et al. found the highest antibiotic resistance against E. coli to Ampicillin and the lowest resistance to Amikacin and  Ayyıldız and Arslan Tuncer: Antibiotic resistance of E. coli and biochemistry Imipenem, while Denk et al. found the highest resistance to ampicillin and the lowest to Phosphomycin and Nitrofurantoin [17,18].In our study, the highest resistance was found against Ampicillin and the lowest resistance was against Imipenem and Levofloxacin, hence our dataset is compatible with the E. coli antibiotic resistances in our region (Table 2).
The results indicated that the analysis of routine biochemical laboratory parameters by machine learning systems can predict antibiotic resistance of E. coli infection, i.e., the most common cause of UTI.Similarly, numerous previous studies also used machine learning systems to investigate antibiotic resistance, as shown in Table 4.
As seen in Table 4, most of the studies investigated the prediction of antibiotic resistance based on the specific genotype of the pathogen [19][20][21][22][23][24].In contrast, unlike other studies, Yelin et al. performed this prediction based on the 10-year clinical history of the patients [25].The method proposed in the present study, however, is highly different from those reported in other studies in that it is based on the biochemical laboratory parameters that are routinely measured in almost all patients with suspected UTI.Accordingly, this method appears to be a reasonable option as it is highly cost-effective and employs a relatively lower number of datasets for the analysis.
In the present study, four classifiers (ANN, SVM, KNN, and DLT) were used to determine E. coli resistance to antibiotics.In machine learning systems such as ANN, the parameters used for the analysis are problem-bound; therefore, it is often not possible to determine which parameter (e.g., number of multilayer perceptron [MLPs], number of neurons in hidden layers, learning coefficient) will provide an optimal outcome.As a solution, the trialand-error method is used to determine which classifier will provide the highest performance.For these reasons, there is no valid comparison of these classifiers.Nonetheless, it could be asserted that an algorithm may have tend to a specific problem [26].
Accuracy (ACC) is calculated as the number of correctly classified instances divided by the total number of the dataset.Meaningfully, ACC alone is not sufficient for the evaluation of unbalanced datasets.In the literature, limited data is available regarding the antibiotic resistance of the input parameters used in our study [27,28].Therefore, a balanced distribution could not be achieved between our resistant and non-resistant parameters in terms of the number of cases.AUG, CXM, IMI and MRP are shown as NED in Table 3. Sensitivity (SN) refers to the ability of a test to estimate the number of correct positive predictions divided by the total number of positives.Specificity (SP) refers to the ability of a test to estimate the number of correct negative predictions divided by the total number of negatives.Meaningfully, SN and SP should be used together for the estimation of these predictions.A test with high SP helps in avoiding misunderstandings and unnecessary preventable interventions (True negative) while a test with high SN is needed particularly in cases of ambiguous diagnosis and in early disease conditions (True positive).The F 1 score was also used in the present study due to the fact that it employs the harmonic mean instead of the arithmetic mean to avoid overlooking extreme conditions.
In some of the antibiotics analyzed in the study, there was a remarkable difference between the SN and SP values of the classifiers, which indicated poor generalization and classification performance of classifiers.This difference was mostly seen in imbalanced datasets (Table 3).
For each classifier, 10-fold cross validation was performed to improve their validity and generalization performance (Figure 2).Low data exchange rates obtained during cross validation further support the linearity assumption of the dataset.Given that the performance of a classifier depends on the problem to be analyzed, it is tempting to consider that using appropriate input parameters is equally important as using an appropriate classifier.In our study, although the use of all parameters (CBC, UA parameters and CRP) provided an acceptable classification performance, the individual effect of these parameters on this performance was also assessed and the results indicated that the use of all parameters showed the highest classification performance in most of the antibiotics analyzed in the study (Figure 3, Table 2).
Our results also indicated that the performance of the classifiers used for the diagnosis of E. coli infection varied across the antibiotics administered in the patients.For this reason, it cannot be mentioned that the same classifier is prone to problem solving for each antibiotic in the study.On the other hand, the study also found a relationship between the parameters analyzed in the study (CBC and UA parameters and CRP) and the resistance of different antibiotics used for the treatment of E. coli infection and also indicated that these parameters could be used in the diagnosis of this infection.

Limitations
The limitation of the study is that we cannot control how the performance parameters of our model will change in different datasets.Therefore, it is necessary to use different datasets to ensure the general accuracy of the study.It is aimed to use different datasets in future studies.

Conclusion
Antibiotic resistance is a potentially serious public health problem worldwide [29].The present study provides a relatively more cost-effective and more practical method to algorithmic treatment modalities that require no use of antibiogram.Moreover, the study also showed that E. coli infection can be diagnosed by the analysis of CBC, UA parameters and CRP with machine learning systems and without the use of antibiogram, something that has never been documented in the literature.The present study also contributed to the variety of diagnostic and treatment modalities by combining biochemical and microbiological laboratory parameters, thereby providing a substantial solution for clinical problems.

Figure 1 :
Figure 1: Illustration of the algorithm used in the study.

Figure 3 :
Figure 3: Classification results based on the use of CBC, UA parameters, CRP, and all of these parameters.

Table  :
Parameters used for the analysis and their descriptions, reference ranges and the laboratory characteristics of patients.

Table  :
The resistance of E. coli isolated from patients to antibiotics.

Table  :
Performance results of the antibiotics.

Table  :
Studies investigating antibiotic resistance by machine learning systems.