Data-Driven Two-Stage Framework for Identification and Characterization of Different Antibiotic-Resistant Escherichia coli Isolates Based on Mass Spectrometry Data

ABSTRACT In clinical microbiology, matrix-assisted laser desorption ionization–time-of-flight mass spectrometry (MALDI-TOF MS) is frequently employed for rapid microbial identification. However, rapid identification of antimicrobial resistance (AMR) in Escherichia coli based on a large amount of MALDI-TOF MS data has not yet been reported. This may be because building a prediction model to cover all E. coli isolates would be challenging given the high diversity of the E. coli population. This study aimed to develop a MALDI-TOF MS-based, data-driven, two-stage framework for characterizing different AMRs in E. coli. Specifically, amoxicillin (AMC), ceftazidime (CAZ), ciprofloxacin (CIP), ceftriaxone (CRO), and cefuroxime (CXM) were used. In the first stage, we split the data into two groups based on informative peaks according to the importance of the random forest. In the second stage, prediction models were constructed using four different machine learning algorithms−logistic regression, support vector machine, random forest, and extreme gradient boosting (XGBoost). The findings demonstrate that XGBoost outperformed the other four machine learning models. The values of the area under the receiver operating characteristic curve were 0.62, 0.72, 0.87, 0.72, and 0.72 for AMC, CAZ, CIP, CRO, and CXM, respectively. This implies that a data-driven, two-stage framework could improve accuracy by approximately 2.8%. As a result, we developed AMR prediction models for E. coli using a data-driven two-stage framework, which is promising for assisting physicians in making decisions. Further, the analysis of informative peaks in future studies could potentially reveal new insights. IMPORTANCE Based on a large amount of matrix-assisted laser desorption ionization–time-of-flight mass spectrometry (MALDI-TOF MS) clinical data, comprising 37,918 Escherichia coli isolates, a data-driven two-stage framework was established to evaluate the antimicrobial resistance of E. coli. Five antibiotics, including amoxicillin (AMC), ceftazidime (CAZ), ciprofloxacin (CIP), ceftriaxone (CRO), and cefuroxime (CXM), were considered for the two-stage model training, and the values of the area under the receiver operating characteristic curve (AUC) were 0.62 for AMC, 0.72 for CAZ, 0.87 for CIP, 0.72 for CRO, and 0.72 for CXM. Further investigations revealed that the informative peak m/z 9714 appeared with some important peaks at m/z 6809, m/z 7650, m/z 10534, and m/z 11783 for CIP and at m/z 6809, m/z 10475, and m/z 8447 for CAZ, CRO, and CXM. This framework has the potential to improve the accuracy by approximately 2.8%, indicating a promising potential for further research.

Then, we chose the model which attained the highest area under the receiver operating characteristic curve (AUC) built by XGBoost, and tuned the hyperparameters to the obtain higher AUC in 5-fold CV.
LR is a statistical model which is widely used in statistics. It basically uses a logistic function to estimate a binary dependent variable. It estimates the parameters of binary logistic model, in this case, resistant and not resistant represented by 1 and 0.
The estimation or called prediction is the log-odds of the value labeled "1" which is calculated by a linear combination of multiple variables, also called features or predictors. Then log-odds are converted by the logistic function. The general formula can be expressed as follow: , where . F(y) is the predictive value of model and the formula is logistic function to convert logodds of y, y is the linear combination of all predictors and a are the regression coefficients which are trained to minimize the difference between model prediction and ground truth. Note that logistic regression we used in our study was using SVM is a supervised machine learning model which can use for classification and regression on linear or nonlinear data. What SVM do is separating the data by a hyperplane and extending the boundaries margin by the kernel trick. In another words, SVM algorithm will separate the data by searching an optimal linear hyperplane as wide as possible. If data are non-linear classification, SVM also can perform kernel trick which maps isolates into high dimension space. The general SVM formula is as follow: where W is the boundary hyperplane normal vector and the cost function of SVM is . We used SVC function in scikit-learn svm package in python [9] in this study. The SVM parameters were set by default. Some important parameters are kernel = 'rbf' which uses RBF kernel function to map the data, C = 1 is the coefficient of penalty term, probability = True which enforce model prediction output probability.
RF is an ensemble learning method that fits multiple DT for classification. RF combines numerous DT's results based on majority vote and each DT will be trained by part of data or part of features. It can properly prevent the overfitting and usually yields the higher accuracy than DT. DT is a tree structure like classification method.
The internal node of tree represents a condition of feature to separate the data.
Following the flowchart of tree, finally each leaf of tree represents the class label. The basic idea of which feature to choose to split is the information gain, in other words, the entropy difference before and after splitting. The criterion of RF we used in this study is Gini index which calculates the impurity of each partition. The Gini index formula is: where D is the isolates which dataset contains from n classes. And p is the probability of each class. Suppose data is split on A into subset, the gini index given the split on A is: Also, the evaluation of scoring was set roc_auc. Finally, we got the best hyperparameters for CIP resistance prediction. Then we built the model by all training data in the best tuned parameters with learning_rate = 0.001, n_estimator = 2000, and it would be the finally models for our study to test the independent test.

Supplementary Figures
Supplementary Figure S1. Processing of MS spectrum by flexAnalysis.
Supplementary Figure S2. Spectral peaks number of the E. coli isolates. Note: x label is the count peaks of each isolate, y label is the number of isolates, orange bar represents the CIP resistant isolates, and green bar represents the CIP susceptible isolates.  Supplementary Table S1. Numbers of data and susceptible proportion for each antibiotic.