Clinically Applicable System for Rapidly Predicting Enterococcus faecium Susceptibility to Vancomycin

ABSTRACT Enterococcus faecium is a clinically important pathogen that can cause significant morbidity and death. In this study, we aimed to develop a machine learning (ML) algorithm-based rapid susceptibility method to distinguish vancomycin-resistant E. faecium (VREfm) and vancomycin-susceptible E. faecium (VSEfm) strains. A predictive model was developed and validated to distinguish VREfm and VSEfm strains by analyzing the matrix-assisted laser desorption ionization–time of flight (MALDI-TOF) mass spectrometry (MS) spectra of unique E. faecium isolates from different specimen types. The algorithm used 5,717 mass spectra, including 2,795 VREfm and 2,922 VSEfm mass spectra, and was externally validated with 2,280 mass spectra of isolates (1,222 VREfm and 1,058 VSEfm strains). A random forest-based algorithm demonstrated overall good classification performances for the isolates from the specimens, with mean accuracy, sensitivity, and specificity of 0.78, 0.79, and 0.77, respectively, with 10-fold cross-validation, timewise validation, and external validation. Furthermore, the algorithm provided rapid results, which would allow susceptibility prediction prior to the availability of phenotypic susceptibility results. In conclusion, an ML algorithm designed using mass spectra obtained from the routine workflow may be able to rapidly differentiate VREfm strains from VSEfm strains; however, susceptibility results must be confirmed by routine methods, given the demonstrated performance of the assay. IMPORTANCE A modified binning method was incorporated to cluster MS shifting ions into a set of representative peaks based on a large-scale MS data set of clinical VREfm and VSEfm isolates, including 2,795 VREfm and 2,922 VSEfm isolates. Predictions with the algorithm were significantly more accurate than empirical antibiotic use, the accuracy of which was 0.50, based on the local epidemiology. The algorithm improved the accuracy of antibiotic administration, compared to empirical antibiotic prescription. An ML algorithm designed using MALDI-TOF MS spectra obtained from the routine workflow accurately differentiated VREfm strains from VSEfm strains, especially in blood and sterile body fluid samples, and can be applied to facilitate the rapid and accurate clinical testing of pathogens.


Specimen processing, Enterococcus faecium identification, and vancomycin susceptibility test
Clinical specimens were continuously collected as daily routine from all the wards to the clinical microbiology laboratory of Chang Gung Memorial Hospital, both Linkou and Kaohsiung branches. The specimen types included blood, respiratory tract specimen (ie, sputum, bronchial wash, and bronchoalveolar lavage), sterile cavity fluid (ie, ascites, pleural effusion, pericardial effusion, cerebrospinal fluid, and synovial fluid), tip of implant, urine, wound, and others. The distribution of specimens is summarized in Supplementary Table 1. Blood specimens were collected after aseptic preparation and cultured in trypticase soy broth (Becton Dickinson, MD, USA). Positive culture results were detected using the automated detection system (BD BACTEC™ FX; Becton Dickinson). Blood was drawn out from positive blood culture bottles onto blood plate (BP) agar for subculture (Becton Dickinson, MD, USA). Sputum specimens with adequate quality 1 were used. The respiratory specimens were inoculated on BP agar (Becton Dickinson), eosin methylene blue (EMB) agar (Becton Dickinson), CNA agar (Becton Dickinson), and chocolate agar (Becton Dickinson). Specimens obtained from the sterile cavity fluid were inoculated on BP, EMB, CNA, and chocolate agars, and into thioglycollate broth (Becton Dickinson). While positive growth was noted in thioglycollate broth, subculture was performed using BP agar. A semiquantitative culture method described by Maki et al. was used for testing the tip of implants. 2 Urine specimens were inoculated using a quantitative loop on BP and EMB agars. For specimens collected from wound, 1.2 mL of 0.9% saline was used for rinsing when the specimens were obtained using a swab. The rinsed saline was inoculated on BP, EMB, CNA, and chocolate agars; for pus collected from wound, the specimens were directly dropped on the agars and into thioglycollate broth. The agar and broth were incubated in a CO2 incubator at 37℃ for 18-24 hours. Single colonies grown on agar plates were picked for further analysis. Enterococcus faecium was identified based on colony morphology and matrix-assisted laser desorption ionization time-of-flight (MALDI-TOF) spectra (Bruker Daltonik GmbH, Bremen, Germany). The paper disc method was used to differentiate vancomycin-resistant Enterococcus from vancomycin-susceptible Enterococcus on the basis of Clinical and Laboratory Standards Institute guidelines M100. The susceptibility of vancomycin was interpreted according to CLSI M100 of the corresponding years. The interpretative criteria of vancomycin for E. faecium do not change in the period of 2013-2017. We only selected specimens of sufficient quantity, that is, blood, urinary tract, sterile body fluid, and wound, for building the vancomycin-resistant E. faecium (VREfm) prediction model in this study.

Binning method for extracting predictor candidates
In a MALDI-TOF mass spectrometry (MS) spectra, the peaks were extracted and regarded as predictors for the construction of predictive models. In an initial scanning through all MS spectra, a peptide would locate at subtle different m/z location in multiple replications. The range of peaks drifting was reported ± 5 m/z 4,7 . The shifting/drifting problem of peaks is illustrated in Supplementary   Figure 1. To deal with this problem among different spectra, the binning method was adopted to group large-scale peaks into a smaller number of "bins". Supplementary Figure 2  diagram of the binning method used in this study. The peaks located within the same bin are considered as the same feature. In the binning method, we evaluated various bin sizes (0-12 Da), and we adopted 10 Da as the bin size for the following experiments given the results of evaluation and the previous studies. 4,7 Moreover, on the basis of binning results, we aligned the peaks according to m/z 4429 to obtain more accurate m/z positions for the peaks. The peptide at m/z 4429 was reported to be fairly abundant in E. faecium. 8,9 Thus, we selected the peptide at m/z 4429 as the intrinsic internal control to adjust the other peaks (Supplementary Figure 3). In the study, we could find peaks that are close to the m/z 4429 across all E. faecium isolates. used the binning method to classify these peaks into smaller groups. An example is illustrated:

Supplementary
Occurrence frequency at a specific m/z larger than 20% of all cases is taken into calculation.
When the bin size is 1, then 5 resulting features are obtained; when the bin size is 3, only 1 resulting feature is obtained.

Heat map
On the basis of the chi-square scores of the predictor candidates (ie, peaks), we selected top 10 most critical predictive peaks (Supplementary Table 2) and plotted a heat map using hierarchical clustering. We took log of the intensity of the top 10 most critical predictive peaks, followed by zscore standardization. The hierarchical clustering was conducted based on Euclidean distance metric and average linkage. We produced the heat map by using pheatmap package on R software (version 3. 3. 3, R Foundation for Statistical Computing, http://www.r-project.org/).

Random forest
Random forest (RF), an ensemble learning method, is widely used for classification. More specifically, the ensemble learning approach combines multiple learning models to obtain an improved classification model and thus obtain better performance on prediction. 10 Additionally, the bootstrap aggregation (bagging) technique is considered for sampling the training data in RF. In other words, the bootstrap method samples the training tuples averagely with replacement, which means every selected tuple is likely to be re-added to the training set. In this study, we adopted the Weka toolkit 11 to construct RF classifiers based on various feature sets.

K-Nearest Neighbors
The nearest neighbor approach is an instance-based classifier used for determining the most similar instances, which were selected from all training data, to a given test instance, based on a distance function. Given a test instance, the most k similar instances are regarded as k-nearest neighbors (KNNs) of the test data, and the class assignment is determined in accordance with the proportion of KNNs. Considering the training data and test data as the n-dimensional vectors in Gaussian space, the Euclidean distance function is usually applied to measure the distances between the test data and all training data. Given a test instance t, the Euclidean distance between t and a training instance x is defined as where n is the size of the feature set. After determining KNNs, the class labels of these k training instances might be inconsistent. A weighted distance voting method was used to conduct the class assignment for a test data. Class assignment C(t) of a test data t is determined by where v is the class label and is the weighted value of the class label of in KNNs. For a binary classification between VREfm and vancomycin-susceptible E. faecium (VSEfm) samples, the positive and negative training instances were represented as n-dimensional vectors with class labels +1 and −1, respectively. The testing data without class labels are classified into +1 or −1 based on the k nearest training samples. In the KNN classifier, various values of k were examined to find the best cutoff with best performance.

Support Vector Machine
This study involved a binary classification of VREfm and VSEfm spectra. The positive (VREfm) and negative (VSEfm) spectra were labeled as +1 and −1, respectively, for the 2 classes. The training dataset is = { , } where = +1 if ∈ positive dataset and = −1 if ∈ negative dataset. This study attempted to identify w and w0 such that + 0 ≥ +1 = +1 and + 0 ≤ −1 = −1, which can be rewritten as ( + 0 ) ≥ +1 This formula could be used to estimate the optimal separating hyperplane that can maximize the margin between 2 classes. 12 The distance of to the discriminating hyperplane is | + 0 | ‖ ‖ and we would like the distance to be higher than a specific value h: The support vector machine (SVM) is an advanced algorithm used to identify a hyperplane between 2 classes with a maximum margin based on an n-dimensional vector space. 12 In an attempt to maximize h, however, an unlimited number of possible values could be elucidated by tuning w. Hence, ℎ‖ ‖ was defined as one and ‖ ‖ was minimized using the following equation 13 : In this work, SVM could be adopted to determine a hyperplane for discriminating between VREfm and VSEfm samples with a maximal margin in a vector space containing n dimensions (size of the feature set). The mass-to-charge ratio values of spectra were represented as a numeric vector in an ndimensional vector space, which are the input values for SVM. A famous SVM public resource, called LIBSVM, 14 was downloaded and installed in our computing server for an iterative training of multiple SVMs in accordance with various feature sets. With ML, if the best discriminant is nonlinear, instead of enabling a nonlinear model, we could map all n-dimensional vectors to a new vector space with higher dimension m, where m > n, based on nonlinear kernel functions. As demonstrated in previous methods, [15][16][17][18][19] the radial basis function (RBF) has been typically chosen as the specified kernel function in SVM models. The RBF function was given as follows: where is the center and s is the radius, which should be provided by the programmer. With LIBSVM, cost (c) and gamma (r) are 2 supporting parameters used to optimize the radius of the kernel function and softness of the hyperplane, respectively. To achieve the feasible values of gamma (r) and cost (c) in model learning, an optimization program, written in Python, was provided by LIBSVM.