Fast and accurate bacterial species identification in biological samples using LC-MS/MS mass spectrometry and machine learning

The identification of microbial species in biological samples is essential to many applications in health, food safety and environment. MALDI-TOF MS technology has become a tool of choice for microbial identification but it has several drawbacks including: it requires a long step of bacterial culture prior to analysis (24h), it has a low specificity and is not quantitative. We have developed a new strategy for identifying bacterial species in biological samples using specific LC-MS/MS peptidic signatures. In the first training step, deep proteome coverage of bacteria of interest is obtained in Data Independent Acquisition (DIA) mode, followed by the use of machine learning to define the peptides the most susceptible to distinguish each bacterial species from the others. Then, in the second step, this peptidic signature is monitored in biological samples using targeted proteomics. This method, which allows the bacterial identification from clinical specimens in less than 4h, has been applied to fifteen species representing 84% of all Urinary Tract Infections (UTI). More than 31000 peptides in 200 samples have been quantified by DIA and analyzed by machine learning to determine an 82 peptides signature and build prediction models able to classify the fifteen bacterial species. This peptidic signature was validated for its use in routine conditions using Parallel Reaction Monitoring on a capillary flow chromatography coupled to a Thermo Scientific™ Q Exactive HF-X instrument. Linearity and reproducibility of the method were demonstrated as well as its accuracy on donor specimens. Within 4h and without bacterial culture, our method was able to predict the predominant bacteria infecting a sample in 97% of cases and 100% above the 1×105 CFU/mL threshold commonly used by clinical laboratories. This work demonstrates the efficiency of our method for the rapid and specific identification of the bacterial species causing UTI and could be extended in the future to other biological specimens and to bacteria having specific virulence or resistance factors.


INTRODUCTION
The identification of the bacterial species or strain present in a biological sample is essential in many fields of microbiology. Epidemiology, for instance, tracks the spreading of microorganisms related to infectious diseases; food safety laboratories ensure the distribution of pathogens-free products to the consumers; environmental bacteria have a strong impact on maintaining the equilibrium of ecosystems; and clinical laboratories require fast diagnosis methods to provide appropriate treatment to patients with a bacterial infection. However, standard methods for the identification of pathogens requires a time-consuming bacterial culture followed by another long step of immunological or biochemical tests of varying duration and cumbersomeness (1)(2)(3).
During this period, typically of 24 to 48h but could extend to weeks, patients received broad spectrum antimicrobial treatments which might be not optimal and sometimes not even efficient to fight the infection (4). Furthermore, inappropriate use of antimicrobial agents contributes to the selection the resistant bacteria in the whole population thus making infections more challenging to combat (5,6). Therefore, there is a need for the development of fast and robust methods for bacterial identification, in order to improve therapy and guide rational use of antibiotics.
Genotyping methods, which are based on the sequencing of partial (16S small subunit ribosomal [rRNA] gene sequencing) or entire genomes (Whole Genome Sequencing) of the microorganisms contained in a sample, are promising since they do not require bacterial culture and can be applied to complex samples containing several species (7,8).
However, the cost and the time required to get identification by sequencing methods preclude their use in routine laboratories. In addition, if 16S rRNA sequencing can provide a quite rapid identification (typically 24 hours), the high conservation of 16S gene sequences across bacterial families and species often limits the precision of identification to the genus level (9,10). By contrast, Whole Genome Sequencing is able to provide an efficient species and even strain typing, but the cost and the time required to get the results is strongly extended by the sequencing itself and by the data analysis. Moreover, this analysis requires expert scientific knowledge to provide a confident genome assembly as well as large computing resources (11,12).
In the past few years, Matrix-Assisted Laser Desorption Ionization -Time Of Flight Mass Spectrometry (MALDI-TOF MS) analysis of microbial proteins has made a breakthrough in routine labs for bacterial identification (13)(14)(15)(16). This fast, inexpensive, and automatable technology can replace the conventional phenotype-based methods, hence reducing the time required to get an identification from 2 or 4 days to less than 50 hours. For those reasons, two mass spectrometers, the Biotyper (Bruker) and the Vitek-MS (Shimadzu-BioMérieux), have been approved for clinical use by health governmental organizations of most countries including the United States Food and Drug Administration (FDA) in 2013 (17). In the typical workflow, bacterial colonies isolated by culture are submitted to fast sample preparation (typically, a treatment with formic acid and ethanol) prior to acquisition of protein mass spectra that are used to interrogate a spectral database providing a confidence score for the bacterium identification, an information a physician can use to diagnose the infection.
Despite its numerous advantages, bacterial identification by MALDI-TOF MS has several drawbacks: i) it requires a lengthy culture step to isolate bacterial colonies, since the detection is based on a comparison with spectral database acquired on pure colonies. For the same reason, is not able to identify polymicrobial infections (i.e.: when several species are present in the same sample) without analyzing several types of colonies visually selected on the culture plate; ii) because of the minimal sample preparation, the information contained in the spectra is restricted to the most abundant molecules, thus limiting the specificity of the method and its capability to identify certain species or subspecies and; iii) it is not quantitative, a potentially important information for certain specimens where pathogens need to be distinguished from the normal microbiota, or when a certain level of infection needs to be reached to necessitate antibiotherapy.
To overcome the abovementioned issues, several studies have tried to improve MALDI-TOF bacterial identification (18). For instance, Clark and colleagues refined the specificity of the method to identify Escherichia coli pathotypes by examining specific peaks in the spectra (19). Other investigators have tried to improve the specificity using trypsin digestion which allows the accession to a larger set of molecules and the generation of a Peptide Mass Fingerprint of the bacterial subspecies (20). Several studies skip the culture step to provide a faster identification, especially in the case of sepsis where MALDI-TOF acquisition is performed directly from a positive blood culture sample (21,22). However, it has been shown that sample preparation methods, which are not homogenous from lab to lab, can influence the rate of correct identifications of certain microorganisms (23).
Although these studies could improve the standard workflow, they are limited by the sensitivity and the specificity of MALDI-TOF mass spectrometers. Therefore, recent proteomics data (i.e. trypsin-digested proteins). These methods were able to reach 89 to 98.5% correct classification rates at the species level but these values have only been demonstrated after a step of bacterial growth (28,29).
Taking the advantages of sensitivity and specificity from nanoscale LC-MS/MS technology, and based on these previous studies, we developed a new pipeline using modern proteomics (DIA -Data Independent Acquisition mode) and machine learning algorithms to identify biomarkers able to speciate a set of bacteria of interest. This strategy is based on two steps ( Figure 1): i) a training step, that enables to define a peptidic signature for the bacteria of interest and ii) an identification step where the signature is monitored by targeted proteomics to get the identification of the bacteria in the infected samples.
Once the training step has been developed, the second step can be performed in routine laboratories on multiple samples and with any type of mass spectrometer working in PRM (Parallel Reaction Monitoring) or SRM (Selected Reaction Monitoring) modes.
As a proof of concept, this pipeline has been applied to the 15 bacterial species most frequently found in Urinary Tract Infections (UTI). Indeed, urine is the most common clinical specimen with hundreds of samples analyzed each day in most of clinical laboratories. Moreover, UTI is one of the most frequent type of infection in humans: it has been demonstrated that 50 to 60 % of women in western countries will have at least one UTI in their lifetime (30 Figure 1). According to literature reports, these are the most frequently found species in UTI (30,31).
Our original method enables to define a peptidic signature which, when monitored by targeted proteomics, is able to detect which of the 15 bacterial species is present in the urine sample, in less than 4 hours, without any bacterial culture. We also demonstrated that the peptidic signature is transferable to other laboratories and to other mass spectrometers. In addition, we compared the efficiency of our method to the MALDI-TOF standard workflow.

Bacterial culture and counting
Bacterial strains were obtained from the Culture Collection of Centre de Recherche en

Sample preparation for spectral libraries
For the generation of bacterial spectral libraries, bacteria from 1mL of semi-log broth bacterial culture were pelleted by centrifugation at 10,000 x g for 15 minutes, the supernatant was discarded and the pellet was washed three times with 1mL of 50mM Tris and centrifuged in the same conditions. The final pellet was frozen dried and stored at -20°C.
Pellets were then resuspended with 50mM of ammonium bicarbonate and 600 units of Each fraction was resuspended in 2% acetonitrile (ACN) / 0.05% trifluoroacetic acid (TFA) at 0.2 µg/µL and 1X iRT peptides (Biognosys) were added. An equivalent of 1µg of peptides was injected on LC-MS/MS system for each fraction of each bacterial species.

Preparation of urine samples
Urine specimens (10 mL), either from patients or artificially inoculated from healthy urine, were treated the same way: human cells were initially pelleted by a low speed centrifugation for 5 min at 1,000 x g, and the supernatant was high speed centrifuged for 15 min at 10,000 x g in order to collect bacteria. Bacterial pellets were then washed with 1mL of 50mM Tris and centrifuged again in the same conditions, another cycle of wash and centrifugation was added and the resulting pellet was frozen dried.
Protocols for protein extraction, trypsin digestion and peptide purification are described above in the 'Sample preparation for spectral libraries' section and were modified as follows: for each sample, 50 units of mutanolysine was used, 250 ng of trypsin was added for the digestion which was then stopped with 1µL of 100% FA and peptides were purified with StageTips (32) containing C18 reverse phase (3M Empore C18 Extraction Disks). Samples were resuspended in 10μL of 2% ACN, 0.05% TFA and 1X iRT peptides (Biognosys) were added. Half of the final volume was injected on LC-MS/MS system.

LC-MS/MS acquisitions
Samples were analyzed by nanoLC/MS using a UltiMate TM 3000 NanoRSLC system (ThermoScientific, Dionex Softron GmbH, Germering, Germany) coupled to an Orbitrap at 1% FDR based on target/decoy search using Percolator software (33). Hoeffding Tree classifier (42). All classifiers were evaluated by several subsampling procedures to assess the robustness of the models, including 10-fold cross validation, holdout and bootstrapping. Since stepwise feature selection tends to remove all correlated features, we retrieved those using Pearson and Spearman correlations, and having similar Information Gain score.

Peptides selection and signal extraction in DIA analyses
The obtained feature subsets of the three models were then merged, and manual curating was used to remove those observed in blank samples. The final curated signature was finally used to train a Bayesian Network prediction model. All new samples were analyzed using this predictive model.

Peptidic signature validation and bacterial identification prediction
After PRM analysis using the Orbitrap Fusion or the Q-Exactive HF-X instrument, the

Experimental Design and Statistical Rationale
In order to obtain a high quality peptidic signature using machine learning algorithms, 9 high-level and 3 low-level inoculations replicates of each bacterial species were used. For the validation of the method in targeted proteomics, four different biological replicates of each bacterial inoculation in urine were monitored in two different analysis conditions. Finally, urine from 27 different patients were used to compare the method to conventional MALDI-TOF analysis. Prediction accuracies were reported.

RESULTS
Our workflow for bacterial identification is composed of two steps: i) a training step which includes the acquisition of information on peptides expressed by the bacteria of interest using LC-MSMS in Data Independent Acquisition (DIA) mode followed by the generation of a short peptidic signature by machine learning models and ii) an identification step where the signature is monitored in unknown samples by PRM to obtain a bacterial identification through a prediction algorithm (Figure 1).
For the training step, in order to detect minor bacterial peptide signals in the human proteic background, we used DIA acquisition, on an Orbitrap Fusion instrument operating in nanoflow rate, because of its high sensitivity and its ability to provide a deep coverage of bacterial proteomes by acquisition of all peptides contained in the sample (43,44).
Indeed, in contrast to DDA which uses a full scan MS for the detection of peptide species, the DIA mode, by systematic acquisition of small size windows all along the mass range, improves the dynamic range and, thus, the sensitivity of the analysis. However, the simultaneous fragmentation of peptides inside this small window generates a complex spectrum which cannot be searched with conventional database search engines and needs to be deconvoluted to get peptide identifications.

Acquisition of bacteria spectral libraries
One of the proposed approaches to extract information from the DIA complex spectra is to use spectral libraries previously acquired in DDA mode on the same type of sample and annotated with peptide/protein identifications through a protein database search (43). In our study, we have generated these spectral libraries from pure bacterial colonies in order to be as exhaustive as possible and cover a very wide range of bacterial tryptic peptides, and subsequently be able to extract this specific bacterial peptide information from the DIA complex spectra contaminated with human biological material.   Figure 2a).
This redundancy associated to our reproducibility filters showed that it is not possible to select from these data one or several specific peptides for each bacterial species that could would be further able to specifically sign for the presence of each distinct species in the urine. Indeed, not enough specific peptides are available when working which this large number of bacteria (i.e. 15) ( Supplementary Figure 2b and 2c).
Thus, we aim to define a set of peptides that could be share by several species, but which, taken together, form a particular pattern for each bacterial species to be identified. To obtain this 'peptidic signature' our strategy was to use deep proteome coverage combined to machine learning algorithms to obtain this signature.

Data Independent Analysis of artificially inoculated urine replicates
In order to define a peptidic signature of 15 bacterial species in the human urine background, we a have generated 12 artificial sample replicates, for each species of our selection, by inoculating urine from healthy volunteers with bacterial culture. Two concentration levels were used approximately set at 1x10 6  are always a few peptides to distinguish them (4 and 7 peptides respectively in these two cases). For some very low concentration replicates, a few peptides, found high concentration replicates, were not detected. This loss affected the ability of the algorithm to predict the bacteria in only 15% of the tested low concentration replicates. Inversely, some false positive peptide detections were also observed, probably due to peak picking errors by Skyline in DIA runs, but they did not interfere with the bacterial prediction, assessing the robustness of the Bayesian Network model. As expected, most of the peptides composing the signature belong to relatively abundant proteins such as ribosomal proteins (e.g. 50S ribosomal protein L10, 30S ribosomal protein S5) or enzymes involved in amino acid metabolism (e.g. formate acetyltransferase) and glycolysis (e.g. GAPDH, pyruvate kinase) (46).

Validation of the signature by targeted proteomics
Since the machine learning algorithm has identified a short list of peptides allowing the discrimination of the 15 bacteria of interest, this list can now be monitored by targeted proteomics which is known to give a better reproducibility of measurements and a better sensitivity in peptide detection and could thus improve the limit of detection of bacterial species in urine. The information on presence or absence of each of the 82 peptides of the signature is then given to the developed prediction model to obtain a probability of contamination. This step corresponds the Identification step of our pipeline (Figure 1).  Table 3). The samples were processed as

Transfer of the signature on different instruments
To demonstrate that the initially designed signature using an Orbitrap Fusion instrument coupled to a nanoflow chromatography system is transferable to other instruments in others labs, we have analyzed the same inoculated urines of healthy volunteers (four different bacteria, five concentrations) in PRM mode on a Q-Exactive HF-X instrument coupled to capillary chromatography in PRM mode. Indeed, chromatography at higher flow rate (1 µL/min) improves the robustness of peptide separation and detection. To reduce the turnaround time between sample collection and bacterial identification as much as possible, the chromatographic gradient was also reduced between the Orbitrap Fusion and the Q-Exactive HF-X from 90 to 30 minutes. As for the Orbitrap Fusion data, the data collected from the Q-Exactive HF-X were analyzed using Skyline with the same validation criteria and the resulting list of detected peptides. Their intensity was used by the prediction algorithm (Figure 4a, Supplementary Figure 6 and Supplementary Table 4).
Thus, in 94% of the cases, the method allowed the correct prediction of the bacteria initially inoculated in the samples. Errors were found only for some of the two lowest concentrations points and in all cases except one, the sample was predicted as blank.
Again, when looking at the data above the clinical threshold of 1x10 5 CFU/mL, the accuracy was significantly improved to reach 100%.  Table 3).
Again, the good results in terms of linearity and reproducibility obtained on the Q-Exactive HF-X instrument suggest its potential use for quantification of bacteria in urine.

Validation on patient samples and comparison to MALDI-TOF MS
In order to validate our method in comparison to conventional MALDI-TOF analysis, Among the 15 bacterial species detectable by our method, many of them have a quite low frequency in UTI and not found in urines tested. In order to validate the detection of these species with our method in comparison with MALDI-TOF, we inoculated healthy urines with each of the 15 bacteria (Supplementary Table 3) and analyzed them with both pipelines. Our method found the correct inoculated bacterium in 100% of cases, while MALDI-TOF reported 2 errors (Figure 5b and Supplementary Tables 4 and 5). This lack of specificity in MALDI-TOF analysis might also explain some of the miscorrelations observed on patient samples.

DISCUSSION
In this study, we developed a new strategy combining proteomics and machine learning for a fast, specific and accurate detection and identification of bacterial species present in urine without the need for time-consuming bacterial culture. We successfully applied our pipeline on the 15 bacterial species most commonly found in urine samples and obtained, in less than 4 hours, high rates of prediction accuracy, especially when looking above the quantitative threshold commonly used by clinical laboratories to consider a urine as infected and requiring anti-biotherapy.
This proof of concept could pave the way to the development of new peptide signatures for the analysis of other types of clinical specimens (bronchoalveolar lavage (BAL), stool, hemoculture…) (47)(48)(49), but also for the detection of foodborne or waterborne pathogens (50,51), to reduce the turnaround time required to obtain a genus-and/or speciesspecific identification of microorganisms by classical or molecular microbiology methods or MALDI-TOF mass spectrometry. For some of these applications, without any culture, the sensitivity of the method might be too low to detect the bacteria but, it is expected that a short-term culture in a liquid medium might be enough to reach a detectable level (<1 x 10 3 CFU/mL for certain peptides) without the isolation of colonies on a culture plate.
In all cases, the high specificity of the method, due to a fine selection of the signature peptides, leads to a great improvement to what can be obtained with other standard methods such as MALDI-TOF mass spectrometry or 16S rRNA sequencing. This would be particularly valuable for the epidemiological surveillance of specific pathogens, instead of relying on expensive and time consuming whole genome sequencing (52,53).
Moreover, the linearity and reproducibility of our method were evaluated and the obtained results suggest that the method could be used for quantification of bacterial Finally, we anticipate that the constant improvement in sensitivity, mass accuracy and acquisition speed of mass spectrometers will contribute in the future to improve the limit and precision of specific bacterial strains detection, making even more relevant the use of LC-MS/MS methods in microbiology.