Bacterial species identification using MALDI-TOF mass spectrometry and machine learning techniques: A large-scale benchmarking study

Graphical abstract


Introduction
In the last decade, matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOF MS) has emerged as a novel identification method for a wide range of bacterial species. Compared to sequencing a bacterial genome, it is a cheaper and faster alternative for situations where a lot of strains need to be analyzed on a daily basis [1,2]. Today, MALDI-TOF MS is routinely used in clinical laboratories for the identification of pathogenic strains, but it is also increasingly applied to identify environmental and food-related microbiota [3][4][5][6][7][8][9][10]. MALDI-TOF MS generates mass spectra that quantify mostly ribosomal proteins and peptides present in a purified culture. As these proteins are highly specific for a bacterial species, mass spectra can be seen as species-specific fingerprints, allowing for an accurate identification of purified strains at the genus and species level [11,12].
The analysis of a sample typically starts with isolating a set of strains using different cultivation conditions, followed by a purification step, in which biomass is amplified to a sufficient level [13][14][15]. After cultivation, which is the most timeconsuming part of the analysis, microbial cells or crude extraction thereof are deposited onto a metal target plate and subsequently overlaid or mixed with an appropriate organic matrix solution. As a last step, the unlabeled spectrum of a novel strain is identified using software tools, which typically consist of one or several machine learning algorithms as most important building blocks. In the present work we focus on this last step of the identification process.
In a recent survey, Weis et al. [16] discussed 36 studies that deployed machine learning algorithms for bacterial species identification and antimicrobial susceptibility testing using MALDI-TOF MS. Remarkably, the vast majority of these studies apply off-theshelf classification methods in combination with relatively-small datasets, typically containing less than a thousand samples and a very small number of species, which are often limited to a single family or even a single genus. This can be attributed to the specific interest of individual research labs, and the substantial labour cost of collecting a large dataset. Yet, also the business model of MALDI-TOF MS manufacturers such as Bruker Daltonik GmbH & Co. KG (Germany, http://www.bruker.com/) and bioMérieux (France, http://www.biomerieux.com/) plays an important role. In addition to mass spectrometers, those vendors sell software packages that have access to private databases with annotated spectra [17]. As soon as research labs bring a new dataset in the public domain, this dataset is acquired by industrial players, and appended to their commercial databases. As a result, commercial software slowly becomes the gold standard, but the black-box nature of this software poses serious threats w.r.t. transparency, reliability and reproducibility.
The goal of the present paper is to provide an independent evaluation of machine learning methods for bacterial species identification with MALDI-TOF MS. To this end, we consider three different identification scenarios that are of practical relevance.
In a first scenario, we assume that at test time only novel biological replicates of strains seen during training are analyzed, and one therefore expects a good identification performance.
In a second and perhaps more realistic scenario, where also novel strains of previously-encountered species need to be identified at test time, one can expect a drop in performance compared to the first scenario. In a third scenario, for which specific methods are required, we investigated how well novel species can be detected at test time. For those three scenarios we make the following contributions: 1. For the first two scenarios, we present benchmarking results on an unprecedented scale, using several datasets that contain, together, almost 100,000 spectra from strains that represent more than 1000 species. We experiment with classical machine learning algorithms and more recent classification methods, such as deep learning and hierarchical classification methods. These last two groups of methods only work well for largescale datasets and have not yet been applied yet to MALDI-TOF MS data. 2. The hierarchical classification methods that we implemented utilize phylogenetic information in the form of a taxonomic hierarchy. By comparing the identification performance of such methods with the performance of methods that ignore phylogenetic information, we also investigate to what extent evolutionary relationships among species are preserved in MALDI-TOF MS data. 3. For the novel species scenario, the goal is to detect whether a MALDI-TOF MS sample represents a species that is not present in the training dataset. In this scenario, which inherently translates to an out-of-distribution detection task (i.e., binary classification), the traditional classification methods in previous scenarios are not directly applicable. We present promising results with dropout-based neural networks, which have recently become very popular in areas like computer vision [18,19]. 4. For the novel species scenario, the goal is to detect whether a MALDI-TOF MS sample represents a species that is not present in the training dataset. In this scenario it is not possible to apply off-the-shelf classification methods.
The structure of the present manuscript is straightforward. In Section 2, we give an overview of the datasets and experiments that are considered in this work. We also discuss in detail the preprocessing and machine learning methods that are used to address the above research questions. In Section 3, we present numerical results for the three different scenarios. In Section 4, we discuss the implications of these results for the community, and we make a connection with previous studies. In Section 5, a short conclusion is formulated.

Datasets
In order to address the above research questions, we will work with three datasets, referred to as the global dataset (GD), lyopreservation dataset (LYO) and additional test dataset (TEST). In what follows we provide information on how the three datasets were generated.
The global dataset is a historical database that contains samples for more than 2400 taxonomic reference strains, representing more than 1000 bacterial species, cultured and analyzed in the last decade by researchers in the Laboratory for Microbiology LM-UGent (Ghent University, Belgium). Strains are considered reference strains in bacterial taxonomy when they are chosen as type specimen during the formal description and naming of novel bacterial species, or when they have been included in comprehensive taxonomic studies [20]. The reference strains included in the global dataset were identified using state-of-the-art identification methods available at the time of isolation and subsequently stored in the BCCM/LMG Bacteria Collection. More precisely, for the recent deposits in the BCCM/LMG Bacteria Collection, strain information was acquired by means of whole genome sequencing. Yet, for the historically-deposited strains, other techniques were used for identification, such as DNA-DNA hybridization or housekeeping gene sequence analysis.
All strains were cultivated according to the provider's instructions to a third generation. Afterwards, bacterial cell extracts (1 ll) were spotted seven to eight times on a target plate (Bruker Daltonik GmbH & Co. KG, Germany) and dried in air at room temperature. The sample spot was overlaid with 1 ll of matrix solution (10 mg=ml a-cyano-4-hydroxycinnamic acid in acetonitrile-wate r-trifluoroacetic acid [TFA] [50:47.5:2.5]). Each target plate comprised one spot of pure matrix solution, used as a negative control, and one spot of Bacterial Test Standard (Bruker Daltonik GmbH & Co. KG, Germany), used for calibration. The target plate was measured automatically for four times on a Bruker Microflex LT/SH (Smart) system (Bruker Daltonik GmbH & Co. KG, Germany), thus obtaining a total of 28 to 32 technical replicate spectra for each strain. The spectra were obtained in linear, positive ion mode using FlexControl (v3.4) software according to the manufacturer's recommended settings (Bruker Daltonik GmbH & Co. KG, Germany). Each final spectrum resulted from the sum of the spectra generated at random positions to a maximum of 240 shots per spectrum.
The lyopreservation dataset is a small dataset that was released in 2019 [21]. This dataset can be subdivided into two subsets, further denoted as LYO1 and LYO2. Both subsets contain the same strains, but MALDI-TOF mass spectra were generated before and after lyophilization and subcultivation of the strains. This makes this dataset ideally suited to evaluate the performance of MALDI-TOF MS for the scenario where novel biological replicates of previously-encountered strains need to be identified. All samples of the lyopreservation dataset were analyzed using the same protocol as the global dataset.
The additional dataset TEST contains MALDI-TOF mass spectra generated from isolates obtained in a recent microbial diversity project, in which the bacterial and yeast fraction of a food sample was studied. MALDI-TOF MS data acquisition was performed as described above for the global dataset GD, except that microbial extracts were spotted in duplicate and measured only once. For this dataset, ground-truth species labels were assigned based on the highest match obtained after matching the mass spectra generated with those present in the BDAL (Bruker Daltonik GmbH & Co. KG, Germany) and the LM-UGent in-house identification databases. A fraction of mass spectra that did not match with a Bruker logscore higher than 1.8 were left out from the final dataset. This dataset contains mass spectra generated from isolates of species that are not represented in the global dataset GD. Some isolates were classified to Pichiaceae, Saccharomycodaceae and Cellulomonadaceae, three families that are not represented in the global dataset GD. Others belonged to species that are not present in the global dataset yet classified as Acetobacteraceae, a family that occurs frequently in the global dataset GD.

Data preprocessing
In what follows, let us denote a dataset by D ¼ fðS i ; y i Þg i¼1;...;N , consisting of N MALDI-TOF mass spectra, in which the i-th sample S i has species label y i . A MALDI-TOF mass spectrum.
of length L i . is formally represented as a set S i ¼ fðx j ; k j Þg j¼1;...;L i , with k j the measured intensity value for a given m=z ratio x j . With this notation we emphasize that different spectra may vary in length.
Most of the machine learning methods described in literature, and more precisely the ones that are going to be considered in this work, require MALDI-TOF mass spectra with a fixed-length representation as input. As such, the first preprocessing step to be considered is the so-called binning step. This step consists of (i) dividing the m=z dimension in intervals (or bins), and (ii) aggregating intensity values in each obtained interval, for each spectrum S i . For aggregation, we choose to work with the maximum of intensities, as this is frequently used in literature [22]. The next steps in our preprocessing pipeline are the following transformations: (i) baseline correction, where an estimated baseline in each spectrum is removed by using the asymmetric least squares method (ALS) and (ii) total ion current normalization (TIC), where each intensity k ij is transformed by means of such that the sum of intensities for every spectrum S i sums to one [23][24][25][26].

Flat classification methods
In general, bacterial identification can be translated to a multiclass classification problem, where the goal is to assign the correct class, i.e. taxonomic label such as strain, species, genus or family, to a given spectrum. More formally, let's assume that the given dataset D is drawn from a distribution PðS; cÞ on X Â Y, with X the instance space consisting of spectra and Y ¼ fc 1 ; . . . ; c K g a class space consisting of K classes. Furthermore, we will assume probabilistic multi-class classifiers, which estimate conditional class probabilities PðÁ j SÞ over Y, with properties 8c 2 Y : 0 6 Pðc j SÞ 6 1; P c2Y Pðc j SÞ ¼ 1. Bearing in mind the availability of taxonomic information, the corresponding classification problem w.r.t. Y can be solved by using flat or hierarchical classification models. With a flat classification model one typically denotes a model that ignores hierarchical information. In contrast, a hierarchical classification model will utilize hierarchical information for classification purposes. In the experiments, we analyzed the following flat classifiers, which can be applied to a large-scale classification setting: random forests (RF), logistic regression (LR), support vector classifier with linear kernel (LSVC), k-nearest neighbours (KNN) and a onedimensional convolutional neural network (1DCNN), which consists of one-dimensional convolutional, batch normalization and max-pool layers. Details on how those methods were configured are given below.
In principle, one can generate conditional class probabilities with all the methods that are considered. However, for certain methods, such as random forests and k-nearest neighbors, those probabilities may not be well-calibrated. At test time, for a novel sample S a species name y is assigned by returning the mode of the conditional class distribution: Pðc j SÞ: Since only the mode of the conditional class distribution is needed, the probabilities themselves do not need to be wellcalibrated.

Hierarchical classification methods
In order to implement hierarchical classification models, we obtained a phylogenetic tree from the NCBI Taxonomy Database [27,28]. We ran experiments with the local classifier per parent node (LCPN) approach [29]. This approach underlies many popular algorithms such as nested dichotomies [30][31][32], conditional probability estimation trees [33], probabilistic classifier trees [34], or hierarchical softmax [35], often used in neural networks as an output layer. In a basic implementation, a multi-class classifier is trained in each parent node (i.e., all nodes except the leaves) within the taxonomy. For example, for the taxonomy represented as a binary tree in Fig. 1, one would need to train seven separate base learners. Each base learner is trained to distinguish between its child nodes. When probabilistic classifiers are used as base learners, a hierarchical factorization of the conditional class distribution Pðc j SÞ is learned, where one can express the conditional class probability for a particular leaf node via the chain rule of probability: Pðv j ParentðvÞ; SÞ; where PathðcÞ is a set of nodes on the path connecting the leaf and the root of the tree structure. ParentðvÞ gives the parent of node v, and for the root node r we have Pðr j ParentðrÞ; SÞ ¼ 1.
In the experiments, we implemented the same base learners as used for flat classification, i.e., random forests (RF), logistic regression (LR), support vector classifier with linear kernel (LSVC), knearest neighbours (KNN) and a one-dimensional convolutional neural network (1DCNN). At test time, a top-down approach is used to determine the most likely species label: starting from the root node, the child node with highest probability is chosen and this process is repeated until a leaf node is reached. Every leaf node corresponds to a species.

Novel species detection methods
As discussed above, we consider the scenario where new species, not yet seen during training, may arrive during test time. In machine learning jargon, the corresponding task is often referred to as out-of-distribution detection [36][37][38][39], open-set recognition [40][41][42] or anomaly detection [43]. Many novel methods for such tasks have been introduced recently, but none of those methods have been applied to bacterial species identification yet. In this work, we ran experiments with neural networks and the Monte Carlo dropout method [18]. In this method, dropout is applied both during the training and testing of neural networks and can be seen as an approximation to exact Bayesian inference in Bayesian neural networks. In contrast to neural networks, where point estimates of the weights are learned by maximizing the likelihood of the data, Bayesian neural networks are trained by learning distributions over weights [44]. By using approximate Bayesian inference, the aim is to detect novel species by analyzing the total uncertainty in predictions. Moreover, in exact Bayesian inference and given a new spectrum S Ã , one can measure the total uncertainty in a prediction by means of the Shannon entropy of the posterior predictive distribution [45]: The posterior predictive distribution can then be written as follows: with the posterior distribution over the weights given by Bayes' rule: and PðD j wÞ the likelihood of the data. Note that in the above notations, a probabilistic classifier is parametrized by a vector of weights w. Unfortunately, calculating Eq. 2 is intractable due to the integration over the weight space W and the calculation of the posterior distribution Pðw j DÞ, which in turn is also intractable due to the denumerator in Eq. 3. However, one can approximate the posterior predictive distribution by a finite ensemble fw 1 ; . . . ; w M g, by means of learning different classifiers on M boostrap samples or by using the Monte Carlo dropout technique in neural networks during test time [46,18]. The posterior predictive distribution is then approximated as follows: Subsequently, one can approximate the total uncertainty by plugging Eq. 4 into Eq. 1: In the experiments, we use the total uncertainty of Eq. 5 to identify out-of-distribution samples, similar to [46,47]. Ideally, in the dataset TEST, those samples should correspond to species that are not present in the dataset GD. For the Monte Carlo dropout method, we use the 1DCNN model where the number of Monte Carlo samples is set to ten (M ¼ 10). More information with respect to implementation details is given below.

Experimental setup and hyperparameter tuning
Different training, validation and test sets were used for the three identification scenarios that were outlined in the introduction -see Table 1 for an overview. To evaluate the novel strain identification scenario, the global dataset is split in three different datasets: GD train , GD val and GD test , in such a way that there is no overlap in terms of strains. More precisely, for each species in GD, we distribute the set of unique strains over the three datasets, such that each strain is only represented in one of the three datasets. For example, assume we have for a given species the following unique strain labels: Species for which only one strain has been collected, are added to the training set GD train in order to differentiate from the novel species scenario. GD train is only used for training various machine learning models, and the hyperparameters of these models are tuned on GD val . GD test is used to evaluate the identification performance in the novel strains scenario. For the scenario where novel biological replicates need to be identified, the LYO2 dataset is used as test set, and the LYO1 dataset is appended to GD for training. For the novel species detection scenario, the TEST dataset is used for testing and GD train for training. Table 2 presents some summary statistics of the datasets and Fig. 2 shows the frequency distribution of the top-7 families for GD test , LYO2 and TEST. One can observe a high imbalance in terms of species occurrence among the three datasets. One can see that the three datasets follow different distributions.
We use the novel strains experiment to test the two different preprocessing steps: binning with size 5, 10 and 20, combined with either no transformation ( : ), baseline correction (ALS) or baseline correction followed by total ion current normalization (ALS + TIC). This results in nine different preprocessing combinations. Subsequently, for each of those combinations we perform hyperparameter tuning by means of randomized grid search [48], where we use five randomly sampled parameter settings for each model, on the training (GD train ) and validation splits (GD val ) of the global dataset.  The optimal preprocessing combination and hyperparameters, obtained in the novel strains experiments, are then used throughout the other experiments. More specifically, when it comes to hyperparameter tuning, we consider for RFs the number of trees (f50; 100; 150g), the number of features to consider in each split, maximum depth of the trees (f30; 90; heuristic À basedg), the minimum number of samples required to split an internal node (f2; 5; 10; 15; 20g) and the minimum number of samples required to be at a leaf node (f1; 2; 5; 10; 15g). For LR, we consider a linear classifier with L2 regularized cross-entropy loss and stochastic gradient descent optimization. Early stopping is applied after five iterations and the maximum number of iterations considered during optimization is set to 200 epochs. We tune the regularization strength (½0:001; . . . ; 1:0) and learning rate (½0:001; . . . ; 1:0). For LSVC, we consider again L2 regularization where the regularization strength is tuned (½0:001; . . . ; 1000). For KNN we consider the number of neighbours (½1; . . . ; 20) and the Euclidean, Manhattan and Chebyshev distance metrics.
Finally, for 1DCNN we consider two blocks consisting of a onedimensional convolutional layer, followed by batch normalization, ReLU activation and a max-pool layer. The first block consists of a convolutional layer with eight output channels and a kernel size of ten with stride four. The second block contains a convolutional layer with 16 output channels and a kernel size of ten with stride three. For both blocks, we consider a padding and dilation of one and use kernels of size two for the max-pool layers. The fullyconnected part consists of dropout, followed by one linear layer. The model is trained by optimizing the L2 regularized crossentropy loss by means of stochastic gradient descent optimization. Early stopping is again applied after five iterations. For hyperparameter tuning, we consider the learning rate (½0:001; . . . ; 0), weight decay (½0:001; . . . ; 0) and dropout rate (½0; . . . ; 0:5).
For the novel strains and novel biological replicates experiments, accuracies on all phylogenetic levels are reported. In case of flat classification, where no taxonomic information is incorpo-rated during training, hierarchical classification is easily achieved during test time by taking the taxonomy into account. For example, a prediction delivered by a flat classifier, trained on eight different species with the taxonomy shown in Fig. 1, indirectly determines the class label on the upper genus and family level. The disadvantage of this approach is that it needs to discriminate among a large number of classes, without exploring information about parentchild class relationships present in the taxonomy.
For the novel species experiment, we consider the area under the ROC (AUROC) and precision-recall curves (AUPR), similarly as in [36,41,39]. All models have been implemented in Python by using the Scikit-learn and PyTorch libraries [49,50].

Results
In Table 3, we report the performance of different preprocessing pipelines and models in the novel strains scenario. For each model, we use the optimal set of hyperparameters obtained during randomized grid search on GD train;val and report the accuracy obtained on GD test . In most cases, decreasing the bin size leads to a higher performance, as less information is lost during the binning process. In general, most machine learning models benefit from both transformations, except for LR, where a significant drop in performance is observed when considering total ion current normalization on top of baseline correction. A possible explanation for this result is unstable model training due to the combination of heavily reduced intensity values after TIC and the type of optimizer that is used in LR. In Fig. 3, we show a specific example of a raw spectrum of Lactococcus garvieae, together with all preprocessing steps considered in this work: binning and transformation by means of baseline correction and total ion current normalization. In this particular case, binning reduces the dimensionality of the original spectrum by a factor of five, while retaining most of the information present in the raw spectrum. Next, baseline correction transforms the binned spectrum by removing any baseline or trend that is present in the Table 2 Summary statistics for the different datasets (N -number of spectra, K f -number of unique families, Kg -number of unique genera, Ksp: -number of unique species, Kst: -number of unique strains, ID -in-distribution, OOD -out-of-distribution). raw spectrum. Finally, the spectrum is rescaled such that the sum of intensities is equal to one.

Species identification
Next, in Table 4, we present the results for the novel strains and novel biological replicates experiments. For each model, we consider a flat and hierarchical classifier and report the performance on three phylogenetic levels: family, genus and species. For all models, we consider a bin size of five and both transformations, since this combination gave rise to the best performance in the previous experiment. Only for LR, we choose to omit the total ion current normalization, given the issue raised in the previous experiment. In the novel strains scenario, KNN outperforms the other models with an accuracy of 84.78%, which brings us to the conclusion that similar spectra, i.e., with respect to some distance metric, Table 3 Hyperparameter tuning in novel strains scenario with reported accuracies ( : -no transformation, ALS -baseline correction only, ALS + TIC -baseline correction and total ion current normalization).  in the input space most likely belong to the same species. When looking at the confusion matrix on family level for flat KNN in Suppl. Fig. 1, there appears to be confusion for the families Aerococcaceae, Bacillaceae, Lactobacillaceae and Microbacteriaceae. On the other hand, in the novel biological replicates scenario, the outperforming model seems to be LSVC with an accuracy of 96.15%. Again, following Suppl. Fig. 2, we observe confusion for Lactobacillaceae for both flat and hierarchical classifiers. In Suppl. Fig. 3, we also show the confusion matrices on genus level for both flat and hierarchical LSVC. In both scenarios, there is no gain when considering deep learning models, besides the fact that training is faster in case of large-scale classification settings. The same conclusion can be made with respect to the hierarchical classifiers, indicating that taxonomic information is not immediately represented in MALDI-TOF MS data.

Novel species detection
Finally, in Table 5 we evaluate whether 1DCNN with different dropout rates and KNN are useful in the context of novel species detection by using total uncertainty as defined in Eq. 5. For both KNN and 1DCNN, we use an ensemble size M ¼ 10. The other models are not considered due to the increasing training complexity of this scenario. 1DCNN with a dropout rate of 0.8 clearly outperforms KNN, considering two different performance measures: the area under the ROC curve and the area under the precision-recall curve. We also present the corresponding ROC and precisionrecall curves in Fig. 4.
More insights into these findings are given in Fig. 5, which visualizes for the dataset TEST the first two principal components after applying principal component analysis on the original feature space and penultimate layer of 1DCNN, in the first and second row, respectively. The penultimate layer represents a learned feature representation, and in deep learning it is a common procedure to plot this space using dimensionality reduction methods such as principal component analysis. From left to right, different color schemes are applied to highlight families, in-distribution and out-of-distribution groups and the total uncertainty score of Eq. 5. One can see that out-of-distribution samples, which correspond to species not observed during training, are clearly separated in the penultimate layer of the neural network. This clear separation might be explained by the fact that the out-of-distribution samples belong to three families that are not observed during training: Pichiaceae, Saccharomycodaceae and Cellulomonadaceae (bottom left). In contrast, no clear separation is visible in the principal components that are derived from the original input space (top center). For the original feature space, the cluster that corresponds to out-of-distribution samples is also not characterized by a higher total uncertainty (top right), which might explain the drop in performance for KNN. Given the above findings, we conclude that novel species can be accurately identified by total uncertainty, especially when considering 1DCNN as underlying model.

Size and scope of the analysis
Compared to existing papers on bacterial species identification with MALDI-TOF MS, the present paper presents benchmarking results of machine learning methods on an unprecedented scale, using a unique dataset that contains more than a thousand species, more than two thousand strains and almost 100,000 mass spectra. Weis et al. [16] discussed in a recent survey 36 studies, of which 27 conducted experiments with machine learning methods for bacterial species identification, and 9 for antimicrobial susceptibility testing. The largest of those studies considered 727 isolates [51], but many studies have less than 50 isolates -see e.g. [52][53][54][55]. In addition, almost all those studies analyzed species that belong to a single family, or in some cases even a specific genus. This particular focus might be attributed to the specific interest of the research lab that collected the data, i.e., interest in clinicallyrelevant species, and the business model of vendors of MALDI-TOF MS machines. As soon as a new dataset is released, commercial players append their own license-based identification libraries with those datasets. License-based identification libraries form a unique selling property for commercial players, but the blackbox nature of the accompanying software poses serious threats w.r.t. transparency, reliability and reproducibility.

Data preprocessing
In this work, the focus was less on data preprocessing. We implemented the three preprocessing steps that are mostly used in other studies: binning to obtain a fixed-length representation, and baseline correction and total ion current normalization to Table 5 Results for novel species scenario. Area under the ROC curve (AUROC) and area under the precision-recall curve (AUPR) are reported for out-of-distribution detection based on total uncertainty.  mitigate technical variability. As already stated, one can choose to work with equally or non-equally spaced intervals for binning, where intervals can be either overlapping or non-overlapping, but previous work has shown that all those options lead to comparable results [22]. The choice for non-overlapping equally-spaced bins is therefore mainly motivated by common practice and preservation of interpretability. Some other preprocessing steps are bundled in the open-source MALDIquant package [56]. In our opinion, those steps are mainly useful in the context of unsupervised learning, such as data visualization and clustering, but they do not influence the performance of supervised classification methods. Because of computational reasons, the recentlyintroduced machine learning method of Weis et al. could not be applied [57]. Inspired by topological data analysis, these authors introduced the so-called PIKE kernel. Interestingly, this is a kernel that avoids binning as a preprocessing step, and it can be used in combination with Gaussian processes and support vector machines. Unfortunately, computing the kernel matrix has a OðN 2 L 2 Þ complexity, making the approach unsuitable given the size of the data that we analyzed.  In all plots, the first two principal components are shown on the horizontal and vertical axis, respectively. Left color scheme: family information, center color scheme: indistribution (ID) and out-of-distribution (OOD) information, right color scheme: estimated total uncertainty, where larger values correspond to higher total uncertainty. A better distinction between in-distribution and out-of-distribution samples is observed for the total uncertainty estimated by 1DCNN. (For interpretation of the references to colour in this figure legend, the reader is referred to the web version of this article.)

Machine learning models for species identification
For the identification of novel strains and novel biological replicates, we applied in this paper for the first time convolutional neural networks (CNNs). The other machine learning methods that we analyzed have been considered before for bacterial species identification using MALDI-TOF MS data, albeit with smaller datasetssee. e.g. [58][59][60][61][62][63]. CNNs have become very popular recently in many fields. For large datasets with complex feature spaces, they usually outperform classical machine learning methods, since they combine the data preprocessing and learning phase in a single algorithm, while avoiding the need for domain-specific preprocessing techniques [64]. Surprisingly, our results indicate that CNNs are not able to outperform traditional methods such as KNNs and LSVMs, which yielded the best predictive performance. However, CNNs turned out to be more stable than other methods w.r.t. the application of specific preprocessing pipelines. Without preprocessing, the performance of CNNs did not drop substantially. In addition, we observed that CNNs could be trained in a very efficient manner, unlike some of the older methods, for which implementations are not always optimized to handle large datasets. A recent paper that analyzed one-dimensional CNNs for single-cell MALDI-TOF MS data came to a similar conclusion [22]. In this revolutionary technology, a spectrum is generated for every cell in a natural environment, e.g. a urine sample of a patient, while avoiding a time-consuming culturing phase. Datasets obtained in that way easily contain more than 100,000 spectra, so the need for computationally-efficient machine learning methods will probably increase in the near future.
As far as we know, this is also the first work that implements hierarchical classifiers for large-scale bacterial species identification using MALDI-TOF MS. Surprisingly, hierarchical classifiers did not outperform their flat counterparts on species level, indicating that taxonomic information is not directly represented in MALDI-TOF MS data. If spectra generated from closely related bacteria are also close w.r.t. some similarity or distance metric in the feature space, hierarchical classification is expected to be more accurate compared to flat classification [65]. This is in fact a necessary condition for improvement, as observed in fields where hierarchical classifiers have gained interest, such as text categorization [35], protein function prediction [66,65], functional annotation of genes [67,68], identification of plants based on images [69] and discrimination of bird songs [70]. In microbiology, hierarchical classifiers have been successfully applied to bacterial species identification using Fourier-Transform-Infrared spectra [71,72] and fatty acid methyl esther profiles [73].
Hierarchical classifiers might be preferred over flat classifiers for other reasons than predictive performance on species level. A nice property of hierarchical classifiers is their ability to abstain at different levels in the hierarchy [29]. When there is too much uncertainty about the species name of a sample, one can decide to identify that sample only on genus or family level [74,75]. This is nicely illustrated in our experimental results, where the accuracy on family level was in general substantially higher for hierarchical classifiers. On family level, the low performance of some flat classifiers can be attributed to the detour that these models make, i.e., they first identify a sample on species level and convert the label to family level as a post-processing step. Such a detour with a one-versus-all decomposition on species level leads to an overparameterized model when species only need to be identified on family level. As a side note, hierarchical classifiers might also be preferred over flat classifiers for computational reasons. At test time, hierarchical classifiers have logarithmic time complexity to identify samples at species level, whereas flat classifiers have a linear time complexity (w.r.t. the number of classes) [35,76].

Evaluation protocols for species identification
In the present study, we analyzed three of the most common reasons why machine learning algorithms might not assign the correct species name to a MALDI-TOF mass spectrum, namely the existence of novel strains, novel biological replicates and/or novel species at test time. Remarkably, many existing studies do not make such a distinction [16]. When we compared the performance for the novel strains and novel biological replicates scenarios, we observed that the accuracy on species level increased with more than 10% if only novel biological replicates are analyzed at test time. Therefore, we believe that relatedness between spectra is indeed present at the more fine-grained strain level, since a classifier yields more accurate identifications for a particular strain when that strain also appears in the training dataset. However, it is very reasonable to test machine learning models in terms of generalization to unseen strains, as most often, novel strains may arrive during test time. Going one step further, one could also try to classify MALDI-TOF MS spectra at strain level, but then the predictive performance usually becomes too low to be useful in practice.
The most common testing protocol in literature is k-fold crossvalidation, where, repeatedly, one of the k-folds is left out for testing, while the other folds are used for training. Obviously, such an approach is not able to quantify how well novel strains can be identified. Even for identification of novel biological replicates, cross-validation is a questionable approach, especially when many technical replicates are present in a dataset. Only four existing studies avoid cross-validation by evaluating models on an independent test dataset that is not used during training [77,78,53,54]. In all four cases, data of one or more labs were considered for training and data of other labs for testing. This allows to make location comparisons, as MALDI-TOF mass spectra measured at different locations suffer from batch effects, which are likely to stem from differences in laboratory routine or system settings [79]. However, an independent test dataset can in principle contain a mix of novel technical replicates, novel biological replicates, novel strains, novel species and other deviations from the training data that might impact identification. So, if one intends to investigate the impact of one particular type of deviation, it is important that the test dataset is constructed in such a way that the specific source of deviation is isolated from other sources. That is in fact the main reason why we considered three different test datasets in this paper. With the dataset GD test , we evaluated the performance for novel strains, with LYO2 we only analyzed novel biological replicates, and with the independent dataset TEST we evaluated the performance on a mix of samples, consisting of novel biological replicates, novel strains and novel species. We did not analyze the identification performance for technical replicates, because that's practically less useful, and because previous work has studied the impact of technical variation extensively.

Novel species detection
For the third scenario, in which novel species had to be detected in the dataset TEST, specific machine learning methods are needed. Technically speaking, in this case, one is not solving a classification problem but an out-of-distribution detection problem. In the machine learning literature this setting is also often referred to as open-set recognition [80,81], anomaly detection [43] or classification with reject option [76]. Inspired by applications like selfdriving cars, where it is of uttermost importance to abstain from making a prediction if the uncertainty is too high, out-ofdistribution detection has become a very popular research topic [82,83,36,[84][85][86]. As a result, a whole bunch of novel methods have been recently proposed, including exact and approximate Bayesian methods [87,88,18], models based on Dirichlet priors [38], generative models [89,41,[90][91][92]39] and distance-based methods [93,41,94]. Very few of these methods have been applied to bacterial species identification. Two papers identify bacterial species based on genomics data, using Naive Bayes [74] and a specific generative model [39] for detecting out-of-distribution samples. Weis et al. [57] use Gaussian processes for out-ofdistribution detection on MALDI-TOF MS data, using specific preprocessing techniques [56,95]. The commercial ClinProTool of Bruker Daltonik GmbH & Co. KG (Germany, http://www.bruker. com/) also detects out-of-distribution samples when no match in their database is found, i.e., when the similarity of the nearest neighbour drops below a certain predefined threshold.
In this work, we decided to experiment with Monte Carlo dropout, as this method is frequently used in several domains [19,46,83]. This type of approximate Bayesian inference seemed to work reasonably well in our setting, yet a more extensive study in which the different frameworks from literature are analysed might be desirable, and is left as potential future work. More specifically, we believe that it would be very useful to conduct experiments with methods that are able to differentiate between aleatoric and epistemic uncertainty [45]. Aleatoric uncertainty originates from inherent noise in the data, which cannot be mitigated by collecting more data, whereas epistemic uncertainty alludes to uncertainty that can be reduced if more data would be available. In fact, the total uncertainty of Eq. 5, which was considered as criterion to detect out-of-distribution samples, can be easily decomposed into an aleatoric and epistemic part. In the conducted experiments, we did not see an improvement when considering the decomposition, however, other methods might yield different conclusions.

Conclusion
In this work, we performed a large-scale benchmarking study of bacterial identification using MALDI-TOF mass spectrometry and machine learning methods. We implemented several traditional machine learning methods, as well as a few novel methods, such as one-dimensional convolutional neural networks, hierarchical classifiers and an out-of-distribution detection method. The size and the diversity of the data that we analyzed allowed us to compare three important identification scenarios that are generally not distinguished in literature, i.e., identification of novel biological replicates (i.e., isolates that represent strains that are already present in the database), novel strains (i.e., isolates that represent novel strains that belong to species that are present in the database) and novel species that are not present in the training data (i.e., isolates that represent strains of species that are not present in the database). The results demonstrate that in all three scenarios acceptable identification rates are obtained, but the numbers are typically lower than those reported in studies with a more limited analysis. Using hierarchical classification methods, we also demonstrated that taxonomic information is in general not well preserved in MALDI-TOF mass spectrometry data. For the novel species scenario, we applied for the first time neural networks with Monte Carlo dropout, which have shown to be successful in other domains, such as computer vision, for the detection of novel classes. Especially for this last scenario, we still see a lot of possibilities for benchmarking other recent methods in a separate study.

Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.