MALDI-TOF mass spectrometry for sub-typing of Streptococcus pneumoniae

Serotyping of Streptococcus pneumoniae is important for monitoring of vaccine impact. Unfortunately, conventional and molecular serotyping is expensive and technically demanding. This study aimed to determine the ability of matrix-assisted laser desorption-ionisation time-of-flight (MALDI-TOF) mass spectrometry to discriminate between pneumococcal serotypes and genotypes (defined by global pneumococcal sequence cluster, GPSC). In this study, MALDI-TOF mass spectra were generated for a diverse panel of whole genome sequenced pneumococcal isolates using the bioMerieux VITEK MS in clinical diagnostic (IVD) mode. Discriminatory mass peaks were identified and hierarchical clustering was performed to visually assess discriminatory ability. Random forest and classification and regression tree (CART) algorithms were used to formally determine how well serotypes and genotypes were identified by MALDI-TOF mass spectrum. One hundred and ninety-nine pneumococci, comprising 16 serotypes and non-typeable isolates from 46 GPSC, were analysed. In the primary experiment, hierarchical clustering revealed poor congruence between MALDI-TOF mass spectrum and serotype. The correct serotype was identified from MALDI-TOF mass spectrum in just 14.6% (random forest) or 35.4% (CART) of 130 isolates. Restricting the dataset to the nine dominant GPSC (61 isolates / 13 serotypes), discriminatory ability improved slightly: the correct serotype was identified in 21.3% (random forest) and 41.0% (CART). Finally, analysis of 69 isolates of three dominant serotype-genotype pairs (6B-GPSC1, 19F-GPSC23, 23F-GPSC624) resulted in the correct serotype identification in 81.1% (random forest) and 94.2% (CART) of isolates. This work suggests that MALDI-TOF is not a useful technique for determination of pneumococcal serotype. MALDI-TOF mass spectra appear more associated with isolate genotype, which may still have utility for future pneumococcal surveillance activities.


Background
Streptococcus pneumoniae, a globally important pathogenic bacterium [1], consists of at least 100 distinct capsular serotypes [2]. Serotype-based surveillance of pneumococcal populations remains important since the polysaccharide capsule is a major antigen and the basis of current pneumococcal conjugate vaccines (PCV). Temporal changes to the serotypes associated with colonisation and disease may necessitate alterations to vaccine composition [3]. Traditional capsular typing by the Quellung reaction is both expensive and timeconsuming. Deduction of serotype is possible by molecular techniques, including polymerase chain reaction (PCR) [4], microarray [5], and whole genome sequencing (WGS) [6]. However, these techniques are often technically demanding and/or not affordable in many settings. Although still largely unaffordable in resource-limited settings, bacterial identification is being done increasingly by matrix-assisted laser desorption-ionisation timeof-flight (MALDI-TOF) mass spectrometry, in both clinical and research laboratories [7]. To improve pneumococcal identification, a combined bile solubility test -MALDI-TOF assay has been developed to separate S. pneumoniae from the closely related S. mitis group of organisms [8]. Although MALDI-TOF mass spectra are derived from peptides / proteins, several studies have assessed the potential of MALDI-TOF for identification of pneumococcal serotypes. Encouragingly, in two of these, MALDI-TOF mass spectra clustering identified common pneumococcal serotypes fairly well [9,10]. However, the most recently published study yielded considerably less optimistic results [11].
In view of the conflicting published data, we set out to determine whether MALDI-TOF mass spectra of pneumococci cluster consistently by serotype. A further aim was to explore whether any potential MALDI-TOF mass spectrumserotype correlations were independent of underlying isolate genotype.

Pneumococcal serotypes and genotypes
A total of 199 S. pneumoniae isolates were included in the study. PCR serotype results were concordant with phenotypic-WGS result in 198/199 to the level determined by PCR specificity (for some, this was to just the serogroup). One isolate was identified as serotype 6C by PCR but was non-typeable (NT) by both WGS and phenotypic methods. In the subsequent analyses, this isolate was referred to as NT.
To determine overall MALDI-TOF mass spectra clustering by serotype (objective 1), 130 of the isolates were examined. This selection included 16 serotypes plus non-typeable isolates (5 -22 isolates per serotype). After further analysis of WGS data, a finalised genotype (sequence type [ST] / global pneumococcal sequence cluster [GPSC]) could be determined for all but three NT isolates: 46 GPSC were identified (Table 1).
From 785 matched peaks, 16 peaks were found to discriminate between serotypes (false discovery rate [FDR] q < 0.05; Table 2). Hierarchical clustering, on the basis of these peaks, identified four major clusters (Fig. 1). Most serotypes (13/16; 81.3%) appeared in > 1 cluster, with just serotypes 1, 23A, and 34 confined to a single cluster. Serotypes 1 and 34 were the least genotypically diverse, being represented by a single GPSC each. The MALDI-TOF mass spectra from the same 130 isolates was reanalysed following reorganisation of the dataset by genotype (GPSC) rather than serotype. This identified 27 discriminatory peaks and three major clusters. Only 3/46 (6.5%) GPSCs were spread across more than one cluster (Additional File 1). Classification of isolate mass spectrum data by random forest or classification and regression tree (CART) algorithms was sub-optimal. The random forest approach identified the correct serotype just in 14.6%, and GPSC in 17.7%, of isolates; CART correctly identified serotype in 35.4% and GPSC in 27.7% of isolates (Additional File 2). Restricting the dataset to the nine dominant GPSC, those comprising of at least five isolates (total 61 isolates, 13 serotypes; 535 matched peaks), discriminatory ability improved to some degree ( Fig. 2 [serotype-organised data; 16 discriminatory peaks] and Additional File 3 [genotype-organised data; 34 discriminatory peaks]). With serotype-organised data, correct serotype was identified in 21.3% (random forest) and 41.0% (CART) instances. Using genotype-organised data, correct GPSC was identified in 45.9% (random forest) and 77.0% (CART) instances (Additional File 4).

Discussion
This study failed to identify consistent clustering of MALDI-TOF mass spectrum by serotype in a collection of well-characterised pneumococcal isolates with diverse    The number in each cell summarises the proportion of isolates of the serotype with the corresponding mass peak. a False discovery rate.
genotypes. Reducing the number of genotypes within the dataset analysed improved classification of serotype by MALDI-TOF mass spectrum. Inclusion of just three serotypes, and one genotype per serotype, resulted in correct serotype classification in > 90% of isolates. Overall, this suggests that MALDI-TOF mass spectrum clustering within S. pneumoniae is driven by underlying genotype.
Our results contrast slightly with two previous studies which both found that, with careful isolate selection and optimisation of peak lists, MALDI-TOF could discriminate between several common serotypes. From a Japanese collection of 407 isolates from 10 major serotypes, a ClinProTools-developed classification algorithm correctly identified serotypes in 84.0% of isolates (9). Although multiple genotypes were included for each serotype, a dominant ST could be identified in several of them. The authors concluded that further work to determine the interaction between pneumococcal genotype and MALDI-TOF mass spectrum would be helpful. Analysis of 416 Brazilian isolates from six serotypes identified 10 major clusters by visualising a neighbour joining tree based on Pearson's coefficient [10]. Whilst visually serotypes did cluster fairly well, it was notable that all serotypes were identified in > 1 cluster. Importantly, genotyping data were not available in this study.
The major strength of the study is that it included whole genome sequenced isolates where serotype had been verified by both molecular and phenotypic methods. The MALDI-TOF work was performed using the machine in clinical diagnostic / IVD mode, i.e. as the data would be generated in routine clinical microbiology laboratory. The study utilised fully open-source and freely available analytic tools [12,13], in contrast to previous analyses of pneumococcal MALDI-TOF data [9,10]. These tools permitted more exploratory analyses and visualisation of the data, rather than being constrained to machine-learning algorithms packaged with the machine software (often Support Vector Machine [14]). However, there are several limitations to note. The sample size was small, resulting in a small number of isolates for some serotypes. It would have been helpful to have a larger number of isolates of each serotypegenotype pair. To minimise minor variations in mass spectra, inclusion of multiple mass spectra per isolate could have been done. However, this added analytic complexity would not have been reflective of normal diagnostic workflow. To mitigate this limitation, the final dataset included only those isolates where the automated Fig. 1 A cluster dendrogram of serotype-organised MALDI-TOF mass spectrum data for 16 serotypes + non-typeable (NT) isolates. The isolate selection includes 130 pneumococcal isolates from 46 global pneumococcal sequence clusters (GPSC). The inner metadata ring denotes GPSC and the outer ring serotype analysis of mass spectrum data had indicated acceptable identification at the species level. In this context, isolates with an initially unacceptable species identification likely represented technical errors in the laboratory (i.e. poor slide preparation). Thus, isolates not meeting this criterion were repeated to obtain acceptable MALDI-TOF mass spectra. It is also encouraging to note that several of the discriminatory masses in Table 1 (e.g. 5062.44 / 10,121.74) display similar peak proportions within a serotype, indirectly demonstrating spectrum reproducibility. Finally, as has been noted previously, it would have been optimal to have included isolates from more than one geographic location and to have performed external validation of the random forest / CART models. This latter point was highlighted as a major roadblock to progress in a recent systematic review [14]. Despite these limitations, our findings and conclusions are similar to those of Ercibengoa et al [11]. This carefully conducted study of 60 isolates of four common pneumococcal serotypes failed to confirm presence of either novel or previously established discriminatory MALDI-TOF peaks.
The study team speculate that proteins associated with capsular polysaccharide synthesis are likely to be outside of the MALDI-TOF detection range. They noted also that genotype differences may render cross-site validation and use of external discriminatory peaks challenging.
The pneumococcal capsule synthesis locus (cps) is remarkably complex, with considerable diversity within the key enzyme classes [15,16]. The relationship between cps locus gene content and immunologically determined serotype is not always straightforward. In an analysis of the cps loci of 88 serotypes with capsules synthesised by the Wzy-dependent pathway, eight major clusters and 21 sub-clusters were identified [17]. In the majority of cases, members of the same serogroup were co-located in the same cluster, however there were several examples where this relationship broke down. Taken together, all of these findings should perhaps temper enthusiasm for further attempts at MALDI-TOF-based serotyping of S. pneumoniae. However, if genotypic differences in pneumococcal MALDI-TOF Fig. 2 A cluster dendrogram of serotype-organised MALDI-TOF mass spectrum data including only genotypes with ≥ 5 isolates. The isolate selection includes 61 isolates comprising 13 serotypes and nine global pneumococcal sequence clusters (GPSC). The inner metadata ring denotes GPSC and the outer ring serotype mass spectra are found to be consistent between sites and with sequence-based genotype data, then MALDI-TOF could still have a potential role in future pneumococcal surveillance. Indeed, there is precedent for this, as it has been shown already that variations in ribosomal protein mass peaks correlated with clonal complex in Neisseria meningitidis [18]. With the rapid proliferation of pneumococcal sequencing globally, the availability of ribosomal protein sequence data from RiboDB [19], and the increasing use of MALDI-TOF for primary identification of isolates that are submitted for such sequencing, this should be amenable to exploration at scale.

Conclusions
Identification of pneumococcal serotype by MALDI-TOF is not reliable. MALDI-TOF mass spectra appear more associated with underlying genotype. Further work is warranted to determine the robustness of pneumococcal genotype identification by MALDI-TOF.

Bacterial isolate selection
Pneumococcal isolates that had been characterised during pre-and post-PCV pneumococcal colonisation and disease studies of children attending for care at Angkor Hospital for Children in Cambodia were selected for further study [20,21]. Isolates were selected for inclusion using the following criteria: (a) submitted for sequencing as part of the on-going Global Pneumococcal Sequencing project [22], and had passed initial WGS quality control (QC) checks with availability of preliminary insilico MLST genotype; (b) WGS-derived and phenotypic serotype were congruent, including NT pneumococci as a "serotype"; (c) at least ten isolates per serotype.
To determine how well MALDI-TOF mass spectra clustered by serotype (objective 1), at least one isolate of each distinct ST identified within a serotype was included. If there were less than five different STs for a serotype, multiple isolates of the same serotype-ST were included to a total of five isolates. A total of 130 isolates Fig. 3 A cluster dendrogram of serotype-organised MALDI-TOF mass spectrum data for the three dominant pneumococcal serotype-genotype pairs. The isolate selection comprises 69 pneumococcal isolates. The inner metadata ring denotes global pneumococcal sequence type (GPSC) and the outer ring serotype were included in this work. To explore the stability of MALDI-TOF mass spectra within unique serotypegenotype pairs (objective 2), multiple isolates of the commonest ST, including single-locus variants if less than 20 isolates, were selected for serotypes 6B, 19F, and 23F, the dominant serotypes in the isolate collection. A total of 69 isolates were included in this work.

Re-confirmation of serotype by multiplex PCR
As a further confirmation of serotype, DNA was extracted from re-cultured isolates using the QIAamp DNA Mini Kit (Qiagen, Hilden, Germany) and a multiplex real-time PCR was performed to detect 40 pneumococcal serotypes, as previously described [23]. This PCR also includes a primer-probe set for the S. pneumoniaespecific lytA (autolysin) gene, to confirm species identification.

Characterisation by MALDI-TOF
The stored isolates were re-cultured overnight at 35-37°C in 5% CO 2 on 5% sheep blood agar (Oxoid, Basingstoke, UK; prepared in-house). Single colonies were mixed with matrix (alpha-cyano-4-hydroxycinnamic acid, CHCA) on disposable metallised slides and analysed using the VITEK MS MALDI-TOF system (bioMerieux, Marcy L' Etoile, France), following manufacturer instructions. Particular care was taken with the preparation of colony-matrix spots, in order to optimise the generation of high-quality mass spectra. To be compatible with routine diagnostic laboratory workflow, a single mass spectrum was generated per isolate. The machine was run in diagnostic (IVD) mode, with automated measurement of proteins in the specific mass range of 2,000 -20,000 Da. Escherichia coli ATCC 8739 was used for QC and calibration of each slide. Isolates were considered acceptable for analysis if the slide passed QC and the isolate was identified unambiguously as S. pneumoniae by the automated reporting system (bioMerieux Myla, Knowledge Base V3.2.0).

Pneumococcal genotype assignment
Isolates were selected on the basis of MLST genotype. However, during the conduct of the study, the GPSC system was proposed as the optimal method for clustering of S. pneumoniae using WGS data [24]. GPSC were determined automatically for all isolates submitted to the GPS project and, thus, in the following analyses, GPSC have been used instead of ST to describe isolate genotypes. For clarity, both GPSC and ST are included in Table 1.

Data analysis
MALDI-TOF mass spectra were exported as peak lists (.mzML files) from the VITEK-MS instrument: a single peak list per isolate. These were converted to text (.csv) files using the R statistical software V3.6.3 [25] and package "MALDIquant" [12]. Peak lists were stored in both serotype-isolate and genotype-isolate folder structures. These labelled peak lists were imported into MASS-Up, an open source MALDI-TOF analysis program, using the "load peak" command [13]. Inter-sample peak matching, by serotype or genotype, was performed using default settings (method -"forward"; tolerance type -"ppm"; tolerance -300 ppm; reference type -"AVG"). Discriminant peaks were identified by using the biomarker discovery function. Discriminant peak lists (DPL) were generated by selecting peaks with a q-value of < 0.05 (Benjamini Hochberg false discovery rate). Hierarchical clustering was performed on these discriminatory peaks using the "hclust" function (method = "average") in R, following generation of a distance matrix (Hamming distance, using the "hamming.distance" function of the "e1071" package [26]). Clusters were identified using the "mclust" and "NbClust" packages [27,28]. For visual assessment of the relationship between MALDI-TOF mass spectra and serotype or genotype, circular dendrograms were visualized and annotated using the "ggtree" package [29]. Formal testing of discriminatory ability was done via the classification analysis functionality in MASS-Up, using all peak data. The "Random Forest" and "Classification and Regression Tree (CART)" algorithms were run using default settings and 10-fold cross validation.