Automated classification of group B Streptococcus into different clonal complexes using MALDI-TOF mass spectrometry

Objectives To evaluate the performance of Matrix-Assisted Laser Desorption/Ionization Time-of Flight Mass Spectra (MALDI-TOF MS) for automated classification of GBS (Group B Streptococcus) into five major CCs (clonal complexes) during routine GBS identification. Methods MALDI-TOF MS of 167 GBS strains belonging to five major CCs (CC10, CC12, CC17, CC19, CC23) were grouped into a reference set (n = 67) and a validation set (n = 100) for the creation and evaluation with GBS CCs subtyping main spectrum (MSP) and MSP-M using MALDI BioTyper and ClinProTools. GBS CCs subtyping MSPs-M was generated by resetting the discriminative peaks of GBS CCs subtyping MSP according to the informative peaks from the optimal classification model of five major CCs and the contribution of each peak to the model created by ClinProTools. Results The PPV for the GBS CCs subtyping MSP-M was greater than the subtyping MSP for CC10 (99.21% vs. 93.65%), but similar for CC12 (79.55% vs. 81.06%), CC17 (93.55% vs. 94.09%), and CC19 (92.59% vs. 95.37%), and lower for CC23 (66.67% vs. 83.33%). Conclusion MALDI-TOF MS could be a promising tool for the automated categorization of GBS into 5 CCs by both CCs subtyping MSP and MSP-M, GBS CCs subtyping MSP-M is preferred for the accurate prediction of CCs with highly discriminative peaks.


Introduction
Streptococcus agalactiae, also known as Group B streptococcus (GBS), is a gram-positive coccus that commonly colonizes the female lower genital tract and rectum.Since the 1980s, it has been recognized as a prominent bacterial pathogen causing neonatal invasive infections such as sepsis, meningitis, and pneumonia, resulting in acute illnesses, longterm impairments, and even neonatal death (Nanduri et al., 2019).GBS gestational colonization during late pregnancy may pose a great threat to fetal health as the primary risk factor for preterm birth, spontaneous abortion and neonatal GBS infections (Delara et al., 2023).GBS strains of different lineages differ greatly in pathogenicity and virulence, epidemiology and antibiotic resistance phenotypes (Manning et al., 2009;Tazi et al., 2010;Teatero et al., 2017;Wang et al., 2018;Arias et al., 2019;Gori et al., 2020;Zheng et al., 2020;Le Gallou et al., 2023).
Nowadays, GBS strains have been commonly determined by genoserotyping with serotyping through latex agglutination (LA) assay using the specific surface capsular polysaccharide (CPS) antibodies and PCR amplification of the capsular gene for the molecular epidemiological survailence (Imperi et al., 2010).However, the high expense of genoserotyping makes it difficult for routine clinical analysis.Moreover, GBS strains belonging to identical capsular serotype III or Ib have shown significant differential characteristics in phenotype and genotype (Arias et al., 2019;Wu et al., 2019;Zheng et al., 2020).Multi-locus sequence typing (MLST) has been widely used to examine the genetic lineages of GBS since 2003 (Jones et al., 2003).However, the high expense, time-consuming, labor-intensive and complex procedures of MLST also forbid its clinical application in routine epidemiological surveillance.
Matrix-Assisted Laser Desorption/Ionization Time-of Flight Mass Spectrometry (MALDI-TOF MS) has been widely applied for the fast and accurate clinical identification of microbial species in recent years.MALDI-TOF MS could identify different subspecies (Wang et al., 2022;Liu et al., 2023;Oberg et al., 2023), like automated, fast and accurate prediction of methicillin-resistant Staphylococcus aureus (MRSA) clonal complexes (CCs) (Camoez et al., 2016) and GBS subspecies-level typing based on the mass variation of ribosomal subunit proteins (rsp profile) (Rothen et al., 2019).The previous reported GBS subtyping strategies based on MALDI-TOF MS all require manual assessment of the acquired spectra and highly trained personnel with professional tools like ClinProTool software or online GBS serotyper, including peak biomarkers of different GBS sequence types (STs) and serotypes (Lartigue et al., 2011;Lanotte et al., 2013;Lin et al., 2019), the statistical models generated based on MALDI-TOF MS spectrometry for the rapid classification of major GBS serotypes (Ia, Ib, III, V, VI) (Wang et al., 2019) and STs (ST10, ST12, ST17, ST19) (Huang et al., 2020), or the facile machine learning GBS CCs multi-classification model generated through XGBoost algorithm based on the antibiotic susceptibility, serotypes and virulence genes of GBS strains (Liu et al., 2022), making them not suitable for routine clinical application.Unfortunately, an automated and fast method to detect different GBS CCs lineages simultaneous with MALDI-TOF MS microbial species identification has not yet been evaluated.This study aimed to analyze the potential of MALDITOF MS to discriminate the major neonatal GBS lineages in China using an automated approach based on two Biotyper main spectrum (MSP).

Sample preparation and MALDI-TOF MS data acquisition
GBS strains were cultured on Columbia sheep blood agar at 37 °C for 16-18 h, which were then collected to extract protein using ethanol-formic acid extraction method (Lartigue et al., 2009).MALDI-TOF MS was performed on a MALDI Microflex LT (Bruker Daltonics, Bremen, Germany) instrument under the control of FlexControl software.1 µL protein extraction from each strain was spotted on the MALDI target plate (MSP 96 target steel; Bruker Daltonics) and air-dried at room temperature.Each dried protein spot was covered with 1 µL saturated matrix solution of αcyano-4-hydroxy-cinnamic acid (Bruker Daltonics) in 50% acetonitrile-2.5% trifluoroacetic acid (Fisher Scientific, U.K) and air-dried for further MS analysis.For each strain, 240 laser shots from 40 separate sample spots were automatically collected at 60 Hz (random walk movement).Protein extracts of the reference and validation sets were spotted on a MALDI target plate in eight and two replicates, which were detected three times to obtain 24 and 6 mass spectra for each strain in the reference and validation sets respectively.By using the main spectrum (MSP) identification standard method (mass range 2000-20000 Da) on the MALDI BioTyper (Version 3.1; Bruker Daltonics), mass spectra were acquired in linear positive mode at a laser frequency of 20 Hz.All raw spectra were aligned to the GBS MSP database using the MALDI Biotyper pattern-matching algorithm.
As for the classification of GBS CCs by GBS CCs subtyping MSP relies mostly on the informative peaks with low intensity or a combination of low peaks, the mass spectra of GBS strains with a logarithm score [log(S)]≥2.3 during GBS species identification were recommended and preferred to be enrolled into the CCs reference set for better CCs subtyping subclassification, but for validation, mass spectra for certain GBS strains with multiple measurements and scores 2.0-2.3 could also be used for CCs prediction, especially for CC10 and CC17 with specific informative peaks.Specifically, for mass spectra of CC12, CC19, and CC23 strains without good informative peaks, mass spectra with high identification log (scores) ≥2.3 were preferred for more reliable prediction of the subtyping CCs.Detailed pipeline applied to the raw spectra was shown in Supplementary Document S1.

MALDI-TOF MS data analysis
Mass spectra from the reference set were grouped into 5 CCs classes (CC10, CC12, CC17, CC19, CC23).The multiple spectra of each strain from 123 GBS isolates representative of five major CCs (CC10, n = 20; CC12, n = 20, CC17, n = 49; CC19, n = 21; CC23, n = 13) were loaded into ClinProTools software (version 3.0; Bruker Daltonics) for model generation and peak analysis.To avoid errors in the statistical calculation of mass spectra through multiple measurements, spectra grouping and similarity selection were enabled before loading spectra.Spectra processing on ClinProTools includes peak selection and calculation of average peak list.Mass to charge ratio values (m/z) ranging from 2000 to 10,000 were used.Null spectra exclusion was enabled.The other default settings remained unchanged.The m/z values from the average spectrum of major CCs were extracted to identify statistical discriminative peaks when the p-value for the Anderson-Darling test was >0.05 and for the t-/ANOVA or Wilcoxon/Kruskal-Wallis test was ≤0.05, or if the p values for the Anderson-Darling test and the Wilcoxon/Kruskal-Wallis test were ≤0.05 (Camoez et al., 2016).The average spectrum of each CC class was calculated in order to create pattern recognition models using the Genetic algorithm (GA), the Supervised neural network (SNN), and the Quick Classifier (QC) algorithms (Bruker Daltonics GmbH, 2011).Recalibration was carried out with a 1000 parts per million maximal peak shift and a match to calibrant peaks.Spectra that had not been calibrated were excluded.All peaks in the spectra were picked in model generation.For GA-KNN, GA algorithm was used as a method to select the peak combinations, the maximum number of best peaks was evaluated as 10, 20, and 30 respectively, the maximum number of generations was set to be 500 for GA algorithm to run to assure it wouldn't not be reached as the stop criteria to halt calculation.The numbers of the k-nearest neighbors (k-NN) evaluated were 1, 3, 5, 7 for each binary class separation.Random mode was chosen for calculating cross validation of generated ML models.The recognition capability and cross validation values were calculated to evaluate the performance of the calculated models.The optimal discriminative peaks and separation weights provided by the best CCs classification model in this study as well as the GBS ST classification models in our previous report (Huang et al., 2020) were taken into consideration for modifying the GBS subtyping MSP to MSPs-M.MSPs of five major GBS CCs specifies (CC10,CC12,CC17,CC19,CC23) were first created using the BioTyper MSP creation method respectively with default parameters.Then, they were all selected for creating the subtyping MSPs of five GBS CCs specifies using the BioTyper subtyping MSP creation method with default parameters.To generate CCs MSPs-M, the specific weights for peaks in the subtyping MSPs were set to 0 or replaced by the weight values of the informative peaks calculated by optimized algorithms.To assess the performance of CCs subtyping MSP and MSP-M, an external validation using 100 GBS isolates from the validation set was performed.Six different spectra for each strain in the validation set were loaded into the Biotyper software and categorized.The CCs subtyping MSP and MSP-M database were searched for the best match with each mass spectra.A cut-off log score value of 2.4 was recommonded to determine the subtyping prediction of the mass spectra.External calibration of the spectra was performed routinely using Bacterial Test Standard (Bruker Daltonics).A more clear flow chart of the study design was displayed in Figure 1, a more detailed paragraph clarifying both subtyping methods was described in Supplementary Document S1.

Creation of GBS CCs classification models
After mass spectra loading on ClinProTools, the automatically selected average mass spectra for each strain of five GBS CCs isolates were used to calculate and generate pattern recognition models by GA, SNN and QC algorithms.Their performances were shown in Table 1.The GA (20)-KNN1 model generated by GA algorithm displayed the highest recognition (100%) and cross-validation rates (80.89%)(Supplementary Table S3).Therefore, it was selected to offer the optimal set of peaks for discriminating GBS 5 CCs.The GA (20)-KNN1 model identified 20 informative peaks (Table 1).Ten out of the twenty differential peaks showed low p values (p < 0.05) for the Anderson-Darling test, indicating that the data were not normally distributed, hence the p-value of the Wilcoxon/ Kruskal-Wallis test was preferred above the p-value of the t-/ ANOVA test to consider them as informative peaks.Another eight peaks showed p values for the Anderson-Darling test greater than 0.05 (normally distributed); hence, the t-/ANOVA was favored instead of the Wilcoxon/Kruskal-Wallis test (Table 2).The statistical analysis revealed that the intensity differences of 91 peaks were statistically significant among 5 CCs (Supplementary xls 1).
The most distinctive peaks for five GBS CCs are shown in Figure 2. In accordance with our previous finding (Huang et al., 2020), a peak biomarker at 6,250 m/z was present while a peak at 6,892 m/z was absent in all CC10 strains.Two powerful differential peaks at 2,955 m/z and 5,912 m/z exhibited significantly higher intensity in CC17 strains.A peak biomarker at 7,620 m/z of ST17 was proved to be unique for CC17 class, which could coexist with peptide ions 7,635-7,644 m/z in some CC17 strains.The peak shift from 7,620 m/z to 7,635-7,644 m/z was confirmed in all non-CC17 strains as previous researches (Lartigue et al., 2011;Huang et al., 2020).In the paired two-dimensional peak distribution diagrams, the distribution of the two most divergent peaks supported their capacity to differentiate isolates belonging to five major CCs (Figure 3).Pair peaks at 6,250 and 6,892 m/z could clearly distinguish mass spectra of CC10 strains, whereas peaks at 2,955 and 5,912 m/z could discriminate strains from CC17 and CC23, and peaks at 7,620 and 7,638 m/z could separate CC17 strains from other four major CCs (Figure 3).

Automated classification of GBS spectra into 5 CCs by MALDI biotyper subtyping MSPs and its modified bank MSP-M
24 mass spectra of each strain from the 5 CCs in the reference set were selected to establish the CCs' specific MSP signature and the associated CCs subtyping MSPs.For GBS CCs subtyping MSPs, the weights of distinguishing peaks among five major CC subclones were automatically set to 0 or the values listed in the supplemental CCs Btmsp files of GBS 5 CCs strains.Meanwhile, referring to the previous study (Camoez et al., 2016), these peaks were also manually edited to generate the CCs subtyping MSPs-M based on the informative peaks in independent GBS STs GA (Huang et al., 2020) and CCs models (Table 1 and Supplementary xls 1).The positive predict values (PPV) of 5 CCs by CCs subtyping MSPs and MSPs-M were listed in Table 3.The PPV for the GBS CCs subtyping MSP-M was greater than the PPV for the subtyping MSP for CC10 (99.21% and 93.65%, respectively), but similar for CC12 (79.55% and 81.06%), CC17 (93.55% and 94.09%), and CC19 (92.59% and 95.37%), and lower for CC23 (66.67% and 83.33%).The peak data of the modified CCs library MSP-M also supported distinctive peaks among five GBS CCs, including peaks at 6,891 (6,888-6,895) m/z, 6,250 m/z, 3,124 m/z, 5,912 m/z, 7,620 m/z (Supplementary xls 1).Due to lack of statistically informative peaks for CC23 strains, a small specific peak at 6,261 m/z identified in GBS CCs subtyping MSPs for CC12 and CC23, should be kept during peak edition for the generation of GBS CCs subtyping MSP-M for better CCs prediction performance.The flow chart of the study design.
TABLE 1 The 20 discriminative peaks calculated by optimized GA-KNN models.

Peak no m/z Start mass End mass Weight Discussion
Previous study used MALDI-TOF MS to automatically classify four main MRSA CCs lineages (CC5, CC8, CC22, and CC398) based on a Biotyper main spectra (MSP) database with modified CCs informative peaks (Camoez et al., 2016).MALDI-TOF MS was also reported for discriminating four major neonatal GBS STs in China using ClinProTools (Huang et al., 2020).No automated MALDI-TOF/MS based statistical classification methodology has been reported for the fast subtyping of GBS strains yet.In this study, we have evaluated the performance of two subtyping MSP for automated classification of GBS five major CCs (CC10, CC12, CC17, CC19, CC23) in China using MALDI Biotyper alone or in combination with ClinProTools software.
Both GBS CCs subtyping MSP and its modified library MSP-M could be applied for the fast automated prediction of GBS CCs lineages, with clinical acceptable PPV values higher than 90% for CC10, CC17,CC19, and lower PPV of roughly 80% for CC12 and CC23, which would be more clinically acceptable than the traditional nonautomated classification strategies described before for GBS subspecies discrimination (Huang et al., 2020).The generation of CCs subtyping MSP is less complicated than its modified library MSP-M, without excessive necessity for obtaining statistically distinguishable peaks and their weights based on classification models with enough strains for each group (nS20), while these distinguishable peaks and their weights were needed for setting modified CCs library MSP-M.Therefore, the number of strains per CC in the reference set could be as few as five during the generation of CCs MSP, facilitating the possible recognition of certain CCs like CC23 without enough strains (n&20).The insufficient number of neonatal GBS CC23 strains (5 < n < 20) ultimately resulted in lack of statistically distinguishable peaks specific for ST23 (Huang et al., 2020) and subsequent poorer modification of informative peak parameters of CC23 than other CCs in CCs subtyping MSP-M, leading to a poorer performance of CCs subtyping MSP-M (PPV 66.67%) than MSP on CC23 prediction.
The CCs subtyping MSP is effective at identifying the GBS major CCs automatically, especially for CC12, CC19 and CC23 that lack effective informative peaks.Similar as the previous modified CCs MSP for MRSA that relies on a robust statistical analysis and the automated use of MALDI-TOF/MS to discriminate the major MRSA clonal lineages (Camoez et al., 2016), the modified GBS CCs MSP-M library is an promising way for GBS CCs prediction, facilitating their correct recognition especially for CC10 and CC17 subspecies with biomarker peaks like m/z 6,250 (Huang et al., 2020) or m/z 7,620 (Lartigue et al., 2011) respectively, with mass spectra of log scores over 2.4.Therefore, although it is more time-consuming for creating the modified CCs MSP-M, the classification performance of CCs MSP-M would be more stable than the unmodified CCs MSP for GBS CCs especially CC10, since identification by CCs MSP-M is less affected by the mass spectra score and the equipment status of MALDI-TOF mass spectrometry (data not shown), allowing better clinical acceptability among different laboratories than the previous GBS subtyping strategies (Lartigue et al., 2011;Huang et al., 2020;Liu et al., 2022).However, due to the reduced peak intensity at 5,912 m/z or the absence of its unique peak at 7,620 m/z, CC17 strains may be mistakenly categorized as non-CC17 strains like CC23 or CC19 by both CCs subtyping MSP and MSP-M.Moreover, the CCs subtyping MSP and MSP-M mistakenly identify isolates from other sporadic CCs with log scores below 2.4 and over 2.4 respectively (Supplementary Table S2).For example, a small number of isolates from CC1 (ST2, n = 2; ST156, n = 1) and ST4 (n = 1) were mistakenly assigned to CC10 with log score values below 2.4.This is probably due to the existence of CC10 peak biomarker m/z 6,250 (Huang et al., 2020) in those CCs.Besides, Both methods also mistakenly classified STs belonging to five major CCs which were not included in the STs in reference set with log scores over 2.4 (Supplementary Table S4).To overcome this limitation, representative ST strains belonging to different CCs should be included in the reference set as more as possible to improve the performance of GBS CCs subtyping MSP and MSP-M during the subspecies library creation.According to the CCs subtyping MSP, a log score of mass spectra greater than 2.4 was preferred for more  accurate prediction of GBS CCs subtypes.Moreover, multiple measurements of GBS strains were preferred to obtain more mass spectra to be enrolled into the validation set, then the CCs could be categorized according to the major classification results of each GBS strain if possible, which helps to decrease the incorrect prediction rate and improve the performance of GBS CCs predication.To facilitate the application of the established GBS CCs subtyping MSPs and MSP-M models, a preliminary reference procedure for the roughly recognizing of GBS CCs by both methods was displayed in Figure 4.
In summary, we have compared the performance of two GBS CCs subtyping MSPs for the automated GBS CCs categorization, providing suggestions for better optimization of peak parameters and creation of GBS CCs subtyping MSP.This automated The preliminary procedure for the recognizing of GBS CCs by GBS mass spectra according to the established GBS CCs subtyping MSP and MSP-M models.
MALDI-TOF MS approach could be implemented in routine microbiology labs with MALDI-TOF mass spectrometry simultaneous with microbial identification, allowing a fast identification of GBS CCs.Moreover, CC17 and CC19 were the predominant CCs for neonatal meningitis (Manning et al., 2009;Ji et al., 2019) and pregnant women colonizers (Teatero et al., 2017), which tended to be levofloxacin susceptible but tetracycline resistance or levofloxacin-resistant but tetracycline susceptible (Wang et al., 2018) respectively, CC10 showed high radezolid MICs (Zheng et al., 2020) and FQ resistance (Arias et al., 2019).Therefore, the automated and fast subtyping of GBS CCs will permit a timely initiation of optimal antibiotic selection for the perinatal and neonatal GBS infections, assisting for better GBS prevention and control clinically.Subtyping MSP could be a promising tool for the automated prediction of microbial subtypes in the future.

FIGURE 2
FIGURE 2Averaged spectra plots showing the presence or absence of corresponding peak biomarkers for matrix-assisted laser desorption ionization time-of flight mass spectrometry discrimination of the five main Group B streptococcus clonal complexes in the optimal genetic algorithm model.CC10 (red), CC12 (pink), CC17 (green); CC19 (blue), CC23 (yellow).x-axis shows the mass per charge ratio values (m/z) and y-axis indicates the intensities of peaks expressed in arbitrary intensity units.

FIGURE 3
FIGURE 3Two-dimensional peak distribution diagrams displaying the pair distribution for the best separating peaks in the optimal genetic algorithm model.The ellipses respresent the standard deviation of the class average of the peak areas/intensities.CC10 (red), CC12 (green), CC17 (blue); CC19 (yellow), CC23 (pink).The peak numbers and m/z values are indicated on the xand y-axes, respectively.

TABLE 2
Peak statistics for the 20 discriminative peaks calculated by optimized GA-KNN models on CliniProTools 3.0.
a Peak number: correlative numbering of the peak in the average spectra.

TABLE 3
Comparison of the validation performance of GBS spectra classification into five major CCs by Biotyper subtyping MSPs and MSP-M automatically.