Application of MALDI-TOF MS and machine learning for the detection of SARS-CoV-2 and non-SARS-CoV-2 respiratory infections

ABSTRACT Matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOF MS) could aid the diagnosis of acute respiratory infections (ARIs) owing to its affordability and high-throughput capacity. MALDI-TOF MS has been proposed for use on commonly available respiratory samples, without specialized sample preparation, making this technology especially attractive for implementation in low-resource regions. Here, we assessed the utility of MALDI-TOF MS in differentiating severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) vs non-COVID acute respiratory infections (NCARIs) in a clinical lab setting in Kazakhstan. Nasopharyngeal swabs were collected from inpatients and outpatients with respiratory symptoms and from asymptomatic controls (ACs) in 2020–2022. PCR was used to differentiate SARS-CoV-2+ and NCARI cases. MALDI-TOF MS spectra were obtained for a total of 252 samples (115 SARS-CoV-2+, 98 NCARIs, and 39 ACs) without specialized sample preparation. In our first sub-analysis, we followed a published protocol for peak preprocessing and machine learning (ML), trained on publicly available spectra from South American SARS-CoV-2+ and NCARI samples. In our second sub-analysis, we trained ML models on a peak intensity matrix representative of both South American (SA) and Kazakhstan (Kaz) samples. Applying the established MALDI-TOF MS pipeline “as is” resulted in a high detection rate for SARS-CoV-2+ samples (91.0%), but low accuracy for NCARIs (48.0%) and ACs (67.0%) by the top-performing random forest model. After re-training of the ML algorithms on the SA-Kaz peak intensity matrix, the accuracy of detection by the top-performing support vector machine with radial basis function kernel model was at 88.0%, 95.0%, and 78% for the Kazakhstan SARS-CoV-2+, NCARI, and AC subjects, respectively, with a SARS-CoV-2 vs rest receiver operating characteristic area under the curve of 0.983 [0.958, 0.987]; a high differentiation accuracy was maintained for the South American SARS-CoV-2 and NCARIs. MALDI-TOF MS/ML is a feasible approach for the differentiation of ARI without specialized sample preparation. The implementation of MALDI-TOF MS/ML in a real clinical lab setting will necessitate continuous optimization to keep up with the rapidly evolving landscape of ARI. IMPORTANCE In this proof-of-concept study, the authors used matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOF MS) and machine learning (ML) to identify and distinguish acute respiratory infections (ARI) caused by SARS-CoV-2 versus other pathogens in low-resource clinical settings, without the need for specialized sample preparation. The ML models were trained on a varied collection of MALDI-TOF MS spectra from studies conducted in Kazakhstan and South America. Initially, the MALDI-TOF MS/ML pipeline, trained exclusively on South American samples, exhibited diminished effectiveness in recognizing non-SARS-CoV-2 infections from Kazakhstan. Incorporation of spectral signatures from Kazakhstan substantially increased the accuracy of detection. These results underscore the potential of employing MALDI-TOF MS/ML in resource-constrained settings to augment current approaches for detecting and differentiating ARI.


Revision Guidelines
To submit your modified manuscript, log into the submission site at https://spectrum.msubmit.net/cgi-bin/main.plex.Go to Author Tasks and click the appropriate manuscript title to begin.The information you entered when you first submitted the paper will be displayed; update this as necessary.Note the following requirements: • Upload point-by-point responses to the issues raised by the reviewers in a file named "Response to Reviewers," NOT IN YOUR COVER LETTER • Upload a compare copy of the manuscript (without figures) as a "Marked-Up Manuscript" file • Upload a clean .DOC/.DOCX version of the revised manuscript and remove the previous version • Each figure must be uploaded as a separate, editable, high-resolution file (TIFF or EPS preferred), and any multipanel figures must be assembled into one file • Any supplemental material intended for posting by ASM should be uploaded separate from the main manuscript; you can combine all supplemental material into one file (preferred) or split it into a maximum of 10 files, with all associated legends included For complete guidelines on revision requirements, see our Submission and Review Process webpage.Submission of a paper that does not conform to guidelines may delay acceptance of your manuscript.
Data availability: ASM policy requires that data be available to the public upon online posting of the article, so please verify all links to sequence records, if present, and make sure that each number retrieves the full record of the data.If a new accession number is not linked or a link is broken, provide Spectrum production staff with the correct URL for the record.If the accession numbers for new data are not publicly accessible before the expected online posting of the article, publication may be delayed; please contact production staff (Spectrum@asmusa.org)immediately with the expected release date.
Publication Fees: For information on publication fees and which article types are subject to charges, visit our website.If your manuscript is accepted for publication and any fees apply, you will be contacted separately about payment during the production process; please follow the instructions in that e-mail.Arrangements for payment must be made before your article is published.

ASM Membership:
Corresponding authors may join or renew ASM membership to obtain discounts on publication fees.Need to upgrade your membership level?Please contact Customer Service at Service@asmusa.org.
The ASM Journals program strives for constant improvement in our submission and publication process.Please tell us how we can improve your experience by taking this quick Author Survey.
Thank you for submitting your paper to Spectrum.

Sincerely, Heba Mostafa Editor Microbiology Spectrum
Reviewer #1 (Comments for the Author): 'Application of MALDI-MS and Machine Learning to Detection of SARS-CoV-2 and non-SARS-CoV-2 Respiratory Infections' by Yegorov, et al. is a study about the practical use of MALDI and machine learning to differentiate between respiratory infections caused by SARS-CoV-2 versus those that are not, with the potential to use this technology in a clinical setting.
Discussion points • You point out that an advantage of this technique could be for limited resource labs in the early stages of a pandemic (Line 233), but how would that be possible without the large pool of samples to train the ML models on, and also with a standard of care test to show true positives?• Line 236, 254 -you acknowledge the importance of taking geographical location of the sample into consideration, so how would this technology work in the clinical lab?Would there be other options for testing if someone was from or had traveled out of the region when infected?How often would the ML model need to be recalibrated to ensure it wouldn't miss mutations/new strains?Would it be part of a testing algorithm?Line 73.Remove 'among' Line 83.How long were samples frozen for before testing?Was there only one freeze/thaw?Table 1.Discrepancies in how percentages are written -some with a '.' and some with a ',' -I would change the three percentages that have a ',' (SARS-CoV-2+ column) to having a '.' for consistency.
Reviewer #2 (Comments for the Author): The authors present an interesting extension of prior MALDI-TOF-MS and Machine Learning (ML) methods for the identification of COVID-19 and other non-COVID respiratory viruses in Kazakhstan.This work specifically highlights the importance of training ML algorithms on geographically diverse datasets.Overall, the findings of the study are well supported.In the attached review are a few recommendations for expanded discussion, updates to figures/tables to improve clarity, and other minor changes.
The authors present an interesting extension of prior MALDI-TOF-MS and Machine Learning (ML) methods for the identification of COVID-19 and other non-COVID respiratory viruses in Kazakhstan.This work specifically highlights the importance of training ML algorithms on geographically diverse datasets.Overall, the findings of the study are well supported.Below are a few recommendations for expanded discussion, updates to figures/tables to improve clarity, and other minor changes.

Main points
 In the discussion, the authors may consider commenting on the potential impact of COVID-19 variants on the performance of an algorithm trained on specimens from 2020, 2021, and 2022.In other words, are new variants a concern for the long-term performance of MALDI-TOF-MS based identifications? Can the authors comment separately on the impact of (1) increased training set size and (2) the geographic diversity of combined datasets on the overall performance of the retrained algorithm?In other words, can any of the improved performance of the retrained algorithm be attributed to having a larger training set size (from the South American and Kazakhstan dataset)? The authors state that MALDI-TOF MS machine learning may be utilized in early stages of endemics/pandemics.Can the authors comment on the importance of the availability of characterized datasets for training such algorithms?
Recommend changes to tables/figures:  Line 95. Missing space between "a" and "18-20 kV"  Line 155: Italicize "et al" We are grateful to both reviewers for their comments and feedback on our manuscript.Please kindly find our point-by-point responses to the reviewer's remarks below.Authors' response: The reviewer raises a very valid point.Our MALDI-TOF MS approach does not target specific viral structures but relies primarily on mass spectrometric signatures associated with ARI-induced perturbations in the nasal mucosa.This feature of the method makes it potentially valuable at the early stages of a pandemic when other tests (e.g.pathogenspecific molecular assays) may yet be unavailable.

Reviewer #1 (Comments for the
However, as the reviewer rightly points out, further implementation of our approach would require access to sufficiently large training datasets, which is an important limitation to keep in mind.We have now highlighted these points in the Discussion (pp 9-10, lines 223-231 and p11, lines 258-260).
• Line 236, 254 -you acknowledge the importance of taking geographical location of the sample into consideration, so how would this technology work in the clinical lab?Would there be other options for testing if someone was from or had traveled out of the region when infected?How often would the ML model need to be recalibrated to ensure it wouldn't miss mutations/new strains?Would it be part of a testing algorithm?
Authors' response: Thank you for raising this question.We agree with the reviewer that the ML models would need to be recalibrated frequently to account for any changes to the ARI landscape.Again, as mentioned in our response above, the nature of our approach (which does not target specific pathogens but assesses molecular changes in the nasal mucosal environment) would make in theory relatively robust to changes in pathogen strains.We have now incorporated these points into the Discussion (pp 9-10, lines 227-231, and the Limitations section).

Line 73. Remove 'among'
Authors' response: Thank you-done!Line 83.How long were samples frozen for before testing?Was there only one freeze/thaw?Authors' response: Thank you for this question.Samples were collected and stored at -80C over the course of the study (2020-2022), and subsequently processed in batches once all samples have been collected.There was only one freeze/thaw cycle prior to MALDI-TOF MS.
We have highlighted this in the Methods (p4, lines 87-88).
Table 1.Discrepancies in how percentages are written -some with a '.' and some with a ',' -I would change the three percentages that have a ',' (SARS-CoV-2+ column) to having a '.' for consistency.Authors' response: Thank you for noting this discrepancy-it has been corrected.Authors' response: We are thankful for the reviewers' positive view of our work!Below, we provide our detailed responses to each of the reviewer's comments.

Main points
In the discussion, the authors may consider commenting on the potential impact of COVID-19 variants on the performance of an algorithm trained on specimens from 2020, 2021, and 2022.In other words, are new variants a concern for the long-term performance of MALDI-TOF-MS based identifications?
Authors' response: Thank you for this question!We have now added this point to the discussion (p10, lines 227-231).As also mentioned in our response to Reviewer #1, our MALDI-TOF MS approach does not target specific viral structures but relies primarily on mass spectrometric signatures associated with ARI-induced perturbations in the nasal mucosa.Therefore, we believe that it would be more tolerant to changes in the circulating viral strains compared to viral structure-specific methods.
Can the authors comment separately on the impact of (1) increased training set size and (2) the geographic diversity of combined datasets on the overall performance of the retrained algorithm?In other words, can any of the improved performance of the retrained algorithm be attributed to having a larger training set size (from the South American and Kazakhstan dataset)?
Figure 1.Ensure that appropriate BioRender license is obtained prior to publication and cited appropriately.
Authors' response: Thank you, we have now included the Biorender citation in the Acknowledgements.
Caption: Please define "NPS" in the figure caption.Figure 2 and 3. Please increase font size so that it is legible.
Authors' response: Thank you-done!Your manuscript has been accepted, and I am forwarding it to the ASM production staff for publication.Your paper will first be checked to make sure all elements meet the technical requirements.ASM staff will contact you if anything needs to be revised before copyediting and production can begin.Otherwise, you will be notified when your proofs are ready to be viewed.
Data Availability: ASM policy requires that data be available to the public upon online posting of the article, so please verify all links to sequence records, if present, and make sure that each number retrieves the full record of the data.If a new accession number is not linked or a link is broken, provide production staff with the correct URL for the record.If the accession numbers for new data are not publicly accessible before the expected online posting of the article, publication may be delayed; please contact ASM production staff immediately with the expected release date.
Publication Fees: For information on publication fees and which article types have charges, please visit our website.We have partnered with Copyright Clearance Center (CCC) to collect author charges.If fees apply to your paper, you will receive a message from no-reply@copyright.com with further instructions.For questions related to paying charges through RightsLink, please contact CCC at ASM_Support@copyright.com or toll free at +1-877-622-5543.CCC makes every attempt to respond to all emails within 24 hours.
ASM Membership: Corresponding authors may join or renew ASM membership to obtain discounts on publication fees.Need to upgrade your membership level?Please contact Customer Service at Service@asmusa.org.
PubMed Central: ASM deposits all Spectrum articles in PubMed Central and international PubMed Central-like repositories immediately after publication.Thus, your article is automatically in compliance with the NIH access mandate.If your work was supported by a funding agency that has public access requirements like those of the NIH (e.g., the Wellcome Trust), you may post your article in a similar public access site, but we ask that you specify that the release date be no earlier than the date of publication on the Spectrum website.

Embargo Policy:
A press release may be issued as soon as the manuscript is posted on the Spectrum Latest Articles webpage.The corresponding author will receive an email with the subject line "ASM Journals Author Services Notification" when the article is available online.
The ASM Journals program strives for constant improvement in our submission and publication process.Please tell us how we can improve your experience by taking this quick Author Survey.
Thank you for submitting your paper to Spectrum.

Sincerely, Heba Mostafa Editor Microbiology Spectrum
Reviewer #2 (Comments for the Author): The reviewers have sufficiently addressed all comments.
Author): Application of MALDI-MS and Machine Learning to Detection of SARS-CoV-2 and non-SARS-CoV-2 Respiratory Infections' by Yegorov, et al. is a study about the practical use of MALDI and machine learning to differentiate between respiratory infections caused by SARS-CoV-2 versus those that are not, with the potential to use this technology in a clinical setting.Discussion pointsYou point out that an advantage of this technique could be for limited resource labs in the early stages of a pandemic (Line 233), but how would that be possible without the large pool of samples to train the ML models on, and also with a standard of care test to show true positives?

# 2 (
Comments for the Author): The authors present an interesting extension of prior MALDI-TOF-MS and Machine Learning (ML) methods for the identification of COVID-19 and other non-COVID respiratory viruses in Kazakhstan.This work specifically highlights the importance of training ML algorithms on geographically diverse datasets.Overall, the findings of the study are well supported.Below are a few recommendations for expanded discussion, updates to figures/tables to improve clarity, and other minor changes.
Thank you, this point has now been clarified in the Methods (p 3Thank you-done!Line 116.Cite R FactoMinR and facoextra Authors' response: Thank you-done!-23R1 (Application of MALDI-MS and Machine Learning to Detection of SARS-CoV-2 and non-SARS-CoV-2 Respiratory Infections.)Dear Dr. Irina Kadyrova: Figure 1.o Ensure that appropriate BioRender license is obtained prior to publication and cited appropriately.o Caption: Please define "NPS" in the figure caption. Figure 2 and 3. Please increase font size so that it is legible. Table 1.Please define which values are included in parentheses in the table. In the main text, the authors may consider including a table that compares the performance of the original model with the re-trained model.

In the main text, the authors may consider including a table that compares the• performance of the original model with the re-trained model. Authors' response:
Thank you for this suggestion.We have now included Table2in the manuscript.