Machine learning-based classification reveals distinct clusters of non-coding genomic allelic variations associated with Erm-mediated antibiotic resistance

ABSTRACT The erythromycin resistance RNA methyltransferase (erm) confers cross-resistance to all therapeutically important macrolides, lincosamides, and streptogramins (MLS phenotype). The expression of erm is often induced by the macrolide-mediated ribosome stalling in the upstream co-transcribed leader sequence, thereby triggering a conformational switch of the intergenic RNA hairpins to allow the translational initiation of erm. We investigated the evolutionary emergence of the upstream erm regulatory elements and the impact of allelic variation on erm expression and the MLS phenotype. Through systematic profiling of the upstream regulatory sequences across all known erm operons, we observed that specific erm subfamilies, such as ermB and ermC, have independently evolved distinct configurations of small upstream ORFs and palindromic repeats. A population-wide genomic analysis of the upstream ermB regions revealed substantial non-random allelic variation at numerous positions. Utilizing machine learning-based classification coupled with RNA structure modeling, we found that many alleles cooperatively influence the stability of alternative RNA hairpin structures formed by the palindromic repeats, which, in turn, affects the inducibility of ermB expression and MLS phenotypes. Subsequent experimental validation of 11 randomly selected variants demonstrated an impressive 91% accuracy in predicting MLS phenotypes. Furthermore, we uncovered a mixed distribution of MLS-sensitive and MLS-resistant ermB loci within the evolutionary tree, indicating repeated and independent evolution of MLS resistance. Taken together, this study not only elucidates the evolutionary processes driving the emergence and development of MLS resistance but also highlights the potential of using non-coding genomic allele data to predict antibiotic resistance phenotypes. IMPORTANCE Antibiotic resistance (AR) poses a global health threat as the efficacy of available antibiotics has rapidly eroded due to the widespread transmission of AR genes. Using Erm-dependent MLS resistance as a model, this study highlights the significance of non-coding genomic allelic variations. Through a comprehensive analysis of upstream regulatory elements within the erm family, we elucidated the evolutionary emergence and development of AR mechanisms. Leveraging population-wide machine learning (ML)-based genomic analysis, we transformed substantial non-random allelic variations into discernible clusters of elements, enabling precise prediction of MLS phenotypes from non-coding regions. These findings offer deeper insight into AR evolution and demonstrate the potential of harnessing non-coding genomic allele data for accurately predicting AR phenotypes.

• Upload point-by-point responses to the issues raised by the reviewers in a file named "Response to Reviewers," NOT in your cover letter.
• Upload a compare copy of the manuscript (without figures) as a "Marked-Up Manuscript" file.
• Upload a clean .DOC/.DOCX version of the revised manuscript and remove the previous version.
• Each figure must be uploaded as a separate, editable, high-resolution file (TIFF or EPS preferred), and any multipanel figures must be assembled into one file.
• Any supplemental material intended for posting by ASM should be uploaded with their legends separate from the main manuscript.You can combine all supplemental material into one file (preferred) or split it into a maximum of 10 files with all associated legends included.
For complete guidelines on revision requirements, see our Submission and Review Process webpage.Submission of a paper that does not conform to guidelines may delay acceptance of your manuscript.
Data availability: ASM policy requires that data be available to the public upon online posting of the article, so please verify all links to sequence records, if present, and make sure that each number retrieves the full record of the data.If a new accession number is not linked or a link is broken, provide mSystems production staff with the correct URL for the record.If the accession numbers for new data are not publicly accessible before the expected online posting of the article, publication may be delayed; please contact production staff (mSystems@asmusa.org)immediately with the expected release date.
Publication Fees: For information on publication fees and which article types are subject to charges, visit our website.If your manuscript is accepted for publication and any fees apply, you will be contacted separately about payment during the production process; please follow the instructions in that e-mail.Arrangements for payment must be made before your article is published.
ASM Membership: Corresponding authors may join or renew ASM membership to obtain discounts on publication fees.Need to upgrade your membership level?Please contact Customer Service at Service@asmusa.org.
The ASM Journals program strives for constant improvement in our submission and publication process.Please tell us how we can improve your experience by taking this quick Author Survey.
Thank you for submitting your paper to mSystems.

Sincerely, Youjun Feng Editor mSystems
Reviewer #1 (Comments for the Author): The manuscript entitled "Machine learning-based classification reveals distinct clusters of non-coding genomic allelic variations associated with Erm-mediated antibiotic resistance" investigate regulatory mechanism of the AR gene known as RNA methyltransferase-encoding erm, which confers resistance to antibiotics belonging to the macrolides, lincosamides, and streptogramins groups (MLS phenotype).The authors found that ermB and other erm genes have independently developed distinct configurations of uORFs and palindromic repeats during evolution.They also discovered many alleles cooperatively influenced the stability of alternative hairpin structures, thereby directly affecting ermB expression inducibility and MLS phenotypes.Overall, I find the motivation of this study was well justified.The work is interesting.However, the manuscript can be further improved in the following aspects.Major issues: 1. Risk levels of ARGs should be identified and discussed.2. To increase the transparency of the analyses, the authors should provide example source codes and the metadata of all samples.All the analysis pipeline and scripts should be deposited in GitHub.3. When the abbreviation first appears, please use the full name meanwhile.4.More details about the software, including specific parameters, should be provided in the Methods section to enable readers to repeat the analysis.5.It appears that the data have not been released as of now.6.The Abstract is too long and sentence needs to be rewritten and re-fine.
Reviewer #2 (Comments for the Author): In face of surging antibiotic resistance, most current studies solely focused on the functionality of the coding regions of antibiotic resistance genes.In this investigation, Tan and coworkers have elucidated the evolutionary emergence of these regulatory elements of ARGs using machine learning with erm gene as proof-of-concept.The study is quite interesting and informative to shed light on the evolutionary trajectory that drives the emergence and development of MLS-induced resistance.The MS is well organized.However, there are some issues have to be concerned before consideration of publication.
1.The exact experimental details were missing in the current abstract.The authors are suggested to bring more insights of present study instead of background information.2.Line 186, please describe the One-Hot encode method in alignment data transform in detail.3.Please note the version of the software or program used throughout the MS.4.As we know, ORF prediction by software sometimes bring bias like high false positive rates.Since the length of uORFs is generally short, so false positive of the prediction is expected to be high.Have the authors performed the corresponding experiment to provide evidences of the uORFs, such as ribosome profiling or proteome/ peptidome?5.What is the rationale behind using deleted the last 12 nucleotides in ermB upstream sequences to make RNA secondary structure prediction?6.It will be better if you use translatome technology to evaluate the translation of ermB mRNA instead of protein quantification.7.The authors are suggested to include a part of discussion for the rationale and benefit to use machine learning.8.The English writing of MS has to be carefully polished.
Response: As per your suggestion, we now include the specific version information for the programs/software that were used in this study.
4. As we know, ORF prediction by software sometimes bring bias like high false positive rates.Since the length of uORFs is generally short, so false positive of the prediction is expected to be high.Have the authors performed the corresponding experiment to provide evidence of the uORFs, such as ribosome profiling or proteome/ peptidome?Response: We agree with the reviewer that short ORF prediction is error-prone and requires additional validation.One of the senior authors had previously used ribosome profiling (Riboseq) to map the unannotated and misannotated ORFs in Staphylococcus aureus (the organism used in this study), revealing many small uORFs after experimentally confirming the production of the corresponding small proteins (PMID: 25313041, PMID: 27001516, PMID: 38252661).Applying a genome-wide Ribo-seq to validate uORF prediction in this study is unnecessary for the following reasons: this study solely focuses on the uORF of erm genes, the two-gene erm operon consisting of the uORF and its downstream erm has been well-studied, and the translation of erm uORFs and the uORF protein products have been detected by mass spectrometry (PMID: 24239289), cryo-EM structures of the uORF trapped ribosomes (PMID: 27380950, PMID: 25306253, PMID: 24961372), toeprinting (PMID: 24662426, PMID: 24961372, PMID: 21292164), cell-free translation (PMID: 18439898, PMID: 27645242), and translational fusion of various reporter genes (PMID: 26727240, PMID: 24239289, PMID: 20676057, PMID: 34262551, PMID: 27645242).Note that this reference list is not exhaustive.5.What is the rationale behind using deleted the last 12 nucleotides in ermB upstream sequences to make RNA secondary structure prediction?
Response: Thank you for your comments.The most important feature of the ermB gene regulation is the use of the alternative RNA hairpin structures.However, the current RNA structure prediction program cannot predict such alternation of the two RNA structures; it will only provide one model that is energy optimized.In other words, when we use the full-length ermB upstream sequence, the program will only predict the formation of hairpin-2, not the hairpin-1, for most of the cases.Thus, to predict the formation of the hairpin 1 structure, we remove the last 12 nucleotides to prevent formation of the hairpin-2 structure.This way will allow us to compute the energy of both RNA hairpin structures, by which we can infer the regulation of ermB expression.The method section on this has been revised accordingly: "Specifically, we used full length ermB upstream sequences (1-211 bp) to predict the RNA secondary structure, resulting in all sequences only forming hairpin-2 structure but not hairpin-1.Thus, we next used the truncated version, which deleted the last 12 nucleotides in ermB upstream sequences (1-199 bp), to prevent the formation of the hairpin-2 and allow prediction for the hairpin-1 structure."6.It will be better if you use translatome technology to evaluate the translation of ermB mRNA instead of protein quantification.