Mobile genetic elements define the non-random structure of the Salmonella enterica serovar Typhi pangenome

ABSTRACT Bacterial relatedness measured using select chromosomal loci forms the basis of public health genomic surveillance. While approximating vertical evolution through this approach has proven exceptionally valuable for understanding pathogen dynamics, it excludes a fundamental dimension of bacterial evolution—horizontal gene transfer. Incorporating the accessory genome is the logical remediation and has recently shown promise in expanding epidemiological resolution for enteric pathogens. Employing k-mer-based Jaccard index analysis, and a novel genome length distance metric, we computed pangenome (i.e., core and accessory) relatedness for the globally important pathogen Salmonella enterica serotype Typhi (Typhi), and graphically express both vertical (homology-by-descent) and horizontal (homology-by-admixture) evolutionary relationships in a reticulate network of over 2,200 U.S. Typhi genomes. This analysis revealed non-random structure in the Typhi pangenome that is driven predominantly by the gain and loss of mobile genetic elements, confirming and expanding upon known epidemiological patterns, revealing novel plasmid dynamics, and identifying avenues for further genomic epidemiological exploration. With an eye to public health application, this work adds important biological context to the rapidly improving ways of analyzing bacterial genetic data and demonstrates the value of the accessory genome to infer pathogen epidemiology and evolution. IMPORTANCE Given bacterial evolution occurs in both vertical and horizontal dimensions, inclusion of both core and accessory genetic material (i.e., the pangenome) is a logical step toward a more thorough understanding of pathogen dynamics. With an eye to public, and indeed, global health relevance, we couple contemporary tools for genomic analysis with decades of research on mobile genetic elements to demonstrate the value of the pangenome, known and unknown, annotated, and hypothetical, for stratification of Salmonella enterica serovar Typhi (Typhi) populations. We confirm and expand upon what is known about Typhi epidemiology, plasmids, and antimicrobial resistance dynamics, and offer new avenues of exploration to further deduce Typhi ecology and evolution, and ultimately to reduce the incidence of human disease.

• Upload point-by-point responses to the issues raised by the reviewers in a file named "Response to Reviewers," NOT in your cover letter.
• Upload a compare copy of the manuscript (without figures) as a "Marked-Up Manuscript" file.
• Upload a clean .DOC/.DOCX version of the revised manuscript and remove the previous version.
• Each figure must be uploaded as a separate, editable, high-resolution file (TIFF or EPS preferred), and any multipanel figures must be assembled into one file.
• Any supplemental material intended for posting by ASM should be uploaded with their legends separate from the main manuscript.You can combine all supplemental material into one file (preferred) or split it into a maximum of 10 files with all associated legends included.
For complete guidelines on revision requirements, see our Submission and Review Process webpage.Submission of a paper that does not conform to guidelines may delay acceptance of your manuscript.
Data availability: ASM policy requires that data be available to the public upon online posting of the article, so please verify all links to sequence records, if present, and make sure that each number retrieves the full record of the data.If a new accession number is not linked or a link is broken, provide mSystems production staff with the correct URL for the record.If the accession numbers for new data are not publicly accessible before the expected online posting of the article, publication may be delayed; please contact production staff (mSystems@asmusa.org)immediately with the expected release date.
Publication Fees: For information on publication fees and which article types are subject to charges, visit our website.If your manuscript is accepted for publication and any fees apply, you will be contacted separately about payment during the production process; please follow the instructions in that e-mail.Arrangements for payment must be made before your article is published.
ASM Membership: Corresponding authors may join or renew ASM membership to obtain discounts on publication fees.Need to upgrade your membership level?Please contact Customer Service at Service@asmusa.org.
The ASM Journals program strives for constant improvement in our submission and publication process.Please tell us how we can improve your experience by taking this quick Author Survey.
Thank you for submitting your paper to mSystems.

Sincerely, Sima Tokajian Editor mSystems
Reviewer #1 (Comments for the Author): In this study, Arancha Peñil-Celis and collaborators use network analysis to study Salmonella Typhi pangenome structure.The authors combined Jaccard Index with a novel metric, genome length distance (GLD), which incorporates differences in genome size into the network analysis.Combining both tools, they are able to dissect pangenome structure at an unprecedented level of resolution.Specifically, the authors illustrated the main role that horizontal gene transfer of mobile genetic elements plays in Salmonella Typhi evolution and stratification.Next, they compare their method with preexisting typing schemes and illustrate the advantages associated with it in terms of studying Salmonella Typhi epidemiology and evolution.In general, I think that this work is relevant and timely, representing an important contribution to the field.I reviewed a previous version of this manuscript for a different journal, and the few issues I raised have already been addressed by the authors in this new version of the manuscript.

Alvaro San Millan
Reviewer #2 (Comments for the Author): In this manuscript (mSystems00365-24), Penil-Celis et al. used Jaccard Index analysis and network visualization coupled with mobile genetic elements characterization to study core and accessory genome structure and diversity within Salmonella enterica serovar Typhi.The study used a large collection of epidemiologically relevant Salmonella enterica serovar Typhi genomes and provided a generalized overview of the pangenome structure of Salmonella enterica serovar Typhi.The findings reported in the manuscript were interesting and likely important for understanding aspects of the evolutionary and ecological dynamics of this host-adapted human pathogen.The study was comprehensive in its analysis and produced robust and unbiased results.Notably, many of the new insights reported in the manuscript would have been undetectable by current, routine methods.Limitations of the approach were adequately addressed in the Discussion section of the manuscript, which was well written, easy to read, and within the scope of mSystems.

Point-for-point responses
We thank the Editor and Reviewers for their comments on our manuscript.Singletons or clusters containing fewer than five members are genomes that are uncommon in Typhi and contain either a large number of SNPs or unique elements not found in the majority of Typhi genomes or a mix of these factors.The presence of singletons may be influenced by 1) incorrect assignment to the Salmonella serovar, 2) sample bias, 3) the thresholds applied, 4) sequencing errors, and 5) genetic diversity.We have added a sentence regarding this in Materials and Methods, section Network visualization and community detection (The presence of singletons and small communities may be influenced by incorrect assignment to the Salmonella serovar, sample bias, the thresholds applied, sequencing errors, and the intrinsic genetic diversity of the samples).

Answers to the Editor
1) There are different tools for in silico serotyping of Salmonella spp., each with varying performance compared to routine laboratory serotyping results.Incorrectly assigning genomes to the serovar compromises the integrity of the database and increases the number of singletons.In this work, Typhi isolates selected for the US dataset were typed using SeroSeq2, which is the in silico method employed in the US enteric pathogen surveillance system (PulseNet).We have added a sentence in Materials and Methods, section Whole genome sequencing to include this information (Typhi serotype for all study genomes was confirmed using SeqSero2 v0.1 ( 49 ) and genomes were further genotyped using the updated GenoTyphi scheme).The 38 singletons from our US dataset (listed in a table below) were re-evaluated with the Salmonella In Silico Typing Resource (SISTR).All of them were confirmed to be serotype Typhi.
2) Selection of genomes skewed toward those with higher variability can bias the sampling, resulting in many singletons, especially if strict thresholds are applied to the network.On the other hand, analyzing genomes from a single geographic location with low genetic diversity could result in singletons if a few highly variable imported cases are included.
3) The number of singletons increases with stricter thresholds; the higher the thresholds, the sparser the network becomes.Applying successive thresholds (e.g., Jaccard Index and Genome Length Distance) also highlights the differences between genomes, thereby increasing the probability of obtaining singletons.In this work, we selected a JI threshold at which less than 2% of the genomes were not assigned to a community.Additionally, this threshold was within a JI range where a plateau in the number of communities with five or more members was observed.Adjusting the JI and GLD thresholds could potentially allow more genomes to be clustered together, depending on the research questions and the level of granularity required.4) Sequencing errors can directly translate into artificial SNPs, which significantly contribute to the Jaccard distance.Low depth of sequencing coverage can also significantly affect the accuracy of indel detection.All genomes included in our US dataset were sequenced using the Illumina sequencing technology and the same quality check and assembly pipelines that are used in the US enteric pathogen surveillance system (PulseNet) were applied.5) Genetic diversity is the underlying biological reason for the existence of different Jaccard Index groups.At thresholds used in this work (JI=0.983and GLD=0.05),between-group differences can be explained by indels larger than 50 kb in size, or by >=2,050 SNPs across the entire genome, or a mix of both.
We further analyzed the singletons from our US dataset by calculating the JI and GLD values to the closest genome assigned to a JI group.The results are recorded in the table shown below.In most cases (35 out of 38), the JI value linking a singleton to a JI group is below the JI threshold (0.983).Of these 35, 29 exhibit a GLD value > 0.05 to the closest assigned genome, indicating that they either contain or lack accessory genome elements.The remaining 6 singletons (out of 35) meet the GLD inclusion criterion, and thus their differences compared to an assigned genome are mostly due to SNPs.On the other hand, three genomes out of 38 singletons (2018AM-3329, PNUSAS120833, and PNUSAS070834) meet the JI but not the GLD threshold, indicating that their differences are mainly due to indels.For example, the closest group to singleton GCF_003719555 is JI-D.Both JI-D genomes and this singleton contain a PTU-HI1A plasmid, but GCF_003719555 additionally contains a PTU-E18 plasmid that is absent in JI-D members.Similarly, genome GCF_003718135 contains a PTU-I1 plasmid that is absent in its closest JI group, JI-A.Presence or absence of MGEs thus defines the singularity in these cases.These isolates were chosen for indel verification, as indicated in Materials and Methods.Isolates PNUSAS195139 and PNUSAS224101 were chosen for long-read sequencing based on our observation from Illumina genomes that some isolates lacking the SGI11 element in Fig S13c (six genomes) clustered phylogenetically with those harboring SGI11 inserted at the yidA gene.We suspected these genomes lacking SGI11 had the yidA gene disrupted, indicating potential excision of SGI11 from these isolates.Given the absence of prior documentation on this phenomenon, we considered it important to validate the loss of SGI11.Consequently, we opted to sequence them using long-read technology to verify this observation.According to the phylogenetic tree in Fig S13c, six genomes were highly similar in their core genome (they differ only in a few SNPs) and we selected two of them as representatives for sequencing.In the revised version (Results, section US Typhi pangenome structure aligns with and expands on known AMR and epidemiological patterns) we have added a sentence to highlight this finding: "Indeed, we detected a likely event of SGI11 excision from the yidA gene in six JI-B isolates (Fig S13b).In these cases, long-read sequencing of two of these genomes (PNUSAS224101, SAMN21040479; PNUSAS195139,SAMN18332688) confirmed that the yidA gene is disrupted by IS1, suggesting that it could be either a precursor to the SGI11 acquisition, or most likely a derivative of SGI11 excision, both probably through IS1-mediated recombination." We chose to sequence PNUSAS198714 to confirm the integration of the blaCTX-M-15 gene into the SGI11 island in some genomes.Although we suspected this from the Illumina sequences, confirmation was hindered because this region was fragmented across several contigs.Long-read sequencing enabled us to conclusively demonstrate this integration (refer to main Figure 3).We explain this in the manuscript ("Insertion sites were confirmed either by direct analysis of the blaCTX-M-15-containing contigs (insertion sites I-III), or with additional long-read sequencing (insertion site IV) (PNUSAS198714, SAMN18813804)").In the revised version, we have also added a clarification in Materials and Methods, section Whole genome sequencing "Long-read sequencing was performed in this study on select isolates (PNUSAS224101, SAMN21040479; PNUSAS195139, SAMN18332688; PNUSAS198714, SAMN18813804; see Table S1) for indel verification (the first two isolates to confirm the absence of SGI11 and the yidA gene disruption, and the third one to detect the integration of blaCTX-M-15 in SG11) as previously described."

The authors should consider testing the workflow using genomes generated by different sequencing platforms and demonstrate biases, if any.
We only have three genomes sequenced using both Illumina and Nanopore technologies, which implies we cannot reconstruct the network using long-read sequences for comparison with the Illumina-based network.We calculated the Jaccard Index between the same genomes sequenced using Illumina and Nanopore to determine if any significant differences attributable to the sequencing method could affect our classification method.The results consistently showed a very high JI value (>0.998) in all cases (0.998947 for PNUSAS195139, 0.999815 for PNUSAS198714, and 0.99956 for PNUSAS224101).This indicates that JI differences attributable to the sequencing method do not affect our classification into JI groups (with a JI threshold >= 0.983), nor the subclassification into JIsubgroups (with JI values of 0.995 for JI-A subgroups, 0.986 for JI-B subgroups, and 0.997 for JI-C subgroups).

The authors should also demonstrate whether the quality and completeness of the reference genomes used would influence the results obtained.
We sought complete Typhi genomes from the NCBI RefSeq database and included them in our analysis as references because of their higher quality.Working with closed genomes allows the accurate location of contigs into chromosomal or extrachromosomal elements.This is relevant to evaluate the indel differences in each JI-group.However, we found NCBI RefSeq references for only 10 out of 17 JI-groups.For the remaining JI-groups, we selected genomes with the highest connectivity degree within each group (ensuring they are good representatives) and reconstructed them using PLACNETw to assort contigs into chromosome or plasmids and thus evaluate the presence of mobile genetic elements.

To overcome the biases associated with k-mer-based methods, including sequencing errors, repetitive sequences, and complex rearrangements, authors should consider using Spectral Jaccard Similarity (SJS).
We thank the Editor for bringing this methodology to our attention, as we were not previously aware of it.The authors (Baharav et al. (2024)) applied SJS to improve and speed up de novo genome assembly from long-read sequencing datasets.SJS is a min-hash-based approach that filters the set of read pairs to select those that share a reasonable number of k-mers and are thus likely to have a significant overlap.These candidate pairs are then used to obtain detailed alignment maps.When used as a metric to filter out pairs of reads that are unlikely to have a large alignment, SJS outperformed Jaccard similarity (JS) on standard classification performance metrics.That is, SJS was significantly more correlated with alignment size in the analysis of 40 PacBio long-read bacterial datasets that exhibited an overlap of at least 30%.SJS was thus proposed as an alternative to the standard k-mer Jaccard similarity for estimating the overlap size between pairs of noisy, third-generation sequencing reads.However, the work does not provide data on the superiority of the SJS method for comparing two complete (fully assembled) genomes, nor does it discuss its possible application beyond tackling the problem of pairwise read alignment.The code provided (https://github.com/TavorB/spectral_jaccard_similarity)was published specifically to verify the results presented in the paper and is not prepared or optimized to compare userprovided genome sequences.
6.In the discussion, the authors should provide more details on the limitations and biases introduced by the design and methodology used in the study and, in future directions, a more detailed description of the practical utility of the approach used and the outcome.
We have added a paragraph into the Discussion that highlights some of the limitations of the study and includes future directions that may help to resolve these limitations.
The potential biases introduced by utilizing Typhi genomes from a single country were addressed by analyzing multiple datasets from different geographic locations and time ranges.A remaining limitation of this analysis is the lack of very recent genomes (2022)(2023)(2024).Since Typhi populations can rapidly evolve, new JI-groups may emerge in a relatively short time period.Additionally, many previously unknown MGEs were detected in this analysis that may prove epidemiologically relevant, but in-depth genetic characterization of every MGE was outside the scope of this analysis.Finally, the JI method used here is not immediately implementable within the US enteric surveillance system, PulseNet, due to existing computational infrastructure.However, recent efforts to modernize PulseNet's genomic surveillance (https://www.aphl.org/aboutAPHL/publications/Documents/PulseNet-2.0-White-Paper.pdf)may offer an opportunity for incorporation of JI-based methods, offering pangenomic analysis closer to "real-time", and simplifying the detection of unknown MGEs that can be explored with targeted genetic analysis.The ultimate public health goal is to provide a practical approach for enhanced genetic discrimination that improves surveillance and outbreak detection of otherwise indistinguishable enteric pathogens.Your manuscript has been accepted, and I am forwarding it to the ASM production staff for publication.Your paper will first be checked to make sure all elements meet the technical requirements.ASM staff will contact you if anything needs to be revised before copyediting and production can begin.Otherwise, you will be notified when your proofs are ready to be viewed.
Data Availability: ASM policy requires that data be available to the public upon online posting of the article, so please verify all links to sequence records, if present, and make sure that each number retrieves the full record of the data.If a new accession number is not linked or a link is broken, provide production staff with the correct URL for the record.If the accession numbers for new data are not publicly accessible before the expected online posting of the article, publication may be delayed; please contact ASM production staff immediately with the expected release date.
Publication Fees: For information on publication fees and which article types have charges, please visit our website.We have partnered with Copyright Clearance Center (CCC) to collect author charges.If fees apply to your paper, you will receive a message from no-reply@copyright.com with further instructions.For questions related to paying charges through RightsLink, please contact CCC at ASM_Support@copyright.com or toll free at +1-877-622-5543.CCC makes every attempt to respond to all emails within 24 hours.
ASM Membership: Corresponding authors may join or renew ASM membership to obtain discounts on publication fees.Need to upgrade your membership level?Please contact Customer Service at Service@asmusa.org.
PubMed Central: ASM deposits all mSystems articles in PubMed Central and international PubMed Central-like repositories immediately after publication.Thus, your article is automatically in compliance with the NIH access mandate.If your work was supported by a funding agency that has public access requirements like those of the NIH (e.g., the Wellcome Trust), you may post your article in a similar public access site, but we ask that you specify that the release date be no earlier than the date of publication on the mSystems website.

Embargo Policy:
A press release may be issued as soon as the manuscript is posted on the mSystems Latest Articles webpage.The corresponding author will receive an email with the subject line "ASM Journals Author Services Notification" when the article is available online.
Cover Image Submissions: If you would like to submit a potential Cover Image, please email a file and a short legend to msystems@asmusa.org.Please note that we can only consider images that (i) the authors created or own and (ii) have not been previously published.By submitting, you agree that the image can be used under the same terms as the published article.Image File requirements: TIF/EPS, 7.5 inches wide by 8.25 inches tall (at least 2,250 pixels wide by 2,475 pixels tall), minimum 300 dpi resolution (600 dpi preferred), RGB, and no figure elements, e.g., arrows or panel labels.The legend should be a short description of the image, 1-2 sentences recommended.Please download and use this interactive template in Adobe to ensure that your proposed cover image meets our size requirements (https://journals.asm.org/pb-assets/pdf-text-excel-files/ASM-Interactive-Sizing-Cover-Template-1715689791.pdf).
Author Video:: For mSystems research articles, you are welcome to submit a short author video for your recently accepted paper.Videos are normally 1 minute long and are a great opportunity for junior authors to get greater exposure.Importantly, this video will not hold up the publication of your paper and you can submit it at any time.

Details of the video are:
• Minimum resolution of 1280 x 720 • .movor .mp4video format • Provide video in the highest quality possible but do not exceed 1080p • Provide a still/profile picture that is 640 (w) x 720 (h) max • Provide the script that was used We recognize that the video files can become quite large, so to avoid quality loss ASM suggests sending the video file via https://www.wetransfer.com/.When you have a final version of the video and the still ready to share, please send it to mSystems staff at mSystems@asmusa.org.
Thank you for submitting your paper to mSystems.

1.
Singletons or JI-clusters with less than five members.Please elaborate on the factors contributing to singletons and small clusters tackling it, for example, from the perspective of sample bias, threshold used, sequencing errors, and genetic diversity.What would the authors recommend to minimize the number of singletons and small clusters?What other parameters could the authors recommend introducing to achieve more comprehensive clustering?