Uncharacterized and lineage-specific accessory genes within the Proteus mirabilis pan-genome landscape

ABSTRACT Proteus mirabilis is a Gram-negative bacterium recognized for its unique swarming motility and urease activity. A previous proteomic report on four strains hypothesized that, unlike other Gram-negative bacteria, P. mirabilis may not exhibit significant intraspecies variation in gene content. However, there has not been a comprehensive analysis of large numbers of P. mirabilis genomes from various sources to support or refute this hypothesis. We performed comparative genomic analysis on 2,060 Proteus genomes. We sequenced the genomes of 893 isolates recovered from clinical specimens from three large US academic medical centers, combined with 1,006 genomes from NCBI Assembly and 161 genomes assembled from Illumina reads in the public domain. We used average nucleotide identity (ANI) to delineate species and subspecies, core genome phylogenetic analysis to identify clusters of highly related P. mirabilis genomes, and pan-genome annotation to identify genes of interest not present in the model P. mirabilis strain HI4320. Within our cohort, Proteus is composed of 10 named species and 5 uncharacterized genomospecies. P. mirabilis can be subdivided into three subspecies; subspecies 1 represented 96.7% (1,822/1,883) of all genomes. The P. mirabilis pan-genome includes 15,399 genes outside of HI4320, and 34.3% (5,282/15,399) of these genes have no putative assigned function. Subspecies 1 is composed of several highly related clonal groups. Prophages and gene clusters encoding putatively extracellular-facing proteins are associated with clonal groups. Uncharacterized genes not present in the model strain P. mirabilis HI4320 but with homology to known virulence-associated operons can be identified within the pan-genome. IMPORTANCE Gram-negative bacteria use a variety of extracellular facing factors to interact with eukaryotic hosts. Due to intraspecies genetic variability, these factors may not be present in the model strain for a given organism, potentially providing incomplete understanding of host-microbial interactions. In contrast to previous reports on P. mirabilis, but similar to other Gram-negative bacteria, P. mirabilis has a mosaic genome with a linkage between phylogenetic position and accessory genome content. P. mirabilis encodes a variety of genes that may impact host-microbe dynamics beyond what is represented in the model strain HI4320. The diverse, whole-genome characterized strain bank from this work can be used in conjunction with reverse genetic and infection models to better understand the impact of accessory genome content on bacterial physiology and pathogenesis of infection.

• Point-by-point responses to the issues raised by the reviewers in a file named "Response to Reviewers," NOT IN YOUR COVER LETTER.
• Upload a compare copy of the manuscript (without figures) as a "Marked-Up Manuscript" file.
• Each figure must be uploaded as a separate file, and any multipanel figures must be assembled into one file.
• Manuscript: A .DOC version of the revised manuscript • Figures: Editable, high-resolution, individual figure files are required at revision, TIFF or EPS files are preferred ASM policy requires that data be available to the public upon online posting of the article, so please verify all links to sequence records, if present, and make sure that each number retrieves the full record of the data.If a new accession number is not linked or a link is broken, provide production staff with the correct URL for the record.If the accession numbers for new data are not publicly accessible before the expected online posting of the article, publication of your article may be delayed; please contact the ASM production staff immediately with the expected release date.
For complete guidelines on revision requirements, please see the journal Submission and Review Process requirements at https://journals.asm.org/journal/mSystems/submission-review-process.Submission of a paper that does not conform to mSystems guidelines will delay acceptance of your manuscript.
Corresponding authors may join or renew ASM membership to obtain discounts on publication fees.Need to upgrade your membership level?Please contact Customer Service at Service@asmusa.org.
Thank you for submitting your paper to mSystems.
The ASM Journals program strives for constant improvement in our submission and publication process.Please tell us how we can improve your experience by taking this quick Author Survey.

Reviewer comments:
Reviewer #1 (Comments for the Author): Potter et al. presents their findings on the pan-genome characteristics of the opportunistic pathogen P. mirabilis.Using newly sequenced and previously sequenced genomes, the study described three subspecies and the variation in accessory genome content within the species.Below are my comments.
Major comments: 1. Results, lines 271-272: The rationale for choosing 4,013 SNPs = 0.1% of the median P. mirabilis genome length in NCBI in delineating clusters should be clarified here.There are several robust and widely used methods to infer bacterial population structure and clustering, e.g., BAPS, Mandrake, PopPUNK, so the 0.1% median definition used in this study is unclear.2. The Discussion is weak, with more than half of the Discussion merely reiterating what was already mentioned in the Results section.The broader implications of the findings should be described.
Minor comments: 1. Intro, Lines 90-92: Reference 14 is cited here, but this reference only mentions urine cultures.A proper reference for P. mirabilis in wound and soft tissue infections should be cited here.2. Methods, lines 141-142: The p value used as threshold for the chi-sq analysis should be stated here.3. Methods, Line 172: Citation for RStudio is missing.4. Methods, Line 174: I suggest removing "3 subspecies" in this sentence as they have not been distinguished yet at this point of the text.Alternatively, you can describe here how the three subspecies were delineated. 5. Throughout the text: spell out numbers below ten.E.g., three instead of 3 (line 174), five instead of 5 (line 248), three instead of 3 (line 295) 6. Methods, lines 155, 163: What quality metrics were used to assess the genomes?Supplementary table S1 should include information about the number of contigs, N50, number of annotated genes, genome size for each genome included in the study.Table S1 is missing the headings for each column.7. Methods, line 169: Citation for Cytoscape is missing.8. Methods, lines 179-180 "To gain specific insight into the population structure of subspecies 1, we used RAxML to identify identical isolates": Phylogenetic tree reconstruction using RAxML needs to include description of parameters used (e.g., boostrapping, rooting, nucleotide substitution model, rate heterogeneity).Moreover, one does not use a phylogenetic tree to determine the population structure and identify identical isolates; hence it is unclear what this sentence means.It is unclear what the rationale is for removing the duplicate genomes.9. Methods, line 195: Citation for Scoary is missing.10.Results, line 226: What does "approximate" tree mean here?11. Results and Methods: It is unclear why EGGnog was used for annotation for subspecies 1 and Prokka for the other Proteus genomes.12.The figure legends associated with each figure is very confusing.For figures, legends should be placed below the figure.For example, figure 2 legend is found below and within the same page as figure 1. 13. Figure 4: Colored branches in the tree should be defined.What "within and outside" 10 largest clusters mean is not clear.14.Results, lines 304-305: What the genes are that were mentioned as to be likely involved in O-antigen modification should be mentioned/listed here.Better yet, a supplementary table listing the gene names and functions of those identified in red dots or circled in Figure 5 should be included.

Reviewer #2 (Comments for the Author):
The authors of this study obtained a substantial number of Proteus isolates, which they sequenced and analyzed to gain significant insights into the population genetics of this taxon.The resulting collection of isolates, including both experimentally ready-to-use isolates and genomes, will prove invaluable in comprehending the functional implications of intra-species diversity.Such knowledge is crucial in elucidating the molecular mechanisms underlying host-bacteria interactions.The clarity of the analysis and writing in this paper is pretty good.
1.The article utilized over 1000 genomes from a publicly accessible database.The authors ought to clarify the supplementary worth of the data generated in this study in relation to existing datasets.Specifically, does the new data feature a greater number of human donors or additional timepoints? 2. It would also be helpful to summarize how many donors are included in both the new data and the existing public data.3. Figure 1 would benefit from improvement to enable readers to discern its details more clearly.4. In my opinion, Figure 2A appears to display either 2 or 4 subspecies, depending on where to set the cutoff.While I acknowledge that non-supervised clustering can be subjective, the authors should furnish additional justifications for their decision to conclude that there are 3 subspecies.5.It might be worth considering creating a new plot that utilizes SNP distance and gene presence-absence distance as the two axes in Figure 4.This would enable the authors to reinforce the conclusions derived from Figure 4 more directly.6.In the abstract, the authors use a single paper demonstrating that intra-species diversity was once believed to be minor to motivate readers.However, it is unclear whether this notion is widely acknowledged as the prevailing understanding within the field.7. The collection of isolates in one lab (?) is a tremendous resource for future experimental research.The authors may wish to expand upon this point in their discussion, particularly in comparison to public datasets.

Reviewer #1 (Comments for the Author):
Potter et al. presents their findings on the pan-genome characteristics of the opportunistic pathogen P. mirabilis.Using newly sequenced and previously sequenced genomes, the study described three subspecies and the variation in accessory genome content within the species.Below are my comments.We thank Reviewer #1 for devoting their time and scientific insight towards our manuscript and agree with their summary statement.Major comments: 1. Results, lines 271-272: The rationale for choosing 4,013 SNPs = 0.1% of the median P. mirabilis genome length in NCBI in delineating clusters should be clarified here.There are several robust and widely used methods to infer bacterial population structure and clustering, e.g., BAPS, Mandrake, PopPUNK, so the 0.1% median definition used in this study is unclear.We agree with the reviewer that this important detail within our manuscript should have additional explanation.We did not initially use a bacterial population structure program because we explicitly did not want to cluster all genomes.We wanted to allow for the existence of singleton and pair genomes, which is not a feature of population structure software, as this was the first foray into P. mirabilis phylogeny.We took the reviewer's suggestion and applied FastBAPS to our core-genome alignment of de-duplicated subspecies 1 genomes.We chose FastBAPS since it was created by the author of panaroo it is benchmarked for utility on large (>1,000) genome datasets.FastBAPS identified 31 clusters, much smaller than our count of 75 clusters (containing ≥4 genomes), however we had great concordance between what we identified as the top 10 largest clusters.We identified 720 genomes within the top 10 largest clusters, while FastBAPS identified 730 genomes in the corresponding BAPS groups.All 720 of the top 10 largest cluster genomes are within the 730 identified by FastBAPS.We investigated the 10 discrepant genomes.Four genomes were identified in BAPS group 20 but not in SNP cluster 1, three genomes were identified in BAPS group 23 but not SNP cluster 4, one genome was identified in BAPS group 14 but not SNP cluster 5, one genome was identified in BAPS group 3 but not SNP cluster 8, and one genome was identified in BAPS group 17 but not SNP cluster 9.The discrepant genomes are the white rectangles within the respective SNP clusters, indicating they are closely related to other genomes in the cluster but with ≥4,013 SNP distance.Given the high concordance between FastBAPS groupings and SNP clusters, we have confidence in our original downstream interpretation of accessory genome content distribution.We have included a visualization of this concordance as Figure S4 and all of the results of FastBAPS groups and SNP clusters as Table S3.[286][287][288][289][290] 2. The Discussion is weak, with more than half of the Discussion merely reiterating what was already mentioned in the Results section.The broader implications of the findings should be described.We appreciate the reviewer's insight into our Discussion section and have addressed this by removing most of the reiterations of our own results, and by expanding on the broader context of our work within the field of Gram-negative pan-genomics and on specific genes of interest identified in the pan-genome but absent from P. mirabilis HI4320.[384][385][386][387][388][389][390][391][392][393][394][395][396][398][399][400][401][402][403][404][418][419][420][421][422][423][424] Minor comments: 1. Intro, Lines 90-92: Reference 14 is cited here, but this reference only mentions urine cultures.A proper reference for P. mirabilis in wound and soft tissue infections should be cited here.We have added a reference for a study identifying the microbiology of urine and wound cultures and found that P. mirabilis is a major component of both (PMID: 31462413).Lines 81-84.

Methods, lines 141-142:
The p value used as threshold for the chi-sq analysis should be stated here.We have clarified that a threshold p of .05 was used for significance testing.Lines 133-135.
3. Methods, Line 172: Citation for RStudio is missing.We thank the reviewer for noticing that we did not properly credit RStudio and have added a citation to remedy this.Lines 166.
4. Methods, Line 174: I suggest removing "3 subspecies" in this sentence as they have not been distinguished yet at this point of the text.Alternatively, you can describe here how the three subspecies were delineated.We thank the author for their helpful comment and took both approaches.We have removed "3 subspecies" and kept the meaning of the sentence the same as well as adding more information on methods in the concluding portion of the previous paragraph.Lines 166-169.S1 should include information about the number of contigs, N50, number of annotated genes, genome size for each genome included in the study.Table S1 is missing the headings for each column.We have included a description of our quality cutoff being <500 contigs as well as including Table S2 as a supplemental file.Line 149.

Methods, line 169: Citation for Cytoscape is missing.
We have added the appropriate reference for Cytoscape (14597658).Line 163.
8. Methods, lines 179-180 "To gain specific insight into the population structure of subspecies 1, we used RAxML to identify identical isolates": Phylogenetic tree reconstruction using RAxML needs to include description of parameters used (e.g., boostrapping, rooting, nucleotide substitution model, rate heterogeneity).Moreover, one does not use a phylogenetic tree to determine the population structure and identify identical isolates; hence it is unclear what this sentence means.It is unclear what the rationale is for removing the duplicate genomes.We thank the author for pointing out this confusion.We have clarified that we used RAxML only to identify duplicate genomes from the initial core-genome alignment of subspecies 1 genomes (n=1,883).We have added our rationale for this in the text and below.We have removed the phrase "To gain specific insight into the population structure of subspecies…" Lines 176-180.9. Methods, line 195: Citation for Scoary is missing.We have added the reference for Scoary in the methods section.Line 203.

Results, line 226: What does "approximate" tree mean here?
The use of "approximate" comes from FastTree 2 documentation which frequently adds the qualifier in front of language describing the algorithm in comparison to traditional maximum likelihood methods (http://www.microbesonline.org/fasttree/).11. Results and Methods: It is unclear why EGGnog was used for annotation for subspecies 1 and Prokka for the other Proteus genomes.We thank the reviewer for pointing out this ambiguity.We clarified prokka was used for gene calling and initial annotation in all of the P. mirabilis genomes in the methods but that EggNOG was only used for subspecies 1.In our experience we cannot completely trust prokka annotation, so we use EggNOG to achieve more comprehensive identification on genes called by prokka.Lines 149,[194][195] 12.The figure legends associated with each figure is very confusing.For figures, legends should be placed below the figure.For example, figure 2 legend is found below and within the same page as figure 1.We agree with the reviewer and hope that following mSystems directions for uploading the figures as separate files has alleviated that annoyance.
13. Figure 4: Colored branches in the tree should be defined.What "within and outside" 10 largest clusters mean is not clear.We thank the reviewer for noticing these detracting aspects of Figure 4A.We have changed all branch colors to be black and have edited the referenced boxes to say "Genome present in top 10 largest SNP clusters" and "Genome absent from top 10 largest SNP clusters" which will add clarity for the reader.Figure 4A.

Results
, lines 304-305: What the genes are that were mentioned as to be likely involved in Oantigen modification should be mentioned/listed here.Better yet, a supplementary table listing the gene names and functions of those identified in red dots or circled in Figure 5 should be included.We agree with the reviewer that this would be a nice piece of information useful for genomic researchers interested in Proteus or accessory genome functional capacities.We have added a supplementary table that lists all of the gene names (from panaroo) and their annotation (from EggNOG), as well as scoary metrics for association and an additional text file containing the representative ORF clustered by panaroo for each gene call (Table S4, Document S2).Additionally, we have mentioned two of the genes in the results section and expanded the Discussion to include thoughts on relationship to previous studies on O-antigen variability.Lines 325-329, 388-396.

Reviewer #2 (Comments for the Author):
The authors of this study obtained a substantial number of Proteus isolates, which they sequenced and analyzed to gain significant insights into the population genetics of this taxon.The resulting collection of isolates, including both experimentally ready-to-use isolates and genomes, will prove invaluable in comprehending the functional implications of intra-species diversity.Such knowledge is crucial in elucidating the molecular mechanisms underlying host-bacteria interactions.The clarity of the analysis and writing in this paper is pretty good.

5.
Throughout the text: spell out numbers below ten.E.g., three instead of 3 (line 174), five instead of 5 (line 248), three instead of 3 (line 295) We have changed numerical values below 10 into their word forms.Throughout text.6. Methods, lines 155, 163: What quality metrics were used to assess the genomes?Supplementary table