Species-Wide Phylogenomics of the Staphylococcus aureus Agr Operon Revealed Convergent Evolution of Frameshift Mutations

ABSTRACT Staphylococcus aureus is a prominent nosocomial pathogen that causes several life-threatening diseases, such as pneumonia and bacteremia. S. aureus modulates the expression of its arsenal of virulence factors through sensing and integrating responses to environmental signals. The agr (accessory gene regulator) quorum sensing (QS) system is a major regulator of virulence phenotypes in S. aureus. There are four agr specificity groups each with a different autoinducer peptide sequence encoded by the agrD gene. Although agr is critical for the expression of many toxins, paradoxically, S. aureus strains often have nonfunctional agr activity due to loss-of-function mutations in the four-gene agr operon. To understand patterns in agr variability across S. aureus, we undertook a species-wide genomic investigation. We developed a software tool (AgrVATE; https://github.com/VishnuRaghuram94/AgrVATE) for typing and detecting frameshift mutations in the agr operon. In an analysis of over 40,000 S. aureus genomes, we showed a close association between agr type and S. aureus clonal complex. We also found a strong linkage between agrBDC alleles (encoding the peptidase, autoinducing peptide itself, and peptide sensor, respectively) but not agrA (encoding the response regulator). More than 5% of the genomes were found to have frameshift mutations in the agr operon. While 52% of these frameshifts occurred only once in the entire species, we observed cases where the recurring mutations evolved convergently across different clonal lineages with no evidence of long-term phylogenetic transmission, suggesting that strains with agr frameshifts were evolutionarily short-lived. Overall, genomic analysis of agr operon suggests evolution through multiple processes with functional consequences that are not fully understood. IMPORTANCE Staphylococcus aureus is a globally pervasive pathogen that produces a plethora of toxic molecules that can harm host immune cells. Production of these toxins is mainly controlled by an active agr quorum-sensing system, which senses and responds to bacterial cell density. However, there are many reports of S. aureus strains with genetic changes leading to impaired agr activity that are often found during chronic bloodstream infections and may be associated with increased disease severity. We developed an open-source software called AgrVATE to type agr systems and identify mutations. We used AgrVATE for a species-wide genomic survey of S. aureus, finding that more than 5% of strains in the public database had nonfunctional agr systems. We also provided new insights into the evolution of these genetic mutations in the agr system. Overall, this study contributes to our understanding of a common but relatively understudied means of virulence regulation in S. aureus.

2. Please consider replacing "Staphopia" with "Staphopia database" throughout the manuscript. 3. Figure 2 has poor resolution, please improve it. Additionally, 2C needs to be bigger than what it is now. 4. I did not see much utility of Figure  This manuscript reports a major effort to characterize an enormous collection of S. aureus genomes for truncation mutations in the agr locus. Such mutations are known to emerge frequently during colonization with S. aureus. As such, further understanding their emergence and distribution is an important question. While I admire the authors for choosing this interesting direction, more work is needed to reach the conclusions listed in the abstract.
Major -A major claim in the abstract is that "Phylogenetic patterns suggested that strains with agr frameshifts were evolutionary dead ends." Similarly, the discussion mentions that the same mutations were "independently acquired." This is an important claim that impacts how other data presented here are interpreted. Either support or contrast with the result would be interesting if characterized more convincingly: the extent to which agr null mutants persist and spread between individuals is currently unknown, as is the likelihood of the same mutation appearing independently. However, this result is poorly supported by the presented data and appears to be in fact contradictory to the data provided. The outlier point in 4B suggests a mutation that occurred early on the phylogeny that is now in many isolates. Moreover, the distribution of specific mutations across particular agr types as shown in Figure 3B also suggests identity-by-descent. Lastly, it is unclear how the dereplication was done for the analysis in 4B impacts results; it appears to me that this step would overestimate 'dead ends' and underestimate the spread of agr null alleles. An analysis that counts the number of independent occurrences on the tree from the full data set is required. If there is a limit to how many sequences can be imported, complete smaller subsets of the phylogeny should be used.
-Relatedly, the authors should consider reporting the "minimum number of mutations on tree" rather than the consistency index. This would be easier to interpret, as we the claim that all mutations are dead ends would produce an expected y=x line on a plot of # of isolates vs # of mutations. This can be calculated from the reported consistency index (https://github.com/JosephCrispell/homoplasyFinder/wiki/Calculating-consistency-index) -Are multiple frameshift mutations in agr and/or indel mutations in RNAiii found in the same genomes? Or is there a signal for overdispersion (i.e. more cases of 1 mutation per pathway per genome than expected if you reshuffle these mutations)?
-Line 292: I would have expected a comparison of # unique frameshift mutations per bp or per gene across these genes, rather than analysis of gene length distributions. Are all genes in this operon equally likely to acquire mutations, condition by their gene length?
Minor: Figure 4B: I think the x-axis should be labeled "number of isolates". If this is accurate, a clearer label should be used. If not, what is meant by an 'occurrence' should be more clearly explained.
The manuscript should make it easier to align data from different panels by indicating the agr types of various clonal complexes.
Line 291-I'm not sure I understand this: "Commonly occurring variant gene lengths (>5000 genomes) were still considered canonical and filtered out." Perhaps a typo? -Line 188: This statement needs a P-value.
-Line 210: ANI or other metrics of diversity that account for gene length would be helpful here.
Line 215: Could this be the identity by descent, given how small this gene is? Perhaps the private alleles in CC15 and CC5 are both derived from this shared allele?
Line 222: How different are these agrBDC alleles? 1 mutation away? This is a similar concern as the above.
How is the tree in Figure 2C rooted? The variation root-to-tip distance looks suspicious.
-Line 363 is worded too strongly. There is clearly some phylogenetic signal, as indicated by the clonal complex level analysis.
Minor text: -Line 157: An exact number would be appreciated here rather than forcing the reader to the SI. -Line 178: The comparison between patients and isolates is misleading. An isolate and a population should not be expected to behave similarly in terms of genetic diversity.
-Line 237: This is misleading, the wording suggests a monophyletic trait. This trait is not monophyletic as shown in 2C.

Preparing Revision Guidelines
To submit your modified manuscript, log onto the eJP submission site at https://spectrum.msubmit.net/cgi-bin/main.plex. Go to Author Tasks and click the appropriate manuscript title to begin the revision process. The information that you entered when you first submitted the paper will be displayed. Please update the information as necessary. Here are a few examples of required updates that authors must address: • Point-by-point responses to the issues raised by the reviewers in a file named "Response to Reviewers," NOT IN YOUR COVER LETTER. • Upload a compare copy of the manuscript (without figures) as a "Marked-Up Manuscript" file. • Each figure must be uploaded as a separate file, and any multipanel figures must be assembled into one file. For complete guidelines on revision requirements, please see the journal Submission and Review Process requirements at https://journals.asm.org/journal/Spectrum/submission-review-process. Submissions of a paper that does not conform to Microbiology Spectrum guidelines will delay acceptance of your manuscript. " Please return the manuscript within 60 days; if you cannot complete the modification within this time period, please contact me. If you do not wish to modify the manuscript and prefer to submit it to another journal, please notify me of your decision immediately so that the manuscript may be formally withdrawn from consideration by Microbiology Spectrum.
If your manuscript is accepted for publication, you will be contacted separately about payment when the proofs are issued; please follow the instructions in that e-mail. Arrangements for payment must be made before your article is published. For a complete list of Publication Fees, including supplemental material costs, please visit our website.
Corresponding authors may join or renew ASM membership to obtain discounts on publication fees. Need to upgrade your membership level? Please contact Customer Service at Service@asmusa.org.
Thank you for submitting your paper to Microbiology Spectrum. Done. (Lines 174, 175, 201, 205, 223, 238, 292, 294, 296, 323, 455, 469, 492, 516, 529, 25 533, 991, 995, 1017(Lines 174, 175, 201, 205, 223, 238, 292, 294, 296, 323, 455, 469, 492, 516, 529, 25 533, 991, 995, , 1026(Lines 174, 175, 201, 205, 223, 238, 292, 294, 296, 323, 455, 469, 492, 516, 529, 25 533, 991, 995, , 1066  Major 38 1. A major claim in the abstract is that "Phylogenetic patterns suggested that strains with 39 agr frameshifts were evolutionary dead ends." Similarly, the discussion mentions that 40 the same mutations were "independently acquired." This is an important claim that 41 impacts how other data presented here are interpreted. Either support or contrast with 42 the result would be interesting if characterized more convincingly: the extent to which 43 agr null mutants persist and spread between individuals is currently unknown, as is 44 the likelihood of the same mutation appearing independently. However, this result is 45 poorly supported by the presented data and appears to be in fact contradictory to the 46 data provided. 47 In response to R2's thoughtful points we have made revisions that further back our 48 conclusions that the agr frameshift null mutations were acquired independently in 49 multiple lineages. We think that the cause of the disagreement may be one of degree. 50 We do not believe that agr null strains are completely unable to transmit (and have 51 cited literature in the first version that gives evidence for this). Instead, we believe 52 that transmission is supressed, which leads to agr nulls being removed within a few 53 transmission generations. 54 a. The outlier point in 4B suggests a mutation that occurred early on the phylogeny 55 that is now in many isolates. 56 This particular mutation is the agrA mutation already characterized by Novick 57 [ref 59] as not "true" phenotypically agr null and was mentioned in our original 58 text (line 264). Therefore, we expect this mutation to be maintained vertically. We  Figure 3B also suggests identity-by-descent. 63 As shown in Fig  d. An analysis that counts the number of independent occurrences on the tree from 88 the full data set is required. If there is a limit to how many sequences can be 89 imported, complete smaller subsets of the phylogeny should be used. 90 To offset the redundancy in S. aureus genomes as well as to reduce the 91 oversampling of specific clonal complexes in NCBI, we opted to construct 92 phylogenies from dereplicated datasets. This approach reduces the ascertainment 93 bias in our data which would otherwise be present if we constructed a 94 phylogenetic tree from > 40,000 genomes. 3. -Are multiple frameshift mutations in agr and/or indel mutations in RNAiii found in 106 the same genomes? Or is there a signal for overdispersion (i.e. more cases of 1 107 mutation per pathway per genome than expected if you reshuffle these mutations)? 108 We have included the exact number of agr operons with frameshift mutations (Line 109 257 -260). We only observed a maximum of 2 frameshifts per operon which were 110 also extremely rare. We included a supplemental figure (Fig S4)   are not comparing the diversity between agr genes, we did not account for gene 150 length. Based on the cluster sizes, longer genes like agrC and agrA appear to have 151 increased diversity compared to shorter genes like agrB and agrD. 152 6. Line 215: Could this be the identity by descent, given how small this gene is? Perhaps 153 the private alleles in CC15 and CC5 are both derived from this shared allele? 154 While that is a possibility, we believe the more parsimonious explanation is that it is 155 the result of a recent recombination event rather than a vestigial ancestral allele and 156 numerous implied deletion events. 157 7. Line 222: How different are these agrBDC alleles? 1 mutation away? This is a similar 158 concern as the above. 159 The average within-agr group SNP distance is 15 and between-agr group SNP a population should not be expected to behave similarly in terms of genetic diversity. 180 We removed the statement that makes the comparison between patients and isolates -181 line 180. 182 12. Line 237: This is misleading, the wording suggests a monophyletic trait. This trait is 183 not monophyletic as shown in 2C. 184 We clarified the statement to say it is limited to one clade (line 240 -241). We 185 removed the word "monophyletic" from line 393. Thank you for submitting the revised version of your manuscript to Microbiology Spectrum. Now we have received the comments from both reviewers for your submitted manuscript. The first reviewer has no comments, however, the second reviewer still has additional comments. Altogether, I am suggesting a minor revision for this version. Please ensure that the added comments are well explained throughout the text and supported by clearly described methods.

RESPONSE TO REVIEWERS
Thank you for submitting your manuscript to Microbiology Spectrum. When submitting the revised version of your paper, please provide (1) point-by-point responses to the issues raised by the reviewers as file type "Response to Reviewers," not in your cover letter, and (2) a PDF file that indicates the changes from the original submission (by highlighting or underlining the changes) as file type "Marked Up Manuscript -For Review Only". Please use this link to submit your revised manuscript -we strongly recommend that you submit your paper within the next 60 days or reach out to me. Detailed instructions on submitting your revised paper are below.

Link Not Available
Thank you for the privilege of reviewing your work. Below you will find instructions from the Microbiology Spectrum editorial office and comments generated during the review.
The ASM Journals program strives for constant improvement in our submission and publication process. Please tell us how we can improve your experience by taking this quick Author Survey. While the authors make compelling descriptions of their findings in the response-to-reviewer (R2R) document, the abstract remains problematically muddled and overclaiming, which can be fixed with just changes to the text. For the two claims I mention below, care should be taken throughout to avoid overclaiming in other places of the text-but I've highlighted the abstract in detail to explain the problem. I also have a few remaining minor concerns.
1) The abstract states " More than five percent of genomes were found to have frameshift mutations in the agr operon. Though most mutations occur only once in the entire species, we observed a small number of recurring mutations evolving convergently across different clonal lineages. Phylogenetic patterns suggested that strains with agr frameshifts were evolutionary dead ends." Together, these statements make the reader think that the authors will show evidence that: (A) that agr mutants cannot transmit to another person; and (B) Any incidence of finding 2 genomes (after genome-wide dereplication) from the same agr type emerges from convergence, rather than identity by descent or recombination.
In the R2R document, the authors claim they are not intending to claim A. However, the use of the phrase 'dead end' in the abstract does have strong connotations for many readers. A quantitative or relative description of how widespread these mutations are should be used instead.
In the R2R document, the authors claim they are not intending to claim B. They rightfully note exceptions in which identity-bydescent (IBD) is likely and mention that the point is a matter of degree. However, the abstract explicitly leaves out the notion of IBD and a naïve reader would think the authors think this third possibility is to be ignored.
Both issues could easily be fixed with a simple rewriting. For example, the authors could write "More than five percent of genomes were found to have frameshift mutations in the agr operon. XX of these frameshifts were due to unique mutations found only once, suggesting that S. aureus with agr frameshifts are less fit in the long term. Despite this, we observe cases where the same mutation emerges in distinct clonal lineages, reinforcing the adaptive nature of these mutations in the short term" 2) Figure 4A is now more surprising that 'occurrence' has been better defined (though this should be more explicitly defined in the figure itself --e.g. '# of CC w same mutation). The word occurrence, if I am understanding correctly, refers to the number of CCs in which a mutation is found. The convergence here is remarkable and suggestions recombination might be an explanation. If so, it would be useful to color these by agr gene (agrA may recombine more across CC complexes). Alternatively, if 'occurrence' means 'unique agr operon' (a phrase used earlier in the text), this is problematic as an operon can become unique after the acquisition of an agr frameshift mutation due to other mutations later occurring on that background.
3) It is not completely clear to me why the authors have discounted the possibility of recombination of a nonfunctional agr allele as a cause for its occurrence in multiple CCs. 4) A supplemental table of every genome analyzed, its CC, agrA type, agr group, and any frameshifts noted would be a useful tool for the field.

Preparing Revision Guidelines
To submit your modified manuscript, log onto the eJP submission site at https://spectrum.msubmit.net/cgi-bin/main.plex. Go to Author Tasks and click the appropriate manuscript title to begin the revision process. The information that you entered when you first submitted the paper will be displayed. Please update the information as necessary. Here are a few examples of required updates that authors must address: • Point-by-point responses to the issues raised by the reviewers in a file named "Response to Reviewers," NOT IN YOUR COVER LETTER. • Upload a compare copy of the manuscript (without figures) as a "Marked-Up Manuscript" file. • Each figure must be uploaded as a separate file, and any multipanel figures must be assembled into one file. For complete guidelines on revision requirements, please see the journal Submission and Review Process requirements at https://journals.asm.org/journal/Spectrum/submission-review-process. Submissions of a paper that does not conform to Microbiology Spectrum guidelines will delay acceptance of your manuscript. " Please return the manuscript within 60 days; if you cannot complete the modification within this time period, please contact me. If you do not wish to modify the manuscript and prefer to submit it to another journal, please notify me of your decision immediately so that the manuscript may be formally withdrawn from consideration by Microbiology Spectrum.
If your manuscript is accepted for publication, you will be contacted separately about payment when the proofs are issued; please follow the instructions in that e-mail. Arrangements for payment must be made before your article is published. For a complete list of Publication Fees, including supplemental material costs, please visit our website.

Species-wide phylogenomics of the Staphylococcus aureus agr operon reveals convergent evolution of frameshift mutations
In this revision, we made minor changes to the abstract and figure 4A. We also included a new supplemental dataset S1 as requested by reviewer #2.
Our responses to the reviewers are in red. The line numbers we have specified correspond to the revised version.

Reviewer #1 (Comments for the Author):
I am satisfied with the revisions.

Reviewer #2 (Comments for the Author):
While the authors make compelling descriptions of their findings in the response-to-reviewer (R2R) document, the abstract remains problematically muddled and overclaiming, which can be fixed with just changes to the text. For the two claims I mention below, care should be taken throughout to avoid overclaiming in other places of the text-but I've highlighted the abstract in detail to explain the problem. I also have a few remaining minor concerns.
1) The abstract states " More than five percent of genomes were found to have frameshift mutations in the agr operon. Though most mutations occur only once in the entire species, we observed a small number of recurring mutations evolving convergently across different clonal lineages. Phylogenetic patterns suggested that strains with agr frameshifts were evolutionary dead ends." Together, these statements make the reader think that the authors will show evidence that: (A) that agr mutants cannot transmit to another person; and (B) Any incidence of finding 2 genomes (after genome-wide dereplication) from the same agr type emerges from convergence, rather than identity by descent or recombination. a. In the R2R document, the authors claim they are not intending to claim A. However, the use of the phrase 'dead end' in the abstract does have strong connotations for many readers. A quantitative or relative description of how widespread these mutations are should be used instead. b. In the R2R document, the authors claim they are not intending to claim B.
They rightfully note exceptions in which identity-by-descent (IBD) is likely and mention that the point is a matter of degree. However, the abstract explicitly leaves out the notion of IBD and a naïve reader would think the authors think this third possibility is to be ignored.
Both issues could easily be fixed with a simple rewriting. For example, the authors could write "More than five percent of genomes were found to have frameshift mutations in the agr operon. XX of these frameshifts were due to unique mutations found only once, suggesting that S. aureus with agr frameshifts are less fit in the long term. Despite this, we observe cases where the same mutation emerges in distinct clonal lineages, reinforcing the adaptive nature of these mutations in the short term" We rewrote line 37 to state that we did not see evidence of long-term transmission and therefore the frameshifts are short-lived. We removed the phrase "dead end". We also included the percentage of uniquely occurring frameshift mutations (52%) in the abstract and in the main text (line 255).
"While 52% of these frameshifts occur only once in the entire species, we observed cases where the recurring mutations evolve convergently across different clonal lineages with no evidence of long-term phylogenetic transmission, suggesting that strains with agr frameshifts were evolutionarily short lived" We also specified that there is no evidence of long-term phylogenetic transmission in line 373.
2) Figure 4A is now more surprising that 'occurrence' has been better defined (though this should be more explicitly defined in the figure itself --e.g. '# of CC w same mutation). The word occurrence, if I am understanding correctly, refers to the number of CCs in which a mutation is found. The convergence here is remarkable and suggestions recombination might be an explanation. If so, it would be useful to color these by agr gene (agrA may recombine more across CC complexes). Alternatively, if 'occurrence' means 'unique agr operon' (a phrase used earlier in the text), this is problematic as an operon can become unique after the acquisition of an agr frameshift mutation due to other mutations later occurring on that background. As stated in line 322 to 329, the number of occurrences of frameshift mutations refers to the number of frameshifts we observed in the dereplicated set of CC22, CC30, CC5 and CC8 genomes. This is also stated in the figure legend (line 1034-1035). Therefore, these are not necessarily "unique" agr operons having frameshifts. This is the same dataset that is also used for Fig 4B. We have changed the title of Fig 4A and 4B, as well as the y-axis label of 4A.
3) It is not completely clear to me why the authors have discounted the possibility of recombination of a nonfunctional agr allele as a cause for its occurrence in multiple CCs.
As stated in line 209 -212, and 218 -220, we did not observe instances of shared agr gene alleles between different strains apart from the noted exceptions. If nonfunctional alleles were recombining, we would have observed multiple shared alleles.
We also observed identical mutations across different agr groups (Fig 3A,3B), however we do not observe multiple agr groups within a CC (except CC45) as would be expected if these mutations were being transmitted by recombination (Fig 1C).
4) A supplemental table of every genome analyzed, its CC, agrA type, agr group, and any frameshifts noted would be a useful tool for the field. We included supplemental dataset S1 (DatasetS1.xlsx) showing 40,890 genomes analysed, with their NCBI and biosample accessions, ST, CC, agrA type, and agr group. We also included the number of mutations, their position(s), the type of mutation (insertion/deletion/snp), the effect (frameshift/loss of start/introduction of stop) and the corresponding nucleotide and amino acid change that is caused by the mutation. We also mentioned this in the "Data availability" section (line 604 -606).

Misc. changes:
Fixed typo in Fig 1A (changed "P1 and P2" to "P2 and P3") January 3, 2022 2nd Revision -Editorial Decision January 3, 2022 Dr. Timothy D Read Emory University School of Medicine Atlanta Re: Spectrum01334-21R2 ( Species-wide phylogenomics of the Staphylococcus aureus agr operon reveals convergent evolution of frameshift mutations) Dear Dr. Timothy D Read, Thank you for submitting the revised version of your manuscript to Microbiology Spectrum. We have received the comments from both reviewers, and both of them agree that this is an exciting and novel study and should be accepted in its current form. Therefore, it is a pleasure to accept the current version of your manuscript entitled "Species-wide phylogenomics of the Staphylococcus aureus agr operon reveals convergent evolution of frameshift mutations" for publication in Microbiology Spectrum. With this email, I am forwarding it to the ASM Journals Department for publication. You will be notified when your proofs are ready to be viewed.
The ASM Journals program strives for constant improvement in our submission and publication process. Please tell us how we can improve your experience by taking this quick Author Survey.
As an open-access publication, Spectrum receives no financial support from paid subscriptions and depends on authors' prompt payment of publication fees as soon as their articles are accepted. You will be contacted separately about payment when the proofs are issued; please follow the instructions in that e-mail. Arrangements for payment must be made before your article is published. For a complete list of Publication Fees, including supplemental material costs, please visit our website.