Evolutionary dynamics of the accessory genomes of Staphylococcus aureus

ABSTRACT Staphylococcus aureus is a ubiquitous commensal and opportunistic bacterial pathogen that can cause a wide gamut of infections, which are exacerbated by the presence of multidrug-resistant and methicillin-resistant S. aureus. S. aureus is genetically heterogeneous and consists of numerous distinct lineages. Using 558 complete genomes of S. aureus, we aim to determine how the accessory genome content among phylogenetic lineages of S. aureus is structured and has evolved. Bayesian hierarchical clustering identified 10 sequence clusters, of which seven contained major sequence types (ST 1, 5, 8, 30, 59, 239, and 398). The seven sequence clusters differed in their accessory gene content, including genes associated with antimicrobial resistance and virulence. Focusing on the two largest clusters, BAPS8 and BAPS10, and each consisting mostly of ST5 and ST8, respectively, we found that the structure and connected components in the co-occurrence networks of accessory genomes varied between them. These differences are explained, in part, by the variation in the rates at which the two sequence clusters gained and lost accessory genes, with the highest rate of gene accumulation occurring recently in their evolutionary histories. We also identified a divergent group within BAPS10 that has experienced high gene gain and loss early in its history. Together, our results show highly variable and dynamic accessory genomes in S. aureus that are structured by the history of the specific lineages that carry them. IMPORTANCE Staphylococcus aureus is an opportunistic, multi-host pathogen that can cause a variety of benign and life-threatening infections. Our results revealed considerable differences in the structure and evolution of the accessory genomes of major lineages within S. aureus. Such genomic variation within a species can have important implications on disease epidemiology, pathogenesis of infection, and interactions with the vertebrate host. Our findings provide important insights into the underlying genetic basis for the success of S. aureus as a highly adaptable and resistant pathogen, which will inform current efforts to control and treat staphylococcal diseases.

carried out in 2019, S. aureus was the leading bacterial cause of death in 135 countries (1,105,000 deaths, range: 816,000-1,470,000) and ranked second highest in terms of the years of life lost (34.3 million, range 25.5-45.3)(6).It is also the leading cause of fatal bloodstream infections with 299,000 deaths (range: 166,000-485,000) (6).
S. aureus can be differentiated into lineages according to sequence-based molecular methods.Sequence types (ST) and the clonal complexes (CC), formed from closely related STs, vary in their phenotypic and epidemiological characteristics, including the propensity to cause outbreaks and epidemics (7).The prevalence of individual lineages may vary, depending on geography, identity of vertebrate hosts, and disease manifes tations (8)(9)(10)(11).As of October 2023, there are 8,700 recognized STs in the S. aureus database of pubMLST (12).Of these, >5,700 STs are classified into 10 major CCs, whereas the rest remain poorly studied or unknown.Despite such tremendous diversity, only a few MRSA clones have instigated pandemic proportions of diseases (7).The most common nosocomial MRSA lineages are ST5 (CC5), ST8 and ST239 (both from CC8), ST22 (CC22), ST36 (CC30), and ST545 (CC45) (13).Accessory genes, that is, genes carried by a subset of a group of genomes and are not essential to every strain of a species (14,15), provide a clue to the success of these lineages.Accessory genes encode antibiotic resistance, virulence, defense, or immune evasion mechanisms and are often carried by mobile genetic elements, such as plasmids, transposons, chromosomal cassettes, prophages, and pathogenicity islands (16,17).Studies of the pan-genome, that is, the totality of genes in a set of strains (18), of S. aureus reveal significant variability in the accessory genomes within the species, but the processes that gave rise to lineagespecific pan-genomes remain unclear.Such fine-scale resolution in genomic variability within a species was recently termed comparative pan-genomics (19).In that study, the authors revealed the non-uniformity in ST-specific pan-genomes in Escherichia coli, which may reflect the species' ecological flexibility (19).The evolutionary (e.g., selection versus neutral evolution), genetic (e.g., restriction-modification systems), and ecological drivers that shape the accessory gene content of specific microbial species remain a lively area of investigation (20)(21)(22)(23)(24)(25)(26)(27).
Here, we aim to determine how the accessory genomes among S. aureus lineages are structured and have evolved.Analysis of 558 complete genomes of S. aureus showed considerable heterogeneity in gene content, gene co-occurrence, and rates of gene gain and loss in the accessory genomes between lineages.Variable accessory genomes within each lineage have important implications on bacterial physiology, pathogene sis of infection, and interactions with the vertebrate host.Our findings are critical in understanding the genetic basis for the success of S. aureus as a highly adaptable and resistant pathogen.

Phylogenetic relationships of S. aureus lineages
We obtained a total of 558 S. aureus complete genomes from the National Center for Biotechnology Information (NCBI).The strains were collected from 1884 to 2022 from 45 countries in six continents (Fig. 1a; Fig. S1; Table S1).Most genomes came from Asia (n = 174), Europe (n = 166), and North America (n = 108) (Fig. 1b).Most of the strains were derived from human sources (n = 453), whereas the rest came from food, environment, and water sources (Fig. 1a).Genome sizes ranged from 2.66 Mb -3.13 Mb.The number of predicted genes ranged from 2,371 to 2,937 per genome.The pan-genome contained 2,104 core genes (present in 553-558 genomes), 53 soft core genes (present in 531-552 genomes), 826 shell genes (present in 84-530 genomes), and 5,020 cloud genes (present in 1-83 genomes) for a total of 8,003 genes (Table S2).The combined core and soft core made up 26.95% of the pan-genome, and the shell and cloud genes together comprised 73.05% of the pan-genome.A total of 1,557 genes were unique to a single genome.

Antimicrobial resistance (AMR) and virulence genes in the accessory genomes
We next sought to compare the accessory gene content of the sequence clusters with ≥20 genomes (BAPS 2, 3, 4, 5, 7, 8, and 10).In terms of the size of the accessory genome, sequence cluster BAPS10 carried the highest number of accessory genes per genome (mean = 413, range = 245-494) (Fig. 2a).Genomes from sequence cluster BAPS3 carried the lowest number of accessory genes per genome (mean = 260, range = 181-334).Comparison of every pair of sequence clusters showed significant differences in the number of accessory genes per genome (all comparisons with P value < 0.05; Welch's t-test) (Fig. 2a).

Co-occurrence of accessory genes
Accessory genes may associate (or co-occur) with other genes more often than expected by chance (30).These co-occurring genes may reflect a shared evolutionary history, co-evolution of traits, correlated selection across environments, or coordinated patterns of gene gain or loss (30).We obtained statistically significant co-occurrence of acces sory genes in the two largest sequence clusters BAPS8 and BAPS10 (Fig. 3) using Coinfinder (30).We also ran Coinfinder in the other eight sequence clusters; however, no significant gene associations were detected because they consisted of much fewer genomes, ranging from 6 to 55 genomes in each cluster.The number of gene associa tions detected by Coinfinder is known to decrease substantially with smaller numbers of genome inputs, with no significant associations identified when an input of around 50 genomes or fewer is used (30).For sequence cluster BAPS3 (n = 55 genomes), we postulated that the one divergent genome found at the base of the cluster (Accession no.GCF_003609855.1) may have contributed to the lack of significant gene associations in our analysis.We therefore ran Coinfinder on BAPS3 without this genome; however, we still did not obtain any significant gene associations.
AMR genes that are common to both co-occurrence networks of BAPS8 and BAPS10 include blaZRI (beta-lactam resistance), bcrABR (bacitracin resistance), qacA (resistance to quaternary ammonium compounds), satA (resistance to nucleoside antibiotics [strepto thricin]), aphA (aminoglycoside resistance), and msrA (resistance to streptogramin and macrolide).In both co-occurrence networks, there were 386 and 257 gene families that have hypothetical functions in sequence clusters BAPS8 and BAPS10, respectively.In terms of the structure of the co-occurrence networks, we found that the two sequence clusters BAPS8 and BAPS10 differed in terms of the number of unique subnetworks (Fig. 3c), number of co-occurring gene pairs (Fig. 3d), number of genes that form the co-occurrence network (Fig. 3e), and the number of AMR genes in the co-occurrence network (Fig. 3f).

Rates of gene gain and gene loss
Accessory genes can be acquired or lost throughout the evolutionary history of a lineage.We therefore sought to determine the rates of gene gain and gene loss in sequence clusters along the branches of their respective phylogenies.Using the core SNP phylogenies, gene presence/absence matrices, and a generalized linear model implemented in Panstripe (31), we estimated the rates of gene gain and loss in the five major sequence clusters (BAPS2, BAPS3, BAPS5, BAPS8, and BAPS10).For BAPS3, we also calculated rates with and without the divergent genome (Accession no.GCF_003609855.1).
We found higher rates of gene gain and loss in sequence cluster BAPS8 compared with BAPS10 (4925.19 and 1195.96,respectively, P value = 2.78e-10) measured as a function of the association between the branch lengths of the phylogeny and the number of gene gain and loss events (Fig. 4a).However, the pan-genomes of the two sequence clusters did not differ in the rates of gene gain and loss at the terminal branches of the phylogeny (13.42 and 14.74, respectively, P value = 6.81e-1;Fig. 4b).An elevated rate in the cumulative number of gene gain and loss events versus the cumulative branch length starting from the root node of their corresponding phyloge nies can be detected for both sequence clusters (Fig. 4c).For both sequence clusters, we also found that the number of gene gain and loss events congregated in more recent branches of their respective phylogenies, suggesting frequent and rapid gene turnover at recent times (Fig. 4d).The enrichment for gene gain and loss events at the tips of a tree is often driven by the presence of highly mobile genetic elements and/or annotation errors (31).Within sequence cluster BAPS10, we identified a divergent group consisting of four genomes that has experienced high number of gene gain and loss early in the history of this sequence cluster, was not observed in sequence cluster BAPS8.
We also carried out comparisons with other smaller clusters; however, the results should be considered with caution because of the considerably smaller number of genomes in them compared with BAPS8 and BAPS10.We found significant difference in the rates of gene gain and loss between sequence clusters BAPS2 and BAPS8 in their core genomes (1430.97 and 4925.19,respectively; P value = 1.09e-06) (Fig. 4a).Sequence clusters BAPS3 and BAPS8 (excluding the divergent genome in BAPS3) had significant differences in the rates of gene gain and loss at the terminal branches of their phylogenies (54.34 and 13.41, respectively; P value = 0.0362) (Fig. 4b).All other cluster comparisons were either not significant or failed in fitting the generalized linear model in Panstripe due to the low number of genomes in the smaller clusters (Table S6).The total number of genes gained or lost varied along each branch of the phylogenies of BAPS2, BAPS3, and BAPS5 (Fig. S2).

DISCUSSION
The bacterial accessory genome consists of genes present in some but not all of the strains being considered (14,15,22).Because of its highly variable nature, how it has evolved and is structured (i.e., why some strains have certain genes and some do not) continues to be explored in microbial genomics.Using 558 complete genomes of S. aureus, we investigated the structure and evolution of the accessory genomes of the main lineages within the species.Our main findings are as follows: (i) the gene content and co-occurring components of the accessory genomes, including genes associated with AMR and virulence factors, varied between the major lineages.(ii) The two lineages differed in the rate at which they gained and lost accessory genes, with the highest rate of gene accumulation occurring late in each lineage's evolutionary history.Together, our results indicate highly variable and dynamic accessory genomes in S. aureus that are structured by the history of the genetic lineages that carry them.
Considerable genomic variation within a species can have important implications on disease epidemiology, pathogenesis of infection, and strain interactions with the vertebrate host in many pathogens.An excellent example is a study whereby accessory gene content variation was used to track the transmission of extensive and multidrugresistant Acinetobacter baumannii across hospital wards (32).Using accessory genomes in disease epidemiology proved to be effective because gene turnover (i.e., gain and loss) occurs faster than the accumulation of mutations in A. baumannii over very short microevolutionary scales (33).In Streptococcus dysgalactiae, the distinction between its two subspecies and their ability to adapt to either human or bovine host lie in a specialized accessory genome characterized by the presence of specific surface-pro tein virulence determinants and carbohydrate metabolic functions (34).In S. aureus from different vertebrate hosts, accessory gene pools consisting of host-specific genes promote adaptive evolution after host-switching events (10).Gain and loss of multiple accessory genes also promoted cattle specialization in Campylobacter jejuni ST61, which evolved from the host generalist lineage ST21 (35).Our results further expand on these previous studies and highlight the importance of lineage-specific accessory gene pools.Our results also show that the use of a single model or reference strain of a species, and even of an ST or any lineage, is not sufficient to capture the genomic diversity below the species level.
The accessory gene pool within a lineage of a species is highly heterogeneous not only in terms of gene functions but also the patterns of preservation and timing of mobility.In a previous study of E. coli ST131, the expansion of the accessory genomes of two clades C1 and C2 appears to have occurred more recently (31), which was similar to what we observed for the two S. aureus lineages in our study.Regardless of the timing of gene acquisition and loss, different mechanisms may act to preserve accessory genes: a more rapid mobilization occurring over short time scales and relatively slower transfer occurring over longer periods of time (16,36).The types of mobile genetic elements and recombination processes will influence how slowly or rapidly accessory genomes are altered across a species or lineage.Hence, grouping all accessory genes as a single entity within the pan-genome may not fully capture the incredible variation in bacterial genomes.This is particularly evident for the divergent group within sequence cluster BAPS10 that experienced an enrichment of acquired and lost genes early in the evolutionary history of this lineage.The temporal analysis of the pan-genome is a powerful approach to studying how the pan-genome has evolved, which is not readily apparent in phylogenetic tree reconstruction and gene accumulation curves that are often used in microbial pan-genome investigations.
Our findings are consistent with results from a recent study that compared pangenomes as highly dynamic ecosystems where genes recruited into individual genomes interact with other genes that they find themselves with either positively or negatively depending on the genetic background of the recipient genome (37).We further add that even if a gene has been acquired and lost multiple times throughout the history of a specific lineage, the recruited gene will likely find itself in a slightly different genetic neighborhood every time it is acquired due to the inconstancy in the presence of other genes within a genome of the lineage.Nonetheless, it remains challenging to distinguish repeated gain and loss of the same genes within a lineage.Although both genetic drift and selection shape the persistence and mobility of genes that comprise the pan-genome (37)(38)(39), the contributions of these processes vary over the time that a lineage has been evolving, and therefore, their impacts on pan-genome content over time will also fluctuate.Understanding the history and variability in the pan-genome within a species may therefore shed light on how ancestral lineages adapted to their environments and, in turn, how their environments shaped their genomes.As an example, the history of S. aureus is characterized by numerous instances of switching between humans and different animal hosts, and hence will certainly alter its pangenome content (10).However, to what extent the genome content changes as a lineage switches from host species A to host species B compared with when it switches from host species A to host species C remains to be elucidated.
We acknowledge the limitations of our study.First, our data set is heavily favored toward genomes from certain geographical regions and from human clinical sources.Additional metadata (e.g., disease versus carriage and therapeutic history) were also not available.Hence, we were not able to make inferences about the ecological and geographical impacts to accessory genome structure.Second, our smaller data set (558 genomes) compared with other studies that number thousands of genomes may also limit the magnitude of accessory gene pool that we can identify.S. aureus is known to have an open pan-genome (40)(41)(42), that is, the pan-genome has no upper limit and new genes are discovered with the addition of new strains (18,43).It is therefore likely that the co-occurrence networks and history of gene gain and loss will be affected with the addition of genomes and novel genes.Nonetheless, because we confined our analyses to complete genomes, we were able to demonstrate a more complete picture of the heterogeneity in structure and evolutionary history between the lineage-specific accessory genomes, which may be obscured when using draft and incomplete genomes.Our results have been largely confined to the two largest sequence clusters (BAPS8 and BAPS10), and hence, we can only make confident inferences about the evolutionary trajectories of their pan-genomes.Gene associations and historical rates of gene gain/ loss remain difficult to infer in sequence clusters that consist of much fewer genomes.The range of diversity of missing relatives and missing branches on the phylogeny will inevitably lead to some genes remaining invisible in any pan-genome analysis and consequently accurately inferring the historical rates of gene gain and loss.Sequencing more genomes from less common lineages is expected to improve the predictive power of the approaches we used.
In summary, we show that the accessory genome of S. aureus is lineage-specific and is structured by the co-occurrence of certain genes and the histories of gene gain and loss.Our findings provide important fundamental insights into the underlying genetic basis for the highly adaptable, virulent, and multidrug resistance features of S. aureus, as well as the global success of certain S. aureus clones.This knowledge will inform current efforts to control and treat staphylococcal diseases, including approaches to target specific S. aureus clones or lineages.

Genome collection and annotation
We retrieved a total of 888 complete genome sequences of S. aureus from the NCBI RefSeq Database in September 2022.Genome quality was assessed using the programs QUAST (44) and CheckM (45) (Table S1).Genomes with <90% completeness and >5% contamination were excluded from downstream analysis, as recommended by CheckM (45).All genomes were compared against each other using the program FastANI v. 1.32 (46) to confirm species identification.We used the ≥95% average nucleotide identity (ANI) threshold to define a species (46).Genomes that did not meet the 95% criteria were also excluded.After filtering based on these criteria, we used a total of 558 genomes for all downstream analyses (Table S1).Our strict criteria for the inclusion of S. aureus genomes are critical to reduce the impact of missing genes in incomplete, draft, and poorly sequenced genomes.The genomes were annotated using Prokka v.1.14.6 (47).

Pan-genome analysis and phylogenetic tree reconstruction
We used Panaroo v.1.2.7 (48) to determine the core and accessory gene content of the entire population.We defined core genes as those present in ≥95% of the genomes (i.e., at least 530 genomes), whereas the accessory genes were genes present in ≥1% and <95% of the data set (i.e., less than 530 genomes).Panaroo was run in strict mode with the -remove-invalid-genes option to ignore annotations that do not conform to the expected gene structure (i.e., with a premature stop codon or gene of an invalid length).Sequence alignment was carried out using MAFFT v.7.487 (49).Sequence alignments of 2,104 core genes and 53 soft core genes were concatenated, and SNPs were identified using SNP-site v.2.5.1 (50).A total of 235,221 SNPs were identified.The aligned core SNPs were used to build a maximum likelihood phylogenetic tree using RAxML v.8.2.12 (51) with a generalized time-reversible (GTR) (52) model of nucleotide substitution and Gamma distribution of rate heterogeneity.We used the Bayesian hierarchical clustering algorithm fastBAPS v.1.0.6 (fast Bayesian Analysis of Population Structure) to partition the genomes into sequence clusters consisting of genetically similar individuals (28).We also built core SNP phylogenetic trees for individual sequence clusters using the same methods described above.

Identification of coincident resistance and accessory genes
To determine if AMR genes associate or dissociate with each other and with other accessory genes more often than would be expected by chance, we used the pro gram Coinfinder v.1.0.7 (30).Coinfinder detects genes that associate or dissociate with other genes using a Bonferroni-corrected Binomial exact test statistic of the expected and observed rates of gene-gene association.Network diagrams of these coincident gene-to-gene relationship within each sequence cluster were visualized with Gephi v.0.9.2 (https://github.com/gephi/gephi).Genes associated with resistance and virulence were identified from the network clusters using output from ABRicate and gene functions listed in UniProt (57).

FIG 1
FIG 1 Phylogenetic relationship and geographical distribution of the 558 S. aureus used in this study.(a) Midpoint-rooted maximum likelihood tree showing the phylogenetic relationships of 558 S. aureus genomes.The tree was built using 235,221 SNPs of aligned sequences of 2,104 core genes and 53 soft core genes.Tree scale represents the number of nucleotide substitutions per site.Colored lines extending out of the branches in the tree represent the 10 sequence clusters inferred by the program fastBAPS.The four outer rings represent the geographical source (continent), ecological source, SCCmec type, and sequence type (ST).For visual clarity, only the seven major STs are shown.(b) World map showing the countries and continents from where the 558 S. aureus was derived.The pie charts show the distribution of the seven major STs per continent.The number above each pie chart shows the number of genomes per continent.

FIG 2 5 FIG 3
FIG 2 Comparison of the accessory genomes among the seven BAPS sequence clusters.Only sequence clusters with ≥20 genomes were compared.(a) Number of accessory genes per genome.(b) Number of acquired antimicrobial resistance genes per genome.(c) Number of antimicrobial classes that have at least one gene conferring resistance to an antimicrobial class present in a genome.(d) Number of virulence genes per genome.Box represents the interquartile range, horizontal line in the middle of the box represents the median, and the lower and upper whiskers represent the lowest data point without the outliers and the highest data point without outliers, respectively.For all panels, Welch's t-test was used for all comparisons of every pair of sequence clusters.

FIG 4
FIG 4 Rates of gene gain and gene loss between sequence clusters.(a) Association between the branch lengths in the core genome phylogeny and rates of gene gain and loss in sequence clusters BAPS2, BAPS3, BAPS5, BAPS8, and BAPS10.(b) Association between genes on the tips of the phylogeny and rates of gene gain and loss.For panels a and b, colored dots represent the mean and error bars represent the 95% bootstrap confidence intervals of the parameter estimates.(c) Cumulative gene gain and loss as a function of the cumulative core branch distance in the phylogeny.Internal and terminal branches are represented by circles and triangles, respectively.For panels a-c, calculation in BAPS3 was done without the divergent genome (Accession no.GCF_003609855.1) and is represented by an asterisk.(d) Inferred ancestral states showing the gene gain and loss events on phylogeny of sequence clusters BAPS8 (left) and BAPS10 (right).The branch colors show the total number of genes gained and lost along the internal and terminal branches of the trees.Phylogenies for sequence clusters BAPS2, BAPS3 (with and without the divergent genome), and BAPS5 are shown in Fig. S2.