Population Genomics of emm4 Group A Streptococcus Reveals Progressive Replacement with a Hypervirulent Clone in North America

ABSTRACT Clonal replacement is a major driver for changes in bacterial disease epidemiology. Recently, it has been proposed that episodic emergence of novel, hypervirulent clones of group A Streptococcus (GAS) results from acquisition of a 36-kb DNA region leading to increased expression of the cytotoxins Nga (NADase) and SLO (streptolysin O). We previously described a gene fusion event involving the gene encoding the GAS M protein (emm) and an adjacent M-like protein (enn) in the emm4 GAS population, a GAS emm type that lacks the hyaluronic acid capsule. Using whole-genome sequencing of a temporally and geographically diverse set of 1,126 isolates, we discovered that the North American emm4 GAS population has undergone clonal replacement with emergent GAS strains completely replacing historical isolates by 2017. Emergent emm4 GAS strains contained a handful of small genetic variations, including the emm-enn gene fusion, and showed a marked in vitro growth defect compared to historical strains. In contrast to other previously described GAS clonal replacement events, emergent emm4 GAS strains were not defined by acquisition of exogenous DNA and had no significant increase in transcript levels of nga and slo toxin genes via RNA sequencing and quantitative real-time PCR analysis relative to historic strains. Despite the in vitro growth differences, emergent emm4 GAS strains were hypervirulent in mice and ex vivo growth in human blood compared to historical strains. Thus, these data detail the emergence and dissemination of a hypervirulent acapsular GAS clone defined by small, endogenous genetic variation, thereby defining a novel model for GAS strain replacement. IMPORTANCE Severe invasive infections caused by group A Streptococcus (GAS) result in substantial morbidity and mortality in children and adults worldwide. Previously, GAS clonal strain replacement has been attributed to acquisition of exogenous DNA leading to novel virulence gene acquisition or increased virulence gene expression. Our study of type emm4 GAS identified emergence of a hypervirulent GAS clade defined by variation in endogenous DNA content and lacking augmented toxin gene expression relative to replaced strains. These findings expand our understanding of the molecular mechanisms underlying bacterial clonal emergence.

B acterial clonal emergence is a hallmark of human epidemics and has contributed to changes in disease caused by Staphylococcus aureus (1,2), Streptococcus pneumoniae (3), and Neisseria meningitidis (4) to name a few. Streptococcus pyogenes (group A Streptococcus [GAS])-a leading cause of invasive bacterial disease in humans-has long served as a model organism for investigating severe bacterial infections and bacterial epidemic behavior (5)(6)(7). The primary typing scheme for GAS is based on sequencing of the hypervariable 59 end of the emm gene which encodes the cell surface M protein, with GAS strains of the same emm type generally being more closely related to each other than to strains of different emm types (8). GAS epidemiology is characterized by epidemic events resulting from strain emergence and eventual replacement of prior circulating strains. To date, emergence of GAS clones has primarily been attributed to acquisition of exogenous DNA (6,9). For example, the replacement of historical emm1 strains with the currently circulating pandemic emm1 GAS in the 1980s resulted from recombination of a 36-kb region that contained the secreted NADase (encoded by nga) and streptolysin O (SLO) (slo) toxins, two well-known, cotranscribed virulence factors (10)(11)(12)(13). This recombination event resulted in a high-activity nga or slo promoter which has been linked to increased frequency and severity of infections and rapid global spread (6,14). A similar recombination event has since been described in emm89 GAS strains contributing to the emergence of a new and more virulent clade (7). In addition to increased Nga/SLO expression, contemporary emm1 GAS acquired via horizontal gene transfer (HGT) a novel prophage-associated pyrogenic exotoxin (SpeA), further contributing to virulence potential and spread (6)a mechanism also attributed to the enhanced invasiveness of emm3 GAS strains following integration of a bacteriophage carrying a novel phospholipase (Sla) (15).
The GAS hyaluronic acid capsule has historically been considered critical to virulence by promoting resistance to host immune cell phagocytosis (16)(17)(18)(19)(20). However, several emm types that lack capsule have recently been identified (21,22). There has been a significant rise in the number of infections caused by acapsular strains (9,22), and they now collectively account for nearly one-third of invasive GAS infections (23,24). Interestingly, lack of capsule production through gene loss or mutation has been linked to high-activity nga and slo genotypes (25). The combination was first described in emm89 GAS strains (7,26) but has since been recognized to occur in additional GAS genotypes, including emm28 and emm87 (25). Thus, the current paradigm is that acapsular GAS bacteria compensate for the lack of capsule by elevating Nga/Slo production.
We recently reported the occurrence of a chimeric emm gene in circulating acapsular emm4 GAS strains isolated in the United States and England (27). The new emm4 strains harboring the chimeric emm gene completely replaced the circulating strains by 2017 in the Houston, Texas, metropolitan area (27). To better understand the timeline and the genetic changes driving the emergence of chimeric emm4 strains, we analyzed 1,126 temporally and geographically diverse emm4 strains from the United States, Canada, and United Kingdom. Our analysis reveals that a new, genetically distinct clade of emm4 GAS strains with enhanced virulence has emerged and constitutes nearly the entire contemporary emm4 population. In striking contrast to previous GAS clonal replacement events, we report here that the newly circulating emm4 strains are defined by small genetic changes of existing DNA rather than acquisition of exogenous DNA and do not have augmented Nga/SLO production.

RESULTS
Analysis of the emm4 GAS population structure reveals distinct clades with nearly complete temporal replacement. A total of 861 strains were sequenced for the current study, and we analyzed the genome sequences from an additional 265 (n = 1,126 total sequences) emm4 GAS disease isolates (invasive and noninvasive) originating from multiple geographic locations in the United States, Canada, and United Kingdom, spanning 25 years (Table 1; see also Table S1 in the supplemental material). Importantly, .70% of the isolates were derived from national-level surveillance (Centers for Disease Control and Prevention, Active Bacterial Core Surveillance program) spanning nearly 20 years (Table 1). Mean sequencing depth of coverage exceeded 120-fold for all strains analyzed. Phylogenetic relationships were inferred based on single nucleotide polymorphisms (SNPs) relative to the core genome (excluding mobile genetic elements [MGE]) of the reference emm4 GAS strain Duke (GenBank accession number CP031770) (28). Based on Bayesian population clustering, the emm4 GAS population consisted of three distinct subpopulations (subclades [SC]) (29) (Fig. 1A Fig. 1A) that were a single locus variant (ST38) of the major emm4 sequence type (ST39).
Maximum likelihood phylogeny suggested that contemporary strains are overrepresented in SC3 compared to SC1 (note the predominance of red for SC3 strains in Fig. 1A). Thus, we determined the overall temporal distribution of the emm4 GAS population stratified by Bayesian clustering (29). Consistent with clonal replacement, we observed a progressive decrease in SC1 strains and a concomitant increase in SC3 over time such that by 2017, 94.7% (89/94) of emm4 strains sampled were SC3 (Fig. 1B). Interestingly, SC2 strains appear sporadically and do not persist over time. Thus, we chose to focus our analyses on the differences between SC1 and SC3 subclades. To better determine the timing of SC3 emergence, we next performed Bayesian reconstruction of ancestral states using BactDating (30). The overall substitution rate (3.5 substitutions per core genome per year; 1.84 Â 10 26 substitutions per site year 21 ) was similar to that reported for emm1 and emm12 GAS using BEAST (31). The root date estimate (time to most recent common ancestor) for the entire emm4 population was 1,921 (range, 1,886 to 1,936), and the date of emergence for SC2 strains was 1,979 (range, 1,975 to 1,983) compared to 1,992 (range, 1,991 to 1,993) for SC3 strains (see Fig. S1 in the supplemental material). For comparison, estimated date of SC1 emergence (excluding ST38 strains) was 1,955 (range, 1,946 to 1,964). Overall, these data indicate a strong temporal signal within the emm4 population examined. Thus, henceforth, we refer to the SC1 strains as "historical" and to SC3 strains as "emergent." Major gene content differences between clades are defined by prophage gene loss but do not involve MGE-related exotoxins or DNases. Recently, it has been shown that contemporary emm4 lineages are defined by prophage-specific gene loss resulting in cryptic prophages (32). Importantly, prophage degradation in the emm4 lineages did not alter exotoxin or DNase gene content. Thus, we first compared the prophage-associated exotoxin and DNase gene content in our emm4 population to determine differences between subclades. Among the major subclades, we observed no differences in exotoxin gene (smeZ, speC, and ssa) content ( Fig. 2 and Table S2). Similarly, the prophage-associated, DNase-encoding genes spd1 and spd3 were nearly universally present in all three subclades ( Fig. 2 and Table S2). However, the emm4 subpopulations differed in the presence of the prophage-associated DNase gene sdn occurring in 54% (333/614) of SC3 strains but was not a subclade-defining feature ( Fig. 2 and Fig. S2).
We next performed a pangenome analysis to further define gene content differences within the emm4 population. The core genome (shared by .99% of strains) was defined by 1,516 (31.7%) of the total 4,788 genes identified in the emm4 pangenome. In comparison, there are 1,934 genes in the Duke reference genome. The majority (2,879/4,788 [60.1%]) of genes were found in ,15% of strains. Using a recently described, novel machine-learning method (PANINI) (33), we discovered that gene content differences contributing to emm4 GAS population structure correlated almost perfectly with subclades defined by Bayesian population clustering from SNPs (Fig. S2)an observation potentially consistent with previously described prophage degradation (32). In addition, genes conferring resistance to macrolides [e.g., mef(A) and erm(B)] also contributed to strain subclustering (Fig. S2). Interestingly, a subpopulation of SC3 strains contained erm(T) residing on a previously described ;5-kb plasmid (34) (Fig. 2). However, strains were more broadly distributed among the SC3 population when assessing gene content (Fig. S2) compared to other resistance elements (data not shown). Further, subpopulation structure within SC3 strains was strongly associated with the presence of the DNase sdn (Fig. S2). Taken together, these data demonstrate that acquisition of an exogenous virulence factor does not distinguish emergent from historical emm4 strains. Emergent clade emm4 GAS strains are defined by small genetic changes and not large-scale recombination. Previous studies of GAS epidemic clone emergence established that large recombination events, specifically involving the known GAS virulence factors Nga and SLO, are critical for epidemic clone emergence (6, 7). Examination of phylogenetic relationships before and after correction for recombination (ClonalFrameML and gubbins) (35,36) revealed only modest changes in the emm4 GAS population structure ( Fig. S3A to C). To further facilitate identification of recombination, we sequenced to closure the genomes of eight (four SC1 and four SC3) emm4 Branch tip color and shape indicate year and subclade designation as in Fig. 1 and the associated legend. Subclades (SC1 to 3) are indicated by shaded boxes. DNase, exotoxin, and antimicrobial resistance gene content is indicated by vertical bars to the right of the ML phylogram and defined in the legend headings. Asterisks indicate nodes representing strains with completed genomes (see Table S3 in the supplemental material). GAS strains representing major clades/subclades and MGE content (asterisks in Fig. 2 and Table S3). Given the sporadic temporal appearance and lack of persistence in SC2 strains, we focused our analyses on differences between SC1 and SC3. Comparison of the completed genomes revealed no large-scale (e.g., .1-kb) recombination, and nga and slo, including the associated promoter, were identical between completed genomes and within the studied emm4 population (data not shown). Specifically, all strains possessed an identical "high-activity" nga or slo promoter (25). For comparison, in a study of emm89 GAS clonal emergence, HGT and recombination accounted for the majority of core chromosomal differences between clades and ranged in size from a few hundred bases to .70 kb (26). Thus, although previous studies identified HGT as a hallmark of GAS clade emergence, these data show that HGT is not a major contributor to clade differences in the emm4 GAS population.
We next determined differences between emergent and historical strains with respect to small genetic changes (e.g., SNPs, insertions/deletions). In addition to the previously described emm-enn gene fusion (27), emergent strains were defined by a small number of differences in core chromosomal loci compared to historical strains. A total of 36 polymorphisms (core chromosomal) occurred in predicted genes and separated historical and emergent emm4 populations (Table S4). Importantly, several of the identified polymorphisms occurred in genes previously shown to alter virulence ( Table 2). For example, all or nearly all emergent (SC3) strains contain inactivating mutations (historical strains with intact reading frames) in ralp3 (98% [602/614]), which encodes a RofA family transcriptional regulator previously identified as important to GAS virulence (37), and silA (100% [614/614]), which encodes a response regulator of the Streptococcus invasion locus (sil) (38) (Table 2). Similarly, .86% of emergent strains contain a mutation predicted to result in a single amino acid change (T104I) in the regulator of proteinase RopB, an activator of the actively secreted virulence factor Streptococcus cysteine protease SpeB (39). We used casein hydrolysis (milk plate assays) to assess the functional impact of this polymorphism and found that the emergent strains produced a slightly larger clearance area relative to the historical strains consistent with increased SpeB production ( Fig. S4A and B). Finally, 90% (377/417) of historical strains encode valine (V) at position 27 (a variably conserved region of the predicted DNA-binding domain [40]) in the multigene activator (Mga) protein (41). Conversely, only 22% (138/614) of emergent emm4 strains possess the same valine, and these strains form a distinct subpopulation within SC3 (data not shown). The remaining emergent strains (476/614) have an Mga protein with alanine (A) at position 27 and thus have a V27A change relative to the historical strains.
We also identified a small number of locus variation between the clades representing ;12 kb (,1%) of the core chromosome (data not shown). Most of these variations were in genes encoding surface proteins with known highly repetitive sequences and frequent strain-to-strain allelic variation such as the streptococcal collagen-like proteins (sclA and sclB) (42,43), serum opacity factor (sof) (44), and fibronectin-binding protein (fbpA) (23). However, in contrast to single nucleotide changes and the emmenn fusion, the observed gene variations were not clade defining and occurred in both emergent and historical strains. emm4 clades exhibit significantly different gene variation in key transcriptional regulators. To gain insight into whether particular GAS loci were under selective pressure, we determined genetic variation at each of the 1,629 core genetic loci (relative to the reference strain Duke and excluding MGE) to define highly polymorphic genes. After exclusion of variation likely to have arisen due to repetitive elements or small recombination events, a total of 24 genes were identified as significantly highly polymorphic ( Fig. 3 and Table S5). Interestingly, nearly half (10/24 [42%]) of the highly variant genes encoded known or predicted proteins of two-component systems (TCS) or stand-alone transcriptional regulators. Similar to previous investigations of GAS populations (5-7, 45), we identified the genes encoding the CovRS TCS as significantly highly polymorphic (Fig. 3). Mutations altering the activity of the CovRS TCS are wellknown to aid in the transition from colonization to invasion (28,(46)(47)(48)(49) and occur at relatively high frequencies in multiple GAS emm types (5,6,45,50). In nearly all cases of regulator gene variation, observed mutations resulted in predicted modified proteins, with .25% of the mutations introducing a premature stop codon (131/505 [25.9%]) and nearly 70% leading to at least one amino acid change (350/505 [69.3%]) (Table S5).
To gain additional insights into mechanisms underlying the observed clonal emergence, we sought to determine whether the frequencies of gene variation differed between the historical and emergent clades. A total of 32 genes were identified in the emm4 population with significantly higher variation levels compared to the rest of the genome of which 8 were also significantly highly polymorphic in both emergent and historical clade strains when analyzed separately (Table S5). Interestingly, excluding clade-defining mutations in mga, ralp3, and ropB (Table 2), we discovered that the proportion of strains with regulator-specific mutations differed significantly between clades (Table 3 and Table S5). Specifically, gene variation rates for covS (histidine kinase of CovRS TCS), rofA, and ralp3 were significantly greater in historical clade strains  Chi-square statistic based on total number of variations in the core genome (n = 10,594).
f P value based on chi-square test for the total emm4 population following correction for multiple comparisons (Bonferroni).
g Number of strains in the emm4 population (total emm4) with at least one of the defined gene mutations.
h Number of strains in emergent or historical clades (excluding ST38; n = 16) with at least one of the defined gene mutations. Italic values indicate that the indicated gene was not significantly highly polymorphic in the respective population (Table S5).
i P value (Fisher's exact) comparing proportion of strains with at least one mutation in emergent versus historical clades following correction for multiple comparisons (Bonferroni).
j Excludes the clade-defining mutation T104I found in emergent/clade 2 strains.
k Excludes the clade-defining mutation A27V found in historical/clade 1 strains.
l Excludes the clade-defining nonsense mutation found in emergent/clade 2 strains and a synonymous SNP (n = 289 strains) in historical/clade 1 strains.
m Excludes a synonymous SNP (n = 492) occurring in both clades.
compared to the emergent clade strains (Table 3). On the other hand, genetic variation in mga was observed to be significantly greater in emergent compared to historical strains ( Table 3). The significantly different gene variation levels between the two emm4 clades suggest different regulatory programs contributing to host-pathogen interaction. Emergent emm4 GAS strains exhibit a distinct transcriptome profile with decreased virulence gene expression. The observed differences in regulator gene polymorphism frequency between the two clades, including regulators known to respond to metabolite availability, led us to hypothesize that emergent clade strains may differ from historical strains in in vitro growth characteristics and global gene regulation. For all subsequent clade comparisons, we identified four representative strains each from the SC1 (historical) and SC3 (emergent) clades (Table S3). All strains were wild type at key, known regulatory loci (e.g., TCS and stand-alone regulators) with the exception of clade-specific changes ( Table 2 and Table S4). Thus, observed differences are likely to be representative of each clade. We first compared in vitro strain growth in nutrient-rich medium (THY) and glucose-limiting conditions (C medium). Overall, growth among emergent clade strains showed greater strain-to-strain variability compared to historical strains (Fig. 4A). Surprisingly, under both conditions, emergent clade strains had a marked growth defect beginning at approximately mid-exponential (ME) phase ( Fig. 4A and Fig. S4C). The observed differences in growth were more pronounced under glucose-limiting conditions (C medium) ( Fig. 4A and Fig. S4D), suggesting that emergent strains are sensitive to nutrient limitation. The lower optical densities (ODs) of the emergent strains correlated with lower CFU, indicating that the lower ODs were not due to an artifact (data not shown).
Next, we determined the in vitro transcriptomes of emergent and historical clade strains using RNA sequencing (RNA-seq) from cells grown in nutrient-rich medium and harvested at mid-exponential phase. These conditions were chosen given that growth differences were less pronounced in nutrient-rich media (Fig. 4A), and many GAS virulence factors are known to be maximally expressed during mid-exponential phase in vitro (26,51). For RNA-seq analysis, the complete genome sequence was resolved for each of the eight strains and annotated using the prokaryotic genome annotation pipeline (PGAP) at NCBI (52). Via pangenome analysis, we identified 1,619 genes that were shared between the eight emm4 genomes, and these were used to determine differential gene expression (DGE) between clades. Principal-component analysis demonstrated well-defined, clade-specific clustering of emm4 strain transcriptomes (Fig. 4B). DGE was defined as an absolute fold difference of $1.5 and Bonferroni-corrected P value of ,0.05. Excluding MGE, we found 62 genes upregulated and 141 genes downregulated in emergent strains relative to historical strains (Table S6). Highly upregulated genes in emergent clade strains involved the methionine oxidative response (ccdA2), arginine deiminase pathway (arcA), and genes encoding putative peptidoglycan-modifying enzymes (05805) (Fig. 4C). In striking contrast to previous GAS epidemic strain emergence investigations (6,7,53), we discovered that genes encoding the virulence factors nga, slo, prtS (interleukin 8 [IL-8] protease), and sda3 (DNase) were among the most highly downregulated genes in emergent strains relative to historical strains (Fig. 4C). The observed differences in nga and slo transcript levels cannot be explained by sequence variation, as emergent and historical strains have identical nga-slo operons and promoter regions. Given that the T4 pilus was recently shown to contribute to colonization and immune evasion of emm4 GAS (54), we also examined pilus gene transcription levels and again observed significantly decreased expression in emergent strains compared to historical clade strains (Table S6).
It is possible that the differences in growth between emergent and historical strains contributed to the observed nga and slo DGE. Given the pivotal role of Nga and SLO in previous GAS epidemic clonal emergence (6,7,53), we sought to confirm the lack of increased transcript levels in emergent clade strains. We isolated RNA from GAS strains grown in nutrient-rich medium at mid-exponential phase and assayed transcript levels of slo using quantitative reverse transcription-PCR (qRT-PCR). We found significantly  ). For emm4 GAS, transcript levels reflect four historical strains (in blue) and four emergent strains (in red). RNA was extracted from ME cultures as described in Methods. TaqMan qRT-PCR data for both panels D and F are means 6 standard deviations of two biological replicates, with two technical replicates, done on two separate days. pos., positive; neg., negative. reduced slo transcript levels in emergent strains compared to historical strains (Fig. 4D). Next, we sought to correlate Nga/SLO protein expression with transcript levels determined by qRT-PCR. Consistent with the decreased nga and slo transcript levels in the emergent strains, we observed less Nga in the supernatant derived from emergent strains (Fig. 4E). Inasmuch as acapsular strains (e.g., emm4) have been associated with high Nga/SLO toxin expression (25), we compared nga and slo transcript levels from historical and emergent emm4 strains to strains representative of other GAS emm types with the high-level nga or slo promoter variant (25). We observed significantly lower slo transcript levels in emm4 GAS compared to emm1, emm3, emm28, and emm89 strains and similar levels compared to the emm87 strain (Fig. 5F). Thus, in total, the above data suggest distinct regulatory programs governing emergent and historical emm4 strains and that emm4 clonal replacement did not result from augmented Nga/SLO toxin production.
Emergent strains are more virulent compared to historical emm4 GAS strains. Emergence of novel GAS strains resulting in clonal replacement has previously been associated with enhanced virulence compared to historical strains (5-7). Given the nearly complete replacement observed in our temporal analysis and despite the seemingly paradoxical in vitro growth difference, we hypothesized that emergent emm4 strains possess increased virulence compared to historical strains in models of GAS disease. The same historical and emergent strains used in transcriptomic analyses were used for virulence assays (Table S3). In a model of GAS bacteremia, groups of mice were inoculated with one of five doses of GAS (range, 5 Â 10 6 to 6 Â 10 8 ) in order to determine the dose at which 50% of the mice become moribund (LD 50 ). Mice exhibited a dose response to GAS infection with either emergent or historical clade strains (Fig. S4E to S4G). With the exception of the lowest dose, mice infected with emergent clade GAS strains exhibited significantly decreased survival compared to historical emm4 GAS strains. For example, the median survival time when infected with 2 Â 10 8 CFU (Fig. 5A) was 23.5 h in emergent strains compared to 181.5 h for historical strains. Next, we compared the ability of emergent and historical clade strains to survive in human blood ex vivo using the Lancefield bactericidal assay (55). In agreement with the results from the mouse challenge, emergent clade emm4 strains demonstrated significantly increased growth in human blood compared to the historical strains (Fig. 5B), although there was substantial strain-to-strain variability (Fig. S4H). The increased virulence observed in emergent emm4 GAS strains that have nearly replaced all previously circulating strains herein suggests that mechanisms that do not involve augmentation of Nga/Slo toxin expression can lead to GAS epidemic emergence. infected with four historical (blue) or four emergent (red) emm4 GAS strains (2 Â 10 8 CFU). The P value of ,0.0001 was determined by the log rank test. Survival curves following infection with other dose ranges are shown in Fig. S4 in the supplemental material. (B) Survival of emergent (red) or historical (blue) strains following exposure to human blood ex vivo. Error bars represent standard deviations of four emergent and four historical emm4 GAS strains assayed in quadruplicate using three independent donors. The two asterisks indicate P , 0.05 (Mann-Whitney U test). Individual strain comparisons are shown in Fig. S4.

DISCUSSION
Whole-genome sequencing (WGS) of large cohorts of clinical isolates facilitated by the dramatic decrease in sequencing costs over the past 15 years have allowed for clear detection of clonal emergence and spread of numerous major human bacterial pathogens such as USA300 methicillin-resistant Staphylococcus aureus (1, 2) and lineage 11 Neisseria meningitidis (4). Similarly, WGS of distinct GAS emm types has identified clonal emergence and replacement events resulting in tens of thousands of lifethreatening human infections (5)(6)(7)45). We previously detected an unrecognized emm-enn fusion event in contemporary emm4 GAS isolates (27). Herein, we used WGS to analyze .1,000 emm4 GAS strains of diverse temporal and geographic origins to ascertain the extent and broader molecular context of the emm-enn gene chimera. Our results show that there has been nearly complete strain replacement among emm4 GAS in North America over the past 20 years. Importantly, this strain replacement is not defined by acquisition of exogenous DNA which has previously been the molecular basis attributed to GAS clonal replacement (6,9).
A critical strength of our study relative to other investigations of bacterial clonal emergence was that the replacement event occurred during a period of active surveillance (CDC active surveillance for GAS began in 1997) which facilitated collection of large numbers of both historical and emergent isolates. In turn, this meant that we could use large-scale phylogenomic analyses to accurately select representative strains from both the historical and emergent populations on which to perform comparative studies rather than having to rely on a small number of convenience samples for the historical isolates. Inasmuch as a single nucleotide polymorphism can markedly change GAS phenotype (9,56), having robust sample collections for both the historical and emergent clades markedly increases the confidence that our findings are generally applicable to the populations being studied. Additionally, the large number of historical and emergent isolates facilitated intra-emm type comparison of relative rates of gene polymorphisms. Intriguingly, the hypovirulent historical isolates had higher rates of covS mutation, which we have recently shown increases emm4 virulence (57). Conversely, the emergent strains had higher rates of mga variation, which we postulate may reflect the in vitro growth issues we identified in the emergent strains given the critical role of Mga in GAS metabolism (41).
Nearly 10 years ago, emm4 and emm22, were the first identified acapsular emm types, a finding which was explained by the absence of the capsule-encoding hasABC operon (21). This discovery was surprising given that the GAS hyaluronic acid capsule has been considered critical to GAS pathogenesis for nearly a century (58). However, in the decade since the initial recognition of acapsular emm4 and emm22 GAS, we and others have described numerous other emm types either lacking or having abrogating mutations in the hasABC operon (22,25). Indeed, acapsular GAS bacteria account for .30% of invasive GAS isolates as determined by active CDC surveillance (24). To date, the majority of acapsular GAS investigation has focused on emm89, as strains lacking the hasABC operon have recently emerged from and replaced their encapsulated precursors to cause epidemics in both the United States and Europe (7,53). The "new" emm89 strains have several regions of large-scale genetic recombination relative to replaced strains (26). The GAS emm types that are the source of these recombined areas have not been fully defined, but it is clear that the emergent emm89 clone contains numerous pieces of exogenous DNA that likely have contributed to its global spread. In contrast, we found emergent emm4 GAS are not distinguished from the replaced strains by acquisition of exogenous DNA despite our use of multiple, complementary methodologies. Thus, the emergent emm4 GAS strains are not defined by the presence of a new virulence factor, which also stands in contrast to various successful GAS clones such as contemporary emm3 strains that contain an actively secreted phospholipase (sla) (15). Some 50% of new emm4 strains contain the DNase Sdn which was very rarely identified in historical strains (Fig. 2), but half of the new strains lack Sdn, meaning that Sdn acquisition cannot explain emergence of the new strains. Our findings of small changes in endogenous DNA content rather than acquisition of exogenous DNA defining an emerging GAS clade echo those recently described for an emm1 clone in England causing scarlet fever (59). Thus, together with the recent emm1 data, our emm4 clonal emergence findings suggest that increased focus needs to be paid to understanding mechanisms by which alteration of endogenous DNA content impacts GAS pathophysiology.
The GAS recombination event that has been most studied thus far involves the nga or slo operon and results in augmented transcript levels of the gene encoding the streptolysin O toxin through acquisition of a "high-activity" promoter variant (9,25). Clonal replacement and emergence of both emm1 (6) and emm89 (7, 53) GAS have been at least partially ascribed to this particular event. Indeed, all of the genes differentially transcribed between currently circulating and replaced emm1 strains were located in the nga or slo recombination area (6). Additionally, acquisition of a high-activity nga or slo promoter by a variety of emm types through recombination has been specifically associated with acapsular GAS strains (25). In stark contrast, we found no difference between the historical and emergent emm4 GAS strains in terms of nga/slo region genetic composition. Additionally, we identified that both historical and emergent emm4 GAS strains produce significantly lower slo transcript levels relative to strains with an identical high-activity nga or slo promoter. We also found that similarly low nga and slo transcript levels were present in an acapsular emm87 strain with a high-activity nga or slo promoter. Therefore, our data challenge the notion that the nga or slo region is the sole driver of acapsular GAS strain emergence, and they conclusively show that acapsular strains can cause serious infections without elevation of slo transcript levels.
Our analyses have identified several marked phenotypic and genotypic differences between the historical and emergent strains, which complicates the ability to definitively associate a single trait with emm4 GAS clonal emergence. Surprisingly, the emergent strains had a clear growth defect relative to the historical strains. As noticeable as this defect is under in vitro conditions, the epidemiologic data clearly show that it does not affect the overall fitness of the organism in the human host, as demonstrated by the hundreds of invasive GAS infections caused by the emergent emm4 clade. Similarly, our RNA-seq data show markedly lower transcript levels of numerous key virulence factor-encoding genes such as prtS and sda3 in the emergent strains. In contrast to the RNA-seq data, however, the emergent strains were more virulent in an intraperitoneal mouse challenge model and an ex vivo human blood model, findings which are consistent with emm1 (6) and emm89 (7, 53) studies in which the expanding clones were hypervirulent relative to the replaced strains. However, increased virulence in emergent emm4 strains was not associated with alterations in the nga or slo region as has been clearly demonstrated for emm1 and emm89 epidemic strains. We identified multiple small-scale genetic events separating the emergent and historical emm4 strains, such as the emm-enn gene fusion and early stop codons in genes encoding two regulatory proteins previously identified as impacting GAS pathophysiology. Additionally, the emergent strains contained a single amino acid change in RopB and produced slightly more of the broad-spectrum, actively secreted protease SpeB relative to historic stains. None of the genetic changes that we identified in the emerging emm4 strains were shared with the 27 SNPs that defined the new emm1 English strains discussed above (59). Interestingly, our extensive phylogenetic analysis did not reveal a stepwise acquisition of the genetic differences separating the emergent and historical strains suggesting that a previously unrecognized clone containing all of the changes may have been introduced into the population. Alternatively, the "intermediate" strains may not have caused invasive infection such that they were not available for our analysis. Which of these factors or combination of factors accounts for observed phenotypic differences is an active area of further investigation in our laboratories.
Another key difference between our investigation and other studies of GAS strain emergence relates to total disease burden. For both emm1 and emm89 GAS, the emerging clones caused markedly more total infections relative to the precursor strains (6,7,53), findings which in turn spurred subsequent investigation. In contrast, our study of emm4 GAS was driven by the WGS identification of the emm-enn gene fusion in a single strain (27). Indeed, total emm4 invasive disease rates have not markedly changed in the United States over the past 20 years, possibly suggesting a defined ecologic niche for emm4 organisms. Thus, our findings show that even when total GAS disease due to a particular emm type is relatively stable, there may be sizable variation in clonal distribution. To this end, the public availability of WGS from invasive GAS strains provided by the CDC is likely to assist with identification and characterization of heretofore unrecognized shifts in GAS epidemiology.
In conclusion, via WGS of .1,000 strains collected over a 201 year period of active surveillance, we have discovered nearly complete replacement of previously circulating emm4 GAS strains by a new clone in high-income settings. In contrast to prior GAS clonal replacement investigations, emergent emm4 GAS strains are defined by small changes in endogenous DNA, rather than acquisition of exogenous DNA, and have low slo transcript levels. Yet, emergent emm4 strains still have marked phenotypic variation relative to the replaced strains, indicating that much remains to be learned regarding the factors driving proliferation of GAS clones within the human population.

MATERIALS AND METHODS
GAS strains used in this study. GAS strains are listed in Table S1 in the supplemental material. A total of 583 strains were obtained from national-level surveillance performed by the CDC Active Bacterial Core Surveillance (ABCs) beginning in 1997 (60,61). Likewise, from a total of 812 strains, a representative subset of strains (n = 203) were selected from surveillance derived from the Toronto Invasive Bacterial Diseases Network and Alberta Provincial Laboratory for Public Health. All strains were grown using standard laboratory media as previously described (62). Genome sequence information from the remaining strains was obtained from the NCBI Sequence Read Archive (see "Data availability" below).
Genomic DNA extraction, short-and long-read sequencing, and bioinformatic methods. For all short-read, Illumina-based sequencing, high-quality genomic DNA was extracted from GAS cells grown overnight at 37°C and 5% CO 2 on SBA using the Qiagen DNeasy kit modified for Gram-positive bacteria as previously described (63). Genomic DNA used in Oxford Nanopore long-read sequencing was extracted from cells using a cetyl trimethylammonium bromide (CTAB) protocol adapted for GAS to enrich for longer DNA fragments. The quantity and quality of DNA were measured using a nanodrop spectrophotometer (ThermoFisher) or Take3 microvolume plate reader (BioTek). DNA libraries for Illumina sequencing were generated using the Illumina Nextera Flex kit per the manufacturer's instructions. Library quality was ensured using an Agilent TapeStation and sequenced using either an Illumina MiSeq or HiSeq4000 instrument. Long-read sequencing libraries were generated using the Oxford Nanopore Technologies Rapid Barcoding kit and GridION instrument per the manufacturer's instructions.
The complete genome sequence was resolved for a subset (n = 8) of emm4 GAS strains (Table S3). Strains were chosen to represent major subclades and included strains wild type at all known two-component systems and stand-alone regulators except in cases where the clade or subclade was defined by such a mutation. Complete genome assembly was performed using both Illumina short-read and Oxford Nanopore GridION long-read sequences. Average depth of coverage for completed genomes exceeded 300-fold and was at least 50-fold for both short-and long-read sequences. Complete hybrid (short-and long-read) assemblies were obtained using Unicycler (v0.4.6) (67). Final circular chromosomes were annotated using PGAP at NCBI (52). A pangenome analysis was performed on strains used in RNAsequencing experiments (n = 8, Table S3) using Roary (v3.12.0) (68) to define shared gene content.
Identification of MGE was performed using SPAdes-corrected short reads and SRST2 (v0.2.0) (69). MGE genotype was inferred based on the presence/absence of site-specific integrases or secreted virulence factors using a published database (45). Virulence factor content (e.g., exotoxin and secreted DNase) was assessed for consistency by blastn (70) comparison of de novo assemblies to a custom database containing all known exotoxin and secreted DNase alleles (71). In addition, completed genomes (n = 14) were manually inspected for the presence/absence of MGE, associated virulence genes, and sites of integration.
In vitro growth analysis of selected emm4 GAS strains. Growth curves for the four historical and four emergent clade strains were performed in THY or C medium as previously described (62). Briefly, overnight cultures grown at 37°C and 5% CO 2 were used to inoculate (1/50 volume) prewarmed medium and distributed to a 96-well microtiter plate. Growth was monitored in a BioTek Synergy H1 plate reader, and readings were obtained every 30 min following shaking to redistribute cells in individual wells. Individual strains were assayed in technical quadruplicate and in biological triplicate on two independent days. Milk plate assays were performed in technical quadruplicate as previously described (39).
RNA extraction, sequencing, and analysis. High-quality RNA was extracted from GAS cells grown to mid-exponential phase (optical density at 600 nm [OD 600 ] of 0.5) in THY at 37°C and 5% CO 2 using the Qiagen RNeasy minikit as previously described (73). The quantity and quality of RNA were assessed using nanodrop (ThermoFisher) and by Agilent TapeStation. RNA-seq libraries were prepared using the Illumina ScriptSeq Complete Bacteria (v3) kit per the manufacturer's instructions. Libraries were sequenced (150-bp PE) using an Illumina HiSeq4000 instrument. RNA-seq analysis was performed using CLC Genomics Workbench (v20). Raw reads were processed and mapped to the reference genome ABC25 (CP049696). Analysis was restricted to core genome content (n = 1,619 open reading frames/genes) based on the pangenome analysis for the completed genomes. Significant differential gene expression was defined as an absolute fold difference of .1.5 and Bonferroni-corrected P value of ,0.05.
Targeted gene transcript analysis. For TaqMan real-time qRT-PCR, strains were grown in duplicate on two separate occasions to mid-exponential phase in THY as described above and processed as described previously (74). Primers and probes used targeted slo (forward, GACCTTTAAAGAGTTGCAACGAAAA; reverse, GACC ATAAGCTACGTTACTCACAAAGA; probe, TGTCAGCAATGAAGCCCCGCC) with tufA (forward, CAATCGTCACTAT GCGCACAT, reverse, GAGCGGCACCAGTGATCAT; probe, CTCCAGGACACGCGGACTACGTTAAAAA) as an internal control.
SLO immunoblot assay. GAS strains were grown to mid-exponential phase in 45 ml of THY as described above. Cultures were centrifuged to separate cells, supernatants were precipitated with acetone and resuspended in 10 mM Tris and 150 mM NaCl. Protein concentration was estimated by the Bradford method, and equal amounts of total protein were separated by sodium dodecyl sulfate-polyacrylamide gel electrophoresis (SDS-PAGE) and transferred. HPr and Nga were detected using a custommade antibody (75) and a commercially available (Abcam) polyclonal antibody, respectively, and the Odyssey imaging system as described previously (51).
Murine model of GAS bacteremia. All mouse studies were performed under a protocol approved by the Institutional Animal Care and Use Committee at the MD Anderson Cancer Center. Four historic (ABC3, ABC25, ABC76, and ABC221) and four emergent (ABC199, ABC208, TSPY637, and TSPY767) strains were used. For the LD 50 studies, seven female BALB/c mice per dose per strain (Harlan-Sprague-Dawley) were injected intraperitoneally with a range (5 Â 10 6 to 6 Â 10 8 ) of GAS CFU as described previously (57). Mice were monitored for near-mortality over 7 days, and survival was compared using Kaplan-Meier analysis. Differences in survival were calculated using a Mantel-Cox (log rank) analysis with a P value of ,0.05 considered statistically significant.
Lancefield assays. Bactericidal assays were performed as previously described (56) under a protocol approved by the Committee for the Protection of Human Subjects at the University of Texas (UT) Health/McGovern Medical School. Blood samples from a minimum of three healthy, nonimmune, adult donors were used for each strain and performed in quadruplicate. Multiplication factors were calculated by dividing the number of CFU per milliliter after 3 h of incubation by the initial inoculum.
Statistical analyses. All statistical analyses were performed using GraphPad/Prism (v8). Statistical comparisons of continuous variables used Student's t test (normally distributed) or Mann-Whitney U test (nonnormally distributed). Survival curves for mouse infection studies were compared using log rank test. P values less than 0.05 were considered significant for all tests.

SUPPLEMENTAL MATERIAL
Supplemental material is available online only.

ACKNOWLEDGMENTS
We thank Bernard Beall, Sopio Chochua, and Chris Van Beneden for assistance with emm4 GAS strains derived from the CDC ABCs and J. Chase McNeil and Loren Sommer for assistance with GAS strain collection in the Texas Medical Center in Houston.
This work was supported by NIH R21 AI142126 and R21 AI159059 to A.R.F. and S.A.S. and T32 AI141349 to L.A.V.