Analysis of the Genomics and Mouse Virulence of an Emergent Clone of Streptococcus dysgalactiae Subspecies equisimilis

Our studies addressed a critical knowledge gap in understanding the genomics and virulence of the bacterial pathogen Streptococcus dysgalactiae subsp. equisimilis. S. dysgalactiae subsp. equisimilis strains are responsible for a recent increase in severe human infections in some countries. We determined that certain S. dysgalactiae subsp. equisimilis strains are genetically descended from a common ancestor and that these strains can cause severe infections in a mouse model of necrotizing myositis. ABSTRACT Streptococcus dysgalactiae subsp. equisimilis is a bacterial pathogen that is increasingly recognized as a cause of severe human infections. Much less is known about the genomics and infection pathogenesis of S. dysgalactiae subsp. equisimilis strains compared to the closely related bacterium Streptococcus pyogenes. To address these knowledge deficits, we sequenced to closure the genomes of seven S. dysgalactiae subsp. equisimilis human isolates, including six that were emm type stG62647. Recently, for unknown reasons, strains of this emm type have emerged and caused an increasing number of severe human infections in several countries. The genomes of these seven strains vary between 2.15 and 2.21 Mbp. The core chromosomes of these six S. dysgalactiae subsp. equisimilis stG62647 strains are closely related, differing on average by only 495 single-nucleotide polymorphisms, consistent with a recent descent from a common progenitor. The largest source of genetic diversity among these seven isolates is differences in putative mobile genetic elements, both chromosomal and extrachromosomal. Consistent with the epidemiological observations of increased frequency and severity of infections, both stG62647 strains studied were significantly more virulent than a strain of emm type stC74a in a mouse model of necrotizing myositis, as assessed by bacterial CFU burden, lesion size, and survival curves. Taken together, our genomic and pathogenesis data show the strains of emm type stG62647 we studied are closely genetically related and have enhanced virulence in a mouse model of severe invasive disease. Our findings underscore the need for expanded study of the genomics and molecular pathogenesis of S. dysgalactiae subsp. equisimilis strains causing human infections. IMPORTANCE Our studies addressed a critical knowledge gap in understanding the genomics and virulence of the bacterial pathogen Streptococcus dysgalactiae subsp. equisimilis. S. dysgalactiae subsp. equisimilis strains are responsible for a recent increase in severe human infections in some countries. We determined that certain S. dysgalactiae subsp. equisimilis strains are genetically descended from a common ancestor and that these strains can cause severe infections in a mouse model of necrotizing myositis. Our findings highlight the need for expanded studies on the genomics and pathogenic mechanisms of this understudied subspecies of the Streptococcus family.

and Oxford Nanopore long-read sequencing. The seven isolates included six group C strains of emm type stG62647 (MGCS35823, MGCS35922, MGCS35957, MGCS36044, MGCS36083, and MGCS36089), which were selected because strains with these characteristics have recently emerged in multiple countries as a cause of increased numbers of severe infections (9,36,43). One group G stC74a strain (MGGS36055) was selected as a comparator, because strains with these characteristics have been reported to be an abundant cause of human infections in multiple epidemiological studies (7,11,15,34,(44)(45)(46). The depth of short-read sequencing coverage averaged 587-fold (range, 204-to 1,539fold) and long-read coverage averaged 94-fold (range, 21-to 239-fold), for a combined average of 681-fold (range, 337-to 1,560-fold) (see Table S1 in the supplemental material). The assembled and closed genomes had an average size of 2,196,360 bp (range, 2,151,694 to 2,233,554) with an average 2,130 protein-encoding genes (range, 2,075 to 2,209) and an average G1C content of 39.4% (range, 39.3% to 39.6%) ( Table 1 and Table  S2). Although some closed S. dysgalactiae subsp. equisimilis genomes differed in their overall architecture due to large regions of sequence inversion (for example, NCTC7136 versus ATCC 12394), these seven genomes were all colinear and had the same chromosomal architecture as reference strain ATCC 12394. These genome size and composition characteristics are consistent with those of the 21 publicly available closed S. dysgalactiae subsp. equisimilis genomes (Table S3). Consistent with the close species genetic relatedness, 1,280 of the 2,075 (61.7%) coding DNA sequences (CDSs) for the genome of strain MGCS36044 had a reciprocal Blastp best hit with a CDS of S. pyogenes emm1 strain MGAS2221 genome (GenBank accession number GCA_012572265.1).
Regions of difference and putative mobile genetic elements. The seven genomes differed in size primarily due to discrete regions of difference (RODs) in gene content attributable to integration of putative mobile genetic elements (MGEs), with gene content characteristic of integrative conjugative elements (ICEs) (47) and phages (48) (Fig. 1). These RODs are the major source of genetic diversity and gene content difference among the seven genomes, constituting on average 267.4 kbp or 12.2% of each genome sequence. In comparing gene content among the seven genomes using ST20/stG62647 strain MGCS36044 as the reference, there were nine RODs (comprising 215.4 kbp) that had gene content that was largely present among all six of the ST20/stG62647 genomes but was largely absent from the ST17/stC74a strain MGGS36055 genome ( Fig. 1A and Table S4). Conversely, using ST17/stC74a strain MGGS36055 as the reference, there were 11 RODs (comprising 270.2 kbp) that had gene content that was present in the stC74a strain MGGS36055 genome but was largely absent from the six stG62647 genomes (Fig. 1B and  Table S5). Cumulatively, these 20 RODs constituted 485.6 kbp of genetic content differing between strains MGCS36044 and MGGS36055. These RODs ranged in size from 5.2 to 70.7 kbp, and each encoded 4 to 70 inferred genes. Nearly all of these RODs encoded a site-specific integrase or recombinase gene, most of which were terminally located. Moreover, the G1C content of most of these RODs was on average 3% to 7% lower than the overall average of the genome. These characteristics-variable presence among the strains, differing from the core genome in nucleotide composition, and encoding an integrase gene-were all consistent with these RODs being integrative mobile genetic elements that had been acquired exogenously. Although not systematically evaluated, much of the gene content of (Continued on next page) the RODs shared sequence similarity with MGEs (phages and ICEs) of other streptococci, in particular S. pyogenes and S. dysgalactiae subsp. dysgalactiae. The larger ($30 kbp) RODs had mobilization and structural gene content characteristic of ICEs and phages, whereas many of the smaller RODs encoded an integrase-or recombinase-like gene but lacked other gene content commonly found in ICEs (e.g., genes encoding conjugal transfer proteins) and phages (e.g., genes encoding coat and tail structural proteins), making the nature of their putative mobilization cryptic. These smaller RODs may represent integrative and mobilizable elements (IMEs) (49) or satellite prophages (48). Some of the smaller RODs, which lacked an integrase gene, were flanked terminally by insertion sequences which were prevalent in the S. dysgalactiae subsp. equisimilis genomes (average, 95) and may have represented composite transposons. An extrachromosomal ICE in stG62647 strain MGCS35922. In addition to the 20 RODs that largely differed in all six stG62647 strains from stC74a strain MGGS36055 and vice versa, there were several additional RODs with gene content largely present in only one stG62647 strain of the cohort (Table S6). Among these strain-specific RODs, we identified a 47-kb circular element that Unicycler assembled in its entirety extrachromosomally from the stG62647 strain MGGS35922 closed genome. Because of the concern that this might represent an assembly error, the strain MGGS35922 extrachromosomal (35922ec) circularized element was checked for consistency with the shortand long-read sequencing data by read mapping. No locus was found where there was discontinuity in either the paired mapping of short reads or disrupted mapping of long reads to indicate that the element was inappropriately assembled as an extrachromosomal closed circularized sequence. Moreover, 35922ec had gene content encoding conjugal transfer and mobilization proteins and a serine-type site-specific integrase or recombinase flanked by an attachment site sequence homologous to (i.e., targeting integration to) the chromosomally encoded 23S RNA methyltransferase gene rlmD, all characteristics consistent with it being an ICE (50) (Fig. 2). Additionally, 35922ec encoded a RepA-like replication protein and a chromosome segregation ATPase protein, findings consistent with the capacity for episomal replication (47). Compared to the seven closed genomes, the 35922ec ICE had sequence similarity with elements integrated at rmlD and corresponding to ROD.7, as defined relative to the strain MGCS36044 genome ( Fig. 1A) and strain MGGS36055 genomes (Fig. 1B). 35922ec had partial gene content that was present in the genomes of all seven strains of the cohort, including the MGGS35922 genome (Fig. 2). As an additional test of the genome assembly validity, PCR was used to check the structure at the putative extrachromosomal element 35922ec putative attachment site and chromosomal attachment and integration sites flanking 35922_ROD.7 (Fig. S1). All PCR amplicons were consistent with the genome architecture produced by the Unicycler hybrid sequencing assembly. 35922ec and 35922_ROD.7, although having some ICE-like gene content in common, differed in sequence and overall length by 13 kbp. We hypothesized that the 35922ec ICE did not integrate into the strain MGGS35922 chromosome because the rlmD integration site was already occupied by 35922_ROD.7 ICE (Fig. 2). Experimental studies are required to test this hypothesis.
Additional sources of genetic content diversity. In addition to the RODs with characteristics of MGEs, there were several additional regions of lower variation in genetic Genomic landmarks are indicated as labeled in the ring adjacent to the strain MGCS36044 genome. Of note are nine regions of difference constituting 215.4 kbp, which contain gene content that is largely present in the six stG62647 genomes but is largely absent from the stC74a comparator strain MGGS36055 genome (see Table S3 in the supplemental material). (B) Illustrated is the 2,204,518-bp genome of GGS, ST17, stC74a strain MGGS36055 and the genetic content it shares with the six other strains of the cohort. Of note, there are 11 regions of difference constituting 270.7 kbp which contain gene content present in the stC74a MGGS36055 genome but largely absent in the six stG62647 genomes (Table S4). For both strain MGCS36044 and strain MGGS36055, many of the RODs encode an integrase and have a G1C content that is 3% to 7% lower than that of the genome average (approximately 39.4%), consistent with these regions potentially being integrative mobile genetic elements of exogenous origin. content that differed between the six stG62647 strains and stC74a strain MGGS36055. These regions included gene content encoding pilus, M protein, carbohydrate group antigen, and the extended streptococcal invasion locus (SIL) (each indicated as a landmark in Fig. 1). These were not regions of complete gene content difference. That is, they did not simply differ by strain-to-strain gene presence or absence, but rather had partial gene content that was conserved and partial content that varied. The regions had in common that each encoded one or more inferred secreted or surface-attached molecules. In principle, these proteins and sugar polymers interact with the host immune system, and genetic variation at these loci may reflect diversifying evolutionary selection for polymorphisms that contribute to evasion of the host immune response. Oppegaard et al. (9) reported that among the 18 ST20/stG62647 isolates collected in western Norway that they studied, the SIL quorum-sensing regulon (51)(52)(53) was disrupted by insertion of IS1548 into the silB gene encoding the two-component system sensor histidine kinase SilB. The investigators postulated that disruption of the silB gene contributed to ST20/stG62647 isolates causing severe infections. All six of the stG62647 isolates collected in French Brittany that we studied also had the same transposon insertion (IS1548) disrupting silB, whereas stC74a strain MGGS36055 lacked all six SIL genes (silA, -B, -C, -CR, -D, and -E) (Fig. 3). Among the 21 closed S. dysgalactiae subsp. equisimilis genomes in GenBank, the SIL genes are absent in 5 but present in 16 (Table S3). However, in 5 of the 16 genomes with the SIL genes present, one or more of the SIL genes were disrupted, including silB in the strain TPCH-A19. Clearly the SIL region, not to mention the additional bacteriocin-like gene content flanking it and constituting the extended SIL, was fairly variable among S. dysgalactiae subsp. equisimilis genomes. The function of SIL has not been experimentally characterized in S. dysgalactiae subsp. equisimilis, and thus the genes it regulates and its role in S. dysgalactiae subsp. equisimilis pathogenesis are currently unknown.
Phylogenetic analysis of the seven genomes. Phylogenetic analysis of the seven genomes based on core-genome single-nucleotide polymorphisms (SNPs) found that the six stG64647 strains we studied were closely genetically related, differing from each other on average by only 495 core chromosomal SNPs (Fig. 4). These six isolates (collected in Brittany, France between 2015 and 2018) were of the same multilocus sequence type clonal complex, CC20 (CC20 is composed of ST20 and its single-locus variants), as were 18 of the 19 stG62647 strains recently reported by Oppegaard et al. (9) as an emergent cause of severe infections in western Norway. Phylogenetically, these two sets of isolates clustered together (Fig. 4), consistent with them being clonally derived from an evolutionarily recent common progenitor. In contrast, the stC74a comparator strain MGGS36055 along with reference strain RE378 and Norway isolate T666 were multilocus sequence type CC17 and differed on average from the CC20 isolates by 12,050 core SNPs (Fig. 4).
Lancefield carbohydrate group identification. S. dysgalactiae subsp. equisimilis strains can be of streptococcal carbohydrate group antigen types G, C, or A. Bioinformatic analysis of the region of the genome encoding the Lancefield carbohydrate group antigen synthesis genes of the six clonally related stG62647 strains indicated that these organisms are group C and that the stC74a strain MGGS36055 is group G. This bioinformatic determination was confirmed experimentally with a commonly used immunologic assay, as described in Materials and Methods (Table 1). Shown in a gradient of gray, indicating percent sequence identity, are strain-to-strain aligned regions. The SIL is composed of six genes: silA, -B, -C, -CR, -D, and -E (highlighted in yellow), which are variably present among S. dysgalactiae subsp. equisimilis genomes. In multiple streptococcal species, the SIL genes are flanked by additional variably present genes with similarity to genes involved in competence quorum sensing and bacteriocin production. Together, the SIL and flanking competence/bacteriocin-like genes constitute the extended SIL. The extended SIL among the six ST20/stG62647 genomes is 16.5 kb, encodes 21 genes which, strain to strain, share greater than 99.4% nucleotide identity. SilB is disrupted by insertion of IS1548 in all six stG62647 genomes. In contrast, the corresponding locus of the ST17/stC74a strain MGGS36055 genome is only 2.9 kbp and lacks all six SIL genes. Sequence similarity between stC674a strain MGGS36055 and the six stG62647 strains is limited primarily to the four bacteriocin genes at the 39 end of the extended SIL.
MICs for beta-lactam antibiotics. Decreased susceptibility of beta-hemolytic streptococci to beta-lactam antibiotics has been reported with increased frequency in recent years (3,16,39,(54)(55)(56)(57)(58). Concerningly, Fuursted et al. (16) described four clonally related S. dysgalactiae subsp. equisimilis strains resistant to penicillin in vitro and reported that these strains had multiple amino acid changes in PBP2X, the primary target of penicillin in streptococci. Thus, we evaluated MICs in our seven S. dysgalactiae subsp. equisimilis strains for five commonly used beta-lactam antibiotics, including The strain-to-strain mean genetic distance (MGD, i.e., core SNP difference) between the 21 closed reference genomes is 20,428 (range,137 to 42,147), the MGD between the 17 Norway emergent stG62647 cluster isolates is 291 (range, 42 to 776), the MGD between the 6 French Brittany stG62647 closed genomes is 495 (range, 237 to 847), the MGD between all 23 isolates of the emergent stG62647 genetic cluster is 379 (range, 42 to 1,080), and the MGD between the 6 French Brittany stG62647/ST20 isolates and stC74a/ST17 strain MGGS36055 is 11,022 (range, 10,976 to 11,097).
amoxicillin, cefotaxime, cefoxitin, meropenem, and penicillin G. None of the seven S. dysgalactiae subsp. equisimilis strains had MICs deviating from the susceptible range for the five beta-lactam antibiotics tested ( Table 2). Consistent with our in vitro susceptibility data, the PBP2X gene in the seven isolates studied did not have amino acid changes expected to result in substantial increases in MICs to these antibiotics. That is, there were no amino acid changes in the PBP2X transpeptidase domain and catalytic motifs that have been associated with reduced beta-lactam susceptibility in streptococci (data not shown). stG62647 strains are significantly more virulent than a genetically divergent strain of emm type stC74a. Epidemiological surveillance reports and outbreak investigations from diverse locations suggest that emm type stG62647 strains may cause more virulent human infections than S. dysgalactiae subsp. equisimilis strains of other emm types, but this has not been experimentally shown. To test the hypothesis that emm type stG62647 strains of the ST20 genetic lineage are more virulent than other S. dysgalactiae subsp. equisimilis genetic lineages that cause abundant human infections, we compared the virulence of ST20/stG62647 strains MGCS36044 and MGCS36089 relative to that of ST17/stC74a strain MGGS36055 using a well-established mouse model of necrotizing myositis (38)(39)(40)(41)59). These three selected strains were of genetic lineages that cause abundant infections and were also temporally, geographically, and disease manifestation matched. The strains were recovered between March and June 2018 from cases of invasive osteitis infection that occurred in Brittany, France. Of note, although stG62647 strains MGCS36044 and MGCS36089 were virtually identical in gene content, they differed by 755 core chromosomal SNPs, which was greater than the average of 379 SNPs that separated the 23 strains of the emergent stG62647 cluster (Fig. 4), consistent with MGCS36044 and MGCS36089 having an evolutionarily less recent common ancestor. Compared to stC74a strain MGGS36055, the two stG62647 strains, MGCS36044 and MGCS36089, led to significantly higher bacterial burdens in infected limbs and caused significantly more mortality (Fig. 5A to C). The stG62647 strains also caused significantly larger lesions with more tissue destruction (Fig. 2D). Together, these data demonstrated that the two stG62647 strains studied were significantly more virulent than the comparator stC74a strain studied and supported the contention (9) that strains of the emergent ST20/stG62647 clonal lineage cause infections with a more aggressive clinical course.
The data presented here show that the six Brittany, France strains of type ST20/ stG62647 we studied were clonally derived. These are the only isolates of ST20/ stG62647 for which genomes have been sequenced to closure. Analogous to important findings presented by Oppegaard et al. (9) for strains from western Norway, members of this clone also have IS1548 inserted in the silB open reading frame. We also demonstrated that members of this clone have enhanced virulence in a commonly used mouse model of serious invasive disease, thereby adding new information about this emergent clone. However, our study was not designed to address the important topics of the molecular genetic events underlying emergence of the ST20/st62647 clone or the molecular mechanisms bearing on enhanced virulence of members of this clone. Our findings underscore the need for substantially expanded genomics and molecular

MATERIALS AND METHODS
Strains. The seven strains analyzed were part of a comprehensive prospective study of S. dysgalactiae subsp. equisimilis human infections conducted from 2010 through 2018 in Brittany, France (33,60). The study was conducted similarly to an investigation of S. pyogenes infections in Brittany, France (60) and will be described in detail elsewhere. Briefly, S. dysgalactiae subsp. equisimilis isolates were identified by standard diagnostic procedures used in the clinical microbiology laboratory, University Hospital of Rennes, France. Culture diagnosis was also conducted at the University Hospital, Rennes, France and analyzed by matrix-assisted laser desorption ionization-time of flight mass spectrometry (Bruker Daltonics GmbH, Germany). Isolates were subcultured at 37°C with 5% CO 2 on Columbia blood agar plates containing 5% sheep blood (Bio-Rad, France) and stored at 280°C.
Whole-genome sequencing. The genomes of the S. dysgalactiae subsp. equisimilis strains were sequenced with methods described previously for S. pyogenes (17,38,54,61,62). Briefly, strains were grown at 37°C with 5% CO 2 on tryptic soy agar with 5% sheep blood (Becton, Dickinson, Franklin Lakes, NJ) or in Todd-Hewitt broth with 2% (wt/vol) yeast extract (THY; Difco Laboratories, Franklin Lakes, NJ). Chromosomal DNA for Illumina short-read sequencing was isolated with the RNAdvance viral kit (Beckman Coulter, Brea, CA) and a BioMek i7 instrument (Beckman Coulter). Libraries were prepared with the NexteraXT kit (Illumina, San Diego, CA) and sequenced with a NovaSeq instrument (Illumina) using a 2 Â 250-bp protocol. Chromosomal DNA for long-read sequencing was isolated with a DNeasy blood and tissue kit (Qiagen, Germantown, MD). Libraries were prepared with either a native barcoding kit or rapid barcoding kit (Oxford Nanopore Technologies, United Kingdom) and sequenced with a GridION instrument using version R10.4 or R9.4.1 flow cells (Oxford Nanopore Technologies), respectively.
Genome assembly, closure, and annotation. Illumina paired-end short reads were base call error corrected with Karect (63), and Oxford Nanopore long reads were corrected with the FM index long read corrector (64). Hybrid assemblies of the preprocessed short-and long-read sequencing were generated with Unicycler (65). Unitigs from genomic assemblies that did not close were ordered against a closely related reference genome (or closed assembly) with progressiveMauve (66). Gaps between the ordered unitigs (mostly rRNA operons) were manually spanned or closed by splicing in corresponding sequences from one or more closely related closed genomes with Sequencher (Gene Codes, Ann Arbor, MI). These closed draft genome assemblies were checked for consistency with the Illumina PE short-read sequencing data by read mapping with SMALT (67) coupled with polymorphism detection with FreeBayes (68) and for structural inconsistencies with NucBreak (69). Consistency between the closed draft assemblies and the Oxford Nanopore long-read sequencing data was checked by read mapping with MiniMap2 (70). Correspondence of the mapped short and long reads with the draft closed genomes was visually inspected with Tablet (71). This reference-assisted and -directed genome closure process was iterated until discrepancies between the sequence data and the draft genomes were resolved. The resultant finished genomes were annotated de novo with RAST (72) and via transfer using PROKKA with S. dysgalactiae subsp. equisimilis strain ATCC 12394 as the reference (73). Annotations were merged and manually curated with Artemis (74). Multilocus sequence types of the isolates were determined by BLAST relative to the PubMLST database (https://pubmlst.org/), emm types were determined relative to the CDC emm type database (https://www2.cdc.gov/vaccines/biotech/strepblast.asp), and insertion sequence types were determined relative to the ISfinder database (https://isfinder.biotoul.fr/). Comparative genome atlases were generated with GView (75), and sequence alignment comparisons were generated with EasyFig (76).
Mouse model of necrotizing myositis and CFU determination. The mouse model of necrotizing myositis used in these experiments has been previously described and used extensively (38)(39)(40)(41)59). Animals were infected intramuscularly in the right lower hindlimb with 1 Â 10 8 CFU of the indicated strain and monitored daily for near mortality using standard criteria (77). Survival was graphed as a Kaplan-Meier curve, and statistical differences were determined with a log-rank test (GraphPad Prism V9.4.1; San Diego, CA). CFU from infected muscle were determined by culturing limb homogenates as described elsewhere (38)(39)(40)(41)59). Briefly, each infected limb was amputated, weighed, homogenized (Omni International, Kennesaw, GA) in 1 mL sterile phosphate-buffered saline, serially diluted, plated, and cultured overnight. CFU data were graphed as means 6 standard errors of the means (SEM), and statistical differences were determined with a Kruskal-Wallis test. For histopathology examination, amputated limbs were photographed with a Solo8 Hovercam (Pathway Innovations, Las Vegas, NV) and processed with standard pathology laboratory methods.
(i) Ethics statement. Mouse experiments were approved by the Institutional Animal Care and Use Committee of Houston Methodist Research Institute (protocol IS00006169).
Identification of Lancefield group carbohydrate. The Lancefield carbohydrate group antigen of the seven strains studied was determined with a latex agglutination method (BBL Streptocard enzyme latex test; Becton, Dickinson, Franklin Lakes, NJ) on overnight growth harvested from blood agar plates.
Determination of MICs for antimicrobial agents. MICs for beta-lactam antibiotics, including amoxicillin, cefotaxime, cefoxitin, meropenem, and penicillin G, were determined by the MIC test strip method (Liofilchem, Waltham, MA) according to the manufacturer's instructions. Strains were grown overnight on tryptic soy agar with 5% sheep blood (Becton, Dickinson), and fresh colonies were collected with the BBL prompt inoculation system (Becton, Dickinson) and plated onto Mueller-Hinton agar (Remel Microbiology Products, Lenexa, KS) with the antibiotic test strips. MICs (in milligrams per liter) were read after 24 h of incubation independently by two technologists.
Data availability. The annotated closed genome assemblies and short-and long-read sequencing data for the seven S. dysgalactiae subsp. equisimilis strains studied have been submitted to the National Center for Biotechnology Information (NCBI) under Bioproject accession number PRJNA925803.

SUPPLEMENTAL MATERIAL
Supplemental material is available online only. SUPPLEMENTAL FILE 1, PDF file, 0.1 MB. SUPPLEMENTAL FILE 2, XLSX file, 0.03 MB.