Major genetic discontinuity and novel toxigenic species in Clostridioides difficile taxonomy

Clostridioides difficile infection (CDI) remains an urgent global One Health threat. The genetic heterogeneity seen across C. difficile underscores its wide ecological versatility and has driven the significant changes in CDI epidemiology seen in the last 20 years. We analysed an international collection of over 12,000 C. difficile genomes spanning the eight currently defined phylogenetic clades. Through whole-genome average nucleotide identity, and pangenomic and Bayesian analyses, we identified major taxonomic incoherence with clear species boundaries for each of the recently described cryptic clades CI–III. The emergence of these three novel genomospecies predates clades C1–5 by millions of years, rewriting the global population structure of C. difficile specifically and taxonomy of the Peptostreptococcaceae in general. These genomospecies all show unique and highly divergent toxin gene architecture, advancing our understanding of the evolution of C. difficile and close relatives. Beyond the taxonomic ramifications, this work may impact the diagnosis of CDI.


Results
An updated global population structure based on sequence typing of 12,000 genomes We obtained and determined the ST and clade for a collection of 12,621 C. difficile genomes (taxid ID 1496, Illumina data) existing in the NCBI Sequence Read Archive (SRA) as of 1 January 2020. A total of 272 STs were identified spanning the eight currently described clades, indicating that the SRA contains genomes for almost 40% of known C. difficile STs worldwide (n = 659, PubMLST, January 2020). C1 STs dominated the database in both prevalence and diversity ( Figure 1) with 149 C1 STs comprising 57.2% of genomes, followed by C2 (35 STs, 22.9%), C5 (18 STs, 10.2%), C4 (34 STs, 7.5%), C3 (7 STs, 2.0%), and the cryptic clades C-I, C-II, and C-III (collectively 17 STs, 0.2%). The five most prevalent STs represented were ST1 (20.9% of genomes), ST11 (9.8%), ST2 (9.5%), ST37 (6.5%), and ST8 (5.2%), all prominent lineages associated with CDI worldwide (Knight et al., 2015). Figure 2 shows an updated global C. difficile population structure based on the 659 STs; 27 novel STs were found (an increase of 4%) and some corrections to assignments within C1 and C2 were made, including assigning ST122 (Knetsch et al., 2012) to C1. Based on PubMLST data and bootstraps values of 1.0 in all monophyletic nodes of the cryptic clades (Figure 2), we could confidently assign 25, 9, and 10 STs to cryptic clades I, II, and III, respectively. There remained 26 STs spread across the phylogeny that did not fit within a specific clade (defined as outliers). The full MLST data and tree file for Figure 2 are available as Supplementary files 1a-d and 2 at http://doi.org/10. 6084/m9.figshare.12471461. Representative genomes of each ST present in the SRA were chosen based on metadata, read depth, and assembly quality. This resulted in a final dataset of 260 STs (C1, n = 149; C2, n = 35; C3, n = 7; C4, n = 34; C5, n = 18; C-I, n = 12; C-II, n = 3; C-III, n = 2) used for all subsequent bioinformatics analyses. The list of representative genomes is available in Supplementary file 1b.

Whole-genome ANI analysis reveals clear species boundaries
Whole-genome ANI analyses were used to investigate genetic discontinuity across the C. difficile species (Figure 3 and Supplementary file 1f). Whole-genome ANI values were determined for the final set of 260 STs using three independent ANI algorithms (FastANI, ANIm, and ANIb; see Materials and methods). All 225 STs belonging to clades C1-4 clustered within an ANI range of 97.1-99.8% (median FastANI values of 99.2, 98.7, 97.9%, and 97.8%, respectively; Figure 3A-C).
These ANI values are above the 96% species demarcation threshold used by the NCBI (Ciufo et al., 2018) and indicate that strains from these clades belong to the same species. ANI values for all 18 STs belonging to C5 clustered on the borderline of the species demarcation threshold (FastANI range 95.9-96.2%, median 96.1%). ANI values for all three cryptic clades fell well below the species threshold; C-I (FastANI range 90.9-91.1%, median 91.0%), C-II (FastANI range 93.6-93.9%, median 93.7%), and C-III (FastANI range 89.1-89.1%, median 89.1%). All results were corroborated across the three independent ANI algorithms ( Figure 3A-C). C. difficile strain ATCC 9689 (ST3, C1) was defined by Lawson et al. as the type strain for the species (Lawson et al., 2016) and used as a reference in all the above analyses. To better understand the diversity among the divergent clades themselves, FastANI analyses were repeated using STs 11, 181, 200, and 369 as reference archetypes of clades C5, C-I, C-II, and C-III, respectively. This approach confirmed that C5 and the three cryptic clades were as distinct from each other as they were collectively from C1-4 ( Figure 3D-G).
Taxonomic placement of cryptic clades predates C. difficile emergence by millions of years Previous studies using BEAST have estimated the common ancestor of C1-5 existed between 1 to 85 or 12 to 14 million years ago (mya) (He et al., 2010;Kumar et al., 2019). Here, we used an alternative Bayesian approach, BactDating, to estimate the age of all eight C. difficile clades currently described. The last common ancestor for C. difficile clades C1-5 was estimated to have existed between 1.11 and 6.71 mya. In contrast, all three cryptic clades were estimated to have emerged millions of years prior to the common ancestor of C1-5 ( Figure 4). Independent analysis with BEAST, using a smaller core gene dataset (see Materials and methods), provided temporal estimates of clade emergence that were of the same order of magnitude and, importantly, supported the same branching order for all clades ( Figure 4).
Similarly, the nearest ANI matches to species within the RefSeq dataset were several C. difficile strains (range C-I: 90.9-91.1%; C-II: 93.4-93.6%; and C-III: 89.2-89.4%) and Paeniclostridium sordellii (77.7-77.9%). A low ANI (range 70-75%) was observed between the cryptic clade genomes and 20 members of the Clostridium including Clostridium tetani, Clostridium botulinum, Clostridium perfringens, and Clostridium butyricum, the type strain of the Clostridium genus senso stricto. An updated ANI-based taxonomy for the Peptostreptococcaceae is shown in Figure 5A. The phylogeny places C-I, C-II, and C-III between C. mangenotii and C. difficile C1-5, suggesting that they should be  (260 STs) using FastANI, ANIm, and ANIb algorithms, respectively. Panels (D-G) show ANI plots for ST11 (C5), ST181 (C-I), ST200 (C-II), and ST369 (C-III) vs. all clades (260 STs), respectively. National Center for Biotechnology Information species demarcation of 96% indicated by red dashed line (Ciufo et al., 2018). assigned to the Clostridioides genus, distinct from both C. mangenotii and C. difficile. Comparative analysis of ANI and 16S rRNA values for the eight C. difficile clades and C. mangenotii shows significant incongruence between the data generated by the two approaches ( Figure 5B). The range of 16S rRNA % similarity between C. difficile C1-4, cryptic clades I-III, and C. mangenotii was narrower (range 94.5-100) compared to the range of ANI values (range 77.8-98.7). Curiously, C. mangenotii and C. difficile shared 94.5-94.7% similarity in 16S rRNA sequence identity, yet only 77.8-78.2% ANI, indicating that they should not even be considered within the same genus, as proposed by Lawson et al., 2016. We also extended our approach to five other medically important clostridia available on the NCBI database; C. botulinum (n = 783), C. perfringens (n = 358), Clostridium sporogenes (n = 100), C. tetani (n = 32), and P. sordellii (formerly Clostridium sordellii, n = 46). We found that three out of the five species (C. perfringens, C. sporogenes, and C. botulinum) showed evidence of taxonomic discontinuity similar to that observed for C. difficile (e.g., a proportion of strains with pairwise ANI Figure 4. Bayesian analysis of species and clade divergence. BactDating and BEAST estimates of the age of major C. difficile clades. Node dating ranges for both Bayesian approaches are transposed onto an maximum-likelihood phylogeny built from concatenated multi-locus sequence type (MLST) alleles of a dozen sequence types (STs) from each clade. Archetypal STs in each evolutionary clade are indicated. The tree is midpoint rooted, and bootstrap values are shown (all bootstrapping values of the cryptic clade branches are 100%). Scale bar indicates the number of substitutions per site. BactDating estimates the median time of the most recent common ancestor of C1-5 at 3.89 million years ago (mya) (95% credible interval [CI], 1.11-6.71 mya). Of the cryptic clades, C-II shared the most recent common ancestor with C1-5 (13.05 mya, 95% CI 3.72-22.44 mya), followed by C-I (22.02 mya, 95% CI 6.28-37.83 mya) and C-III (47.61 mya, 95% CI 13.58-81.73 mya). Comparative temporal estimates from BEAST show the same order of magnitude and support the same branching order (clades C1-5 [12.01 mya, 95% CI 6.80-33. below the 96% demarcation threshold). This was most notable for C. sporogenes and C. botulinum, where there were many sequenced strains with a pairwise ANI below 90% (8% and 31% of genomes, respectively, Supplementary file 1i).
Evolutionary and ecological insights from the C. difficile species pangenome Next, we sought to quantify the C. difficile species pangenome and identify genetic loci that are significantly associated with the taxonomically divergent clades. With Panaroo, the C. difficile species pangenome comprised 17,470 genes, encompassing an accessory genome of 15,238 genes and a core genome of 2232 genes, just 12.8% of the total gene repertoire ( Figure 6). The size of the pangenome reduced by 2082 genes with the exclusion of clades CI-III, and a further 519 genes with the exclusion of C5. Compared to Panaroo, Roary overestimated the size of the pangenome (32,802 genes, 87.7% overestimation), resulting in markedly different estimates of the percentage core genome, 3.9% and 12.8%, respectively (c 2 = 1395.3, df = 1, p<0.00001). The overestimation of pangenome was less pronounced when the identity threshold was decreased to 90% (42.0% overestimation) and the paralogs were merged (28.7% overestimation). Panaroo can account for errors introduced during assembly and annotation, thus polishing the 260 Prokka-annotated genomes with Panaroo resulting in a significant reduction in gene content per genome (median 2.48%; 92 genes, range 1.24-12.40%; 82-107 genes, p<0.00001). The C. difficile species pangenome was determined to be open ; Figure 6). Pangenome-Wide Association Study (Pan-GWAS) analysis with Scoary revealed 142 genes with significant clade specificity. Based on KEGG orthology, these genes were classified into four functional categories: environmental information processing, genetic information processing, metabolism, and signalling and cellular processes. We identified several uniquely present, absent, or organised gene clusters associated with ethanolamine catabolism (C-III), heavy metal uptake (C-III), polyamine biosynthesis (C-III), fructosamine utilisation (C-I, C-III), zinc transport (C-II, C5), and folate metabolism (C-I, C5). A summary of the composition and function of these major lineage-specific Figure 6. Clostridioides difficile species pangenome. (A) Pan and core genome estimates for all 260 sequence types (STs), clades C1-4 (n = 242 STs) and clades C1-5 (n = 225 STs). (B) The difference in % core genome and pangenome sizes with Panaroo and Roary algorithms. * indicates c 2 p<0.00001 and ** indicates c 2 p=0.0008. (C) The proportion of retained genes per genome after polishing Prokka-annotated genomes with Panaroo. (D) The total number of genes in the pan (grey) and core (black) genomes is plotted as a function of the number of genomes sequentially added (n = 260). Following the definition of Tettelin et al., 2005., the C. difficile species pangenome showed characteristics of an 'open' pangenome. First, the pangenome increased in size exponentially with sampling of new genomes. At n = 260, the pangenome exceeded more than double the average number of genes found in a single C. difficile genome (~3700) and the curve was yet to reach a plateau or exponentially decay, indicating more sequenced strains are needed to capture the complete species gene repertoire. Second, the number of new 'strain-specific' genes did not converge to zero upon sequencing of additional strains, at n = 260, an average of 27 new genes were contributed to the gene pool. Finally, according to Heap's law, a values of 1 are representative of open pangenome. Rarefaction analysis of our pangenome curve using a power-law regression model based on Heap's law  showed the pangenome was predicted to be open (Bpan [ » a  = 0.47], curve fit, r 2 = 0.999). (E) Presenceabsence variation (PAV) matrix for 260 C. difficile genomes is shown alongside a maximum-likelihood phylogeny built from a recombination-adjusted alignment of core genes from Panaroo (2232 genes, 2,606,142 sites). Unique to C-III and is in addition to the highly conserved eut cluster found in all lineages. Has a unique composition and includes six additional genes that are not present in the traditional CD630 eut operon or any other non-C-III strains.
An alternative process for the breakdown of ethanolamine and its utilisation as a source of reduced nitrogen and carbon. Peptide/nickel transport system ATP-binding protein ddpD Oligopeptide transport system ATP-binding protein oppF Class I SAM-dependent methyltransferase -Heterodisulfide reductase subunit D (EC:1.8.98.1) hdrD Unique to C-III and is in addition to the highly conserved spermidine uptake cluster found in all other lineages.
Alternative spermidine uptake processes that may play a role in stress response to nutrient limitation. The additional cluster has homologs in Romboutsia, Paraclostridium, and Paeniclostridium spp.

CDP-L-myo-inositol myo-inositolphosphotransferase dipps
Spermidine/putrescine transport system substrate-binding protein ABC.SP.S Spermidine/putrescine transport system permease protein ABC.SP.P1 Spermidine/putrescine transport system permease protein ABC.SP.P Spermidine/putrescine transport system ATP-binding protein potA Sigma-54-dependent transcriptional regulator gfrR Present in all lineages except C-I. Cluster found in a different genomic position in C-III.
Mannose-type PTS system essential for utilisation of fructosamines such as fructoselysine and glucoselysine, abundant components of rotting fruit and vegetable matter.
Fructoselysine/glucoselysine PTS system EIIB component gfrB In E. coli, AbgAB proteins enable uptake and cleavage of the folate catabolite p-aminobenzoyl-glutamate, allowing the bacterium to survive on exogenous sources of folic acid.

Aminobenzoyl-glutamate utilisation protein B abgB
MarR family transcriptional regulator -gene clusters is given in Table 2, and a comparative analysis of their respective genetic architecture can be found in Supplementary file 1l.

Discussion
Through phylogenomic analysis of the largest and most diverse collection of C. difficile genomes to date, we identified major incoherence in C. difficile taxonomy, provide the first WGS-based phylogeny for the Peptostreptococcaceae, and provide new insight into intra-species diversity and evolution of pathogenicity in this major One Health pathogen.
Our analysis found high nucleotide identity (ANI >97%) between C. difficile clades C1-4, indicating that strains from these four clades (comprising 560 known STs) belong to the same species. On the other hand, ANI between C5 and C1-4 is on the borderline of the accepted species threshold (95.9-96.2%). This degree of speciation likely reflects the unique ecology of C5 -a lineage comprising 33 known STs, which is well established in non-human animal reservoirs worldwide and associated with CDI in the community setting . Conversely, we identified major taxonomic incoherence among the three cryptic clades and C1-5, evident by ANI values (compared to ST3, C1) far below the species threshold (~91%, C-I;~94%, C-II; and~89%, C-III). Similar ANI value  Figure 7 continued on next page differences were seen between the cryptic clades themselves, indicating that they are as divergent from each other as they are individually from C1-5. This extraordinary level of discontinuity is substantiated by our core genome and Bayesian analyses. Our study estimated the most recent common ancestor of C. difficile clades C1-4 and C1-5 existed between 0.46 to 2.77 mya and between 1.11 to 6.71 mya, respectively, whereas the common ancestors of clades C-I, C-II, and C-III were estimated to have existed at least 1.5-75 million years before the common ancestor of C1-5. For context, divergence dates for other notable pathogens range from 10 million years (Ma) (Campylobacter coli and C. jejuni) (Sheppard and Maiden, 2015), 47 Ma (Burkholderia pseudomallei and B. thailandensis) (Yu et al., 2006), and 120 Ma (Escherichia coli and Salmonella enterica) (Ochman et al., 1999). Corresponding whole-genome ANI values for these species are 86, 94, and 82%, respectively (Supplementary file 1j).
Although BEAST provided wider confidence intervals (and therefore less certainty compared to BactDating), it estimates the time of divergence for all clades within the same order of magnitude and, importantly, provides robust support for the same branching order of clades with clade C-III the most ancestral of lineages, followed by the emergence of C-I, C-II, and C5. After this point, there appears to have been rapid population expansion into the four closely related clades described today, which include many of the most prevalent strains causing healthcare-associated CDI worldwide (Knight et al., 2015). We acknowledge that the dating of ancient taxa is often imprecise and that using a strict clock model for such a diverse set of taxa leads to considerable uncertainty in divergence estimates. However, we tried to mitigate this as much as possible by using two independent tools and evaluated multiple molecular clock estimates (covering almost an order of magnitude), ultimately using the same fixed clock model as Kumar et al., 2019 (2.5 Â 10 À9 -10.5 Â 10 À8 ). The branching order of the clades is robust, supported by comprehensive and independent comparative genomic and phylogenomic analyses. Notwithstanding this finding, if variations in the molecular clock happen over time and across lineages, which is likely the case for such a genetically diverse spore-forming pathogen, then the true age ranges for C. difficile clade emergence are likely far greater (and therefore less certain) than we report here.
Comparative ANI analysis of the cryptic clades with >5000 reference genomes across 21 phyla failed to provide a better match than C. difficile (89-94% ANI). Similarly, our revised ANI-based taxonomy of the Peptostreptococcaceae placed clades C-I, C-II, and C-III between C. difficile and C. mangenotii. Our analyses of the Clostridioides spp. highlights the major discordance between WGS data and 16S rRNA data, which has historically been used to classify bacterial species. In 2016, Lawson et al., 2016 used 16S rRNA data to categorise C. difficile and C. mangenotii as the sole members of the Clostridioides. These species have 94.7% similarity in 16S rRNA sequence identity, yet our findings indicate that C. mangenotii and C. difficile share 77% ANI and should not be considered within the same genus. The rate of 16S rRNA divergence in bacteria is estimated to be 1-2% per 50 Ma (Ochman et al., 1999). Contradicting our ANI and core genome data, 16S rRNA sequences were highly conserved across all eight clades. This indicates that in C. difficile 16S rRNA gene similarity correlates poorly with measures of genomic, phenotypic, and ecological diversity, as reported in other taxa such as Streptomyces, Bacillus, and Enterobacteriaceae (Janda and Abbott, 2007;Chevrette et al., 2019). Another interesting observation is that C5 and the three cryptic clades had a high proportion (>90%) of MLST alleles that were absent in other clades (Supplementary file 1e), suggesting minimal exchange of essential housekeeping genes between these clades. Whether this reflects divergence or convergence of two species, as seen in Campylobacter (Sheppard et al., 2008), is unknown. Taken together, these data strongly support the reclassification of C. difficile clades C-I, C-II, and C-III as novel independent Clostridioides genomospecies. There have been similar genome-based reclassifications in Bacillus (Liu et al., 2018), Fusobacterium (Kook et al., 2017), and Burkholderia (Loveridge et al., 2017). Also, a recent Consensus Statement (Murray et al., 2020) argues that the genomics and big data era necessitate easing of nomenclature rules to accommodate genome-based assignment of species status to nonculturable bacteria and those without 'type material', as is the case with these Clostridioides genomospecies.
We also found that the significant taxonomic incoherence observed in C. difficile was also evident in other medically important clostridia, supporting calls for taxonomic revisions (Lawson et al., 2016;Oren and Rupnik, 2018). The entire published collections of C. perfringens, C. sporogenes, and C. botulinum all contained sequenced strains with pairwise ANI below the 96% demarcation threshold, with 8% of C. sporogenes and 31% of C. botulinum sequenced strains below 90% ANI. These findings highlight a significant problem with the current classification of the clostridia and further demonstrate that high-resolution approaches such as whole-genome ANI can be a powerful tool for the re-classification of these bacteria (Lawson et al., 2016;Oren and Rupnik, 2018;Murray et al., 2020).
The NCBI SRA was dominated by C1 and C2 strains, both in number and diversity. This apparent bias reflects the research community's efforts to sequence the most prominent strains causing CDI in regions with the highest burden, for example, ST1 from humans in Europe and North America. As such, there is a paucity of sequenced strains from diverse environmental sources, animal reservoirs, or regions associated with atypical phenotypes. Cultivation bias -a historical tendency to culture, preserve, and ultimately sequence isolates that are concordant with expected phenotypic criteriacomes at the expense of 'outliers' or intermediate phenotypes. Members of the cryptic clades fit this criterion. They were first identified in 2012 but have been overlooked due to atypical toxin architecture, which may compromise diagnostic assays (discussed below). Our updated MLST phylogeny shows as many as 55 STs across the three cryptic clades (C-I, n = 25; C-II, n = 9; C-III, n = 21) (Figure 2). There remains a further dozen 'outliers' that could either fit within these new taxa or be the first typed representative of additional genomospecies. The growing popularity of metagenomic sequencing of animal and environmental microbiomes will certainly identify further diversity within these taxa, including nonculturable strains (Stewart et al., 2018;Lu et al., 2015).
By analysing 260 STs across eight clades, we provide the most comprehensive pangenome analysis of C. difficile to date. Importantly, we also show that the choice of algorithm significantly affects pangenome estimation. The C. difficile pangenome was determined to be open (i.e. an unlimited gene repertoire) and vast in scale (over 17,000 genes), much larger than previous estimates (~10,000 genes), which mainly considered individual clonal lineages Knight et al., 2016). Conversely, comprising just 12.8% of its genetic repertoire (2232 genes), the core genome of C. difficile is remarkably small, consistent with earlier WGS and microarray-based studies describing ultralow genome conservation in C. difficile (Knight et al., 2015;Scaria et al., 2010). Considering only C1-5, the pangenome reduced in size by 12% (2082 genes); another 519 genes were lost when considering only C1-4. These findings are consistent with our taxonomic data, suggesting that the cryptic clades, and to a lesser extent C5, contribute a significant proportion of evolutionarily divergent and unique loci to the gene pool. A large open pangenome and small core genome are synonymous with a sympatric lifestyle, characterised by cohabitation with, and extensive gene transfer between, diverse communities of prokarya and archaea . Indeed, C. difficile shows a highly mosaic genome comprising many phages, plasmids, and integrative and conjugative elements (Knight et al., 2015), and has adapted to survival in multiple niches including the mammalian gastrointestinal tract, water, soil and compost, and invertebrates .
Through a robust Pan-GWAS approach, we identified loci that are enriched or unique in the genomospecies. C-I strains were associated with the presence of transporter AbgB and absence of a mannose-type phosphotransferase (PTS) system. In E. coli, AbgAB proteins allow it to survive on exogenous sources of folate (Carter et al., 2007). In many enteric species, the mannose-type PTS system is essential for catabolism of fructosamines such as glucoselysine and fructoselysine, abundant components of rotting fruit and vegetable matter (Miller et al., 2015). C-II strains contained Zn transporter loci znuA and yeiR, in addition to Zn transporter ZupT, which is highly conserved across all eight C. difficile clades. S. enterica and E. coli harbour both znuA/yeiR and ZupT loci, enabling survival in Zn-depleted environments (Sabri et al., 2009). C-III strains were associated with major gene clusters encoding systems for ethanolamine catabolism, heavy metal transport, and spermidine uptake. The C-III eut gene cluster encoded six additional kinases, transporters, and transcription regulators absent from the highly conserved eut operon found in other clades. Ethanolamine is a valuable source of carbon and/or nitrogen for many bacteria, and eut gene mutations (in C1/C2) impact toxin production in vivo (Nawrocki et al., 2018). The C-III metal transport gene cluster encoded a chelator of heavy metal ions and a multi-component transport system with specificity for iron, nickel, and glutathione. The conserved spermidine operon found in all C. difficile clades is thought to play an important role in various stress responses including during iron limitation (Berges et al., 2018). The additional, divergent spermidine transporters found in C-III were similar to regions in closely related genera Romboutsia and Paeniclostridium (data not shown). Together, these data provide preliminary insights into the biology and ecology of the genomospecies. Most differential loci identified were responsible for extra or alternate metabolic processes, some not previously reported in C. difficile. It is therefore tempting to speculate that the evolution of alternate biosynthesis pathways in these species reflects distinct ancestries and metabolic responses to evolving within markedly different ecological niches.
This work demonstrates the presence of toxin genes on PaLoc and CdtLoc structures in all three genomospecies, confirming their clinical relevance. Monotoxin PaLocs were characterised by the presence of tcdR, tcdB, and tcdE, the absence of tcdA and tcdC, and flanking by transposases and recombinases which mediate LGT (Ramirez-Vargas and Rodriguez, 2020;Ramírez-Vargas et al., 2018;Monot et al., 2015). These findings support the notion that the classical bi-toxin PaLoc common to clades C1-5 was derived by multiple independent acquisitions and stable fusion of monotoxin PaLocs from ancestral clostridia (Monot et al., 2015). Moreover, the presence of syntenic PaLoc and CdtLoc (in ST369, C-I), the latter featuring two copies of cdtA and cdtR, and a recombinase (xerC), further supports this PaLoc fusion hypothesis (Monot et al., 2015).
Bacteriophage holin and endolysin enzymes coordinate host cell lysis, phage release, and toxin secretion (Fortier, 2018). Monotoxin PaLocs comprising phage-derived holin (tcdE) and endolysin (cwlH) genes were first described in C-I strains (Monot et al., 2015). We have expanded this previous knowledge by demonstrating that syntenic tcdE and cwlH are present within monotoxin PaLocs across all three genomospecies. Moreover, since some strains contained cwlH but lacked toxin genes, this gene seems to be implicated in toxin acquisition. These data, along with the detection of a complete and functional (Riedel et al., 2017) CdtLoc contained within FSemix9P1 in ST343 (C-III), further substantiate the role of phages in the evolution of toxin loci in C. difficile and related clostridia (Fortier, 2018).
The CdtR and TcdR sequences of the new genomospecies are unique, and further work is needed to determine if these regulators display different mechanisms or efficiencies of toxin expression (Chandrasekaran and Lacy, 2017). The presence of dual copies of CdtR in ST369 (C-I) is intriguing as analogous duplications in PaLoc regulators have not been documented. One of these CdtR had a mutation at a key phosphorylation site (Asp61!Asn61) and possibly shows either reduced wild-type activity or non-functionality, as seen in ST11 (Bilverstone et al., 2019). This might explain the presence of a second CdtR copy.
TcdB alone can induce host innate immune and inflammatory responses leading to intestinal and systemic organ damage (Carter et al., 2015). Our phylogenetic analysis shows that TcdB sequences from the three genomospecies are related to TcdB in C2 members, specifically ST1 and ST41, both virulent lineages associated with international CDI outbreaks (He et al., 2013;Eyre et al., 2015), and causing classical or variant (C. sordellii-like) cytopathic effects, respectively (Lanis et al., 2010). It would be relevant to explore whether the divergent PaLoc and CdtLoc regions confer differences in biological activity, as these may present challenges for the development of effective broad-spectrum diagnostic assays, and vaccines. We have previously demonstrated that common laboratory diagnostic assays may be challenged by changes in the PaLoc of C-I strains (Ramírez-Vargas et al., 2018). The same might be true for monoclonal antibody-based treatments for CDI such as bezlotoxumab, known to have distinct neutralising activities against different TcdB subtypes (Shen et al., 2020).
Our findings highlight major incongruence in C. difficile taxonomy, identify differential patterns of diversity among major clades, and advance understanding of the evolution of the PaLoc and CdtLoc. While our analysis is limited solely to the genomic differences between C. difficile clades, our data provide a robust genetic foundation for future studies to focus on the phenotypic, ecological, and epidemiological features of these interesting groups of strains, including defining the biological consequences of clade-specific genes and pathogenic differences in vitro and in vivo. Our findings reinforce that the epidemiology of this important One Health pathogen is not fully understood. Enhanced surveillance of CDI and WGS of new and emerging strains to better inform the design of

Estimates of clade and species divergence
BactDating v1.0.1 (Didelot et al., 2018) was applied to the recombination-corrected phylogeny produced by Gubbins (471,708 core-genome sites) with Markov chain Monte Carlo (MCMC) chains of 10 7 iterations sampled every 10 4 iterations with a 50% burn-in. A strict clock model was used with a rate of 2.5 Â 10 À9 to 1.5 Â 10 À8 substitutions per site per year, as previously defined by He et al., 2013 andKumar et al., 2019. The effective sample sizes (ESS) were >200 for all estimated parameters, and traces were inspected manually to ensure convergence. To provide an independent estimate from BactDating, BEAST v1.10.4 (Drummond and Rambaut, 2007) was run on a recombination-filtered gap-free alignment of 10,466 sites with MCMC chains of 5 Â 10 8 iterations, with a 9 Â 10 À7 burn-in, which were sampled every 10 4 iterations. The strict clock model described above was used in combination with the discrete GTR gamma model of heterogeneity among sites and skyline population model. MCMC convergence was verified with Tracer v1.7.1, and ESS for all estimated parameters were >150. For ease of comparison, clade dating from both approaches was transposed onto a single MLST phylogeny. Tree files are available asSupplementary file 3 and 4 at http://doi.org/10.6084/m9.figshare.12471461.
Scoary (Brynildsrud et al., 2016) v1.6.16 was used to identify genetic loci that were statistically associated with each clade via a pan-GWAS. The Panaroo-derived pangenome (n = 17,470) was used as input for Scoary with the evolutionary clade of each genome depicted as a discrete binary trait. Scoary was run with 1000 permutation replicates, and genes were reported as significantly associated with a trait if they attained p-values (empirical, naïve, and Benjamini-Hochberg-corrected) of 0.05, a sensitivity and specificity of >99% and 97.5%, respectively, and were not annotated as 'hypothetical proteins'. All significantly associated genes were reannotated using Prokka and BlastP, and functional classification (KEGG orthology) was performed using the Koala suite of web-based annotation tools (Kanehisa et al., 2016).

Comparative analysis of toxin gene architecture
The 260 ST genome dataset was screened for the presence of tcdA, tcdB, cdtA, and cdtB using the Virulence Factors Database (VFDB) compiled within ABRicate v1.0 (Seemann, 2020). Results were corroborated by screening raw reads against the VFDB using SRST2 v0.1.8 (Inouye et al., 2014). Both approaches employed minimum coverage and identity thresholds of 90 and 75%, respectively. Comparative analysis of PaLoc and CdtLoc architecture was performed by mapping of reads with Bowtie2 v.2.4.1 to cognate regions in reference strain R20291 (ST1, FN545816). All PaLoc and CdtLoc loci investigated showed sufficient coverage for accurate annotation and structural inference. Genome comparisons were visualised using ACT and figures prepared with Easyfig (Ramírez-Vargas et al., 2018). MUSCLE-aligned TcdB sequences were visualised in Geneious v2020.1.2 and used to create trees in iToL v4.

Statistical analyses
All statistical analyses were performed using SPSS v26.0 (IBM, NY). For pangenome analyses, a chisquared test with Yate's correction was used to compare the proportion of core genes and a onetailed Mann-Whitney U test was used to demonstrate the reduction of gene content per genome, with a p-value 0.05 considered statistically significant.   (Table 2). [6] Comparative genomic analysis of virulence gene architecture (Figure 7). We retrieved the entire collection of C. difficile genomes (taxid ID 1496) held at the NCBI SRA [https://www.ncbi.nlm.nih.gov/sra/]. The raw dataset (as of 1st January 2020) comprised 12,621 genomes. These genomes comprise hundreds, maybe thousands of publications. The individual accession numbers for all genomes analysed in this study are provided in the Supplementary Data at http://doi.org/10.6084/m9.figshare.12471461.

Author ORCIDs
The following dataset was generated: