Global genomic population structure of Clostridioides difficile

Clostridioides difficile is the primary infectious cause of antibiotic-associated diarrhea. Local transmissions and international outbreaks of this pathogen have been previously elucidated by bacterial whole-genome sequencing, but comparative genomic analyses at the global scale were hampered by the lack of specific bioinformatic tools. Here we introduce EnteroBase, a publicly accessible database (http://enterobase.warwick.ac.uk) that automatically retrieves and assembles C. difficile short-reads from the public domain, and calls alleles for core-genome multilocus sequence typing (cgMLST). We demonstrate that the identification of highly related genomes is 89% consistent between cgMLST and single-nucleotide polymorphisms. EnteroBase currently contains 13,515 quality-controlled genomes which have been assigned to hierarchical sets of single-linkage clusters by cgMLST distances. Hierarchical clustering can be used to identify populations of C. difficile at all epidemiological levels, from recent transmission chains through to pandemic and endemic strains, and is largely compatible with prior ribotyping. Hierarchical clustering thus enables comparisons to earlier surveillance data and will facilitate communication among researchers, clinicians and public-health officials who are combatting disease caused by C. difficile.


Introduction
The anaerobic gut bacterium Clostridioides difficile (formerly Clostridium difficile) 1 is 70 the primary cause of antibiotic-associated diarrhea in Europe and North America 2 .
Molecular genotyping of C. difficile isolates has demonstrated international dissemination of diverse strains through healthcare systems [3][4][5] , the community 6 , and livestock production facilities 7,8 . Previously, genotyping was commonly performed by PCR ribotyping or DNA macrorestriction. More recent publications have documented 75 that genome-wide single-nucleotide polymorphisms (SNPs) from whole-genome sequences provide improved discrimination, and such analyses have enabled dramatic progress in our understanding of the emergence and spread of pandemic strains [9][10][11][12] and the epidemiology of local transmission 13,14 . Eyre and colleagues have argued that transmission of C. difficile isolates within a hospital environment can be 80 recognised with high probability as chains of genomes which differ by up to two SNPs whereas genomes which differ by at least 10 genomic SNPs represent unrelated bacteria 13,15 . However, SNP analyses require sophisticated bioinformatic tools and are difficult to standardize 16,17 . A convenient alternative to SNP-based genotyping is offered by the commercial software SeqSphere, which implements a 85 core genome multilocus sequence typing scheme (cgMLST) for the analysis of genomic diversity in C. difficile 18 and other organisms. Indeed, cgMLST 18 confirmed the prior conclusion from genomic SNP analyses 19 that a common clone of C. difficile had been isolated over two successive years at a hospital in China. However, we are not aware of a quantitative comparison of the sensitivity of both methods. 90 cgMLST of genomic sequences of a variety of bacterial pathogens can also be performed with EnteroBase (http://enterobase.warwick.ac.uk/), which has been developed over the last few years with the goal of facilitating genomic analyses by microbiologists 20 . EnteroBase automatically retrieves Illumina short-read sequences 95 from public short-read archives. It uses a consistent assembly pipeline to automatically assemble these short-reads into draft genomes consisting of multiple contigs, and presents the assembled genomes together with their metadata for public access 21 . It also performs the same service on sequencing data uploaded by its Here we show that EnteroBase cgMLST and SNP analyses attain comparable levels of resolution. We also summarise the genomic diversity that accumulated during recurring infections within single patients as well as transmission chains within individual hospitals and between neighbouring hospitals in Germany, and show that it can be detected by HierCC. We also demonstrate that HierCC can be used to 120 identify bacterial populations at various epidemiological levels ranging from recent transmission chains through to pandemic and endemic spread, and relate these HierCC clusters to genotypes that were previously identified by PCR ribotyping and 7-gene MLST. These observations indicate that cgMLST and HierCC within EnteroBase can provide a common language for communications and interactions by 125 the global community who is combatting disease caused by C. difficile.

Implementation of MLST schemes in EnteroBase. cgMLST in EnteroBase
consists of a defined subset of genes within a whole-genome MLST scheme that represents all single copy orthologs within the pan-genome of a representative set of bacterial isolates. To this end, we assembled the draft genomes of 5,232 isolates of C. difficile from public short-read archives, and assigned them to ribosomal sequence 135 types (rSTs) according to rMLST, which indexes diversity at 53 loci encoding ribosomal protein subunits on the basis of existing exemplar alleles at PubMLST 23 .
We then created a reference set of 442 genomes consisting of one genome of C. mangenotii 1 , 18 complete genomes from GenBank, 81 closed genomes from our work, and the draft genome with the smallest number of contigs from each of the 343 140 rSTs (https://tinyurl.com/Cdiff-ref). The Clostridioides pan-genome was calculated as described 21 , and used to define a wgMLST scheme consisting of 13,763 genetic loci (http://enterobase.warwick.ac.uk/species/clostridium/download_data). EnteroBase uses the wgMLST scheme to call loci and alleles from each assembly, and extracts the allelic assignments for the subsets corresponding to cgMLST, rMLST and 7-gene 145 MLST from those allelic calls. The cgMLST subset consists of 2,556 core genes which were present in ≥ 98% of the reference set, intact in ≥ 94% and were not excessively divergent (Figure 1).

Suitability of cgMLST for epidemiological analyses of transmission chains. 150
Multiple bacterial strains from local outbreaks of C. difficile disease in hospitals in Oxfordshire and China differ by very few SNPs 13,15,19 . The published data indicates a 95% probability that the genomes of pairs of strains from a transmission chain differ by two non-recombinant SNPs, or less 13 . Bletz et al. 18 have concluded that sets of strains from transmission chains can also be identified because they differ by ≤ 6 155 cgMLST allelic differences in pairwise comparisons. To test this conclusion, we performed a quantitative comparison between the numbers of cgMLST allelic differences and the numbers of non-recombinant SNPs in isolates from multiple epidemiological chains ( Figure 2). We calculated the intra-patient pairwise genetic distances by both measures between 176 isolates from four patients with recurring 160 CDI (C. difficile infection). We also analysed the intra-outbreak distances with 63 isolates from four transmission chains in multiple hospitals. We also re-examined the pairwise distances from the comprehensive sample of 1,158 isolates collected over several years in four hospitals in Oxfordshire, UK 13 . All three analyses yielded a strong linear relationship (R 2 , 0.71-0.93) between the numbers of different cgMLST 165 alleles and the numbers of non-recombinant SNPs ( Figure 2). The slope of the regression lines was close to 1.0, indicating a 1:1 increase in cgMLST allelic differences with numbers of SNPs. The same data were also investigated with cgMLST calculated with the commercial program SeqSphere 18 , and yielded similar results except that the slope was lower due to lower discriminatory power of the 170 SeqSphere cgMLST scheme (lower panels in Figure 2).
Eyre et al. 13 concluded that it is possible to recognize genomes that reflect direct transmission between two hospital patients because they tend to differ by two SNPs or less. Our analylsis of the Oxfordshire dataset indicated that they would also have 175 been recognized by cgMLST in EnteroBase because genomes which differed by two cgMLST alleles usually also differed by ≤ 2 SNPs according to a binary logistic regression model (p = 89%; 95% confidence interval, 88-89%) ( Figure 3). We also compared the genetic distances between 242 genomes from Oxfordshire which had been isolated during the initial six month trial and 916 genomes during the 180 subsequent three years (April 2008 to March 2011) 13 . Based on cgMLST, 35% of the latter genomes (318/916) also differed at ≤ 2 core-genome alleles from any earlier genome in that study, and 34% (316/916) differed by ≤ 2 SNPs. The genomes identified by both methods were also largely concordant (89% matching entries).
Thus, our results confirm that cgMLST can be used for epidemiological investigations 185 with similar results to SNP analysis. Local and regional spread. SNP analyses are computer intensive and are only feasible with limited numbers of genomes 24 . cgMLST can be used to analyse up to 100,000 genomes with GrapeTree, but once again analyses involving more than 190 10,000 genomes are very computer intensive 22 . EnteroBase has therefore implemented single-linkage hierarchical clustering (HierCC) of cgMLST data at multiple levels of relationship after excluding missing data in pairwise comparisons 21 (hierarchical clusters HC0: clusters of indistinguishable core-genome Sequence Types (cgSTs); HC2: clusters with pairwise distances of up to two cgMLST alleles; 195 HC5; HC10; HC20; HC50; … HC2500). Here we address the nature of the genetic relationships that are associated with multiple levels of HierCC clustering within 13,515 publicly available C. difficile genomes, and examine which levels of pairwise allelic distances correspond to epidemic outbreaks and to endemic populations, respectively. 200 As described above, multiple isolates from individual patients fell into patient-specific HC2 clusters, even when sampled from multiple episodes of recurrent CDI that had occurred 80 to 153 days apart (Suppl . Table S1) (patients D, F, and G in Figure 4). In these cases, relapsing disease likely reflected continued colonization after initially 205 successful therapy. However, multiple HC2 clusters were also identified. Some isolates from patient F differed by 12-21 cgMLST allelic differences from the bulk population ( Figure 4). Such strain diversity likely reflects co-infections with multiple related strains. In patient E, the two CDI episodes differed by >2,000 allelic differences (Figure 4), indicating an independent reinfection with an unrelated strain. 210 Hence, discrimination between relapse and reinfection based on cgMLST appears to be straightforward, although relapse might be simulated on occasion by reinfection with identical strains because the environment around CDI patients tends to be contaminated with C. difficile spores 25 . We note that time intervals between the relapse episodes investigated here had been between 16 and 22 weeks, beyond the CDI relapses 26 . Additional similar observations would lend support to extending this definition to a longer timeframe 27 .
Although isolates from a single HC2 cluster are likely to represent an outbreak, it is 220 readily conceivable that multiple HC2 clusters might be isolated from a single epidemiological outbreak due to accumulation of genetic diversity over time or incomplete sampling which resulted in the absence of missing links. Indeed, the majority of isolates from densely sampled, local outbreaks were related at the HC2 level, but some outbreaks investigated here consisted of more than one HC2 cluster 225 ( Figure 5). For example, nine isolates from a recently reported outbreak of ribotype 018 in Germany 28 formed four related HC2 clusters, and outbreaks with ribotypes 027 and 106 in a hospital in Spain 14 were each affiliated with two or three HC2 clusters ( Figure 5). Incomplete sampling of outbreaks may be common because asymptomatic patients may constitute an important reservoir for transmission but are 230 only rarely examined for colonization with C. difficile 29-31 .
We identified 23 HC2 clusters (encompassing 133 genome sequences) in a dataset of 309 C. difficile genome sequences collected from CDI patients in six neighboring hospitals in Germany. These HC2 clusters were associated with individual hospitals 235 (Χ 2 , p=8.6x10 -5 ; Shannon entropy, p=4.2x10 -5 ) and even with single wards in these hospitals (Χ 2 , p=0.01; Shannon entropy, p=6.2x10 -3 ). We investigated whether these HC2 clusters reflected the local spread of C. difficile within institutions by retrospective analyses of patient location data. Sixty six cases (50%) revealed ward contacts with another patient with the same HC2 cluster (median time interval 240 between ward occupancy: 63 days; range, 0 to 521 days). These results are consistent with the direct transmission for C. difficile isolates of the same HC2 cluster ( Figure 6). In those cases where the shared ward contacts were separated in time, transmission may have occurred indirectly or from a common reservoir, possibly through environmental spore contamination or through asymptomatically colonized 245 patients 14,29,30 . We also detected 15 HC2 clusters that included isolates from two or more hospitals in the region. Subsequent analyses of patient location data confirmed that some of these HC2 clusters were directly associated with patient transferrals between the hospitals ( Figure 6). Hence, hierarchical clustering of C. difficile genome sequences in conjunction with retrospective analysis of patient movements revealed 250 multiple likely nosocomial transmission events, none of which had been detected previously based on routine surveillance.

Pandemic strains and endemic populations.
Pandemic spread of C. difficile over up to 25 years has been inferred previously on the basis of molecular epidemiology 255 with other, lower resolution techniques 32 . The majority of isolates in EnteroBase that are known to representing such pandemic strains were related at the level of HC10, including epidemic ribotype 017 (cluster HC10|17) 11 , each of two previously reported fluoroquinolone-resistant lineages of ribotype 027 (HC10|4, HC10|9) 9 , or livestockassociated ribotype 078/126 (HC10|1) 33 (Suppl. Figure 1). 260 Endemic populations have also been described by ribotyping and phylogenetic analyses, some of which have formed a source for the emergence of epidemic or pandemic strains 2,9 . Many such endemic populations may be represented by HC150 clusters. Clustering at HC150 was well supported statistically (Suppl. Figure 2), and 265 the frequency distribution of pairwise genomic distances indicated that multiple database entries clustered at <150 cgMLST allelic differences (Suppl. Figure 3). A cgMLST-based phylogenetic tree of 13,515 C. difficile genomes showed 201 well separated HC150 clusters, each encompassing a set of closely related isolates, plus 209 singletons (Figure 7). Because these HC150 clusters are based on cgMLST 270 genetic distances, we will refer to them as 'cgST complexes', abbreviated as CCs.
Genomes from each of the major CCs have been collected over many years in multiple countries, indicating their long-term persistence over wide geographic ranges (Table 1).

275
We compared HC150 clustering with PCR ribotyping for 2,263 genomes spanning 84 PCR ribotypes for which PCR ribotyping data were available in EnteroBase. These included 905 genomes which we ribotyped ourselves (Suppl . Table S2) as well as several hundred genomes for which we had manually retrieved ribotype information from published data. The correlation between HC150 clustering and ribotyping was 280 high (adjusted Rand coefficient, 0.92; 95% confidence interval, 0.90-0.93). However, our analysis also revealed that PCR ribotypes did not always correspond to phylogenetically coherent groupings. PCR ribotypes 002, 015 and 018 were each distributed across multiple phylogenetic branches ( Figure 8). Furthermore, multiple ribotypes were assigned to genomes with indistinguishable cgMLST alleles. These 285 included ribotypes 106 and 500, ribotype 014 and 020, and ribotypes 066, 126, 413 and 078, as well as other examples in Figure 8. In contrast, HC150 clusters corresponded to clear-cut phylogenetic groupings within a phylogenetic tree of core genes (Figure 8-HC150).

290
Rarefaction analysis indicated that the currently available genome sequences represent about two-thirds of extant CC diversity (Suppl. Figure 4). It is not clear why so many populations of C. difficile co-exist, but at least some of this diversity may be due to the occupation of distinct ecological niches, as exemplified by differential propensities for colonizing non-human host species (Table 1)  Higher population levels. HierCC can identify clusters at higher taxonomic levels 21 . 300 In C. difficile, HC950 clusters seem to correspond to deep evolutionary branches (Suppl. Figure 5) and HC2000 clusters were congruent with the major clades reported previously 36 , except that cluster HC2000|2 encompassed clade 1 plus clade 2 (Suppl. Figure 6). Finally, HC2500 may correspond to the subspecies level, because it distinguished between C. difficile and distantly related "cryptic clades" 305 (Suppl. Figure 7).

310
In this study, we demonstrate that the application of cgMLST to investigations of local C. difficile epidemiology yields results that are quantitatively equivalent to those from SNP analyses. This is a major advance because SNP analyses require specific bioinformatic skills and infrastructure, are time consuming, and not easily standardized 16 . Even though a cgMLST scheme for C. difficile had been published 315 recently, its ability to identify closely related isolates and the inferred genomic distances had not been compared to SNP analyses in any quantitative way 18 . In EnteroBase, cgMLST is based on de-novo assembly of sequencing reads, which is 1 0 also known to introduce errors associated with the assembler algorithms. However, EnteroBase uses Pilon 37 to polish the assembled scaffolds and evaluate the 320 reliability of consensus bases of the scaffolds, thereby achieving comparable accuracy to mapping-based SNP analyses. When applied to a large dataset of C.
difficile genomes from hospital patients in the Oxfordshire region (UK), cgMLST and SNP analysis were largely consistent (89% match) at discriminating between isolates that were sufficiently closely related to have arisen during transmissions chains from 325 others that were epidemiologically unrelated. Similarly, cgMLST identified numerous HierCC will also enable comparisons to previously published data because we have 365 provided a correspondence table between HC150 clusters and PCR ribotypes. A full understanding of the population structure of C. difficile and its relationship to epidemiological patterns will require additional study because many of the clusters described here have not yet been studied or described. However, this task can be addressed by the global community due to the free public access to such an 370 unprecedented amount of genomic data from this important pathogen.

375
Sampling. 309 C. difficile isolates were collected at a diagnostic laboratory providing clinical microbiology services to several hospitals in central Germany. To assemble a representative sample, we included the first 20 isolates from each of six hospitals from each of three consecutive calendar years (Suppl.   While that original study had encompassed a slightly larger number of sequences, we restricted our analysis to the data (95%) that had passed quality control as implemented in EnteroBase 20 . We used the SNP data from Eyre's report 13 .

425
The hierarchical single-linkage clustering of cgMLST sequence types was carried out as described 21 for all levels of allelic distances between 0 and 2,556. We searched for stable levels of differentiation by HierCC according to the Silhouette index 45  We further evaluated the "stability" of hierarchical clustering using two other criteria.
The Shannon index is a measure of diversity in a given population. The Shannon 435 index drops from nearly 1 in HC0, because most cgSTs are assigned to a unique HC0 cluster, to 0 in HC2500, which assigns all sequence types to one cluster. The gradient of the Shannon index between the two extremes reflects the frequencies of coalescence of multiple clusters at a lower HC level. Thus, the plateaus in the curve correspond to stable hierarchical levels, where the Shannon index does not change 440 dramatically with HC level. We also evaluated the stability of hierarchical clustering by pairwise comparison of the results from different levels based on the normalized mutual Information score 46 (Fig. S3).
To estimate concordance between cgMLST-based hierarchical clustering and PCR 445 ribotyping, we calculated the adjusted Rand coefficient 47             Suppl. Figure 4. Rarefaction analysis to estimate numbers of extant HC clusters.