Homologous Recombination in Clostridioides difficile Mediates Diversification of Cell Surface Features and Transport Systems

Infections with C. difficile result in up to half a million illnesses and tens of thousands of deaths annually in the United States. The severity of C. difficile illness is dependent on both host and bacterial factors.

difficile, and their secretion results in the destruction of intestinal epithelial cells and recruitment of inflammatory mediators and neutrophils (2)(3)(4)(5). However, other factors, such as adhesins, cell surface capsule proteins, hydrolytic enzymes, and flagella, have also been suggested to play an important role in C. difficile colonization, pathogenesis, and virulence (2,6).
Five major phylogenetic clades of C. difficile have been described (7), and upward of 100 PCR-ribotypes circulate and cause disease in U.S. hospitals (8). Some studies have suggested that epidemic ribotypes 027 (clade 2) and 078 (clade 5) cause more severe disease than other C. difficile strains (9,10), while others fail to find an association between ribotype and disease severity (11). Thus, more research is needed to explore pathogen factors important in disease manifestation and heterogeneity in severity.
Bacteria can acquire genetic diversity through vertically inherited mutations, recombination via transformation of free DNA, transduction by phage, and conjugation of plasmids (12). Recombination can either be homologous, in which a genomic region is exchanged for an allelic variant, or nonhomologous, in which accessory genes can be transferred from one organism to another (12). Studying bacterial recombination is important in genomics studies since it can both weaken phylogenetic signals and reveal important selective pressures (13). Genomic hot spots for recombination have been hypothesized to be due to both mechanistic and evolutionary reasons. For example, lower rates of recombination have been observed at the terminus compared to the origin of replication, and conserved core genes often flank highly recombinant genes in Escherichia coli (14). However, recombination events that are disadvantageous to the organism will be lost via negative selection, and the events that remain to be observed are likely to be advantageous or neutral in effect (13).
Homologous recombination has been shown to result in high rates of genetic diversity in adhesion, invasion, and colonization factors, as well as cell membrane/surface structures in various pathogenic bacteria (13,(15)(16)(17). These factors each play a role in the pathogenesis and/or host interactions of these pathogens. Elevated levels of genetic diversity in these features could lead to adaptions in virulence, transmission, or evasion of the host immune system. In C. difficile, the S-layer cassette that encodes surface proteins important for its antigenicity has been shown to undergo frequent recombination (13,18). This study utilizes more than 400 C. difficile whole-genome sequences to unveil the contribution of homologous recombination within and between clades of this pathogen, and to elucidate specific functions associated with higher rates of realized recombination events.

RESULTS AND DISCUSSION
Strain selection. A total of 412 strains were selected that cluster into five distinct clades (Fig. 1a), as described previously (7). Of the 412 strains, 306 isolates are in clade 1, 76 in clade 2, 15 in clade 3, 10 in clade 2, and 5 in clade 5. Clade 5 is the most distantly related from the rest of the phylogeny, with only 97% sequence identity to reference strain 630 (calculated as number of identical nucleotides over the length of the aligned sequence). Since the majority of isolates are in clades 1 and 2 and since excluding clades 3, 4, and 5 increased the size of the core genome (from 1.66 to 1.75 Mbp) and allowed more genes to be studied, gene and pathway analyses were performed on just the 382 isolates in clades 1 and 2. Moreover, besides ribotype 078, which is in clade 5 and is relatively distantly related to clades 1 and 2, the majority of C. difficile isolates that cause human disease in the United States are in clade 1 (includes ribotype 014/020) and clade 2 (includes ribotype 027) (8), so the results of these analyses can be applicable to much of pathogenic C. difficile.
Homologous recombination in five C. difficile clades. There is a great deal of heterogeneity in the contribution of recombination to the evolution across the C. difficile phylogeny, as measured by the proportion of single-nucleotide variants (SNVs) inside compared to outside predicted recombination events (r/m) (Fig. 1b). Most branches have more single nucleotide substitution events than recombination events (see Fig. S1 in the supplemental material); however, each recombination event can lead to the transfer of multiple SNVs. Throughout the phylogeny, 72% of SNVs fall within predicted recombination events, even though there are about 15 times as many inferred nucleotide substitution events as there are recombination events. In addition to variance on branches of the tree, there is also as a significant amount of heterogeneity in r/m values between the clades (Kruskal-Wallis x 2 df=4 P , 0.001; Fig. 1b), suggesting a potential difference in evolutionary mechanisms between the clades. Clades 2 and 4 have the highest median r/m values, while clades 1 and 5 have the lowest. It is important to note the variance in sample sizes between the clades, which may be biasing these interclade comparisons.
The median length of recombination events in all C. difficile clades is ;6.7 kb, and the maximum length of a predicted recombination event was .100 kb in clade 1. Overall, most recombination events are ,10 kb, with an exponential decay in events with increasing length (Fig. 1c). This distribution is observed in all C. difficile clades, though there are more recombination events observed overall in clades 1 and 2 than in clades 3, 4, and 5, likely because the genome sample size is greatest in clades 1 and 2.
Areas of enriched homologous recombination. Twelve percent (486/3,980) of C. difficile genes analyzed were significantly enriched for homologous recombination (P , 0.05) according to a permutation test (see Table S2 in the supplemental material), and 40 of these genes were enriched for recombination with a fold change of $3 ( Table 1). The majority of these highly enriched ($3-fold) genes are involved in flagellar activity. Overall, the set of 486 significant genes were significantly overrepresented in four KEGG pathways, and one COG term involved in flagella, sugar transport and metabolism, and the cell envelope and outer membrane ( Table 2). The Slayer cassette, which has previously been shown to be highly recombinant in C. difficile (13,18), also appears in a high-recombination zone (Fig. 2), with many of the genes involved in S-layer structure significantly enriched for recombination (see Table S2 and Fig. S2 in the supplemental material). There is no KEGG pathway specific to S-layer proteins, but the COG ontology "cell envelope biogenesis, outer membrane," whose genes are enriched for homologous recombination (Table 2), does include some S-layer proteins.
The genomic region from approximately 2.6 to 0.7 Mb appears to be enriched for recombination compared to the other half of the genome (Fig. 2). This could suggest the mechanistic feasibility of recombination in this area and/or grouping of genes where genetic diversity is more evolutionarily advantageous. Low recombination at the terminus of the genome is consistent with findings in Escherichia coli (14).
Flagella. There have been reports of large-scale structural variations of flagella genes in C. difficile, but the contribution of homologous recombination is not as well described. It is clear that unique flagellar variant patterns are present in distantly related isolates (Fig. 3). In particular, within each clade there appears to be a set of distinct flagellar genotypes distributed sporadically throughout the phylogeny, suggesting that large regions of the flagella operons may have been swapped in homologous recombination events. Further, there appears to be a distinct recombination pattern in the two flagellar regulatory units present in the clades 1 and 2 core genome, with there being more F3 genotypes than F1 genotypes present in these isolates. The F3 and F1 operons are responsible for early-stage and late-stage flagellar genes, respectively (19). F1 genes, such as fliC and fliD, code for the flagellin filament and protein cap (19) that may be involved in protein-protein interactions and recombinant FliC proteins have been shown to be involved in C. difficile immunogenicity (20). F3 genes are associated with other structural components of the flagella, such as the basal body, hook, and motor proteins (19), and the F3 gene fliA (sigD) has been suggested to be important not only in regulating late-stage flagellar genes but also in the production of C. difficile toxins (21).
Genomic structural diversity of C. difficile flagella has been described before. For instance, C. difficile ribotype 078 (clade 5) strains lack the F3 regulon completely, rendering that ribotype immotile (19). Furthermore, many strains of C. difficile also have an F2 operon between F1 and F3, although these operational units are so diverse that they do not show up in the core genome for clades 1 and 2, and thus homologous recombination in the F2 operon was not analyzed in this study. Supporting the existence of selective pressure for diversification of F2 are previous reports of nonhomologous recombination in this region. C. difficile 630, for example, has four genes in its F2 region, including genes involved in glycosylation, while ribotype 027 isolate CD196 has six seemingly different genes in this area (19,22).
These results suggest that homologous recombination, in addition to nonhomologous recombination, may be an important mechanism of flagellar diversity in C. difficile. It has been reported that flagella may have a role in motility, colonization, adherence, virulence, and immunogenicity in mucosal pathogens, including C. difficile (23), and in vivo studies show that mutants of late-stage flagellar genes fliC and fliD in C. difficile 630 exhibited greater virulence than did the wild type (24). The enrichment in Flagellar operon RNA polymerase s 28 factor 11 33 3.0 ,0.001 a PR events, the predicted number of overlapping recombination events with each gene, calculated as the mean number of recombination events in each of 10,000 permutations randomly placing the identified recombination events. b OR events, the observed number of overlapping recombination events with each gene, as identified by Gubbins. c Ratio (fold change) of observed recombination events compared to predicted recombination events. d Probability (P) that the number of observed recombination events is observed by chance based on 10,000 permutations randomly placing the identified recombination events throughout the clades 1 and 2 core genome. e Gene in F3 regulon. f Gene in S layer. g Gene in F2 regulon. h Gene in F1 regulon. i PTS gene. j Membrane protein gene. flagellar diversity in closely related strains observed here suggests an evolutionary advantage, such as antigenic diversification or a unique role in virulence or pathogenicity. Of note, five genes involved in type IV pilus biosynthesis were also shown to be significantly enriched for homologous recombination (see Table S2), which suggests a FIG 2 Distribution of recombination events throughout the C. difficile genome. From innermost circle outward: positions in the core genome of clades 1 and 2 (purple) used as input for recombination detection; histogram of recombination events overlapping with each position in the C. difficile genome (dark gray represents more recombination than average, light gray represents less recombination than average); flagellar genes (turquoise); S-layer cassette genes (teal), phosphotransferase system (PTS) genes (dark red); genes involved in amino sugar and nucleotide sugar metabolism (salmon); genes involved in fructose and mannose metabolism (orange), and genes annotated as membrane proteins (navy). Areas highlighted in light blue represent genes that have more recombination than expected with a P value of ,0.05 as determined by a permutation test. Numbers around the circular genome plot mark the nucleotide position in the reference genome C. difficile 630. further role of homologous recombination in the diversification of cell surface structures that act as colonization factors in C. difficile (25). PTS and sugar metabolism. The phosphotransferase system (PTS) is responsible for the active import of certain sugars into bacteria (26). It involves many transmembrane proteins that interact with the extracellular environment, and could be important in host interactions and immunogenicity. The combination of enriched recombination in both PTS and the metabolism of certain sugars (fructose, mannose, and pentose; Table 2) also suggests that recombination may play a role in the ability of C. difficile strains to occupy unique niches via diverse metabolic capabilities. Interestingly, a pleiotropic effect of fliC on genes involved in a number of mechanisms, including PTS and carbon metabolism has been observed (6). Variation in the interaction of flagella, PTS, and metabolism proteins itself may be being selected for via homologous recombination.
Cell membrane and the S layer. Genes that code for components on the outer layers of bacteria that interact with the outside world and host environment have previously been shown to undergo a relatively large amount of homologous recombination (13). That appears to be the case here as well, since flagella are environmentexposed appendages, and the PTS system involves a number of transmembrane proteins that regulate what sugars can come into the cell. The COG ontology that includes genes involved in cell envelope biogenesis and the outer membrane, including S-layer proteins, is significantly enriched for recombination (Table 2). These data are in line with previous studies that show high recombination in the S-layer surface proteins of C. difficile (18). The majority of genes involved in the S layer (27) were significantly enriched for homologous recombination in this study (Table 1; see also Table  S2 and Fig. S2 in the supplemental material).
Conclusions. Homologous recombination is an important contributor to the evolution of C. difficile, accounting for an estimated 72% of SNVs. Furthermore, there are genomic regions and functional pathways especially enriched for homologous recombination. Genes involved in the production of proteins that interact with the host and outside environment (i.e., flagella, cell surface proteins, and PTS transporters) and genes important in the metabolism and import of certain sugars are more likely to be highly recombinant. The genes that have been shown here to undergo high rates of homologous recombination have also been documented elsewhere to be associated with toxin production, virulence, and pathogenicity (2, 6, 19-21, 23, 28). These data suggest that homologous recombination may play a significant role in intraclade variation in virulence. There are a number of hypothetical proteins seen here with recombination much higher than expected (Table 1; see also Table S2). These genes should be further investigated in order to determine their potential role in C. difficile pathogenesis and/or virulence, since there appears to be an evolutionary advantage to sustained heterogeneity in these proteins throughout clinically important strains of the bacteria.
Previous work that explored the impact of homologous recombination on C. difficile highlighted the enrichment of recombination in the S layer (13) but failed to find significance in other areas, such as the flagella and PTS. Limitation to only clades 1 and 2 in our analyses allowed for an increased core genome that includes many important genes in these systems that would be missed if only working with the entire C. difficile core genome.
Another implication of this research is the importance of examining the contribution of homologous recombination and allele variants when designing vaccine and drug targets. Both flagellar proteins and S-layer proteins have been suggested as potential vaccine targets for C. difficile (29); however, since the genomic evidence reveals multiple allelic types of these structures, multivalent vaccines may need to be considered.
The analysis of homologous recombination can be conducted on publicly available genomes, and reveal a great deal about pathogen evolution and selective pressures. Here, it has suggested a potentially important role of genetic diversity in flagella, the S layer, PTS, and metabolic pathways to the survival and pathogenesis of C. difficile. Further in vitro and in vivo assays should be conducted to confirm these roles.

MATERIALS AND METHODS
Strain selection. C. difficile isolates with whole-genome sequence data were selected from a set of 917 genomes previously analyzed in our lab and 4,443 previously published genomes (30) to maximize representation of clinically important strains. To reduce redundancy of nearly identical genomes, isolates that were differentiated from another isolate in the sample by less than six coding mutations were excluded. All genomes used in this analysis were sequenced previously and are publicly available (see Table S1).
Genome alignments. Whole-genome alignments were created for each clade, as well as the combination of clades 1 and 2 and the entire set of genomes. Whole-genome alignments were produced as previously described (31). Briefly, each assembly was first concatenated into a pseudochromosome by using mauve contig mover to order and orient large contigs relative to a reference (32). The finished chromosome of the C. difficile 630 strain (included in clade 1) was used as a reference genome where possible, and the most complete whole-genome sequence in each alignment (determined by minimum number of contigs) was chosen as the reference genome otherwise. Each pseudochromosome was aligned to the reference genome using nucmer, and variants were identified relative to the reference genome using the show-snps command (33). Single-nucleotide variants (SNVs) were filtered to remove those SNVs that were likely to be due to alignment or sequencing errors. SNVs were filtered out if (i) they resided in genes annotated as phage, transposase, or integrase; (ii) they resided within 20 bp of the start or end of a contig; (iii) they resided in tandem repeats of total length .20 bp, as determined by the exact-tandem program associated with MUMmer; or (iv) they resided in large inexact repeats as determined by nucmer. Core genomes were calculated based on positions in the reference genome that were aligned in every genome using nucmer.
Recombination identification. Gubbins v1.3.4 (Genealogies Unbiased By recom-Binations In Nucleotide Sequences [34]) was used to identify areas likely introduced under homologous recombination in C. difficile. Gubbins was run on core genome alignments of each C. difficile clade, as well as clades 1 and 2 collectively, and on all 412 representative C. difficile genomes used in this study. Genomic regions identified by Gubbins as having densities of SNVs statistically different from background SNV densities in the core genome were inferred to be homologous recombination events, although it is possible that other molecular mechanisms could also have played a role in some of these regions. Maximum-likelihood recombination-filtered phylogenetic trees for the entire set of genomes, as well as just clades 1 and 2, were produced using RAxML v8.2.9 (35) assuming a general time reversible model of nucleotide variations (-m GTRCAT). The bootstrap convergence test and the autoMRE convergence criteria (-N autoMRE) were used to determine the number of bootstrap replicates. The best-scoring maximum-likelihood tree was identified with rapid bootstrap analysis (-f a).
Identification of genes and pathways with enriched recombination. A permutation test was implemented and run in R v3.4.0 (36) to randomly place each observed recombination event in the C. difficile clades 1 and 2 core genome. Only clades 1 and 2 were used in this analysis since that is where the majority of recombination events were found and, since these two clades are closely related, it allowed for an expanded core genome compared to the entire set. Furthermore, epidemic ribotypes 014 and 027 fall within clades 1 and 2, respectively. The permutation test was run 10,000 times. Information about all deposited C. difficile genes in the reference genome C. difficile 630 were downloaded from the National Center for Biotechnology Information (NCBI) gene database. A recombination event was considered overlapping with a gene if they overlapped in at least one nucleotide position. One-sided P values were calculated for each gene as the number of permutations that predicted a number greater than or equal to the number of recombination events observed plus 1, divided by 10,001. Genes with a P value of ,0.05 were input as a gene list to the DAVID v6.8 (david.ncifcrf.gov) functional annotation tool to identify the KEGG pathways and COG ontologies associated with increased homologous recombination in C. difficile.
Data visualization. An unrooted phylogenetic tree (Fig. 1a) of all 412 isolates was plotted in R v3.4.0 with the ape package v4.1. Box plots and histograms (Fig. 1b and c;  Data availability. All of the genomes discussed here, as well as their accession numbers, are listed in Table S1 in the supplemental material.

SUPPLEMENTAL MATERIAL
Supplemental material is available online only.

ACKNOWLEDGMENTS
This study was supported by the National Institutes of Health, National Institute of Allergy and Infectious Diseases (grant U01AI124255).
The study was conceptualized by H.D.S. and E.S.S. Data analysis, original draft preparation, and visualizations were performed by H.D.S. All authors performed editing and review, and E.S.S. supervised the project.
We have no conflicts of interest to report.