Predicting Phenotype and Emerging Strains among Chlamydia trachomatis Infections

Single nucleotide polymorphisms can be used for epidemiologic and evolutionary studies worldwide.

Chlamydia trachomatis is a global cause of blinding trachoma and sexually transmitted infections (STIs). We used comparative genomics of the family Chlamydiaceae to select conserved housekeeping genes for C. trachomatis multilocus sequencing, characterizing 19 reference and 68 clinical isolates from 6 continental/subcontinental regions. There were 44 sequence types (ST). Identical STs for STI isolates were recovered from different regions, whereas STs for trachoma isolates were restricted by continent. Twenty-nine of 52 alleles had nonuniform distributions of frequencies across regions (p<0.001). Phylogenetic analysis showed 3 disease clusters: invasive lymphogranuloma venereum strains, globally prevalent noninvasive STI strains (ompA genotypes D/Da, E, and F), and nonprevalent STI strains with a trachoma subcluster. Recombinant strains were observed among STI clusters. Single nucleotide polymorphisms (SNPs) were predictive of disease specificity. Multilocus and SNP typing can now be used to detect diverse and emerging C. trachomatis strains for epidemiologic and evolutionary studies of trachoma and STI populations worldwide. C hlamydia trachomatis is spread by close social contact or sexual activity. Worldwide, C. trachomatis is the leading preventable cause of blindness and bacterial sexually transmitted infections (STIs). Various typing techniques have been developed to better understand the epidemiology and pathogenesis of chlamydial diseases. Early typing schemes used monoclonal and polyclonal antibodies directed against the major outer membrane (MOMP) (1), and differentiated the organism into serovars and seroclasses: B class (comprising serovars B, Ba, D, Da, E, L 2 , L 2 a), C class (A, C, H, I, Ia, J, Ja, K, L 1 , L 3 ), and intermediate class (F, G). Sequencing of ompA, which encodes MOMP, has refined typing, detecting numerous trachoma (2,3) and sexually transmitted (4,5) serovar subtypes.
Seroclasses, however, do not correlate with disease phenotypes. For example, A, B, Ba, and C are responsible for trachoma, whereas lymphogranuloma venereum (LGV) strains, L 1-3 , are associated with invasive diseases, such as suppurative lymphadenitis and hemorrhagic proctitis (6). Other typing techniques, such as restriction fragment length polymorphism (7), random amplification of polymorphic DNA, or pulsed-field gel electrophoresis (PFGE) (8), and amplified fragment length polymorphism (9) also correlate poorly with disease phenotype, and none have been standardized across laboratories.
Multilocus sequence typing (MLST) has been used to characterize strains and lineages of numerous pathogens associated with human diseases that cause serious illness and death, including Neisseria meningitidis, Staphylococcus aureus, Vibrio cholerae, and Haemophilus influenzae. MLST uses 500-700 bp sequences of internal regions of 6-8 housekeeping genes, excluding genes suspected to be under immune selection (where there is positive selection for sequence diversity) and ribosomal RNA genes (which are multicopy and too conserved) (10). Advantages of MLST include its precision, allowing simple interlaboratory comparisons, good discrimination between strains, and buffering against the distorting effect of recombination on genetic relatedness. MLST data are also amenable to various population genetic analyses (11,12). Databases for >30 species are curated at www.mlst.net and pubmlst.org. In parallel with our study, 2 multilocus schemes have recently been developed for C. trachomatis. The first violated the above premise by including ompA, which is under immune selection (13). The second included only laboratory-adapted and 5 clinical E strains from the Netherlands (14).
In this work, unlike the other C. trachomatis MLST schemes, we used complete genomic comparisons of 7 strains from 4 species within the family Chlamydiaceae to identify conserved candidate housekeeping genes across the genomes. This approach ensures that the chosen loci are stable over the course of evolution, and allows for future application of a unified MLST scheme for other Chlamydiaceae spp. We typed a diverse worldwide collection of reference and clinical isolates from trachoma and STI populations, correlating genetic variation with geography and disease phenotype. We found disease-specific single nucleotide polymorphisms (SNPs) and a diversity of new strains including recombinant strains that occurred for ompA relative to housekeeping loci, following up on our recent discovery of this phenomenon at multiple loci in Chlamydiaceae genomes (15)(16)(17).
DNA (GenBank accession nos. FJ45414-FJ746022) using ABI3700 instruments were aligned by using MegAlign (DNASTAR). Each unique sequence for a locus was designated as a unique allele using Sequence Output (www.mlst. net). Each allelic profile (made up of the string of integers corresponding to allele numbers at the 7 loci) was assigned as a different strain or clone and given an ST as a clone descriptor. All STs have been deposited in the C. trachomatis site at www.mlst.net.
Allelic profiles and concatenated sequences were used to determine the relatedness of isolates. Average pairwise diversity between isolates was calculated from the 3,714bp concatenated sequence of the 7 loci for each isolate joined in-frame using MEGA4 (27). Synonymous (dS) and nonsynonymous (dN) substitutions were determined using MEGA4 for each locus. Allele frequencies per locus and geographic region were calculated using SAS software 9.2 (SAS Institute, Inc., Cary, NC, USA) with the PROC FREQ tool supplying the frequency count. We calculated a classification index (11) on the basis of allele and ST frequency between populations of different geographic regions to determine the probability of association of an allele with a particular continental/subcontinent region. Statistical significance was determined by 10,000 resamplings of allele and ST frequency per region.

Strain Clustering and Single Nucleotide Polymorphism Analyses
eBURST (www.mlst.net) was used to identify clusters of related and singleton STs that were not closely related to any other ST (12) and to predict patterns of evolutionary descent. MEGA4 (27) was used to construct a tree from concatenated sequences by using minimum evolution, neighbor joining, or unweighted pair group method with arithmetic mean, with various substitution models including Kimura 2-parameter, Jukes Cantor, and p-distance; 1,000 bootstrap replicates were used to test support for each node in the tree. The short evolutionary distances (<≈0.01) im-ply that back-substitutions were rare, and as expected, all methods gave similar results (data not shown). SplitsTree (www.splitstree.org) was used for evolutionary tree construction by decomposition analyses using the distance matrix produced from pairwise comparisons of concatenated sequences to determine interconnected networks (28).
A matrix of all SNPs by ST was produced in Excel. SAS was used to identify which SNPs were associated with an ST using PROC FREQ. Statistical significance was determined by using a classification index as above for the probability of association of a SNP with a particular ST. Levene's test (29) was used to determine whether there was equal variance across the 87 isolates. A p value of <0.05 was considered significant.  ompA genotypes correlated poorly with relatedness between strains by MLST data (online Technical Appendix). Isolates of the same ST had up to 4 different ompA genotypes. For example, ST19 included ompA genotypes D, H, I, and J. For each ompA genotype, 38%-100% belonged to different STs. Different STs with the same ompA genotype were closely related by MLST (e.g., isolates with C and F ompA genotypes); others were not. Isolates of D, E, and Ja ompA genotypes differed at as many as 5 MLST loci.

Allele Characteristics and Localization by Geography
Allele characteristics are shown in Table 2. The number of alleles at each locus varied from 4 to 11. The average pairwise distance and dS and dN are provided. We determined allele frequencies on the basis of continental/ subcontinental regions ( Table 3). The majority of alleles were observed multiple times. Seventeen were found only once, and 28 were unique to a specific region ( Table 3). The range was from 1 allele at the lysS locus for South America to 9 in North America. The highest frequency of a unique allele was 84.6% (leuS allele 7) for Asia, which also had the highest proportion of unique alleles, 6/17 (35.29%). There was a significant nonuniform distribution of alleles at each locus by classification index. Relationships between the isolates was further explored by constructing a minimum-evolution tree using MEGA4 (27). These data showed 3 disease clusters ( Figure 3). Cluster I comprised noninvasive STIs (eBURST CC-B) and a trachoma subcluster (eBURST CC-A). Cluster II comprised only invasive LGV strains. Cluster III includ-ed noninvasive prevalent D/Da, E and F STIs (eBURST CC-C). E58t strain (ST39; cluster III) was isolated from the conjunctiva of a trachoma patient, most likely representing autoinoculation from the urogenital tract, because all other isolates of this ST were from STIs.
Nine isolates did not localize on the MLST tree with strains of the same ompA genotype (Figure 3). Ja41nl and Ja47nl, which were expected to cluster with other J and Ja isolates in cluster I if the genome sequences were similar, were identical by MLST to reference strain F and clinical isolates F8p, F9p, E19e, and E5s in cluster III. Similarly, D83s, which were expected to cluster with other D and Da isolates in cluster III, had the same ST as H40nl, H18s, I22p, and J44nl in cluster I; D2s were identical to Ia and Ia57e in cluster I. Additionally, G16p did not cluster with the other G isolates in cluster I. In analyzing locations of incongruence between clinical D and E isolates in cluster I, compared with those in cluster III, the loci that differed were glyA, yhbG, and pykF in which allele assignments were identical, in general, to G, H, I, Ia, J, Ja, and K strains (online Technical Appendix) in cluster I. These were the exact same loci that differed for Ja41nl and Ja47nl in cluster III, compared with other J/Ja isolates in cluster I. Ja26s differed at glyA, mdhC, and yhbG, whereas G16p differed at yhbG, lysS, and leuS. Furthermore, the ompA tree ( Figure 4) was incongruent with the MLST tree. We interpret all 9 isolates to be recombinants.
SplitsTree decomposition evaluated alternative evolutionary pathways that might indicate recombination between MLST loci ( Figure 5). There was considerable network structure, providing evidence of alternative pathways between strains, which may indicate that recombination has influenced the evolution of housekeeping genes for the C. trachomatis strains.

SNPs Associated with Disease Phenotypes
We identified 61 polymorphic sites among the 7 loci. Multiple SNPs were significantly associated with each of the 3 clusters and disease groups (Table 4). For example, 15 SNPs in yhbG and leuS were 100% specific for all LGV strains in cluster III. Any 1 of these SNPs could be used to identify these strains. SNPs 4, 29, 31, 33, and 34 (together or any 1 alone) were specific for the cluster II STIs. For

Discussion
Accumulating evidence for recombination among Chlamydiaceae in general, and C. trachomatis in particular, has motivated a typing system that provides buffering from the distorting effects of genetic reshuffling that plague systems based on a single locus. We therefore developed an MLST scheme derived from comparative genomics of species within the family Chlamydiaceae to select conserved chromosomally dispersed housekeeping genes. Our scheme showed considerable variability in allelic profiles associated with geographic regions, as well as diverse and recombinant strains. We also identified SNPs that correlated with the 3 C. trachomatis disease groups: invasive LGV diseases, noninvasive urogenital diseases, and trachoma.
Comparative genomics of Chlamydia and Chlamydophila spp. identified 14 conserved housekeeping genes that could be used to extend MLST schemes for these and potentially other Chlamydiaceae spp.. Surprisingly, each gene was located in a different position within the respective genome, indicating a lack of synteny among chromosomes (20) (Figure 1), except for the 2 C. trachomatis and 3 C. pneumoniae strains, which share within species >99% nucleotide sequence identity. This finding suggests that future schemes should select loci to ensure reasonable coverage of the chromosome.
Although there was relatively little sequence diversity in the housekeeping genes, the number of STs (0.51 ST/isolate) was similar to that of other bacterial pathogens. The  previous C. trachomatis MLST scheme had 0.60 ST/isolate (14). None of the loci were identical to ours. In a recent study of the bacterium Burkholderia pseudomallei in Australia, there were 0.65 ST/isolate (11,12) with relatively little diversity and few alleles per locus. However, high levels of recombination are believed to shuffle alleles to generate different large numbers of allelic profiles (STs). The extent to which recombination among alleles generates novel STs in C. trachomatis is unclear. Although the number of STs per isolate varies, the majority of MLST schemes have been successful for strain discrimination, epidemiologic studies, and evaluation of organism evolution (10). MLST, however, may not be sufficiently discriminatory for some epidemiologic investigations, even with increased loci numbers. This may be the case for LGV strains, although our scheme resolved 2 L2b strains from all other LGV strains. We found that a number of STs for STI isolates were shared across continents. This finding was particularly evident for those from Amsterdam, Ecuador, Lisbon, and San Francisco, which would be expected given increasing opportunities for global travel and international sexual encounters. Notably, L2b isolates (ST33) from proctitis cases differed at 2 loci from other LGV isolates (ST1) and were restricted to Amsterdam. Although some L2b strains from Amsterdam and San Francisco have historically been similar (30), ST differentiation most likely reflects the emergence of these strains among men who have sex with men.
Not surprisingly, STs for trachoma isolates were restricted to the geographic region of origin where populations travel only locally.
Allele frequencies were assigned on the basis of continental/subcontinental regions (Table 3). Most alleles were observed multiple times, and more than half were region specific. Despite the opportunity for worldwide spread, some strains may be stable within the respective geographic populations. This stability was particularly evident in Africa and Asia, where the frequency of unique alleles was the highest, although this finding also reflects the fact that most isolates were from trachoma populations. As expected, we found, in general, a statistically significant nonuniform distribution of alleles.
Analyses using eBURST and trees constructed in MEGA4 resolved isolates into clonal complexes or clusters.   Both methods identified distinctive groupings of strains by disease phenotypes. STIs caused by less common strains formed an eBURST group (CC-B) but were within Cluster I on the tree together with the trachoma Subcluster I, which was a separate eBURST group (CC-A). A similar clustering pattern to our tree was found by Pannyhoek et al. (14) by using 16 reference and 5 clinical E strains, but they did not distinguish trachoma reference strain B/TW-5 from the LGV group. Our Cluster II included only LGV strains. Cluster III contained the noninvasive globally prevalent D/Da, E, and F strains (eBURST CC-C). This cluster represents efficiently transmitted strains with adaptive fitness in the genital tract. A number of isolates representing different ompA genotypes shared the same ST, whereas many isolates of the same ompA genotype had different STs (online Technical Appendix). Furthermore, 9 isolates were found outside the expected cluster, suggesting that recombinational replacement at the ompA locus occurs relatively frequently. Accumulating evidence supports frequent recombination among Chlamydiaceae. Initial evidence came from observations of recombination within ompA (4,31) followed by phylogenetic analyses (32), and bioinformatic and statistical analyses for multiple species of the family Chlamydiaceae and C. trachomatis strains (15). Recently, we showed intergenic recombination involving ompA and pmpC, pmpE-I, and frequent recombination throughout the genome with significant hotspots for recombination for recent clinical isolates (16,17). Pannekoek et al. noted incongruence between ompA and fumC sequences (14). Most recombination in our study involved yhbG, glyA, and pykF (online Appendix Table) with incongruence, compared with ompA. Based on C. trachomatis recombinants that have been created in vitro, the estimated size of transferred DNA ranged from 123 kb to 790 kb (33). Although additional recombination sites may exist in regions that were not sequenced, any gene in our study could be involved in lateral gene exchange with a range of 1,191 bp for a single gene (e.g., ompA), 27 kb (yhbG to ompA) to at least 248 kb (glyA to yhbG), which is consistent with DeMars and Weinfurter (33) and our previous findings (16,17).
Analysis of the 61 SNPs among the 7 loci showed a statistically significant association of specific polymorphisms with each disease cluster (Table 4). A total of 15 SNPs singly or together identified the LGV cluster. Similarly, 5 SNPs identified the prevalent Cluster II D/Da, E, and F strains. Three clinical D and E strains did not contain these SNPs and each appeared to be a recombinant with other STI strains. Only 1 SNP (in pykF) identified all trachoma strains in Subcluster I. Reference trachoma strains A and Ba did not contain this SNP, suggesting that they may not represent circulating strains among present-day populations.
Other studies have associated SNPs or indels in pmp and porB genes with specific disease causing C. trachomatis clades (16,34,35). However, SNPs were not individually analyzed for specific disease associations and the target genes encode surface exposed proteins likely to be under selection for epitope variation to avoid immune system surveillance. A frame-shift mutation in 1 of the tryptophan synthase genes, trpA, was associated with trachoma strains when compared with all others, although some B and C strains lack the entire gene (35). Large deletions in the cytotoxin loci have also been identified that differentiate the 3 disease groups, yet strain B is missing these loci (36). The latter study relied on reference strains, which may limit the use of these deletions for identifying disease-specific groups because clinical isolates may vary in deletion size or location. Additionally, tryptophan synthase genes and cytotoxin loci are located within the 50-kb plasticity zone of the chromosome, a region known for genetic reshuffling (20). The current study differs from those previously mentioned in that it used housekeeping genes that are not under immune selection or in the plasticity zone. Therefore, the SNPs we identified are probably neutral and can be used as reliable markers for disease association. Furthermore, SNPs were based on reference and clinical isolates of multiples of the same strains from 6 geographic regions, representing a broad diversity of this species.
Given the high rates of infection among STI (37) and trachoma populations (38,39), the ability to distinguish LGV and noninvasive urogenital and trachoma strains, including mixed infections, would aid epidemiologists, clinicians, and public healthcare workers worldwide in determining appropriate therapeutic or intervention strategies (40). Our multilocus and SNP typing can now be used to standardize the way an organism is typed; isolates from diverse geographic regions worldwide can be identified and compared; and diverse and emerging C. trachomatis strains can be detected for epidemiologic and evolutionary studies among trachoma and STI populations worldwide. Dr Dean is director of Children's Global Health Initiative and a faculty member at the University of California at San Francisco and the University of California at Berkeley Joint Graduate Group in Bioengineering. She is also a senior scientist in the Center for Immunobiology and Vaccine Development at Children's Hospital Oakland Research Institute. Her research interests focus on chlamydial pathogenesis and comparative genomics and molecular epidemiology of chlamydial ocular and sexually transmitted diseases.