Population-based analysis of ocular Chlamydia trachomatis in trachoma-endemic West African communities identifies genomic markers of disease severity

Background Chlamydia trachomatis (Ct) is the most common infectious cause of blindness and bacterial sexually transmitted infection worldwide. Ct strain-specific differences in clinical trachoma suggest that genetic polymorphisms in Ct may contribute to the observed variability in severity of clinical disease. Methods Using Ct whole genome sequences obtained directly from conjunctival swabs, we studied Ct genomic diversity and associations between Ct genetic polymorphisms with ocular localization and disease severity in a treatment-naïve trachoma-endemic population in Guinea-Bissau, West Africa. Results All Ct sequences fall within the T2 ocular clade phylogenetically. This is consistent with the presence of the characteristic deletion in trpA resulting in a truncated non-functional protein and the ocular tyrosine repeat regions present in tarP associated with ocular tissue localization. We have identified 21 Ct non-synonymous single nucleotide polymorphisms (SNPs) associated with ocular localization, including SNPs within pmpD (odds ratio, OR = 4.07, p* = 0.001) and tarP (OR = 0.34, p* = 0.009). Eight synonymous SNPs associated with disease severity were found in yjfH (rlmB) (OR = 0.13, p* = 0.037), CTA0273 (OR = 0.12, p* = 0.027), trmD (OR = 0.12, p* = 0.032), CTA0744 (OR = 0.12, p* = 0.041), glgA (OR = 0.10, p* = 0.026), alaS (OR = 0.10, p* = 0.032), pmpE (OR = 0.08, p* = 0.001) and the intergenic region CTA0744–CTA0745 (OR = 0.13, p* = 0.043). Conclusions This study demonstrates the extent of genomic diversity within a naturally circulating population of ocular Ct and is the first to describe novel genomic associations with disease severity. These findings direct investigation of host-pathogen interactions that may be important in ocular Ct pathogenesis and disease transmission. Electronic supplementary material The online version of this article (10.1186/s13073-018-0521-x) contains supplementary material, which is available to authorized users.


INTRODUCTION
The obligate intracellular bacterium Chlamydia trachomatis (Ct) is the leading infectious cause of blindness (trachoma) and the most common sexually transmitted bacterial infection 1,2 .
Ct strains are differentiated into biovars based on pathobiological characteristics and serovars based on serological reactivity for the major outer membrane protein (MOMP) encoded by ompA 3  suggesting that minor genetic changes influence pathogen-host and tissue-specific infection characteristics [5][6][7] . All published African ocular Ct genomes are situated on the ocular branch within the T2 clade of non-LGV urogenital isolates 4 . Currently there are only 31 published ocular Ct genome sequences 4,9-12 . In particular there appears to be limited genomic diversity between published Ct genomes from Gambian and Tanzanian populations 8 .
The pathogenesis of chlamydial infection begins with epithelial inflammation and may progress to chronic immuno-fibrogenic processes leading to blindness and infertility, though many Ct infections do not result in sequelae 13,14 . Strain-specific differences related to clinical presentation have been investigated in trachoma 8,15,16 .
These studies examined a small number of ocular Ct isolates from the major trachoma serotypes and found a small subset of genes in addition to ompA that were associated with differences in in vitro growth rate, burst size, plaque morphology, interferon-γ sensitivity and most importantly, intensity of infection and clinical disease severity in non-human primates (NHPs), suggesting that genetic polymorphisms in Ct may contribute to the observed variability in severity of trachoma in endemic communities 8 .
The obligate intracellular development of Ct has presented significant technical barriers to basic research into chlamydial biology. Only recently has genetic manipulation of the chlamydial plasmid been possible, allowing in vitro transformation and modification studies, though this remains technically challenging, necessitating alternative approaches 17,18 .
However there are currently no published population-based studies of Ct using WGS with corresponding detailed clinical data, making it difficult to relate genetic changes to functional relevance and virulence factors in vivo.
There is an increasing pool of Ct genomic data, largely from archived samples following cell culture and more recently directly from clinical samples 31 . WGS data obtained directly from clinical samples can be preferable to using WGS data obtained from cell cultured Ct, since repeated passage of Ct results in mutations that are not observed in vivo [32][33][34] .
Ct bacterial load is associated with disease severity, particularly conjunctival inflammation, in active (infective) trachoma 35 . Conjunctival inflammation has previously been shown to be a marker of severe disease and plays an important role in the pathogenesis of scarring trachoma [36][37][38] . In this study we used principal component analysis (PCA) to reduce the dimensions of clinical grade of inflammation (defined using the P score from the FPC trachoma grading system 39 ) and Ct bacterial load to a single metric to define an in vivo conjunctival phenotype in active (infective) trachoma. PCA is a recognised dimension reduction technique used to combine multiple correlated traits into their uncorrelated principal components (PC) 40-42 , allowing us to examine the relationship between Ct genotype and disease severity.
These data currently represent the largest collection of ocular Ct sequences from a single population from the trachoma-endemic region of the Bijagós Archipelago of Guinea Bissau, and provide a unique opportunity to gain insight into ocular Ct pathogenesis in humans.

RESULTS
Conjunctival swabs collected during a cross-sectional population-based trachoma survey on the Bijagós Archipelago yielded 220 ocular Ct infections detected by Ct plasmid-based droplet digital PCR (ddPCR). Of the 220 Ct infections detected, 184 were quantifiable using Ct genome-based ddPCR.
We obtained WGS data from 126 using cell culture (n=8) or direct sequencing from swabs with SureSelect XT target enrichment (n=118), representing the largest cross-sectional collection of ocular Ct WGS. Eighty-one of these sequences were subsequently included in the phylogenetic and diversity analyses and 71 were retained in the final genome-wide association (tissue localization (derived from the anatomical site of sample collection) and disease severity) analyses. The quality filtering process is illustrated in Figure 1 and detailed in Methods. Briefly, we used standard GATK SNP-calling algorithms where >10x mean depth of coverage is defined as a threshold value and performs well in variant calling, is highly sensitive and has a false positive rate of <0.05% 43,44 .
A total of 1034 unique SNP sites were identified within the 126 Bijagós Ct genomes relative to the reference strain Ct A/HAR-13. Following application of further threshold criteria based on minor allele frequency (MAF) and genome-wide missing data thresholds, we retained only high quality genomic data in the final association analyses (129 SNPs from 71 sequences). There were no significant differences between the 71 retained and the 55 excluded sequences with respect to demographic characteristics, bacterial load, disease severity scores or geographical location ( Table 1). Clinical and demographic details of the survey participants in whom we did not identify Ct infection have been published previously 45 . Of the ten SNPs initially identified within the Ct plasmid sequences, none fulfilled the quality filtering criteria and were not retained for the genome-wide association analyses.

Ocular C. trachomatis Phylogeny and Diversity
For the phylogeny and diversity analyses 81 Bijagós Ct sequences were included on the basis of quality filtering criteria described in detail in Figure 1. SNPbased phylogenetic trees constructed using all 1034 SNPs for sequences above 10x coverage (n=81), with 54 published Ct reference genomes, are shown in Figure 2.
The Bijagós sequences are situated within the T2 ocular monophyletic lineage with all other ocular Ct sequences 46 except those described by Andersson et al. 10 .
However, our population-based collection of ocular Ct sequences has much greater diversity at whole genome resolution than previously demonstrated in African trachoma isolates 4,8 . We used a pairwise diversity (π) metric to compare two populations of ocular Ct from regions with similar trachoma endemicity and studies with similar design, sample size and available epidemiological metadata. These data show much greater genomic diversity in the Bijagós ocular Ct sequences (π=0.07167) compared to the Tanzanian (Rombo) ocular Ct sequences (π =0.00047).
By ompA genotyping, 73 of the Bijagós sequences are genotype A and eight are genotype B, supporting their classical ocular nature (Supplementary Information S1). The high resolution of WGS data obtained directly from clinical samples captures diversity that may be useful in strain classification, particularly as we found some evidence of clustering at village level, although the very small number of sequences per village means that it is not possible to provide accurate estimates of clustering in this study ( Figure 3).

Homoplasic SNPs and regions affected by recombination are shown in
Supplementary Information S2 (a). Removal of these regions of recombination identified using the pairwise homoplasy index had no effect on phylogenetic relationships. Additionally, a site-wise log likelihood plot demonstrated that there was no clear genomic region where there was significant lack of confidence in the tree construction due to recombination ( Supplementary Information S2 (b)). Whether regions containing recombination were included or excluded, tree topology remained essentially identical, indicating that branching order is not affected by the removal of these regions.

Genome-wide analysis of C. trachomatis localization
Candidate genes thought to be involved in or indicative of ocular localization or preference were examined to further characterize this population of ocular Ct.
Polymorphisms and truncations in the tryptophan operon have previously been implicated in the inability of ocular Ct to infect and survive in the genital tract 5 . All sequences contained mutations in trpA resulting in truncation. The majority (80/81) were truncated at the previously characterised deletion at position 533 5 .
The variable domain structure of the translocated actin-recruiting phosphoprotein (tarP) has also been implicated in tropism 47  Permutation-based re-sampling methods, commonly used in GWAS analyses, were used to account for multiple comparisons [48][49][50][51] . 1007 SNPs were tested in 157 Ct sequences ( Figure 1) for association with ocular localization (defined by anatomical site of sample collection), comparing eight ocular, 17 urogenital and 13 LGV strains ( Figure 4(a)). One hundred and five SNPs were significantly associated with ocular localization (p*<0.05) of which 21 were non-synonymous (details in Table 2(a) and Supplementary Information S5). These were within a number of genes known to be polymorphic, genes previously identified as tropism-associated (CTA0156, CTA0498/tarP and CTA0743/pbpB) and virulence factors (CTA0498/tarP and CTA0884/pmpD). No predicted protein localization was over-represented in the ocular localization-related SNPs (p=0.6174), however early and very-late expressed genes were over-represented (p=0.0197).

Markers of disease severity in ocular C. trachomatis infection
Using permutation-based resampling methods, eight SNPs were found to be significantly associated with disease severity (Figure 4(b)). Seven of these are in

DISCUSSION
This is the first collection of clinical ocular Ct WGS from a single trachomaendemic population to be characterized, enabling us to describe the population diversity of naturally occurring Ct in a treatment-naïve population. We used detailed clinical grading combined with microbial quantitation to perform a genome-wide association study (GWAS) and investigated associations between Ct polymorphisms with ocular localization and disease severity in trachoma. Such diversity is likely to be representative of recombination present in Ct 57 .
Genome-wide recombination was common and widespread within these sequences.
Extensive recombination has been noted in previous studies, and is thought to be a source of diversification with possible interstrain recombination 4 The eight disease severity-associated SNPs are within less well-characterized genes, although there is some evidence of correlation of expression in the Ct developmental cycle (with a peak between 18 and 24 HPI). Apart from pmpE, there is a paucity of published data showing polymorphism in these genes. This suggests that these SNPs may be important in ocular Ct pathogenesis, rather than in longer-term chlamydial evolution. Three of these genes are putative Ct virulence factors, with functions in nutrient acquisition (glgA 24,28,72 ), host-cell adhesion (pmpE 73 ) and response to IFNγ-induced stress (trmD 69 ). Homologues of alaS 74,75 and CTA0273 76,77 are known virulence factors in related Gram-negative bacteria, suggesting that these genes are potentially important in Ct pathogenesis.
Transcriptome analysis of chlamydial growth in vitro has shown that there is highly upregulated gene expression of trmD (encoding a tRNA methyl-transferase) associated with growth in the presence of interferon gamma, thought to be important in the maintenance of chlamydial infection 69 . yjfH (renamed rlmB) is phylogenetically related to the TrmD family and encodes the protein RlmB, which is important for the synthesis and assembly of the components of the ribosome 78 . In Escherichia coli, Haemophilus influenzae and Mycoplasma genitalium, RlmB catalyses the methylation of guanosine 2251 in 23S rRNA, which is of importance in peptidyl tRNA recognition but is not essential for bacterial growth 78,79 . alaS encodes a tRNA ligase of the class II aminoacyl-tRNA synthetase family involved in cytoplasmic protein biosynthesis. It is not known to have virulence associations in chlamydial infection, but has been described as a component of a virulence operon in Haemophilus ducreyi 74 and H. influenzae 75 . The CDS CTA0273 encodes a predicted inner membrane protein translocase component of the autotransporter YidC, an inner membrane insertase important in virulence in E. coli 76 and Streptococcus mutans 77 . This is the first study suggesting that these loci may be important in disease severity and host-pathogen interactions in chlamydial infection. A summary of available literature for these key ocular localization and disease severity-associated SNPs is tabulated in Supplementary Information S8. We cannot speculate further on the effect polymorphisms on expression. It is possible that the synonymous disease severityassociated SNPs are linkage-markers for disease-causing alleles that were not included in the GWAS. For both analyses, further mechanistic studies are required to establish causality, validity and to fully understand the nature of the associations presented in this study.
Though we were intrinsically limited to those cases where infection was detectable and from which we were able to obtain Ct WGS data, our population-based treatment naïve sample attempts to provide a representative picture of what is observed in ocular Ct infection. We acknowledge that there may be Ct genotypes that are cleared by the immune system such that we do not capture them a cross-sectional study. We are limited to the small sample size in this study, but attempt to address the issues of statistical power and multiple testing by using a bi-dimensional conjunctival phenotype and permutation-based multivariable regression analysis. To date the majority of published microbial GWAS have sample sizes under 500 80 , including several key studies examining virulence 59 and drug-resistance 60 in Staphylococcus aureus with sample sizes of 75 and 90 respectively.
The potential of bacterial GWAS has only recently been realized, and despite the limitations with sample size, its use to study Ct in this way is particularly important, since in vitro models are intrinsically difficult to develop and it has not been possible to study urogenital Ct in the same way due to the lack of a clearly defined in vivo disease phenotype. The genomic markers identified in this study provide important direction for validation through in vitro functional studies and a unique opportunity to understand host-pathogen interactions likely to be important in Ct pathogenesis in humans. The greater than expected diversity within this population of naturally circulating ocular Ct and the clustering at village-level demonstrates the potential utility of WGS in epidemiological and clinical studies. This will enable us to understand transmission in both ocular and urogenital Ct infection and will have significant public health implications in preventing and eliminating chlamydial disease in humans.

Survey, Clinical Examination and Sample collection
Survey, clinical examination and sample collection methods have been described previously 45,81 . Briefly, we conducted a cross-sectional population-based survey in trachoma-endemic communities on the Bijagós Archipelago of Guinea Bissau. The upper tarsal conjunctivae of each consenting participant were examined, digital photographs were taken, a clinical trachoma grade was assigned and two sequential conjunctival swabs were obtained from the left upper tarsal conjunctiva of each individual using a standardized method 45 . DNA was extracted and Ct omcB (genomic) copies/swab quantified from the second conjunctival swab using droplet digital PCR (ddPCR) 81,82 .
We used the modified FPC (Follicles, Papillary hypertrophy, Conjunctival scarring) grading system for trachoma 39 . The modified FPC system allows detailed scoring of the conjunctiva for the presence of follicles (F score), papillary hypertrophy (conjunctival inflammation) (P score) and conjunctival scarring (C score), receiving a grade of 0-3 for each parameter. A single validated grader conducted the examinations and these were verified by an expert grader (masked to the field grades and ddPCR results) using the digital photographs. Grader concordance was measured using Cohen's Kappa, where a Kappa > 0.9 was used as the threshold to indicate good agreement.
Conjunctival inflammation (P score) is known to have a strong association with Ct bacterial load in this and other populations 35,[83][84][85][86] . For this study we used principal component analysis (PCA) to combine the presence of inflammation (defined by the P score using the FPC trachoma grading system 39 ) with Ct bacterial load (defined by tertile cut-offs illustrated in Supplementary Information S9) 87 . The conjunctival disease phenotype is a dimension reduction of these two variables defining what we observed in the conjunctiva at the time of sampling ( Figure 5).
Dimension reduction using PCA to define complex disease phenotypes in GWAS is well-recognised, as it allows multiple traits to be included to capture a more complex phenotype and accounts for correlation between traits. This approach therefore may reveal novel loci or pathways that would not be evident in single-trait GWAS, where the full extent of genetic variation cannot be captured 40 .   Variants were selected as the intersection dataset between those obtained using both SNP callers and SNPs were further quality-filtered. SNP alleles were called using an alternative coverage-based approach where a missing call was assigned to a site if the total coverage was less than 20x depth or where one of the four nucleotides accounted for at least 80% total coverage 95 . There was a clear relationship between the mean depth of coverage and proportion of missing calls, based on which we retained sequences with greater than 10x mean depth of coverage over the whole genome (81 sequences retained).

Pre-sequencing target enrichment
Heterozygous calls were removed and SNPs with a minor allele frequency of less than 25% were removed. Samples with greater than 25% genome-wide missing data and 30% missing data per SNP were excluded from the analysis (n=10, 71 sequences retained). The quality assessment and filtering process is shown in Figure 1.
Detail of WGS data is contained in Supplementary Information S10.

Phylogenetic Reconstruction
Samples were mapped to the ocular reference strain Ct A/HAR-13 and SNPs were called as described above. Phylogenies were computed using RAxML version 7.8.2 96 from a variable sites alignment using a GTR+gamma model and are midpoint rooted. Recombination is known to occur in Ct 4,6 and can be problematic in constructing phylogeny. We applied three compatibility-based recombination detection methods to detect regions of recombination using PhiPack 97 : the pairwise homoplasy index (Phi), the maximum Chi2 and the neighbour similarity score (NSS) across the genome alignment. We also examined the confidence in the phylogenetic tree by computing RAxML site-based likelihood scores 96 . Phylogenetic trees were examined adjusting for recombination using the methods described above.
Additionally, sequence data for the tryptophan operon (CTA0182 and CTA0184-CTA0186), tarP (CTA0498), nine polymorphic membrane proteins (CTA0447-CTA0449, CTA0884, CTA0949-CTA0952 and CTA0954) and ompA (CTA0742) were extracted from the 81 ocular Ct sequences from Guinea Bissau retained after quality control filtering described above, 48 ocular sequences originating from a study conducted in Kahe village, Rombo District, Tanzania 53 and 38 publicly available reference sequences. Phylogenies were constructed as described above.
Polymorphisms, insertions and deletions (INDELs) and truncations for the tryptophan operon were manually determined from aligned sequences using SeaView 98 . Tyrosine repeat regions and actin-binding domains in tarP were found using RADAR 99 and Pfam 100 respectively.

Pairwise diversity
A comparison was made between the two population-based Ct sequence data sets from the Bijagós (Guinea Bissau) and Rombo (Tanzania) sequences whereby short read data from the 81 Bijagós sequences and 48 Rombo sequences were mapped against Ct A/HAR-13 using SAMtools. Within population pairwise nucleotide diversity was calculated using the formula: where n is the number of sequences, x is the frequency of sequences i and j and πij is the number of nucleotide differences per site between sequences i and j 101 . Frequency of sequences was considered uniform within the populations and sites with missing calls were excluded on a per sequence basis.

Genome-Wide Association Analyses
To investigate the association between Ct polymorphisms with ocular localization and clinical disease severity, we used permutation-based logistic regression methods, which are powerful and well-recognised tools in GWAS, allowing for adjustment for population structure, age and gender in the model and accounting for multiple testing [48][49][50][51] .
We used permutation analyses of 100,024 phenotypic re-samplings, where the distribution of the p-value was approximated by simulating data sets through randomisation under the null hypothesis of no association between phenotype and genotype. Genome-wide significance was determined as p*≤0.05, where p* was defined as the fraction of re-sampled (simulated) data that returned p-values that were less than or equal to the p-values observed in the data 87 . All analyses were conducted using the R statistical package v3.0.2 (The R Foundation for Statistical Computing, http://www.r-project.org) using MASS, GLM and lsr. All R script used for these analyses is contained within Supplementary Information S11 and is released as a CC-BY open resource (CC-BY-SA 3.0).

Ocular localization
Tissue localization is defined as the localization (or presence) of a detectable

Clinical Disease Severity
A permutation-based ordinal logistic regression model was used to test the association between the disease severity score (using the in vivo conjunctival phenotype defined previously) and polymorphic sites. The final analysis includes 129 SNPs from 71 sequences derived as described in Figure 1. For each SNP the standard error for the t-statistic was estimated from the model and used to calculate the odds ratios (OR) and 95% confidence intervals. Individuals' age and gender were included as a covariate to the regression analysis.
We investigated the effect of population structure on the results of the GWAS analysis using PCA 105 . The first three principal components (PC) captured the majority of structural variation but including these in the model had no effect and therefore these were not included in the final model.
We corrected for genomic inflation if the occurrence of a polymorphism in the population was over 90% or there was a minor allele frequency of 3%.

DATA AVAILABILITY
All sequence data is available from the European Bioinformatics Institute (EBI) short read archive. See Supplementary Information S12 for details and accession numbers.

COMPETING FINANCIAL INTEREST
All authors declare no competing financial interests. Sequences (n=55) were excluded from the association analysis if there was a) <10x coverage b)* >25% missing reads genome-wide and c) >25% missing (N) calls at the single nucleotide polymorphism (SNP) locus. Coverage and missing data were correlated and resulted in exclusion of the same samples irrespective of criteria chosen. 71 sequences were retained in the final disease severity analysis. **Ocular C. trachomatis load = omcB (C. trachomatis genome) copies per conjunctival swab measured using droplet digital PCR. *** P score = Conjunctival inflammation score (0-3) using the modified FPC (Follicles, Papillary Hypertrophy, Conjunctival Scarring) grading system for trachoma 39 . (a) Ocular localization-associated non-synonymous SNPs (p-value < 0.1). Position of the SNPs and name of the impacted are from the Ct A/HAR13 (GenBank Accession Number NC_007429) genome. 'Allele Percentage' is the percentage of each group where the given allele was present. 'CDS/NCR' identifies whether the SNP was in a coding or non-coding region. 'P*' indicates p-values from 100,024 simulations indicating genome wide significance at p*<0.05. 'MAF' is the minor allele frequency. 'N Calls at Locus' is the proportion of isolates which had no base called. 'AA' is the amino acid coded for. (b) Disease severity-associated SNPs (p-value < 0.1). Disease severity is defined by a composite in vivo conjunctival phenotype derived using principal component analysis using ocular C. trachomatis load and conjunctival inflammatory (P) score (using the modified FPC (Follicles, Papillary Hypertrophy, Conjunctival Scarring) trachoma grading system 39 ). 'Reference Allele' indicates the reference allele on Ct A/HAR-13 (GenBank Accession Number NC_007429). 'CDS/NCR' identifies whether the SNP was in a coding or non-coding region. 'P*'=permuted p-value after 100,024 simulations indicating genome wide significance at p*<0.05. 'T' is the t statistic; SE(T) is the Standard Error of the t statistic. OR is the adjusted Odds Ratio (derived from the t statistic). 95% C.I.=95% confidence interval of the OR. 'MAF' is the minor allele frequency. 'N Calls at Locus' is the proportion of isolates which had no base called.

Figure 1. Whole Genome Sequence (WGS) quality filtering processes and threshold criteria for inclusion in analyses
Ct DNA detected using droplet digital PCR 82 . WGS data were obtained using SureSelect target enrichment 31 (or chlamydial cell culture) and Illumina paired end sequencing. FastQC 90 was used to assess basic WGS quality. SNP alleles were called against reference strain Ct A/HAR-13 using an alternative coverage-based approach where a missing call was assigned to a site if the total coverage was less than 20x depth or where one of the four nucleotides accounted for at least 80% total coverage 95 . There was a clear relationship between the mean depth of coverage and genome-wide proportion of missing calls, therefore only sequences with greater than 10x mean depth of coverage over the whole genome were retained using the GATK Best Practice threshold 93,94 . Heterozygous calls were removed and SNPs with a minor allele frequency (MAF) of less than 25% were removed. Samples with greater than 25% genome-wide missing data and 30% missing data per SNP were excluded from the analysis. WGS sequence quality is shown in detail in Supplementary Information S12. *n=157 including the 71 Bijagós sequences in addition to 48 Rombo sequences and 38 reference sequences.    A composite in vivo phenotype was derived using principal component analysis (PCA) for dimension reduction of two phenotypic traits: a disease severity score (using the P score value) and C. trachomatis load (where C. trachomatis load was log transformed and cut-offs determined from the resulting density plot (See Supplementary Information S9)). Each circle represents an individual infection (represented on the x axis (Index), n=81). Circle size reflects C. trachomatis load and circle colour reflects inflammatory P score (P0-P3) defined using the modified FPC (Follicles, Papillary Hypertrophy, Conjunctival Scarring) grading system for trachoma 39