Targeted genomic analysis reveals widespread autoimmune disease association with regulatory variants in the TNF superfamily cytokine signalling network

Background Tumour necrosis factor (TNF) superfamily cytokines and their receptors regulate diverse immune system functions through a common set of signalling pathways. Genetic variants in and expression of individual TNF superfamily cytokines, receptors and signalling proteins have been associated with autoimmune and inflammatory diseases, but their interconnected biology has been largely unexplored. Methods We took a hypothesis-driven approach using available genome-wide datasets to identify genetic variants regulating gene expression in the TNF superfamily cytokine signalling network and the association of these variants with autoimmune and autoinflammatory disease. Using paired gene expression and genetic data, we identified genetic variants associated with gene expression, expression quantitative trait loci (eQTLs), in four peripheral blood cell subsets. We then examined whether eQTLs were dependent on gene expression level or the presence of active enhancer chromatin marks. Using these eQTLs as genetic markers of the TNF superfamily signalling network, we performed targeted gene set association analysis in eight autoimmune and autoinflammatory disease genome-wide association studies. Results Comparison of TNF superfamily network gene expression and regulatory variants across four leucocyte subsets revealed patterns that differed between cell types. eQTLs for genes in this network were not dependent on absolute gene expression levels and were not enriched for chromatin marks of active enhancers. By examining autoimmune disease risk variants among our eQTLs, we found that risk alleles can be associated with either increased or decreased expression of co-stimulatory TNF superfamily cytokines, receptors or downstream signalling molecules. Gene set disease association analysis revealed that eQTLs for genes in the TNF superfamily pathway were associated with six of the eight autoimmune and autoinflammatory diseases examined, demonstrating associations beyond single genome-wide significant hits. Conclusions This systematic analysis of the influence of regulatory genetic variants in the TNF superfamily network reveals widespread and diverse roles for these cytokines in susceptibility to a number of immune-mediated diseases. Electronic supplementary material The online version of this article (doi:10.1186/s13073-016-0329-5) contains supplementary material, which is available to authorized users.


NHGRI GWAS Catalog search for TNFSF-related genes and intersection with eQTLs
The NHGRI GWAS Catalog [1] was filtered at p < 5 x 10 -8 (genome-wide significance) for variants in or near (as defined by the catalogue) any autosomal member of the TNFSF, TNFRSF or their downstream signalling molecules (Additional file 2) associated with any autoimmune or autoinflammatory disease (search terms derived from [2] are in Additional file 1). To test enrichment of TNFSF-related gene associations in autoimmunity and autoinflammation versus other diseases, genes only in the "Mapped Genes" category of the catalogue were considered to avoid reporting bias of immune-related genes. Results were separated by disease trait and genomic regions were counted only once per disease trait to avoid over-counting associations found in multiple studies. Fisher's exact test was then used to test enrichment of autoimmune and autoinflammatory disease-associated loci for TNFSFrelated genes (overlap of 29 from 711 autoimmune and autoinflammatory disease associations and 58 associations near autosomal TNFSF-related genes among a total of 4890 catalogue associations).
To examine overlap with TNFSF-related cis-eQTLs, the SNP most strongly associated with gene expression was extracted for each TNFSF-related cis-eQTL (FDR < 0.1) in each cell type. The NHGRI GWAS Catalog filtered as in the previous paragraph was intersected with these eQTL SNPs (linkage disequilibrium, LD, r 2 ≥ 0.8). LD and phasing between eQTLs and GWAS SNPs were calculated from the 1000 Genomes Phase 1 EUR population vcf files [3].

Sorting peripheral blood subsets from individuals
Peripheral blood was separated over a Histopaque 1077 gradient. Neutrophils were isolated from the granulocyte pellet using CD16 Microbeads (Miltenyi Biotec). Peripheral blood mononuclear cells were split in two fractions. From one fraction, in sequential rounds of positive selection, monocytes followed by CD4 + T cells were isolated using CD14 and CD4 Microbeads, respectively (Miltenyi Biotec). From the second fraction, B cells followed by CD8 + T cells were isolated using CD19 and CD8 Microbeads, respectively (Miltenyi Biotec).
RNA was extracted using the AllPrep DNA/RNA Mini Kit (Qiagen).

Gene expression microarray data processing
Samples exhibiting sex discordance or global dimness were excluded before gene expression microarray data processing. For cell type comparison in healthy controls, gene expression datasets of four cell subsets from five healthy individuals were run in the same microarray batch. For eQTL analysis with healthy controls and IBD patients, expression datasets for each cell subset were analysed separately as there was confounding of cell type and microarray batch. The microarray dataset contained more individuals than genotypes were available for. To increase normalisation robustness, microarray processing for the eQTL study was performed using all available samples of the same diagnoses and cell types as were used in the eQTL study. Probesets were annotated with the pd.hugene.1.1.st.v1 Bioconductor annotation package [7], and samples were processed using the RMA function of the oligo Bioconductor package [8] in R. Quality control was performed by first correcting for microarray batch with the ComBat function of the sva Bioconductor package [9] using diagnosis (Control, CD, or UC), gender, and age as covariates, and then running the arrayQualityMetrics Bioconductor package [10]. Samples with two or more outlying characteristics were excluded from the datasets.
For the healthy control gene expression analysis, we extracted TNFSF-related gene-level probesets from RMA-preprocessed expression data (Additional file 3). Where there was more than one probeset per gene, we selected the probeset with maximal transcript coverage. Where there was equal transcript coverage, a probeset was chosen at random.
Heat maps were generated using the heatmap.2 function of the gplots R package [11].
For eQTL analysis, the PEER R package [12] was used instead of ComBat to remove latent as well as known technical and biological confounders. Expression data was adjusted with PEER, specifying batch, gender, diagnosis, and age as known potential confounders, and accounting for 30 hidden factors. Microarray probesets were then filtered as described in the previous paragraph.

Genotype data processing for eQTL analysis
Within each of the two genotyping batches, samples were removed based on the following criteria: differing from pre-sequencing Sequenom typing (concordance < 0.9), duplication (concordance > 0.98 with another sample), inferred sex ambiguous or conflicting with provided sex, call rate < 0.95, or mean normalised magnitude of intensity < 0.9. All subsequent data processing was performed in PLINK [5,6] except ethnicity principal component analysis (PCA). Within each batch, SNPs and samples were filtered for genotyping rate above 95%, minor allele frequency above 1%, and Hardy-Weinberg equilibrium p-value greater than 10 -8 . The two batches were then merged and the same filters applied. Heterozygosity was calculated and samples with inbreeding f-statistics outside three standard deviations from the mean f-statistic were removed. Identity-by-state was calculated, and duplicate samples were removed, keeping the duplicate with the fewest missing genotypes. To verify the homogeneity of sample ethnicities, sample genotypes were compared to hapmap3 genotypes [13,14]. Hapmap3 founder genotypes were filtered for SNPs common to our dataset on the same strand, combined with our data, and thinned to 5% of the original SNP coverage. Ethnicity PCA was performed in R, using the snpStats Bioconductor package [15]. Ethnicity principal components 1 and 2 were plotted and our dataset visually examined for outliers. One outlier was removed. Only autosomal SNPs were retained for eQTL analysis.

Variable selection for multiple cis SNPs contributing to cis-eQTLs
For each gene with more than one significant cis-eQTL SNP in a given cell type, all SNPs associated with expression of that gene (FDR < 10%) were included as predictor variables in a linear model. In an exhaustive model search, all possible variable subsets were evaluated for Bayesian information criterion (BIC) using the regsubsets function from the leaps R package [16]. The model with minimum BIC was chosen.

Nanostring nCounter measurements and data processing
RNA from CD4 + T cells, CD14 + monocytes, and CD16 + neutrophils from healthy controls and IBD patients was previously measured by the nCounter Analysis System (Nanostring Technologies, Seattle, Washington, USA) [17]. Samples were from the same cohort as those used for eQTL mapping, with 11/14 CD4 + T cell samples, 12/14 CD14 + monocyte samples, and 8/12 CD16 + neutrophil samples shared between the Nanostring and eQTL datasets. Hybridisations were carried out with 100 nanograms of RNA for 17 hours before scanning. The custom nCounter probe set included all TNFSF and TNFRSF member genes (but not downstream signalling molecule genes), among other test and control genes.
Samples were first normalised for hybridisation efficiency by the geometric mean of positive control probes. No normalisation factors fell outside of the acceptable range of 0.3-3.
Normalised counts from each sample were then further normalised to CNOT1 expression, which was previously found to be a reasonable control gene in the cell types examined [17].

Processing and analysis of H3K27ac ChIP-seq data
For each cis-eQTL (FDR < 0.1), the strongest cis-eQTL SNP was extracted and hg19 coordinates mapped [18]. H3K27ac ChIP-seq or input DNA sequencing reads overlapping these loci were counted using the bedtools [19] intersect function and normalised per million reads. To examine enrichment of H3K27ac ChIP-seq reads across these cis-eQTLs compared with other SNPs, all SNPs tested in the eQTL analysis were intersected with the ChIP-seq .bed files. A random distribution of mean counts per million was created in each cell type as follows. The same number of TNFSF-related genes as had eQTLs in that cell type were randomly selected, and a SNP from the eQTL SNP genotyping chip from the cis region of each of these genes was randomly chosen to compute a mean H3K27ac counts per million overlap. This process was repeated 10,000 times. Because our eQTL SNPs were not fine-mapped, we also expanded this comparison to include all SNPs in LD r 2 ≥ 0.8 with the eQTL SNPs or randomly selected SNPs from the cis region. LD-tagged SNPs were identified using the 1000 Genomes Phase 1 EUR population vcf files [3]. For each eQTL or random SNP, acetylation was counted at the maximally acetylated tagged SNP, and these ChIP-seq counts per million were then averaged for the true data or for each randomly selected dataset. To compare eQTL strength versus acetylation enrichment, we considered the strongest eQTL SNP for each gene in each cell type, regardless of significance level.
The expression association statistics (1 degree of freedom chi-squared scores) for these SNPs were then compared with H3K27ac counts per million at the same loci by Spearman correlation. For visualisation, H3K27ac ChIP-seq data was converted to bedGraph format using bedtools2 [19] genomecov function, and tracks were viewed in the UCSC genome browser, hg19 genome build [20].

Processing of genetic data from previous GWAS
Genetic data processing was carried out in PLINK [5,6] and R using the snpStats Bioconductor package [15]. All (if any) SNPs and samples flagged for removal by the original studies were removed. If cases and controls were genotyped separately, and if this information was provided in the available data, the following filtering steps were performed in each batch separately before combination: genotyping rate at the SNP and individual level above 95%, minor allele frequency above 1%, and Hardy-Weinberg equilibrium p-value greater than 10 -8 . After combination of cases and controls, the same steps were performed in the whole dataset. For SNPs with minor allele frequency less than 5%, SNP genotype missingness was required to be less than 1%. For GWAS datasets where SNPs were labelled with a manufacturer's IDs, we annotated SNPs with rsIDs. Where more than one non-rsID corresponded to a single rsID, the ID with fewer missing calls was selected. SNP     TNFSF14   TNF   TNFSF18   TNFSF11   FASLG   TNFSF9   TNFSF15   TNFSF4   CD70   4 8 12   Value   Color Key   TNFRSF10A   TNFRSF10D   CD40   TNFRSF8   TNFRSF13B   TNFRSF13C   TNFRSF17   TNFRSF11B   EDAR   TNFRSF18   TNFRSF4   TNFRSF19   TNFRSF21   TNFRSF11A   NGFR   TNFRSF25   CD27   TNFRSF1A   TNFRSF10B   FAS   TNFRSF14   TNFRSF12A   TNFRSF9   TNFRSF10C Figure 2 for separated TNFSF ligands, TNFRSF receptors, and signalling molecules.