Integrative genomics analyses unveil downstream biological effectors of disease-specific polymorphisms buried in intergenic regions

Functionally altered biological mechanisms arising from disease-associated polymorphisms, remain difficult to characterise when those variants are intergenic, or, fall between genes. We sought to identify shared downstream mechanisms by which inter- and intragenic single-nucleotide polymorphisms (SNPs) contribute to a specific physiopathology. Using computational modelling of 2 million pairs of disease-associated SNPs drawn from genome-wide association studies (GWAS), integrated with expression Quantitative Trait Loci (eQTL) and Gene Ontology functional annotations, we predicted 3,870 inter–intra and inter–intra SNP pairs with convergent biological mechanisms (FDR<0.05). These prioritised SNP pairs with overlapping messenger RNA targets or similar functional annotations were more likely to be associated with the same disease than unrelated pathologies (OR>12). We additionally confirmed synergistic and antagonistic genetic interactions for a subset of prioritised SNP pairs in independent studies of Alzheimer’s disease (entropy P=0.046), bladder cancer (entropy P=0.039), and rheumatoid arthritis (PheWAS case–control P<10−4). Using ENCODE data sets, we further statistically validated that the biological mechanisms shared within prioritised SNP pairs are frequently governed by matching transcription factor binding sites and long-range chromatin interactions. These results provide a ‘roadmap’ of disease mechanisms emerging from GWAS and further identify candidate therapeutic targets among downstream effectors of intergenic SNPs.


II-Supplementary Figures
Supplementary Figure 1

III-Supplementary Tables
Supplementary where gene g 1 was annotated to a set of GO terms, T(g 1 ), and |T(g 1 )|, is the cardinality of the set T(g 1 ).
The GENE_ITS provides a score that ranges from 0 to 1, where GENE_ITS of 0 corresponds to two genes with no similar GO annotations and GENE_ITS of 1 corresponds to two genes with the same GO annotations. See Pesquita at al. (Pesquita et al. 2009) and our previous paper (Regan et al. 2012) for more details about this widely-used semantic measure.

A3. Novel Semantic biological similarity of two SNPs using eQTL associations (SNP_ITS):
We developed a new method to calculate the biological similarity of two SNPs using their SNP-mRNA eQTL associations (or SNP_ITS for short) (Equation 4). The similarity between SNP 1 (s 1 ) and SNP 2 (s 2 ) is measured using the semantic similarity (Equation 3) between the set of mRNAs associated to SNP 1 (G(s 2 )) and those associated to SNP 2 (G(s 2 )). Precisely, for a specific mRNA (g i ) associated to SNP 1, we calculate its similarity score (GENE_ITS(g i , g j )) to every mRNA (g j ) associated to SNP 2 ( g j ∈ G(s 2 ) ), and retain the maximum value among them. This is repeated for each mRNA associated to SNP 1. Then the converse is calculated for each mRNA associated to SNP 2. Then, we calculated the average of all the maximum similarity scores to obtain the similarity of the two SNPs, noted as SNP_ITS(s 1 , s 2 ). The SNP_ITS of two SNPs was formalized as follows: where SNP s 1 was associated to a set of mRNAs G(s 1 ), and |G(s 1 )| is the cardinality of the set G(s 1 ), similarly for s 2 . The SNP_ITS provides a score that ranges from 0 to 1, where SNP_ITS of 0 corresponds to two SNPs with neither similar mRNA annotations nor similar GO annotations. SNP_ITS of 1 corresponds to two SNPs with the same mRNA annotations or the same number of mRNAs each having the same GO annotations.
B) Permutation methods. eQTL associations were selected by cutoffs of p-values between 10 -4 and 10 -6 (p≤10 -4 , p≤5x10 -5 , p≤10 -5 , and p≤10 -6 ) and SNP node degree (ND) threshold between 1 and 5 (ND≥1, ND≥3, and ND≥5). Each eQTL dataset was regarded as a bipartite network consisting of SNP and mRNAs. A conservative permutation resampling that keeps the node degree of each specific SNP and specific mRNA constant (as observed) was conducted for each dataset (100,000 permutations). By this means, the permutation resampling matched the probability of a SNP or mRNA connected in the bipartite (from 1/#SNPs to 1 for SNPs and from 1/#mRNAs to 1 for mRNAs). Three types of empirical permutation were conducted: mRNA overlap, GO biological process, and molecular functions. The one for mRNA overlap was directly permutated from the eQTL dataset with respect to a p-value cutoff and a node degree threshold. For GO biological process and molecular functions, the genes without respective GO annotations were filtered out before permutation, and the subsequent calculation of biological semantic similarity and significance were based on the filtered ones, by which potential biases from incomplete annotation could be controlled. The same set of permutations was used for calculation of GO biological process similarity of SNP pairs and specific overlapping GO biological process terms in the SNP pairs. The same rule was applied to calculations for GO molecular function. The permutations were conducted in supercomputers (specifically beagle) or clusters using MPI parallel computing.
C) Q-Q plot analysis. Quantile-quantile (Q-Q) plot was employed to show the distribution difference of SNP pair measures between the pairs of SNPs associated with the same complex diseases and those with different diseases. Two types of measures were conducted: FDR of mRNA overlap and FDR of mRNA similarity (GO-BP). To show the relative trend for SNP pairs derived from different eQTL strengths, multiple Q-Q plot curves were assembled in one panel, such as different p-value cutoffs of eQTL associations (p≤10 -4 , p≤10 -5 , and p≤10 -6 ), and different SNP node degree (ND) thresholds (ND≥1, ND≥3, and ND≥5). The Q-Q plot curves derived by multiple eQTL cutoffs were generated by the "qqplot" function in R software and the output of the function were extracted and plotted with customized shapes and colors to distinguish these curves. Since the p-values of SNP pairs were derived from empirical permutations (e.g. 100,000 times), they were truncated at the minimal observable p-values (10 -5 ) and manually set as 90% of that (e.g. 9*10 -6 ), with corresponding FDR truncated accordingly. The axes of the figure, each for one group of SNP pairs, were shown in negative log 10 scale (-log 10 FDR) to better visualize the pattern. In addition, Mann-Whitney U test ("wilcox.test" function in R) was performed to evaluate the overall mean between the two groups of SNP pairs of each curve, using FDR values directly rather than negative log values. The p-value result of each U-test was shown along with its correspondingly Q-Q plot curve.

D) Disease Network: assessment of the GO term overlaps significance within a SNP pair.
This method focuses on assessing the significance of an overlap of GO terms within a SNP pair (SNP-GO-SNP triplet). It allows explaining which functional relationships are more likely to relate two SNPs of prioritized SNP pair. The method requires four steps. First, all Lead SNPs in the studied SNP pairs were related to mRNA by eQTL associations from the SCAN database. Second, each of these associated mRNAs was related to each of their corresponding GO terms using the Gene Ontology Annotations (molecular functions (GO-MF) biological processes (GO-BP)). We utilized the fully-derived GO annotation table, where each GO terms is directly associated to mRNAs and all mRNA associations to the ancestral GO terms. GO terms are organized in a hierarchical structure (directed acyclic graph). Third, GO terms were assigned to Lead SNPs via their respective mRNAs. For each SNP of the Lead SNP pair, a list of associated GO terms is calculated. Finally, from these associations, we can straightforwardly identify overlapping GO terms between two Lead SNPs of the prioritized Lead SNP pair. Then, we impute the statistical significance (p-value) of those overlapping GO terms using empirical permutation resamplings (100,000 times) of SNP-mRNA associations based on eQTLs (eQTL p-value cutoff ≤10 -4 for the reported results). Under these permutations, every mRNA is associated to the same number of SNPs and each SNP to the same number of mRNAs (constant node degree, constant number of total mRNA-SNP associations; see above section). However, the association of SNP-GO terms differs from the observed set through resampling. False Discovery Rate (FDR) was used to adjust for multiplicity. These calculations were performed separately for GO-BPs and GO-MFs to derive prioritized Lead SNP pairs and then combined. Each significant SNP-GO-SNP triplet (FDR<0.05) was considered as prioritized and therefore, as a putative shared mechanism for the prioritized SNP pair.  2,358 Lead SNPs associated with 467 diseases in NHGRI GWAS catalog and associated with 6301 mRNA expression levels in B Lymphoblastic cells by eQTL studies were considered, in pairwise, to find common biological mechanisms. B. Hypothesis: Surveyed Lead SNP pairs were then dichotomized into same disease and distinct disease Lead SNP pairs based on NHGRI GWAS catalog and into inter-inter, inter-intra and intra-intra pairs based on dbSNP genomic annotations. C. Mechanism analytics: These Lead SNP pairs (all three types) were prioritized by shared biological mechanisms via mRNA overlap derived from eQTL associations and similarity of biological process and/or molecular function based on gene ontology (GO) annotations of SNP-associated mRNAs. The prioritization was controlled by empirical resampling of SNP-mRNA associations and adjusted for multiple comparisons, in this case with an FDR<5% cutoff. D. Validation of prioritization: Prioritized SNP pairs were assessed for common regulatory properties derived from ENCODE data and were further validated in disease case studies. Note that we focused on SNP pairs with at least one intergenic SNP. Prioritized intergenic-intragenic Lead SNP pairs were found significantly enriched in the same disease than across distinct diseases (odds ratio in y-axis) under various p-value cuto s for eQTL associations (SNP-mRNA) (P-value in x-axis). A one-tailed Fisher's Exact Test (FET) was used to measure the significance of the enrichment. The odds ratios range from 2.3 to 5.8 (3.6x10 -5 ≤ p ≤0.04 for OR ≥ 3.5), 1.4 to 5.2 (4x10 -4 ≤ p ≤0.04 for OR ≥ 3.5), and 1.6 to 3.7 (best p=8.1x10 -7 ) for intergenic-intragenic SNP pairs prioritized by mRNA overlap   Those prioritized at FDR<5% are highlighted. This network translates the mechanistic map at a single disease level to reflect relationships between different biological scales and across Lead SNPs: from prioritized Lead SNP pairs and their eQTL-associated mRNAs to their associated disease-mechanisms (GO-MF and GO-BP). The network was visualized using Cytoscape (Methods and Supplemental Methods). The pairwise matrix (bottom) indicates each surveyed SNP pairs among those that were found prioritized at FDR<5% by mRNA overlap (purple square), by molecular function similarity (green square), and by biological similarity (orange square). The non-prioritized Lead SNP pairs are indicated by a grey square. The similarity between GO BP terms that share many genes and are hierarchically related is indicated by dotted lines. Computation of similarity is conducted by information theoretic distance (Methods).

ChIA-PET
14 A Prioritized SNP-pairs by mRNA overlap Prioritized inter-inter and inter-intra SNP pairs were found significantly enriched in the same disease than across distinct diseases (odds ratio in y-axis) under various p-value cutoffs for the eQTL association (SNP-mRNA) dataset (P-value in x-axis). A one-tailed Fisher's Exact Test (FET) was used to measure the significance of the enrichment. The odds ratios range from 1.8 to 9.7 (1.3x10 -5 ≤ p ≤0.02), 1.3 to 9.2 (1.1x10 -4 ≤ p ≤0.02 for OR ≥1.4), and 1.4 to 4.5 (best p=1.4x10 -5 ) for inter-inter and inter-intra SNP pairs prioritized by mRNA overlap (A), molecular function similarity (B), and biological process similarity (C), respectively. This demonstrated that enrichment of common biological mechanisms among Lead SNPs associated with the same disease (Figure 3) was an intrinsic property of the SNPs rather than the result of the linkage disequilibrium chosen. Supplementary Figure 9. Enrichment of shared biological mechanisms among inter-inter and inter-intra SNP pairs associated with the same disease using liver-derived eQTL associations. Two biological mechanisms were imputed for each SNP pair with LD r 2 <0.8: mRNA overlap (A) and biological processes (B). The two mechanisms were derived from liver eQTL data, only p≤10 -5 associations were available (truncated part indicated in grey regions), and various cutoffs were shown in x-axis by a log scale. SNP pairs were prioritized and controlled empirically (100,000 permutation resampling; FDR<0.05). Prioritized inter-inter and inter-intra SNP pairs were found significantly enriched in the same disease than across distinct diseases (odds ratio in y-axis) under various p-value cutoffs for the eQTL association (SNP-mRNA) dataset (P-value in x-axis). A one-tailed Fisher's Exact Test (FET) was used to measure the significance of the enrichment. The odds ratios range from 1.4 to 3.8 (6.1x10 -3 ≤ p ≤0.05 for OR≥1.9) and 2.4 to 5.9 (1.5x10 -12 ≤ p ≤0.04 for OR≥2.7) for inter-inter and inter-intra SNP pairs prioritized by mRNA overlap (A) and biological process similarity (B), respectively. These results demonstrated that the enrichment of common biological mechanisms among SNPs associated to the same disease (Figure 3) could be extended to other tissues or cell lines derived eQTL association dataset beyond LCL.

plots indicate that inter-inter and inter-intra SNP pairs associated with the same disease showed significantly different distributions of their shared mechanisms compared to those associated across distinct diseases.
In all four panels, the left skewed Q-Q plots indicate that, in each quartile, the "same disease" distribution contains more significant pvalues than the "across distinct disease" distribution. (A) and (B). We generated Quantile-Quantile plots (Q-Q plots) to investigate two distributions of the significance of mRNA overlap between Lead SNP pairs: SNP pairs of the same disease vs. those of distinct diseases. (C) and (D). We generated the Q-Q plots to examine the distributions of Gene Ontology Biological Process similarity (GO-BP). In (A) and (C), we calculated the Q-Q plots according to three different p-value cutoffs of eQTL associations (Methods, Supplementary Methods).
In (B) and (D), we calculated the Q-Q plots according to three different SNP node degree cutoffs (Methods, Supplementary Methods). Also, we calculated the Mann-Whitney U tests for each of the three curves in each panel (one-sided; shown alongside and color-coded according to each Q-Q plot curve). The significance of overlapping mRNAs and mRNA biological similarity of a Lead SNP pair was calculated empirically from 100,000 conservative permutations of the LCL eQTL associations and was adjusted by false discovery rate (overlap FDR or ITS FDR, shown in log scale in axis) (Methods). Of note, the horizontal plateau of p-values are attributable to data being truncated at p=10 -5 (related to the number of permutations). SNP pairs with linkage disequilibrium r 2 ≥0.01 were excluded from this study.
Other cutoffs for eQTL data led to similar results (data not shown).
18 Enrichment (odds ratio) Baseline (odds ratio=1, NS)  Figure 3 for complete details. Prioritized inter-inter and inter-intra Lead SNP pairs were significantly enriched for biomodule similarity for increasing levels of eQTL association (SNP-mRNA) stringency. Enrichment was calculated "within disease" versus across "distinct diseases" using a one-tailed Fisher's Exact Test (FET). Comparing with that in Figure 3, higher odds ratio were obtained. The odds ratios range from 3.0 to 25.4 (1.1x10 -12 ≤ p ≤1.6x10 -4 ), 2.4 to 23.7 (1.7x10 -11 ≤ p ≤3.2x10 -4 ), and 2.8 to 13.9 (1.8x10 -15 ≤ p ≤5.9x10 -3 ) for inter-inter and inter-intra SNP pairs prioritized by mRNA overlap (A), molecular function similarity (B), and biological process similarity (C), respectively. This demonstrated that enrichment of common biological mechanisms among Lead SNPs associated with the same disease (Figure 3) was an intrinsic property of the SNPs rather than the choice of a specific human reference genome annotation and gene definition. Figure 12. GWAS input covariates contributing to the interpretation of study results. The number of prioritized Lead SNP pairs within a disease is modestly correlated with the total number of SNPs associated by GWAS to a disease (A). Similarly, the proportion (percent) of prioritized Lead SNPs associated by GWAS to a disease is also slightly correlated with the total number of SNPs associated by GWAS to a disease -though this may be more complex (bimodal distribution) as the plot shows a smaller subset of anticorrelated patterns (B). Diseases overrepresented in GWAS studies are also overrepresented in our results (C   Figure 5 using a more stringent LD cutoff of r 2 <0.01. ENCODE data were used to assess the propensity of prioritized inter-inter and inter-intra Lead SNP pairs to localize in regulatory regions with (A) the same TF(s) via ChIP-seq, (B) two distinct interacting TFs (ChIP-seq and protein-protein interactions, PPI), and (C) long-range chromatin interaction properties (ChIA-PET). Enrichment of inter-inter and inter-intra Lead SNP pairs (odds ratios with 95% confidence intervals, y-axis) in regions sharing common regulatory properties were evaluated between (i) prioritized and non-prioritized Lead SNP pairs (Panel I), (ii) prioritized Lead SNP pairs in the same disease and across-diseases (Panel II). Greater ORs are observed in disease-specific SNP pairs (Panel II compared to Panel I); ORs range from 1.2 to 7129.2 (2x10 -16 ≤ p ≤0.02) in Panel I and 1.4 to 14140.2 (1.6x10 -31 ≤ p ≤0.05; one exception) in Panel II. The odds ratios are comparable to those yielded by inter-inter and inter-intra SNP pairs of LD r 2 <0.08. Candidate inter-inter and inter-intra SNPs considered for the enrichments were associated with mRNAs by eQTL with p ≤10 -4 (mRNA overlap; grey bars). Stringent prioritizations using empirical computations were performed on mRNA overlap (mauve bars), biological process similarity (green bars), molecular function similarity (orange bars), and in combination (merged methods; yellow bars). Enrichments of SNP pairs were performed using Fisher's exact test among all pairwise combinations of NHGRI disease-associated SNPs. Potential causal SNPs represented by the Lead SNPs in the pairs were included in this regulatory function study and were taken from RegulomeDB (Materials and Methods).