Predicting gene targets from integrative analyses of summary data from GWAS and eQTL studies for 28 human complex traits

Genome-wide association studies (GWAS) have identified hundreds of genetic variants associated with complex traits and diseases. However, elucidating the causal genes underlying GWAS hits remains challenging. We applied the summary data-based Mendelian randomization (SMR) method to 28 GWAS summary datasets to identify genes whose expression levels were associated with traits and diseases due to pleiotropy or causality (the expression level of a gene and the trait are affected by the same causal variant at a locus). We identified 71 genes, of which 17 are novel associations (no GWAS hit within 1 Mb distance of the genes). We integrated all the results in an online database (http://www.cnsgenomics/shiny/SMRdb/), providing important resources to prioritize genes for further follow-up, for example in functional studies.


Background
Genome-wide association studies (GWAS) have identified thousands of genetic loci associated with various complex traits, disorders, and diseases [1,2]. The GWAS paradigm exploits the linkage disequilibrium (LD) correlation structure of the genome, which means that the majority of the variation in the genome can be captured in a cost-effective way by genotyping only a few hundred thousand variants, followed by imputation of non-genotyped variants using a densely genotyped reference panel [3]. However, the LD structure also means that identified associations frequently point to genomic regions that harbor many genes, and it is extremely difficult to prioritize among these genes to identify the most functionally relevant genes using GWAS data alone. Laboratory-based follow-up of the associated regions is costly and prohibitive given the number of putatively causal variants in a typical genome-wide significant locus. GWAS of gene expression levels has allowed identification of expression quantitative trait loci (eQTL) [4][5][6]. Several recent methods [7][8][9][10][11] have used analytical approaches to integrate eQTL and complex trait associations as strategies to prioritize genes for further studies. In this study, we apply the recently developed summary data-based Mendelian randomization (SMR) method to 28 complex traits (including diseases), which have GWAS summary statistics available in the public domain, to obtain a list of genes to prioritize for further follow-up such as functional studies, and develop a database to query all the data and results. We use the SMR method because: it implements a transcriptome-wide association analysis in a formal statistical framework using summary data so that the statistical power is increased by using the latest GWAS and eQTL data of very large sample size; it provides a test to distinguish pleiotropy (or causality) from linkage (see below for more details) [11]; and it is implemented in a user-friendly software tool [12,13].

Construction and content
Details of the SMR method can be found in the Zhu et al. paper [11]. In brief, SMR applies the principles of Mendelian randomization (MR) to jointly analyze GWAS and eQTL summary statistics in order to test for association between gene expression and trait due to a shared causal variant at a locus. Mendelian randomization is an instrumental variable analysis approach that uses genetic variant(s) as instrumental variable(s) (Z) to test whether an exposure (X) has a causal effect on an outcome (Y) [14,15]. Equivalently, it is an analysis to test whether the effect of Z on Y is mediated by X (a model of Z -> X -> Y). The instrumental variable estimate of the effect of X on Y (b XY ) can be expressed as b XY = b ZY /b ZX , where b ZY is the effect size of Z on Y and b ZX is the effect size of Z on X [16]. This approach is usually used to test for the causative effect of a modifiable risk factor on health outcomes but the same principle can be used to test whether the effect size of a SNP (Z) on a trait (Y) identified from GWAS is mediated by the expression level of a gene (X). The SMR test [11] is a two-sample MR approach [17,18]. It allows us to estimate and test b XY using summary data from independent studies [11]. For the purpose of testing for the association between gene expression and trait, it uses the estimate of SNP effect on the trait (b ZY ) from GWAS summary data and the estimate of SNP effect on gene expression (b ZX ) from summary data of an independent eQTL study. In this case, trait is the outcome (Y), gene expression is the exposure (X), and the top cis-eQTL that is strongly associated with gene expression is used as the instrument (Z) (we used cis-eQTL with P eQTL <5e-8 in this study). Here we use "association" rather than "causal association" because previous results [11] suggest that there are at least three models consistent with a significant association from the SMR test using only a single genetic variant. These models are causality (Z -> X -> Y), pleiotropy (Z -> X and Z -> Y), and linkage (Z 1 -> X, Z 2 -> Y, and Z 1 and Z 2 are in LD). We provide details below of a test to distinguish pleiotropy (or causality) from linkage that is of less biological interest. The purpose of this study is to identify genes whose expression levels are associated with complex traits due to a shared causal variant. We therefore do not further distinguish between causality and pleiotropy (which is also impossible to achieve using only the cis-eQTLs).
As mentioned above, significant SMR results could also reflect linkage (i.e. the top associated cis-eQTL being in LD with two distinct causal variants, one affecting gene expression and the other affecting trait variation), which may be of less interest, at least in the first round of gene prioritization. To exclude SMR results that may reflect linkage, Zhu et al. [11] proposed the heterogeneity in dependent instruments (HEIDI) test, which considers the pattern of associations using all the SNPs that are significantly associated with gene expression (eQTLs) in the cis-region. The null hypothesis is that there is a single causal variant affecting trait and gene expression (pleiotropy or causality), which is of biological interest and should be prioritized for follow-up studies. The alternative hypothesis is that gene expression and trait are affected by two distinct causal variants, which is of less biological interest. Under the null hypothesis that there is a single causal variant, b XY estimated at any of the cis-SNPs that are associated with gene expression (e.g. SNPs with P eQTL <1.6 × 10 −3 , equivalent to χ 2 > 10) is expected to be equal to that estimated at the top associated cis-eQTL (see Equation  7 of Zhu et al. [11] for more details). Therefore, it is equivalent to test whether there is heterogeneity in b XY estimated at the significant cis-eQTLs (null hypothesis: no heterogeneity, causality or pleiotropy model; alternative hypothesis: heterogeneity, linkage model). Note that the HEIDI test takes into account non-independence of cis-eQTLs due to LD (using individual-level data from a reference sample to estimate LD between the cis-SNPs). Probes that show evidence of heterogeneity (e.g. P HEIDI <0.05) are rejected.
The previous SMR study analyzed three traits (body mass index (BMI), height, and waist-to-hip ratio adjusted by BMI) and two diseases (rheumatoid arthritis and schizophrenia) and identified 21 novel genes (genes that passed the SMR and HEIDI tests and that are located >1 Mb from the nearest GWAS hit) [11]. In this study, the SMR analysis is extended to an additional 28 complex traits and diseases (Table 1) which have summary data available in the public domain from largescale GWAS. The results from the SMR analyses are made available in an online query database (http:// www.cnsgenomics.com/shiny/SMRdb/) [13], which is implemented in R Shiny.

Utility and discussion
After quality control (QC) steps [11], associations between 5967 probes and 757,479 SNPs from the blood gene expression study by Westra et al. [5] were used in the analysis. The Westra eQTL summary data are available in the public domain and on the SMR website [12]. It should be noted that all the probes included in the analysis have at least a cis-eQTL at P eQTL <5 × 10 -8 . For each probe, the top associated cis-eQTL was used as the instrument for the SMR test. The SMR test was performed for each of the 5967 probes on 28 traits and disorders/diseases (Additional file 1: Table S1). The genome-wide significance level for the SMR test, corrected for multiple testing, is defined as 0.05/5967 = 8.4 × 10 -6 . For probes with P SMR <8.4 × 10 -6 , we conducted the HEIDI test and retained for further investigation only those probes with little evidence of heterogeneity P HEIDI ≥0.05. All the analyses were performed using the SMR software tool [11,12]. We particularly emphasized results that are considered to be novel, i.e. no previously identified SNP, reported as genome-wide significant in the primary GWAS paper, within a 1 Mb window of the probes. We identified 247 gene-trait associations (271 probes) with P SMR <8.4 × 10 -6 (Additional file 1: Table S2). After application of the HEIDI test (P HEIDI ≥0.05), this was reduced to 71 gene-trait associations (77 probes) (Additional file 1: Table S3). Of these, 17 gene-trait associations were considered novel (Table 2 and Additional file 1: Table S4).
There were 15 genes associated with more than one trait or disease (Additional file 1: Table S5). Where a gene was associated across more than one trait, there was a strong correlation between the traits, with only two cross trait associations being between disparate traits or diseases. Crohn's disease (CD) and ulcerative colitis (UC) are chronic gastrointestinal disorders that represent as intestinal inflammation; collectively they are known as inflammatory bowel disease (IBD). GWAS to date have identified 200 loci associated with IBD [19], 71 with CD [20], and 47 with UC [21], as well as evidence for transancestry shared genetic risk for IBD [19]. The SMR analyses predicted ten gene targets for a combination of IBD, CD, and UC (Additional file 1: Table S6), of which four were novel gene associations (in total there were two novel gene associations for CD and three each for IBD and UC). The other traits that shared gene associations were the lipids, i.e. high-density lipoprotein (HDL), lowdensity lipoprotein (LDL), and total cholesterol (TC) (Additional file 1: Table S7).
The results from this analysis can be queried and viewed in the online application [13]. Results from the initial Zhu et al. study are also included in this database. We intend that as more GWAS summary data becomes available, SMR analysis will be conducted using the summary data and the results database will be updated accordingly. This application enables users to query the database by trait, gene, or both and apply thresholds  P eQTL p value of the top associated cis-eQTL of the probe, P GWAS GWAS p value of the top cis-eQTL, P SMR p value for gene-trait association from the SMR test, P HEIDI p value from HEIDI test to indicate whether the gene-trait association is due to a single shared genetic variant (the smaller P HEIDI the more likely that there are more than one genetic variant) based on the p value from the SMR method and the HEIDI test. In addition, Manhattan plots are given based on the p value from the SMR analysis and regional association plots are provided for those probe-trait associations that pass both the SMR and HEIDI tests.
Conclusion SMR, as indicated by the results, provides a means of using summary statistics from GWAS and eQTL data to prioritize likely functionally relevant genes within previously identified regions of association and in some cases identify novel gene associations.