Prioritizing putative influential genes in early life cardiovascular disease susceptibility by applying tissue-specific Mendelian randomization

Background The extent to which changes in gene expression can influence cardiovascular disease risk across different tissue types has not yet been systematically explored. We have developed an analytical framework that integrates tissue-specific gene expression, Mendelian randomization and multiple-trait colocalization to develop functional mechanistic insight into the causal pathway from genetic variant to complex trait. Methods We undertook a transcriptome-wide association study in a population of young individuals to uncover genetic variants associated with both nearby gene expression and cardiovascular traits. Two-sample Mendelian randomization was then applied using large-scale datasets to investigate whether changes in gene expression within certain tissue types may influence cardiovascular trait variation. We subsequently performed Bayesian multiple-trait colocalization to further interrogate findings and also gain insight into whether DNA methylation, as well as gene expression, may play a role in disease susceptibility. Results Eight genetic loci were associated with changes in gene expression and early life measures of cardiovascular function. Our Mendelian randomization analysis provided evidence of tissue-specific effects at multiple loci, of which the effects at the ADCY3 and FADS1 loci for body mass index and cholesterol respectively were particularly insightful. Multiple trait colocalization uncovered evidence which suggested that changes in DNA methylation at the promoter region upstream of FADS1/TMEM258 may also play a role in cardiovascular trait variation along with gene expression. Furthermore, colocalization analyses were able to uncover evidence of tissue-specificity, most prominantly between SORT1 expression in liver tissue and cholesterol levels. Conclusions Disease susceptibility can be influenced by differential changes in tissue-specific gene expression and DNA methylation. Our analytical framework should prove valuable in elucidating mechanisms in disease, as well as helping prioritize putative causal genes at associated loci where multiple nearby genes may be co-regulated. Future studies which continue to uncover quantitative trait loci for molecular traits across various tissue and cell typse will further improve our capability to understand and prevent disease.


Abstract
Background: The extent to which changes in gene expression can influence cardiovascular disease risk across different tissue types has not yet been systematically explored. We have developed an analytical framework that integrates tissue-specific gene expression, Mendelian randomization and multiple-trait colocalization to develop functional mechanistic insight into the causal pathway from genetic variant to complex trait.

Methods:
We undertook a transcriptome-wide association study in a population of young individuals to uncover genetic variants associated with both nearby gene expression and cardiovascular traits. Two-sample Mendelian randomization was then applied using large-scale datasets to investigate whether changes in gene expression within certain tissue types may influence cardiovascular trait variation. We subsequently performed Bayesian multiple-trait colocalization to further interrogate findings and also gain insight into whether DNA methylation, as well as gene expression, may play a role in disease susceptibility.
Results: Eight genetic loci were associated with changes in gene expression and early life measures of cardiovascular function. Our Mendelian randomization analysis provided evidence of tissue-specific effects at multiple loci, of which the effects at the ADCY3 and FADS1 loci for body mass index and cholesterol respectively were particularly insightful.
Multiple trait colocalization uncovered evidence which suggested that changes in DNA methylation at the promoter region upstream of FADS1/TMEM258 may also play a role in cardiovascular trait variation along with gene expression. Furthermore, colocalization Introduction 1 Despite recent efforts in research and development, cardiovascular disease still poses 2 one of the greatest threats to public health throughout the world, accounting for more 3 deaths than any other cause [1]. Since their development, genome-wide association 4 studies (GWAS) have identified thousands of different genetic loci associated with 5 complex disease traits [2]. An example of their successful application within 6 cardiovascular research is the identification of numerous genetic variants associated with 7 low density lipoprotein (LDL) cholesterol levels [3], which is a causal mediator along the 8 coronary heart disease progression pathway [4,5]. However, the functional and clinical 9 relevance for the vast majority of GWAS results are still unknown, emphasizing the 10 importance of developing our understanding of the causal pathway from single nucleotide 11 polymorphism (SNP) to disease. changes in gene regulation [7]. Recent efforts have incorporated messenger ribonucleic 16 acid (mRNA) expression data into analyses to determine whether SNPs identified by 17 GWAS influence levels of gene expression (i.e. whether they are expression quantitative 18 trait loci [eQTL]) as well as complex traits [8]. Novel methods have integrated eQTL data 19 with summary association statistics from GWAS [9] to identify genes whose nearby (cis) 20 regulated expression is associated with traits of interest (widely defined as variants within 21 1 megabase (Mb) on either side of a genes transcription start site [TSS]) [10]. These types 22 of studies have been referred to as transcriptome-wide association studies (TWAS). 23 24 A recent paper has highlighted some limitations that may be encountered by studies 25 integrating transcriptome data to infer causality [11], such as intra-tissue variability and 26 co-regulation amongst proximal genes, making it challenging to disentangle putative 27 causal genes for association signals. This exemplifies the importance of developing 28 methods that investigate tissue-specificity and co-regulation of association signals 29 detected by TWAS. Therefore, there needs to be further research into the most 30 appropriate manner to harness eQTL data (across multiple tissue and cell types) in order 31 to improve the biological interpretation of GWAS findings. 32 33 We have developed a systematic framework which can be used to evaluate five potential 34 scenarios that can help explain findings from TWAS ( Figure 1). Firstly, we identify putative 35 causal genes responsible for observed association signals, by evaluating the association 36 between lead SNPs and proximal gene expression using eQTL data from the 37 Framingham Heart Study (n=5,257) [8]. We then investigate the relationship between 38 gene expression and complex traits at loci of interest by applying the principles of 39 Mendelian randomization (MR); a method which uses genetic variants associated with an 40 exposure as instrumental variables to infer causality among correlated traits [12,13]. A 41 recent development in this paradigm is two-sample MR, by which effect estimates on 42 exposures and outcomes are derived from two independent datasets, allowing 43 researchers to exploit findings from large GWAS consortia [14]. Applying this approach 44 can therefore be used to help infer whether changes in gene expression (our exposure) 45 may influence a complex trait identified by GWAS (our outcome). Furthermore, as tissue-46 inference in this field, as it suggests there may be underlying mechanisms which are 70 consistent with causality (i.e. DNA methylation acting as a transcriptional repressor). 71 However, a major challenge in this paradigm is the lack of accessible tissue-specific DNA 72 methylation/mQTL data akin to GTEx for gene expression. Previous studies have 73 investigated the potential mediatory role of DNA methylation between genetic variant and 74 gene expression using eQTL and mQTL data derived from blood which may act as a 75 proxy for other tissue types [18][19][20]. Moreover, other studies have demonstrated a 76 surprisingly high rate of replication between mQTL derived from blood and more relevant 77 tissue types for a complex trait of interest [21]. We have therefore undertaken moloc 78 analyses using eQTL derived from both blood and cardiovascular-specific tissue types. 79 Finally, it is also important to note that, along with other approaches which apply causal 80 methods to molecular data, we are currently unable to robustly differentiate mediation 81 from horizontal pleiotropy (scenario 5) [12,22]. However, within this framework we will be 82 able to accommodate additional eQTL as instrumental variables derived from future larger 83 studies in order to address this. 84

85
In this study, we demonstrate the value of our framework by applying it to data from the 86 Avon Longitudinal Study of Parents and Children (ALSPAC) using early life measures of 87 cardiovascular function as outcomes. Evaluating putative causal mechanisms apparent 88 early in the life course can be extremely valuable for disease prevention and healthcare, 89 particularly given that cardiovascular disease such as atherosclerosis has been shown to 90 develop in childhood [23]. Therefore, we used ~19,000 cis-eQTL's observed in adults at 91 risk of cardiac events from the Framingham Heart Study [8] for our TWAS to ascertain 92 whether they influence these cardiovascular traits in young individuals (age  10). We 93 have further evaluated results using our framework by harnessing summary statistics 94 from large-scale GWAS to demonstrate the value of our approach and validate findings 95 in independent samples. 96 97

The Avon Longitudinal Study of Parents and Children (ALSPAC)
99 Detailed information about the methods and procedures of ALSPAC is available 100 elsewhere [24][25][26]. In brief, ALSPAC is a prospective birth cohort study which was 101 devised to investigate the environmental and genetic factors of health and development. 102 In total, 14,541 pregnant women with an expected delivery date of April 1991 and 103 December 1992, residing in the former region of Avon, UK were eligible to take part. 104 Participants attended regular clinics where detailed information and bio-samples were 105 obtained. The study website contains details of all the data that is available through a fully 106 searchable data dictionary [27]. All procedures were ethically approved by the ALSPAC 107 ethics and Law Committee and the Local Research Ethics Committees. Written informed 108 consent was obtained from all participants. 109 110

Genetic data 111
All children were genotyped using the Illumina HumanHap550 quad genome-wide SNP 112 genotyping platform. Samples were removed if individuals were related or of non-113 European genetic ancestry. Imputation was performed using Impute V2.2.2 against a 114 reference panel from 1000 genomes [28] phase 1 version 3 [29]. After imputation, we 115 filtered out variants and kept those with an imputation quality score  0. 8  The methods and procedures to acquire data for the 14 phenotypes analyzed in this study 120 are as follows. All measurements were obtained at the ALSPAC clinic. Height and weight 121 were measured at age 7 (mean age: 7.5, range: 7.1-8. interleukin 6 (IL-6) have been described previously [31]. 131

132
The Framingham Heart Study 133 We identified over 19,000 pruned lead cis-eQTLs from Joehanes et al [8] who provide in-134 depth details of the Framingham Heart study and their analysis plan in their paper. Trans-135 eQTLs were not considered for our analysis to reduce the likelihood of horizontal 136 pleiotropy influencing our findings and also to reduce the burden of multiple testing [32]. 137 This eQTL data was chosen for the initial analysis in ALSPAC due to the larger sample 138 size of transcriptome data from the Framingham Heart Study (n=5,257) using whole blood 139 in comparison to GTEx sample sizes for other tissue types. This allowed us to maximise 140 statistical power to detect association signals which we were then subsequently able to 141 evaluate in detail using data from other tissue types. Sample sizes vary between tissues, thus affecting statistical power to identify eQTL. In 148 depth information on the materials and methods for GTEx is available in the latest 149 publication [15]. In short, RNA sequencing samples were sequenced to a median depth 150 of 78 million reads. This is suggested to be a credible depth to quantify accurately genes 151 that may have low expression levels [33]. DNA was genotyped at 2.2 million sites and 152 imputed to 12.5 million sites. We used GTEx eQTL data in all downstream analysis 153 following the discovery analysis in ALSPAC (i.e. Mendelian randomization and multiple-154 traits. We applied a Bonferroni correction to account for multiple testing which equated to 163 0.05/the total number of tests undertaken. Using a script derived from the qqman R 164 package [36], results were plotted using a Manhattan plot. We undertook fine mapping 165 across the region 1Mb either side of each lead SNP identified from our TWAS using 166 FINEMAP [37] software. We used the default setting which outputs a maximum of 5 167 putative causal variants. 168 169

Tissue-specific Mendelian randomization analysis 170
To investigate potential causal genes at association signals detected in our TWAS, we 171 applied the principles of MR using the wald method [38] (Additional File 1: Figure S1) to 172 assess whether changes in tissue-specific gene expression (eQTLs as instrumental 173 variables) may be responsible for effects on associated traits. Furthermore, it can help 174 discern whether multiple proximal genes at a region are contributing to trait variation or 175 whether they are likely just co-regulated with causal genes in accessible tissue types such 176 as whole blood, i.e. scenario 3. Firstly, for each lead eQTL from the TWAS we used 177 tissue-specific data from GTEx to discern whether they were cis-eQTL for genes in tissue 178 types which may play a role in the pathology of cardiovascular disease (P < 1 x 10 -4 ). If 179 this was not possible then we used eQTL for all genes within a 1MB distance of the lead 180 eQTL. The tissue types evaluated were; adiposesubcutaneous, adiposevisceral 181 (omentum), liver, pancreas, arterycoronary, arteryaorta, heartatrial appendage 182 and heartleft ventricle. The mean donor age for all tissues included in this analysis 183 resided in the range of 50-55 years. In addition to this, we ran an additional analysis for 184 the association with BMI but investigating effects in the following brain tissues: pituitary, 185 anterior cingulate cortex (BA24) and frontal cortex (BA9). 186

187
For this analysis, we used data from large-scale GWAS; A full list of these with details 188 can be found within additional file 2 (Table S2) [39][40][41]. We then undertook a validation 189 analysis using our ALSPAC data. As cardiovascular trait data is therefore obtained at an 190 earlier stage in the life course compared to the tissue-specific expression data, any 191 associations detected in the validation analysis suggest genetic liability to cardiovascular 192 risk via changes in gene expression. These analyses were undertaken using the  Base platform [42]. The only trait we were unable to assess in our analysis was 194 interleukin-6, due to the lack of GWAS summary statistics for this trait. Nonetheless, we 195 still performed MR for the IL-6 data we possessed in ALSPAC. We applied a multiple 196 testing threshold to the MR results to define significance (p<0.05/54). We plotted the 197 results from the validation analysis using volcano plots from the ggplot2 package in R 198 [43]. We also applied the Stieger directionality test [44] to discern whether our exposure 199 (i.e. gene expression) was influencing our outcome (i.e. our complex trait) as opposed to 200 the opposite direction of effect. 201 202

Multiple-trait colocalization (moloc) 203
Blood samples were obtained from 1,018 ALSPAC mothers as part of the accessible 204 resource for integrated epigenomics studies (ARIES) [45] from the 'Focus on Mothers 1' 205 time point (mean age = 47.5). Epigenome-wide DNA methylation was derived from these 206 samples using the Illumina HumanMethylation450 (450K) BeadChip array. From this 207 data, we obtained effect estimates for all genetic variants within a 1MB distance of lead 208 eQTL from the TWAS and proximal CpG sites (again defined as < 1MB). We then used 209 the moloc [16] method to investigate 2 questions: 210 1) Is the same underlying genetic variant influencing changes in both proximal gene 211 expression and cardiovascular trait (i.e. investigating scenario 4) 212 2) Does the genetic variant responsible for these changes also appear to influence 213 proximal DNA methylation levels, suggesting that changes in this molecular trait may also 214 play a role along the causal pathway to disease. 215 As such, at each locus we applied moloc using genetic effects on 2 different molecular 216 phenotypes (gene expression and DNA methylation (referred to as eQTL and mQTL 217 respectively) along with the associated cardiovascular trait from our GWAS summary 218 statistics. Since we included three traits (i.e. gene expression, DNA methylation and 219 cardiovascular trait), moloc computed 15 possible configurations of how the traits are 220 shared: detailed information on how these are calculated can be found in the original 221 moloc paper [16]. For each independent trait-associated locus, we extracted effect 222 estimates for all variants within 1MB distance of the lead TWAS hit, for all molecular 223 phenotypes and relevant cardiovascular GWAS traits. We subsequently applied moloc in 224 a gene-centric manner, by mapping CpG sites to genes based on the 1MB regions either 225 side of our TWAS hit. Moloc was subsequently applied to all gene-CpG combinations 226 within each region of interest. We ran this analysis twice, once using expression data from 227 whole blood and again using expression data from a tissue type which was associated 228 with the corresponding trait in the tissue-specific MR analysis (Additional file 2: Table S3). 229 Only regions with at least 50 SNPs (MAF >= 5%) in common between all three datasets 230 (i.e. gene expression, DNA methylation and cardiovascular trait) were assessed by moloc 231 based on recommendations by the authors. We computed summed PPAs for all 232 scenarios where GWAS trait and gene expression colocalized. When summed PPAs 233 were >= 80%, we reported findings as evidence that genetic variation was influencing 234 cardiovascular traits via changes in gene expression. Furthermore, when summed PPAs 235 relating to DNA methylation were >=80%, there was evidence that DNA methylation may 236 also reside on the causal pathway to complex trait variation via changes in gene 237 expression. In all analyses we used prior probabilities of 1e-04, 1e-06, 1e-07 and 1e-08 238 as recommended by the developers of moloc based on their simulations [16].  Of the 3 cis-and potentially causal genes for this signal, only ADCY3 provided strong 267 evidence of being the putative causal gene in two types of adipose tissue (adipose 268 subcutaneous (P = 6.8 x 10 -40 ) and adipose visceral (P = 3.1 x 10 -48 )) ( Figure 3a). This 269 suggests that changes in ADCY3 expression in adipose tissue could influence BMI levels. 270 In contrast, there was a lack of evidence that changes in NCOA1 expression in the 271 analyzed tissue types influence BMI. We were unable to undertake MR of CENPO 272 expression in this analysis as were unable to harmonise effect estimates between 273 exposure and outcome. As an additional analysis, we repeated the MR on BMI using artery aorta (P = 5.8 x 10 -10 ), heartatrial appendage (P = 6.3 x 10 -5 ) and pancreas (P = 288 6.3 x 10 -5 )). The most parsimonious explanation may be that multiple genes at this locus 289 influence cholesterol levels, however further analyses are required to robustly 290 differentiate between scenarios 2 and 3 here (Figure 1). 291 292 At other loci evaluated (Additional File 1: Figure's S3-S9), LPL showed evidence of 293 association with triglycerides in a single tissue (adipose subcutaneous (P = 9.6 x10 -168 )) 294 implying that this effect may be more tissue-specific compared to those observed at other 295 loci in this study (Additional file 1: Figure's S8 & S9, Additional file 2: Tables S14 & S15). 296 On chromosome 1, there was strong evidence that gene expression in liver influences 297 total cholesterol (Additional file 1: Figure S6) and LDL (Additional file 1: Figure S7) (p < 298 3.22x10 -120 ). However, this was observed for all three genes in the region (SORT1, 299 CELSR2 and PSRC1). In these analyses alone, we were unable to determine whether a 300 particular gene is driving this observed effect, with the other proximal genes being co-301 regulated, or whether there are multiple causal genes for these traits (i.e. scenario 2). 302 However, evidence from the literature implicates SORT1 as the most likely causal gene 303 for this association signal [11,46]. Our MR results from ALSPAC provided evidence 304 between ABO expression and IL-6 in 4 different tissues (Additional file 2: Table S12). 305 Although, caution is required when interpreting this signal based on previous evidence 306 across a diverse range of traits [47]. Finally, to test the direction of effect at each locus 307 (i.e. are changes in gene expression causing changes in trait or vice versa), we ran a 308 causal direction test [44]. In all scenarios, the test provided evidence that gene expression 309 influences traits at these loci rather than the opposite direction of effect (Additional file 2: 310 Tables S5-S15). 311 312 Ascertaining whether DNA methylation resides on the causal pathway to 313 disease 314 We identified evidence of colocalization (PPA ≥ 0.8) for 7 unique genes across 5 loci 315 across various tissue types (Additional file 2: Tables S16-S20). Building upon results from 316 the tissue-specific MR analysis, we found strong evidence that ADCY3 is the functional 317 gene for the BMI associated signal on chromosome 2 (maximum PPA of 0.99 between 318 gene expression and BMI). We identified evidence of colocalization between BMI and 319 ADCY3 expression In both whole blood and subcutaneous adipose tissue. There was 320 also evidence that distributions between DNA methylation at cg04553793 (at the 321 promoter region of ADCY3) colocalized with BMI and ADCY3 expression in whole blood 322 (PPA = 0.88). However, the lead mQTL for this observed effect (rs13401333) was not 323 correlated with the lead eQTL and GWAS hit (rs6745073, r 2 =0.02), which suggests that 324 in-depth analysis with multiple tissue types is necessary to confirm whether DNA 325 methylation influences disease suscepbility at this locus. 326

327
There was also evidence that changes in DNA methylation at a CpG site in the promoter 328 region for FADS1 (cg19610905) colocalized with total cholesterol variation. There was 329 evidence of colocalization for all 3 traits using gene expression for TMEM258 (PPA=0.85) 330 (Figure4a), where the lead GWAS variant (rs174568) and mQTL were in perfect LD 331 (rs1535, r 2= 1). This effect was only observed in whole blood. Evidence of colocalization 332 between all three traits using FADS1 expression narrowly missed the cut-off (PPA=0.77). 333 Finally, we found limited evidence that changes in DNA methylation at this CpG site 334 colocalized with FADS2 expression, although as with the previously evaluated locus, this 335 was not surprising given that cg19610905 is located downstream of FADS2. Gene 336 expression of TMEM258 in whole blood was negatively associated with DNA 337 methylation at cg19610905. The directionality test suggested that DNA methylation 338 influences TMEM258 expression at this locus rather than the opposite direction of effect 339 (P<1 x 10 -16 ). 340

341
We did not identify evidence in the colocalization analysis suggesting that DNA 342 methylation plays a role in trait variation at the SORT1 region. However, there was 343 evidence of tissue specificity in liver tissue which supports evidence identified in our MR 344 analysis. The first plot in Figure 4b illustrates how effects on SORT1 gene expression and 345 total cholesterol at this region colocalizes in liver tissue. In contrast, the neighbouring plot 346 depicts the same analysis but in whole blood, whereby no evidence of colocalization was 347 detected. Furthermore, we see the same tissue-specific colocalization for the effect on 348 ApoB in the same region (Additional file 2: Table S16). The CELSR2 gene showed similar 349 evidence for tissue specificity in liver, whereas PSRC1 expression colocalized with 350 GWAS traits in both whole blood and liver. The ADCY3 locus has been reported to be associated with BMI in young individuals in 365 previous studies [48,49]. Our MR analyses identified evidence that changes in ADCY3 366 expression in adipose tissues may influence BMI, whereas weaker evidence was 367 observed based on the expression of other proximal genes (NCOA1). Specifically, we 368 found that the magnitude of the effect for ADCY3 expression was observed most strongly 369 in adipose tissue, aligning with other research [50,51]. Furthermore, recent work has 370 uncovered a variant in ADCY3 associated with an increase in obesity levels [52]. In 371 contrast, moloc showed a lack of evidence of colocalization for NCOA1 expression. 372 Moreover, although the CENPO gene was evaluated as part of our original association 373 analysis, there were no eQTL for this gene for any of the tissues we analyzed. From this, 374 we believe that ADCY3 is likely the functional gene impacting BMI at this locus, although 375 only with in-depth follow up analyses can this be determined with confidence. Our 376 additional analysis indicated no tissue-specific effects using eQTL effect estimates 377 derived from brain tissue, which suggests that the influence of ADCY3 expression on BMI 378 levels may be confined to adipose tissue. However, extended analyses using molecular 379 data derived from brain tissue is necessary to confirm this, particularly given that previous 380 work has linked gene expression in brain tissue with obesity-related traits [50,53]. 381 We also identified evidence of colocalization for gene expression, DNA methylation and 383 complex trait variation at the cholestrol associated region on chromosome 11. This was 384 observed for TMEM258 expression in whole blood, although FADS1 narrowly missed the 385 0.8 cut-off (PPA = 0.77). This was based on DNA methylation levels at a CpG site located 386 in the promoter region of FADS1 (cg19610905). This effect was observed using data from 387 whole blood (which is the only tissue we had accessible DNA methylation for in this study), 388 which is potentially acting as a proxy for the true causal/relevant tissue type for this effect 389 [18]. However, there was no indication that methylation played a role in the expression of 390 FADS2. TMEM258 has been proposed as a regulatory site for cholesterol in 'abdominal 391 fat' previously [54]. Interestingly, our MR analyses identified a single hit for this gene in 392 adipose tissue, suggesting that TMEM258 expression is highly tissue-specific. FADS1 393 has previously been associated with cholesterol levels in young individuals [55]. 394 Additionally, genetic variation at this region is associated with DNA methylation levels at 395 cg19610905 based on cord blood in ARIES, which suggests that these methylation 396 changes may influence the expression of FADS1/TMEM258 from a very early age. 397 Overall at this region, our results suggest that scenario 2 is a likely explanation for the 398 association signal, where it is biologically plausible that multiple causal genes influence 399 complex trait variation. Specifically, our analyses suggest that TMEM258 and FADS1 are 400 potential causal genes, however, further work is needed to elucidate whether FADS2 is 401 directly influencing cardiovascular traits or is simply co-regulated with the nearby 402 functional loci. 403

404
The LPL locus was not subject to co-regulation/uncertainty over the likely causal gene 405 and is therefore likely attributed to scenario 1. LPL has been previously reported to 406 influence lipid and triglyceride levels [56][57][58] and there is also evidence from gene 407 knockout experiments [59]. The tissue-specificity of LPL has also previously been 408 explored, although not by recent studies [60]. 2SMR analyses provided robust evidence 409 of highly specific gene expression in adipose tissue, corroborating previous research 410 [60,61]. 411

412
For other regions evaluated in our study, there was evidence that multiple genes may 413 potentially influence traits. The SORT1 locus has been previously studied in detail with 414 regards to its effect on cholesterol levels [46,62]. Our MR analyses provided additional 415 evidence of an effect using expression derived from liver tissue for SORT1, CELSR2 and 416 PSRC1, as well as in pancreas tissue for SORT1 and CELSR2 only. Our subsequent 417 moloc analysis identified evidence of colocalization for SORT1 and CELSR2 expression 418 with cholesterol only in liver tissue, suggesting that PSRC1 could be less tissue-specific 419 than the other 2 genes in this region. Previous research supports these observations with 420 regards to the effects of SORT1 and CELSR2 in liver [11,63], as well as the lack of tissue-421 specificity for the PSRC1 locus [64]. There was limited evidence that DNA methylation 422 was affecting gene expression at this region, although future work with methylation data 423 derived from liver tissue is warranted. 424 This study has demonstrated the value of our systematic framework in terms of 425 distinguishing between scenarios 1, 2, 3 and 4. However, an important limiting factor, as 426 with any study applying single-instrument MR, is the inability to separate mediation from 427 horizontal pleiotropy (i.e. scenario 5). Given that trans-eQTLs likely regulate genes 428 through a non-allele-specific mechanism [65], we selected only eQTLs that were 429 influencing proximal genes. As more eQTL are uncovered across the genome by future 430 studies, across a wide range of tissue and cell types, our framework should become 431 increasingly powerful to evaluate all 5 outlined scenarios. 432

433
In terms of limitations in this study, we recognise the varying sample sizes between 434 tissues in GTEx will determine the relative power to detect eQTL (Additional file 1: Figure  435 S2). Increased sample sizes in GTEx [66] and similar endeavours will help address this 436 limitation. Furthermore, the DNA methylation data we incorporated within our framework 437 from the accessible resource for ARIES [45] project was only obtained in whole blood. 438 However, in general, investigating the potential mediatory role of DNA methylation in 439 whole blood is a limitation, as this assumes that whole blood is acting as a proxy for 440 another, more relevant tissue type [67] . Furthermore, recent work has suggested that 441 promoter DNA methylation may not be sufficient on its own to influence transcriptional 442 changes [68]. Future work will need to incorporate DNA methylation data from various 443 tissues as and when these data become available so we can better understand the role 444 of this epigenetic process on transcriptional activity. For this purpose, a resource 445 concerning tissue-specific DNA methylation would be extremely valuable. 446 447 Another constraint of relatively modest sample sizes in GTEx is that we did not detect 448 evidence of co-localization at some loci despite investigating the functionally relevant 449 gene. For example, we can be reasonably certain that circulating ApoA1 levels are 450 influenced by the expression of APOA1. The complexity of gene regulation is often under-451 estimated due to factors such as feedback loops, hidden confounders in expression data 452 and regulatory activity not always being detected in relevant tissues [69]. However, we 453 are beginning to better understand regulation across tissues [64], which should provide 454 us with further opportunities to detect cross-tissue regulatory activity and develop our 455 biological understanding of disease.

Additional files
Additional file 1 -Supplementary figures: Figure S1. MR effect estimates are based on the Wald ratio test, where ̂ | is the coefficient of the genetic variant in the regression of the exposure (e.g. gene expression) and ̂ | is the coefficient of the genetic variant in the regression of the outcome (e.g. cardiovascular trait). Figure S2. Scatter plot illustrating how eGene discovery increases as sample size increases (R 2 = 0.84). Figure adapted from the Genotype Tissue Expression Project (Aguet et al 2017). Figure S3. Volcano plot from our tissue-specific Mendelian randomization analysis for the Apolipoprotein A1 associated region (rs2727784). Outcome data from Kettunen et al (2016). Figure S4. Volcano plot from our tissue-specific Mendelian randomization analysis for the Apolipoprotein B associated region (rs646776). Outcome data from Kettunen et al (2016). Figure S5. Volcano plot from our tissue-specific Mendelian randomization analysis for the Apolipoprotein B associated region (rs10419998). Outcome data from Kettunen et al (2016). Figure S6. Volcano plot from our tissue-specific Mendelian randomization analysis for the cholesterol associated region (rs646776). Outcome data from Willer CJ et al (2016). Figure S7. Volcano plot from our tissue-specific Mendelian randomization analysis for the low density lipoprotein associated region (rs646776). Outcome data from Willer CJ et al (2016). Figure S8. Volcano plot from our tissue-specific Mendelian randomization analysis for the triglyceride associated region (rs80026582). Outcome data from Willer CJ et al (2016). Figure S9. Volcano plot from our tissuespecific Mendelian randomization analysis for the very low density lipoprotein associated region (rs80026582). Outcome data from Kettunen et al (2016). Table S1. Tissues used for tissue-specific Mendelian randomization. Table S2. Results of fine mapping analysis. Table S3. Tissues used for moloc analysis. Table S4. Details on the GWAS datasets used. Table S5. Tissue-specific Mendelian Randomization results for the Apoliporotein A1 associated region on chromosome 11 (rs2727784). Table S6. Tissue-specific Mendelian randomization results for the Apolipoprotein B associated region on chromosome 1 (rs646776). Table S7. Tissue-specific Mendelian randomization results for the Apolipoprotein B associated region on chromosome 19 (rs10419998). Table S8. Tissue-specific Mendelian randomization results for the body mass index associated region chromosome 2 (rs11693654). Table S9. Tissue-specific Mendelian randomization results for the cholesterol associated region on chromosome 1 (rs646776). Table S10. Tissue-specific Mendelian randomization results for the cholesterol associated region on chromosome 11 (rs174538). Table S11. Tissue-specific Mendelian randomization results for the interleukin-6 associated region on chromosome 1 (rs12129500). Table S12. Tissue-specific Mendelian randomization results for the interleukin-6 associated region on chromosome 9 (rs600038). Table S13. Tissue-specific Mendelian randomization results for the low density lipoprotein associated region on chromosome 1 (rs646776). Table  S14. Tissue-specific Mendelian randomization results for the triglyceride associated region on chromosome 8 (rs80026582). Table S15. Tissue-specific Mendelian randomization results for the very low density lipoprotein associated region on chromosome 8 (rs80026582). Table S16. Moloc results for the apolipoprotein B associated region on chromosome 1. Table S17. Moloc results for the cholesterol associated region on chromosome 1. Table S18. Moloc results for the body mass index associated region on chromosome 2. Table S19. Moloc results for the cholesterol associated region on chromosome 11. Table S20. Moloc results for the low density lipoprotein associated region on chromosome 1.  1) The genetic variant influences the trait, mediated by the expression of a single gene at a locus. 2) The genetic variant influences the trait via multiple genes which are co-regulated with one another.

Additional file 2 -Supplementary tables:
3) The genetic variant influences the trait via a single gene which is co-regulated with other non-causal genes. 4) The genetic variant that influences the trait is in linkage disequilibrium with another variant which is responsible for changes in gene expression that does not affect the trait. 5) The genetic variant influences both gene expression and the trait outcome by two independent biological pathways (horizontal pleiotropy).