MT-HESS: an efficient Bayesian approach for simultaneous association detection in OMICS datasets, with application to eQTL mapping in multiple tissues

Motivation: Analysing the joint association between a large set of responses and predictors is a fundamental statistical task in integrative genomics, exemplified by numerous expression Quantitative Trait Loci (eQTL) studies. Of particular interest are the so-called ‘hotspots’, important genetic variants that regulate the expression of many genes. Recently, attention has focussed on whether eQTLs are common to several tissues, cell-types or, more generally, conditions or whether they are specific to a particular condition. Results: We have implemented MT-HESS, a Bayesian hierarchical model that analyses the association between a large set of predictors, e.g. SNPs, and many responses, e.g. gene expression, in multiple tissues, cells or conditions. Our Bayesian sparse regression algorithm goes beyond ‘one-at-a-time’ association tests between SNPs and responses and uses a fully multivariate model search across all linear combinations of SNPs, coupled with a model of the correlation between condition/tissue-specific responses. In addition, we use a hierarchical structure to leverage shared information across different genes, thus improving the detection of hotspots. We show the increase of power resulting from our new approach in an extensive simulation study. Our analysis of two case studies highlights new hotspots that would remain undetected by standard approaches and shows how greater prediction power can be achieved when several tissues are jointly considered. Availability and implementation: C++ source code and documentation including compilation instructions are available under GNU licence at http://www.mrc-bsu.cam.ac.uk/software/. Contact: sylvia.richardson@mrc-bsu.cam.ac.uk or lb664@cam.ac.uk Supplementary information: Supplementary data are available at Bioinformatics online.


S.1.1 Simulation set-up
The matrix Γ of 0's and 1's encodes the pattern of non-zero associations between responses and predictors. Because our method is aimed at detecting a signal that is present in different tissues, we use the same Γ in all tissues.
The association pattern used in the main simulation study can be visualized in Figure 1  The pattern used to distinguish between cis-and trans-associations contains 6 hotspots, at SNP positions {100, 200, 300, 900, 1000, 1100}, 5 isolated cis SNPs at positions {400, 500, 600, 700, 800} and 5 cis-associations which predict responses also predicted by other SNPs at positions {150, 250, 850, 950, 1050}. It can be visualized in Figure 1  The most general pattern of noise across the tissues in our simulation study is as follows: shared ik ∼ N(0, σ 2 shared ) This is visualised in Figure S.4, along with the coefficient of determination R 2 plot for a case with unbalanced total noise standard deviation across tissues.

S.1.2 Simulation study results: residual correlations between tissues
In Figure S.1 we show Receiver Operating Characteristic (ROC) curves of MT-HESS and MANOVA for three simulation set-ups with equal total noise standard deviation σ total = 0.1, 1 ≤ ≤ 3, but with increasing correlation between tissues (0, 0.16 and 0.5). The correlation between tissues is proportional to the "shared" noise variance and also depends on the noise variances of the two tissues considered. Given two tissues and , the correlation between them is given by In the balanced case, when the tissue-specific noise standard deviation is constant across tissues, the equation above simplifies with constant level of correlation between tissues. In this case, without loss of generality, we drop the subscripts , . We also report ROC curves for individual tissues. We call these analyses ST-HESS, single tissue HESS, to distinguish it from MT-HESS, multi-tissue HESS. Although ST-HESS ROC curves are not properly comparable to those from MT-HESS ones, as they result from different subsets of the data, they are included to give some insight into the way in which the multiple tissues methods combine information across the tissues. Since we keep the total noise standard deviation constant in all three cases, each single-tissue ROC curves remain the same as they are not influenced by the level of correlation. Finally we have also added the ROC curves intersecting the results from the single-tissue version of HESS run on each separate tissue. We call these analyses iST-HESS, intersection of single tissue HESS, to distinguish it from single and multiple tissues analysis above. iST-HESS ROC curves do not depend on the level of correlation between tissues and therefore remain the same for all the cases simulated. As expected iST-HESS analysis is more powerful at detecting associations than any single-tissue analysis. In contrast we see that MT-HESS gains less power over the single tissue analysis as the correlation between tissues increases. MANOVA, which also uses an estimate of the covariance across tissues is also affected by the correlation level. Plots have different levels of correlation between tissues due to different levels of tissue-specific noise standard deviation and "shared" noise standard deviation:

S.1.3 Simulation study results: unbalanced noise across tissues
We have also investigated a range of simulation cases with different signal-to-noise ratios across tissues with the same level of the signal strength µ = 0.15 as before. Here we show cases with independent residuals between tissues, in order to separate out the effects of imbalanced noise standard deviation in the three tissues and correlation between tissues. Figure S.2 shows the ROC curves for four simulation cases, with varying values of signal-to-noise ratio across the three tissues. The individual ST-HESS curves for the different tissues give an indication of the amount of imbalance of signal-to-noise ratio across the three tissues. The three methods which combine results across tissues (MT-HESS, MANOVA and iST-HESS) in general lie in between the separate tissue curves, as they all perform an averaging across tissues. We see clearly the advantage of MT-HESS over both MANOVA and the intersection of single-tissue results.

S.1.4 Simulation study results: unbalanced noise across tissues and residual correlation between tissues
Here we show a more complicated simulation case in which the noise is unbalanced across tissues and the residual correlation varies between tissues. We use the same pattern of associations between responses and predictors as before with the mean level of the signal strength µ = 0.29. For approximately half of the responses where the association is simulated, the residual correlation between tissues is high, with moderately imbalance of the signal-to-noise ratio (ratio 2.2 between strongest and weakest tissues), and the other half of responses have low correlation, but extremely high signal-to-noise ratio imbalance (ratio 15 between strongest and weakest tissues). The two residual correlation matrices are   Figure S.4 values of the coefficient of determination R 2 for each response k and each tissue . In particular, we can see that signal is much stronger in tissue 1 than in tissues 2 and 3, and pinpoint which responses have a much weaker signal.  Here we provide here detailed results for one replicate. Figure S.6 shows the heatmap of the matrices of associations. For MT-HESS (ST-HESS) the measure of association is the marginal posterior probability of inclusion that response k is associated to SNP j, whereas for the frequentist methods the measure is given by the p-value. The major observation is that MT-HESS (ST-HESS) provides a much better separation between noise and signal, whereas p-values corresponding to "one-at-a-time" frequentist methods exhibit a much more noisy pattern. In particular MANOVA (t-test) only detects the strongest signal in the third hotspot, while MT-HESS (ST-HESS) detects 5 out of 6 hotspots (see Figure S.7).
Regarding the information synthesis across multiple tissue, we see that MT-HESS (Figure S.6) tends to be conservative as it seeks agreement between the 3 tissues. This is why some associations reported in tissue 1 by ST-HESS are not present in the multiple tissue results. On the other hand, a few isolated false positives present in tissue 1 are not reported in the multiple tissue analysis.
On tissue 1, t-tests declare only one hotspot, whereas ST-HESS identifies 5 out of 6 hotspots. Also, ST-HESS manages to correctly find the relevant responses, and exclude the other ones (although there a few false positives for the fourth hotspot). On tissue 2, ST-HESS still manages to identify 5 out of 6 hotspots, although it detects fewer associations on hotspots 1 and 4. On tissue 3, both t-tests and ST-HESS fail to actually detect any hotspot, the reason being that this tissue has a very weak signal. Looking at the multiple-tissue analyses, we see that MANOVA reports the same hotspots and associations as tissue 1, that is MANOVA is fundamentally using only the information in tissue 1 to declare associations. On the other hand, MT-HESS combines the information of all tissues and dampens some of the associations found in tissue 1, mostly on hotspots 1 and 4. On the other hand, a few additional true associations are declared in hotspots 3 and 6.
This illustrates that MT-HESS is fairly conservative; in other words, borrowing information across tissues leads to reliable association detection, at the cost of possibly excluding signal arising solely in specific tissues. However when compared to multivariate MANOVA it is able to better separate signal from noise with better sensitivity and specificity. This set-up is similar to the case shown in the main paper in Figure 2 (a), where noise is balanced across tissues σ total = 0.1, 1 ≤ ≤ 3 and σ shared = 0. We selected an arbitrary replicate for the case µ = 0.29. Figure S.9 visualises the matrices of associations. As describe in Section S.1.4 both MT-HESS (ST-HESS) provides a much better separation between noise and signal. Additionally, despite the fact that µ = 0.29 corresponds to a case with strong signal, MANOVA (and t-test) only seem to identify a cluster of association around SNP 400 and response 120, whereas MT-HESS (and ST-HESS) detect all the simulated clusters of associations. This analysis is confirmed when we look at the hotspots declaration on Figure S.10. As described in the main text, to declare a hotspot, we first declare the "significant associations" by setting a threshold on the association matrices shown in Figure S.9. Because it is simulated data, we can calculate the exact FDR for each replicate, and we can set the threshold for each method (on MPPIs or p-values) to reach the same FDR rate of 5%. Once the significance threshold is set, for each SNP we count the number of significantly associated responses. Figure S.10 shows a striking difference between the "one-at-a-time" approach and MT-HESS and ST-HESS, and confirms what was observed on the heatmaps. MANOVA and t-tests only identify one hotspot out of 6 (around SNP 400), while both ST-HESS and MT-HESS identify all the hotspots, with almost all the correct associations. S.1.5.2 Detailed results for "weak association signal" We display in this section the heatmaps of associations and the hotspots detection in the case of a weaker signal µ = 0.09. Figures S.11 and S.12 show that no signal is detected by ST-HESS on any of the tissues, and similarly for the t-test. In contrast, MT-HESS manages to detect some associations corresponding to the two strongest hotspots (around SNPs 100 and 500).

S.2 Post-processing for model probabilities
When calculating the most probable models for each response k, rather than using the empirical frequencies directly, we use the marginal likelihoods for the models visited during the MCMC chain, obtaining an estimate of the quantity .
We use a Rao-Blackwellised estimator: , where t = 1, · · · , T are the MCMC iterations and θ (t) stands for all parameters in the model other than γ k . The denominator is approximated by replacing the space of all possible model vectors γ by the set of unique models visited during the MCMC chain {γ [1] , γ [2] , · · · }, assuming that models not visited have low probability. Thus our estimator is where the index m ranges over the set of unique models visited. The quantity S kmt can be simplified due to conditional independence in the model structure: , {ω 1 , · · · , ω q } (t) , {ρ 1 , · · · , ρ p } (t) ).

S.3 Calibration of FDR estimate
For our simulations we are able to calculate the true number of false positive and negative calls and thus check our estimates of the FDR. For each simulated data set, we find the marginal posterior probability of inclusion (MPPI) threshold corresponding to a given value of bF DR, and calculate the true FDR for that list of declared positives. Table S.1 shows the MPPI thresholds, the number of declared positives and the true FDR values for the simulations shown in Figure 2 in the main text. Values are averaged over multiple replications of each simulation case. Only replicates for which some positive associations are declared at that level are used. We give results for bF DR of 5% and 10% as these are common values used for FDR thresholding. The first two cases depicted in Fig. 2 (a) and (b) are high signal-to-noise ratio cases. Here we see that the bF DR overestimates the true FDR, and so the decision rule is conservative. For Figure 2 (c) and (d), which have lower signal-to-noise ratio, the mean FDR is higher than the bF DR estimate but the median is lower. For these cases where very few positives are declared, the number of possible true FDR values is very small, and it will be rare that the true FDR will match the bF DR estimate. However the median values show that in these higher signal to noise cases we are still conservative.

S.4 Sensitivity analysis
In this section we present a sensitivity analysis for the parameters c ρ j and d ρ j of the prior ρ j ∼ Gamma(c ρ j , d ρ j ), j = 1, . . . , p, that accounts for the propensity of predictor j to be a hotspot. Figure S.13 shows the shape of the gamma distribution for three different parameterization c ρ j = d ρ j = 1.2, c ρ j = d ρ j = 2 and c ρ j = d ρ j = 5. From the figure is clear that the larger the value of the parameters, the smaller the variance c j /d 2 j with the prior concentrating around its mean c j /d j . Note that with c ρ j = d ρ j = 1.2 we avoid placing substantial prior mass in 0 or shrinking large values of ρ j towards 1. Figure S.14 shows the results for different parameterizations of the prior ρ j ∼ Gamma(c ρ j , d ρ j ) obtained in a replication of the simulation experiment illustrated in Figure 2 (a) (independent residuals, σ shared = 0 and balanced case with σ total = 0.1, 1 ≤ ≤ 3) and whose pattern of association is shown in Figure 1 (a). At bFDR ≤ 0.05, the parameterizations c ρ j = d ρ j = 1.2 and c ρ j = d ρ j = 2 produce nearly the same results, although we can already see that with c ρ j = d ρ j = 2 in the two hotspots simulated at chromosomes 14 and 15 the number of probe sets significantly associated starts to decrease. This is very clear with c ρ j = d ρ j = 5. Finally, note that with the parameterization c ρ j = d ρ j = 1.2, there are less false positives than in the other two cases as illustrated by the number of "small picks" close to the x-axis. Figure S.15 shows sensitivity analysis for IDIN network example presented in Section 4.1. We confirm results obtained in the sensitivity analysis above with similar behavior with c ρ j = d ρ j = 1.2 and c ρ j = d ρ j = 2 parameterization. Both prior specifications detect three hotspots with the former sightly favoring more associations. Strikingly, with c ρ j = d ρ j = 5 almost no hotspots are detected showing the importance of our propensity prior and a parameterization that a priori allows large variance without placing the mode in 0.  Figure 2 (a) and whose pattern of association is shown in Figure 1 (a) for different parameterizations of the prior ρ j ∼ Gamma(c ρj , d ρj ). The x-axis depicts chromosome number and position. The y-axis shows the number of probe sets significantly associated with each SNP (bFDR ≤ 0.05). Vertical dotted lines show the number of probe sets (30, 3, 10, 30, 10, 10) associated with 6 hotspots marked with red dots in the x-axis.

S.5 Rat data results
Affymetrix RAE 230 2.0 microarrays were used for three tissues (left ventricle, aorta and liver). Gene expression summaries were derived using robust multichip average (RMA) algorithm (Irizarry et al., 2003). To select informative probe sets targeting robustly expressed genes, for each of the three tissues we computed the mean expression over all normalized samples using the normalmixEM function from R package mixtools (Benaglia et al., 2009). This function fits a mixture of two Gaussian distributions to the log-transformed mean expression (expressed and non/low-expressed genes). Starting values for the EM algorithm were chosen assuming 50% of expressed genes, and the initial mean values of the underlying Gaussian distributions were set equal to the first and third quartile of the gene expression distribution. The distribution with the lowest mean expression was considered as the non/low-expressed genes distribution, and a detection threshold was set, by taking the 95% percentile of the non/low-expressed genes distribution. Microarray probe sets above the detection threshold in less than two samples were considered as targeting non-expressed genes and filtered out from the analysis. Subsequently, an annotation-based filtering was performed to remove all probe sets that were non-specific or targeted non-Ensembl genes (Genome assembly: Rnor_5.0). The intersecting set of probe sets was obtained for the three tissues considered (left ventricle, aorta and liver), and the IDIN network (Heinig et al., 2010) was retrieved (final number of probe sets = 146).
The intersecting probe sets of the IDIN network was mapped with HESS to the rat genome by using a non-redundant genome-wide SNP panel of 1,384 SNPs (Roider et al., 2009). In this panel of rat SNPs, missing genotypes were imputed by using FastPhase (Scheet and Stephens, 2006). We carried out three single-tissue HESS runs (mapping the IDIN network to left ventricle, aorta and liver tissues separately), and a HESS multiple-tissue run by combining the IDIN network expression of these three tissues.
The PASTAA algorithm was used to identify overrepresented transcription factors binding sites motifs in the promoter of the corresponding human orthologs of the rat genes that were associated with the hotspot located in rat chromosome 10 with 5% Bayes FDR (Roider et al., 2009). The rat genes associated to the regulatory hotspot located in rat chromosome 10 (52 protein coding genes) were mapped to human by considering only those genes that a one-one ortholog relationship in Ensembl 59 archive (which resulted in 41 human genes). PASTAA online tool was run with default parameterization and considering a promoter range of ±400 base pairs around the gene transcription start site. P -values were corrected for multiple testing by using the R package p.adjust and the adjustment method "Benjamini & Hochberg" (Benjamini, 1995). Corrected p-values are referred in the text as adjusted p-value. A transcription factor was considered to be significantly overrepresented if adjusted p-value ≤ 0.05. Bayes FDR was computing as explained in Section 3.4. S.6 Human data results S.6.1 Human data methods

S.6.1.1 Subjects
Adult patients with active inflammatory bowel disease (Crohn's disease or ulcerative colitis) were recruited from a specialist inflammatory bowel disease clinic at Addenbrooke's hospital in Cambridge, prior to commencing treatment as described previously (Lee et al., 2011). Diagnosis was made using standard endoscopic, histopathological, and radiological criteria. Patients receiving immunomodulators or corticosteroids were excluded due to potential effects on gene expression. Ethical approval was for this work was obtained from the Cambridgeshire Regional Ethics Committee (REC08/H0306/21). All participants provided written informed consent.
S.6.1.2 Immune cell subsets 100 ml of whole blood was taken, and cell separations were performed as previously described (Lyons et al., 2007).

S.6.1.3 Nucleic acid extraction
RNA was extracted from cell lysates using RNEasy Mini Kits (Qiagen) or Qiagen Allprep kits according to the manufacturer's instructions. RNA quality was assessed using an Agilent Bioanalyser 2100 (Agilent Technologies) and quantified by spectrophotometry using a NanoDrop ND-1000 spectrophotometer (Thermo Scientific). Genomic DNA was extracted using Qiagen Allprep kits according to the manufacturer's instructions and quantified as above.
S.6.1.4 Genomic DNA genotyping and genotype data QC Genotyping was performed using the Illumina Human OmniExpress12v1.0 BeadChip at the Wellcome Trust Sanger Institute according to the manufacturer's protocol. After genotype calling with Illumina GenCall software, SNP data was processed in the following sequence using PLINK (Purcell et al., 2007). Samples with a sex mismatch, abnormal heterozygosity, proportion of missing SNPs >0.05, and duplicated or related samples (identified using Identity by State) were removed. SNPs with missing calls >5%, extreme deviation from Hardy-Weinberg Equilibrium (p-value <1x10 −8 ), MAF <0.05 or which were monomorphic were removed. Finally, principal components analysis of the post-QC genotype calls combined with calls from HapMap3 founder individuals, using the R/Bioconductor snpStats package, confirmed all samples are of European ancestry. Highly correlated SNPs were removed using the --indep-pairwise option in PLINK so that no pair of SNPs in a 50 SNP window shifted at 5 SNP intervals had r 2 >0.8. S.6.1.5 Gene expression measurement, pre-processing and QC 200 ng RNA was processed for hybridization onto Affymetrix Human Gene ST 1.1 microarrays, according to the manufacturers instructions, prior to scanning. Microarrays were processed with the automated GeneTitan instrument which uses a 96 well plate. Chip probe intensity files were read into R version 3.02 (R Development Core Team, 2008) using the Bioconductor oligo package (Carvalho and Irizarry, 2010). Raw intensity values were pre-processed with the Robust Multiarray Average (RMA) algorithm which performs background correction, quantile normalization, log 2 transformation and summarization to gene level via a median polish.
The arrayQualityMetrics package was run on the normalised expression matrices before and after batch correction with Combat (see below) and poor quality samples removed.

S.6.1.6 Correction for batch effects
Batch correction was performed using the ComBat function from the sva (Surrogate Variable Analysis) Bioconductor package (Leek et al., 2012;Johnson et al., 2007).
S.6.1.7 Filtering the expression data Filtering of the expression data was performed prior to running HESS. Probesets on sex chromosomes and those which did not map to an Entrez gene ID were removed (Bioconductor genefilter package (Gentleman et al., 2014)). The remaining probesets were then filtered by variance. We retained any probeset whose variance across samples was in the upper 10% in any one of the cell types.

S.6.1.8 Chromosome selection
We had no a priori knowledge of which chromosomes were most likely to contain hotspots. We selected chromosome 5 because a SNP (rs10044354) tagging an inflammatory bowel disease susceptibility locus has previously been shown to be a cis eQTL for ERAP2 in both monocytes and B cells (Fairfax et al., 2012).
S.6.2 Additional discussion S.6.2.1 MT-and ST-HESS find the cis eQTL for ERAP2 at an IBD susceptibility locus We found one SNP, rs2549782, significantly associated with ERAP2 expression, and this SNP was not associated with any other genes. The MPPI for the rs2549782-ERAP2 association was 1 in the joint analysis and in all the separate cell type analyses. rs2549782 is in LD with rs10044354, the eSNP reported in (Fairfax et al., 2012) (r 2 = 0.72 in 1000 Genomes CEU population). Unsurprisingly, HESS can find eQTLs associated with only one gene (typically cis effects which tend to be strong), as well as hotspots which are acting in trans. However, cis eQTLs can be detected by standard regression-based approaches, and the real advantage of HESS is the detection of trans-eQTL clusters (i.e. hotspots), and, in the case of MT-HESS, multi-tissue hotspots.
S.6.2.2 Hotspot associated with the second highest number of genes (using a 5% bFDR threshold) The second "top" hotspot in CD4 T cells, rs17256961, was associated with 29 genes, and lies 35,469 base pairs from the top hotspot. These two SNPs are in LD with r 2 = 0.58 in 1000 Genomes CEU population. There was however no overlap in the genes associated with these SNPs. The visual representation of the MPPI matrix of SNP-gene associations around this locus in Supplementary  rs2937764  rs1502645  rs13354541  rs7709151  rs700239  rs351651  rs2249654  rs723852  rs17484586  rs11745891  rs17256961  rs11956209  rs7733888  rs9326958  rs11953759  rs1423209  rs7712083  rs17164094  rs6866231  rs2069118  rs17116707  rs4547967  rs10064789  rs13187359  rs10060856  rs2362834  rs10514203  rs2974089  rs7709951  rs331096  rs258822  rs17708574  rs3734050  rs17058343  rs347994  rs16901859 rs10512747 rs7718215 • CD4 T cells CD8 T cells Monocytes Joint Figure S.18: SNPs with 5 or more associated genes (bFDR ≤ 0.05) in any one of the single tissues or in the multitissue analysis. We observe a few SNPs associated with more genes in the joint analysis compared to each of the single cell types analysed separately. For example, rs7718215 is associated with 6 genes in the joint analysis versus 1,4,0 for CD4 T cells, CD8 T cells and monocytes respectively in the single tissue analyses. . 3) Manhattan plot showing the number of genes associated with each SNP (using a 5% bFDR threshold) in the displayed region (single tissue HESS-CD4 T cell analysis). The top two hotspots in CD4 T cells both lie in the region shown (black vertical lines). 4) Transcripts (including alternatively spliced isoforms) from Ensembl. The lincRNA ENST00000507733, indicated by the arrow and annotated in red, lies on the anti-sense strand. 5) RefSeq genes. 6-8) Histone marks from ENCODE. H3K4me1-histone monomethylation, which is associated with enhancer function, and with active transcription or with genes that are poised for activation. H3K4me3-histone trimethylation, which is linked to promoter activity. H3K27ac-histone acetylation, a marker of active enhancers. Numbers (outermost) indicate chromosomes. Gene symbols (inner circle) indicate genes whose expression is significantly associated (Bayes FDR ≤0.05) with rs11745891. Interconnecting lines link the genomic location of the hotspot SNP and these genes (red lines for genes on the same chromosome as the SNP, blue lines for genes on different chromosomes).