Exploiting expression patterns across multiple tissues to map expression quantitative trait loci

Background In order to better understand complex diseases, it is important to understand how genetic variation in the regulatory regions affects gene expression. Genetic variants found in these regulatory regions have been shown to activate transcription in a tissue-specific manner. Therefore, it is important to map the aforementioned expression quantitative trait loci (eQTL) using a statistically disciplined approach that jointly models all the tissues and makes use of all the information available to maximize the power of eQTL mapping. In this context, we are proposing a score test-based approach where we model tissue-specificity as a random effect and investigate an overall shift in the gene expression combined with tissue-specific effects due to genetic variants. Results Our approach has 1) a distinct computational edge, and 2) comparable performance in terms of statistical power over other currently existing joint modeling approaches such as MetaTissue eQTL and eQTL-BMA. Using simulations, we show that our method increases the power to detect eQTLs when compared to a tissue-by-tissue approach and can exceed the performance, in terms of computational speed, of MetaTissue eQTL and eQTL-BMA. We apply our method to two publicly available expression datasets from normal human brains, one comprised of four brain regions from 150 neuropathologically normal samples and another comprised of ten brain regions from 134 neuropathologically normal samples, and show that by using our method and jointly analyzing multiple brain regions, we identify eQTLs within more genes when compared to three often used existing methods. Conclusions Since we employ a score test-based approach, there is no need for parameter estimation under the alternative hypothesis. As a result, model parameters only have to be estimated once per genome, significantly decreasing computation time. Our method also accommodates the analysis of next- generation sequencing data. As an example, by modeling gene transcripts in an analogous fashion to tissues in our current formulation one would be able to test for both a variant overall effect across all isoforms of a gene as well as transcript-specific effects. We implement our approach within the R package JAGUAR, which is now available at the Comprehensive R Archive Network repository. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1123-5) contains supplementary material, which is available to authorized users.


Our model
For a given gene-SNP pair, we begin with a linear mixed effects model that models expression patterns across tissues as a function of genotype, i.e., where Y is a nt-dimensional vector of expression levels in t tissues and n individuals, α is a vector of tissue-specific intercepts, G is a nt-dimensional vector of genotypes, β is a fixed effect of genotype across tissue, u ∼ N 0, τ ZZ T is a vector of subject-specific random effect, v ∼ N 0, γXX T is a vector of tissue-specific random effects, and ξ ∼ N (0, I nt ). The matrices J, Z and X are design matrices with X being a function of genotype. J is nt × t dimensional matrix denoting the design matrix for the tissue-specific intercepts. Z is nt×nt design matrix for the subject-specific intercepts. X is a nt×t design matrix of stacked genotypes. The parameters of interest are β and γ; α, τ and are nuisance parameters.
We test the null hypothesis that H 0 : β = γ = 0, i.e. the variant does not affect gene expression across any of the tissues.

Derivation
From equation 1, the log-likelihood function of Y conditioned on the genotype is - where θ represents the vector of all the variance components involved in Σ and c is a constant.

Score test
Let the parameters of interest be ψ = (β, γ) T and the nuisance parameters be η = (α, τ, ) T . The following is constructed under the null (H 0 ) Some algebra will result in the following - and

Missing response data
i the observed part and Y m i the missing part. Also, let R i,j = 1 if Y i,j is observed and R i,j = 0 otherwise. Assume that all the explanatory variables are completely observed. θ and ψ describe the measurement and missingness, respectively.
Assuming that the data are missing at random (MAR), If the parameter space of (θ , ψ ) is the product of the parameter spaces of θ and ψ (separability condition), then the inference is based on the observable data only (ignorability) [3,2].
) and x 1 constitute gene expression data available for samples with all the tissues/groups while x 2 constitutes gene expression data for samples with depleted tissues/groups. The multivariate gaussian theorem states that the marginal distribution of x 1 and x 2 are also normal with mean vector µ i and covariance matrix Σ ii (i = 1, 2), respectively. The conditional distribution of x i given x j is also normal with mean vector such that The joint density of x is given by After some algebra, we can show that the marginal distribution of x 1 can be written as ...and the conditional distribution of x 2 given x 1 is given by In this way, we can show that the observed data likelihood has the exact same model form as the full data likelihood.
3 Variance-covariance of U 2 β and U γ where Σ =ˆ I +τ ZZ T ,τ andˆ are the maximum likelihood estimates of the individual-specific and tissue-specific random effects. Using the theory of quadratic forms [1], estimated variance of U γ under the null is given by Similarly, from section 2.1, From the theory of quadratic forms [1], estimated variance of U 2 β under the null is U 2 β and U γ share the same . Again, from the theory of the quadratic forms, the covariance between U 2 β and U γ is

Optimal weights for minimum variance linear combination
Let a = (a β , a γ ) T , U ψ = U 2 β , U γ , and V ψ = V ar (U ψ ). We want to find the minimum variance linear combination a T V ψ , subject to the constraint that a β + a γ = 1 or a T 1 = 1. Specifically, we wish to minimize a T V ψ a.
Using Lagrangian multipliers to perform constrained optimization, we see that From the above equations, we have the following system of equations- This gives our optimal weights - and

MetaTissue method
MetaTissue (MT) method was proposed by Sul et al that jointly models all tissues by utilizing a meta-analysis by extending it to a mixed-model framework. A mixed model is used to account for the correlation of expression between tissues, and perform meta-analysis to combine results from multiple tissues. The following model description is from the original paper.
Consider the following mixed-model - where u ∼ N 0, σ 2 µ D and e ∼ (0, σ 2 e I) and X j is the matrix denoting SNP j for T tissues. The variances are estimated using EMMA andβs are jointly estimated using the following equation Given theβ = β 1 , . . . ,β T , information from multiple tissues is combined by applying metaanalysis toβ.

Fixed-effects model
A statistic of FE and its distribution under the null hypothesis are - where B 1 . . . B T and V 1 , . . . , V T are the estimates of effect-size and the standard error of B i in T tissues. Let µ be the unknown true effect size and so the null hypothesis of FE is µ = 0 or in other words the effect size in all tissues is zero. A p-value of S F E is obtained from the standard normal distribution.
Under the null hypothesis,β var(β) will approximately follow t-distribution with k degrees of freedom.

Random-effects model
The general assumption behind the random-effects model is that the effect size of a variant is different among datasets and follows a probability distribution with mean µ and variance τ 2 .
The H 0 is the same as that of the fixed-effects model -H 0 : µ = 0. The statistic for the random effects model is defined as - whereμ andτ 2 are estimated mean and variance of the effect size, and the maximum likelihood estimates of the two parameters that are iteratively calculated using Hardy and Thompson approach or some other iterative approach. The statistic follows a half and half mixture of χ 2 0 and χ 2 1 under the null.
where y si denotes gene expression vector in tissue s, for i th individual, µ s is the mean expression level of the gene in tissue s, β s is the effect of the gene on the genotype in tissue s and g i is the genotype of individual i coded as 0,1 or 2 copies of the reference allele. Statistical inference is made on γ, a binary variable (called configuration) whose status indicates the presence or absence of an eQTL. The length of γ depends on the number of tissues. Null hypothesis is indicated by γ = {0, . . . , 0} and any other combination is considered an alternative hypothesis. The statistical inference on γ is done using Bayes Factors such that - In order to account for many possible alternatives, the overall strength of evidence against at the candidate SNP is obtained by "Bayesian Model Averaging" (BMA), which involves averaging over the possible alternative configurations, weighting each by its prior probability, η γ = P (γ|H 0 = F ALSE): indicates a hierarchical model where the hyperparameters are estimated from the data (datadriven approach). BF BM Alite is a more computationally scalable version of the above flavors as it averages the test statistic over S + 1 configurations. In general, eQTL-BMA method does not scale well with increasing number of tissues because the number of terms in the sum of above equation is 2 S − 1.
In the presence of a strong evidence against the H 0 , posterior probability on each configuration indicating that the SNP is an eQTL in tissue s is computed by -P (γ = T RU E|data, H 0 = f alse) = η γ BF γ γ=0 η γ BF γ A frequentist interpretation to Bayes Factors computed by eQTL-BMA is given by performing adaptive permutations at the gene-level at a given FDR.

A note on statistical software
Our simulations were run to compare the statistical power (and type I error rate) between our method, eQTL-BMA, MetaTissue and Tissue-by-Tissue methods.
eQTL-BMA software is available for download at https://github.com/timflutre/eqtlbma. In order to expedite the analysis, we ran BF BM Alite version of the software, 1,000 adaptive permutations (using trick 1) to obtain the gene-level p value. These p values were then extracted from output.log˙jointPermPvals.txt.gz file for further analysis. We used eQTL-BMA software version 1.2 to perform all the analyses. In case of the real data analyses, we increased the number of permutations to the author recommended 10,000.