Sparse regression models for unraveling group and individual associations in eQTL mapping

As a promising tool for dissecting the genetic basis of common diseases, expression quantitative trait loci (eQTL) study has attracted increasing research interest. Traditional eQTL methods focus on testing the associations between individual single-nucleotide polymorphisms (SNPs) and gene expression traits. A major drawback of this approach is that it cannot model the joint effect of a set of SNPs on a set of genes, which may correspond to biological pathways. To alleviate this limitation, in this paper, we propose geQTL, a sparse regression method that can detect both group-wise and individual associations between SNPs and expression traits. geQTL can also correct the effects of potential confounders. Our method employs computationally efficient technique, thus it is able to fulfill large scale studies. Moreover, our method can automatically infer the proper number of group-wise associations. We perform extensive experiments on both simulated datasets and yeast datasets to demonstrate the effectiveness and efficiency of the proposed method. The results show that geQTL can effectively detect both individual and group-wise signals and outperforms the state-of-the-arts by a large margin. This paper well illustrates that decoupling individual and group-wise associations for association mapping is able to improve eQTL mapping accuracy, and inferring individual and group-wise associations.


Background
Expression quantitative trait loci (eQTL) mapping aims at identifying single nucleotide polymorphisms (SNPs) that influence the expression level of genes. It has been widely applied to analyze the genetic basis of gene expression and molecular mechanisms underlying complex traits [1,2]. In a typical eQTL study, the association between each expression trait and each SNP is assessed separately [3][4][5]. This approach does not consider the interactions among SNPs and among genes. However, multiple SNPs may interact with each other and jointly influence the phenotypes [6]. This assumption will inevitably miss complex cases where multiple genetic variants jointly affect the co-expressions of multiple genes. It has been observed *Correspondence: chengw02@gmail.com 1 Department of Computer Science, UNC at Chapel Hill, 201 S Columbia St., NC 27599 Chapel Hill, USA Full list of author information is available at the end of the article in biological experiments that the joint effect of multiple SNPs to a phenotype may be non-additive [6], and genes from the same biological pathway are usually coregulated [7] by the same genetic basis. The biological process contains both individual effects and joint effects between SNPs and genes [8]. A straightforward approach to detect associations between sets of SNPs and a gene expression level can be done using the standard gene set enrichment analysis [9]. Wu et al. [10] further proposed the variance component models for SNP set testing. Braun et al. employed aggregation-based approaches to cluster SNPs [11]. In [12], Listgarten et al. further considered the potential confounding factors.
However, there are two limitations for these approaches. First, these methods typically only consider SNPs from pre-defined pathways or gene ontology categories, which are far from being complete. Second, these methods can only detect the mapping of SNP set and a single gene expression level. To better elucidate the genetic basis of gene expression, it is a crucial challenge to understand how multiple modestly-associated SNPs interact to influence the a group of genes [6]. In this paper, we refer to this kind of eQTL mapping to find associations between group of SNPs and group of gene expression levels as the groupwise eQTL mapping. An example is shown in Fig. 1. Note that an ideal model should allow overlaps between SNP sets and between gene sets, that is, a SNP or gene may participate in multiple individual and group-wise associations [6]. In literature, group-wise eQTL mapping has attracted increasing research interest recently. For example, Xu et al. [13] proposed a two-graph-guided multi-task Lasso approach to infer group-wise eQTL mapping. However, it required the grouping information of both SNPs and genes available as prior knowledge, which may not be practical for many applications. Besides, it is not able to correct the effects of confounding factors.
In this paper, we propose a novel method, geQTL, to automatically detect individual and group-wise associations in eQTL studies. It uses a two-layer feature selection strategy and adopts efficient optimization techniques, which make it suitable for large scale studies. Moreover, geQTL can automatically infer the optimal number of group-wise associations. We perform extensive experiments on both simulated datasets and yeast datasets to demonstrate the effectiveness and efficiency of the proposed method.

Preliminaries
Important notations used in this paper are listed in Table 1. In this paper, for each sample, the data of SNPs and genes are denoted by column vectors. Let x = [ x 1 , x 2 , . . . , x K ] T denote the K SNPs. Here, x i ∈ {0, 1, 2}  The traditional linear regression model for association mapping between x and z is where z is a linear function of x with coefficient matrix W, μ is an N × 1 translation factor vector. And is the additive noise of Gaussian distribution with zero-mean and variance γ I, where γ is a scalar. That is, ∼ N (0, γ I).
In association studies, sparsity is a reasonable assumption because only a small fraction of genetic variants are expected to be associated with a set of gene expression traits. This can be modeled as a feature selection problem. For example, the standard Lasso [5] can be used in association mapping, which applies 1 penalty on W for sparsity.
If both X and Z are standardized, the objective function of Lasso is formulated as where || · || F denotes the Frobenius norm, || · || 1 is the 1norm. η is the empirical parameter for the 1 penalty. W is the parameter (also called weight) matrix parameterizing the space of linear functions mapping from X to Z.
Confounding factors, such as unobserved covariates, experimental artifacts and unknown environmental perturbations, may mask real signals and lead to spurious findings. LORS [14] uses a low-rank matrix L ∈ R N×H to account for the variations caused by hidden factors. The objective function of LORS is where ||·|| * is the nuclear norm [14]. ρ is the regularization parameter to control the rank of L. L is a low-rank matrix assuming that there are only a small number of hidden factors influencing the gene expression levels.
When we fix {W, we can optimize {L} by using singular value decomposition (SVD) according to the following lemma.

Lemma 1. ([15]) Suppose that matrix O has rank r. The solution to the optimization problem
Thus, for fixed W, the formula for updating L is Both Lasso and LORS do not consider the existence of group-wise associations. Below, we will introduce the proposed model to infer both group-wise and individual associations for eQTL mapping.

geQTL
In geQTL, individual associations between SNPs and genes are modeled by following the Lasso-based strategy. Group-wise associations are inferred using a two-layer feature selection method. Since multiple SNPs may have joint effect on a group of genes, and such effect may be accomplished through complex biological processes, we introduce latent variables to bridge sets of SNPs and sets of genes. Specifically, we assume that there exist latent factors regulating the gene expression level, which serve as bridges between the SNPs and the genes. The latent variables are denoted by y =[ y 1 , y 2 , . . . , y M ] T . Here, M (M min(K, N)) is the total number of latent variables representing group-wise associations. The relationship between x and y can be represented as A ∈ R M×K denotes the matrix of coefficients between x and y. σ 2 1 I M denotes the variances of the additive noise. I M is an identity matrix. Here we drop the intercept terms because the input data X and Z are normalized to zero mean and unit variance as preprocessing.
Similarly, the relationship between y and z can be represented as where 2 ∼ N 0, σ 2 2 I N . B ∈ R N×M denotes the matrix of coefficients between y and z, C ∈ R N×K denotes the matrix of coefficients between x and z to encode the individual associations.
Note that Eq. (7) decouples the associations between SNPs and genes into two parts: one for individual associations represented as Cx, and another for group-wise associations represented as By. Next, we infer the groupwise associations by a two-layer feature selection strategy. We first remove the individual associations and denotẽ ThusZ contains only group-wise effects. Next let Thus Y represents a low-rank transformation of the original SNP matrix. Each row of Y represents a group of SNPs. From Eq. (7), we have the following multiple-inputmultiple-output (MIMO) linear system where E is a Gaussian white-noise term. In Eq. (9) and (10), A and B should be sparse since a single gene is often influenced by a small number of SNPs and vice versa [12]. Therefore, the overall objective function is where α, β, γ , ρ are the regularization parameters, and the loss function is Here, we choose different penalties for A, B, C because the sparsities of different matrices are typically of different scales.

Optimization
The optimization for L can be achieved by following a similar approach as in [14]. To optimize A, B, C, many tools can be used to optimize the 1 penalized objective function, e.g., the Orthant-Wise Limited-memory Quasi-Newton (OWL-QN) algorithm [16]. Due to space limitation, we omit the details. In the next, we devise optimization techniques that can dramatically improve the computational efficiency of geQTL.

Boosting the computational efficiency
Given a large number of SNPs and gene expression traits, scalability of the algorithm is a crucial issue. We propose two improved models, geQTL + and geQTL-ridge, which optimize the search for significant individual associations, which is the main computational bottleneck of the algorithm.

geQTL +
In a typical eQTL study, we usually have M min(K, N). Thus, the bottleneck of the algorithm is to optimize C. Our strategy is to confine the space of C. The intuition is that we only permit a small fraction of elements in C to be nonzero. It has been shown that if Z and X are standardized with zero mean and unit sum of squares, then r = abs(ZX T ) is equal to the gene-SNP correlations (r gs = |cor(z g , x s )|) [17]. Since for many test statistics, e.g., t, F, R 2 , and LR, for the simple linear regression problem can be expressed as functions of the sample correlation r gs , e.g., R 2 = r 2 , and t = r √ n−2 1−r 2 , we can find a threshold according to the required p-value, such that test statistics exceeding the threshold are significant at the required significance level. The test statistics for every gene-SNP pair in r are compared with the threshold, and only those elements whose r are greater than the threshold are optimized. We denote R ∈ R N×K as the indicator matrix indicating which elements in C can be nonzero (i.e., r gs > threshold).

geQTL-ridge
When N and K are extremely large, optimizing C may still be time-consuming, since it may take many iterations to converge with the 1 constraint. Next, we introduce geQTL-ridge, which further improves the time efficiency with slight decrease in accuracy. The key idea is to use ridge regression for individual associations so that we can get a closed form solution for C. The objective function is shown in the following. where and P i is defined as in formula (19).
The proof of the Theorem 1 is in the following section.

Proof of Theorem 1
Proof. Recall that any ridge regression problem min a ||b − aQ|| 2 2 + ||a || 2 2 , where a is a row vector and Q has linearly independent rows, has the following solution Note that Taking into account that (c i ) j can be nonzero only if (R) i,j is 1, we introduce P i , where P i has K rows and l i = K j=1 (R) i,j columns. And Then is solved by Therefore, min C loss(A, B, C, L) + γ ||C|| 2 2 , is solved by C = c T 1 , . . . , c T N T , which leads to the update formula given in Eq. (14).

Determining the number of hidden variables
In Eq. (12), we use BA + C to formulate the overall associations between SNPs and expression traits. Two group-wise associations will not share the same group of SNPs (or genes), since otherwise these two group-wise associations can be combined into one. Therefore, every group-wise association should be unique and irreplaceable. Hence, following two conditions should be satisfied • A has linearly independent rows. Since M K, this condition is equivalent to that A has full rank; • B has linearly independent columns. Since M N, this condition is equivalent to that B has full rank.
When these two conditions are met, we have The last equality holds because both A and B have full rank.
We have the following observation.
We compute BA by minimizing Eq. (12), which gives Combine (22), (23), and (24), we find M = the number of nonzero singular values of Due to the existence of noise, we should allow small singular values to be considered as zero. Therefore, we can draw a plot with singular values of (Z − L − CX)X T XX T −1 in descending order and set M to be k, if the first k singular values are large and significantly greater than the (k + 1)-th singular value.
Based on the discussion above, in order to find optimal M, we can first use Lasso to infer the initial value of C. Then, using Eq. 25, we can infer the optimal M at this stage. After that, we can optimize new C, and calculate new optimal M. We can repeat this procedure until M became stable or reach maximal number of iterations.

Results
In this section, we perform extensive experimental study using both simulated and real eQTL datasets to evaluate the performance of our methods. For comparison, we select several state-of-the-art eQTL methods, including two-graph guided multi-task lasso (MTLasso2G) [13], FaST-LMM [12], SET-eQTL [18], LORS [14], Matrix eQTL [17] and Lasso [5]. Note that we did not compare with our previous work, GDL, in [19] because it needs to incorporate many prior knowledge, that is not relevant to this work. For all the methods, the tuning parameters are learned using cross validation. The discussion of setting proper number of group-wise associations M is included in the Additional file 1. The shrinkage of the coefficients is also presented in the Additional file 1.

Simulated data
We use a similar setup for simulation study to that in [14]. First, 100 SNPs are randomly selected from the yeast eQTL dataset [20]. This gives birth to the matrix X. 100 gene expression profiles are generated by Z j * = β j * X + is used to simulate the Gaussian noise. To simulate the effects of confounding factors, we use j * , drawn from N (0, τ ). In this paper, we set τ = 0.1. is given by FF T . Here, F ∈ R H×J and F ij ∼ N (0, 1). J denotes hidden factor number. In this paper, we set J to 10.
In the left most of Fig. 2, we illustrate β. Here, we set the association strength to 1. Totally, there exist four groupwise associations with different scales. The diagonal line represents the individual signals in cis-regulation.
In Fig. 2, we report the associations inferred by geQTL. Recall that group-wise associations can be inferred from matrix A and B, and individual associations can be inferred from matrix C. It is obvious that geQTL can detect both group-wise and individual signals.
We use SNR =

Var(βX)
Var( +E) to denote the signal-tonoise ratio [14] in the eQTL datasets. Here, we fix J = 10, τ = 0.1. The SNR's are controlled by using different φ's. Using 50 simulated datasets with different SNR's, we compare the proposed methods with the selected methods. Because FaST-LMM requires the input of genomic locations information (e.g., chromosome, base pair, etc), we will compare it on the real data set. The results are averaged over 50 different simulated datasets. BA + C is used to represent the association matrix in our method. Figure 3 shows the ROC curve of TPR-FPR (true positive rate -false positive rate) for performance comparison. Typically, we care more about the TPR when the FPR is small because it is important to evaluate the performance of model when controlling the maximum tolerated FPR. Thus, in Fig. 3 Fig. 4. It can be seen that geQTL and geQTL + outperform all alternative methods by a large margin since they considers both individual and group-wise associations. We also observe that geQTL-ridge is not as good as geQTL and geQTL + . This is because geQTL-ridge does not provide a sparse solution for individual associations. MTLasso2G is comparable to LORS. LORS can correct the effects of the confounders, however, it is not able to detect group-wise mappings. We also observe that by decoupling individual and group-wise associations, the proposed models (geQTL, geQTL + , and geQTL-ridge) are more robust to noise than other methods.

Yeast eQTL data
We also validated geQTL using the bench mark datasetyeast (Saccharomyces cerevisiae) eQTL dataset. The dataset contains 112 yeast segregants generated from a cross of two inbred strains [20]. Originally, It contains 6229 gene epxressions and 2956 SNPs. SNPs with > 10 % missing values in the remaining SNPs are imputed using the function fill.geno in R/qtl [21]. The neighboring SNPs with the same genotype profiles are combined, resulting in 1027 genotype profiles. Remove gene expression traits with missing values, we get 4474 expression profiles.

cis-and trans-analysis
We follow the standard cis-enrichment analysis that is used in [22,23] for evaluation. Moreover, we use the trans-enrichment with a similar strategy [24]. Genes regulated by transcription factors (obtained from http://www. yeastract.com/download.php) are treated as trans-acting signals.
In Table 2, we report the pairwise comparison using cis-and trans-enrichment analysis. We do not list geQTL separately from geQTL+, since geQTL+ is a faster version of geQTL. In this table, the methods are sorted (from top to bottom in the left column and from left to right in the top row) in decreasing order of performance. A p-value shows how significant a method on the left column outperforms a method in the top row in terms of cis and trans enrichments. We observe that geQTL +  has significantly better cis-enrichment scores than the other models. For trans-enrichment, geQTL + is the best, and MTLasso2G comes in second, outperforming FaST-LMM, SET-eQTL, LORS, Matrix eQTL and Lasso. LORS outperforms Matrix eQTL and Lasso for both cis-and trans-enrichment. This is because LORS considers confounding factors while Matrix eQTL and Lasso does not. In total, these methods each detected about 6000 associations according to non-zero W values. We estimate FDR using 50 permutations as proposed in [14]. With FDR ≤ 0.01, geQTL + obtains about 4500 significant associations.
The plots of all identified significant associations for different methods are given in Fig. 5. Obviously, we can see that C + B × A and C of geQTL + report weaker trans-regulatory bands while stronger cis-regulatory signals than other competitors.

Gene ontology enrichment analysis on detected group-wise associations for yeast
We further evaluate the quality of detected groups of genes by measuring the correlations between the detected groups of genes and the GO (Gene Ontology) categories  [25]. Specifically, the GO enrichment test is calculated by DAVID [26]. In this paper, gene sets reported by our algorithm with calibrated p-values less than 0.01 are considered as significantly enriched.
Since SET-eQTL is the only previous approach capable of detecting group-wise association mapping, we compare the groups of genes detected by geQTL and those by SET-eQTL. For SET-eQTL, 90 out of 150 gene sets are significantly enriched. By contrast, 28 out of 30 gene sets reported by geQTL are significantly enriched. This well illustrates that the effectiveness of geQTL to infer group-wise associations. The number of SNPs in each group reported by geQTL and their genomic locations are shown in Fig. 6. We can clearly observe that SNPs in the same group are often physically close to each other. This is reasonable because SNPs nearby usually jointly affect the expression level of a set of genes that achieves a specific cell function [8].

Reproducibility of eQTLs between studies
To further evaluate the identified associations, we investigate the consistency of calls between two independent studies [27]. We examine the reproducibility based on the following two criteria [14,28,29]: • Reproducibility of detected SNP-gene associations: Let L 1 and L 2 be the sets of SNP-gene associations detected in the two yeast datasets, respectively. We can rank the associations according to the weights (or q-values for FaST-LMM). Let L T 1 and L T 2 be the top T most significant associations from the two datasets.
The reproducibility is defined as • Reproducibility of trans regulatory hotspots: For each SNP, we count the number of associated genes from the detected SNP-gene associations. We use this number as the regulatory degree of each SNP. For FaST-LMM, SNP-Gene pairs with a q-value < 0.001 are considered significant associations. We also tried different cutoffs for FaST-LMM (from 0.01 to 0.001), the results are similar. SNPs with large regulatory degrees are often referred to as hotspots. We sort SNPs in descending order of their regulatory degrees. We denote the sorted SNPs lists as S 1 and S 2 for the two yeast datasets. Let S T 1 and S T 2 be the top T SNPs in the sorted SNP lists. The trans calling consistency of reported hotspots is denoted by In Fig. 7a, we show the consistency of the top 4500 associations between different studies. In Fig. 7b, we list the reproducibility of trans regulatory hotspots reported by different approaches. Overall, geQTL + yielded results with greater consistency all other methods. This well illustrates the superiority of geQTL + .

Conclusions
In literature, much efforts have been done on eQTL mapping. Traditional eQTL mapping approaches can not detect the group-wise associations between sets of SNPs and sets of genes. To achieve that, we propose a fast approach, geQTL, to detect both individual and group-wise associations for eQTL mapping. geQTL can also correct the effects of potential confounders. We also introduce efficient algorithms to scale 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28  up the computation so that the algorithms are able to tackle large scale data sets. Additionally, the proposed model provides an effective strategy to automatically infer the proper number of group-wise associations. We perform extensive experiments on both simulated datasets and yeast datasets to demonstrate the effectiveness and efficiency of the proposed method. Inferring individual and group-wise associations also helps us better explain the genetic basis of gene expression. Due to scalability issue, our model simply assume random errors between different genes are independent and have the same variance. That is the reason why current model only identified a small number of group-wise associations. Our future work will further incorporate the relationships between genes by integrating gene co-expression network or protein-protein-interaction network.