Pathways-Driven Sparse Regression Identifies Pathways and Genes Associated with High-Density Lipoprotein Cholesterol in Two Asian Cohorts

Standard approaches to data analysis in genome-wide association studies (GWAS) ignore any potential functional relationships between gene variants. In contrast gene pathways analysis uses prior information on functional structure within the genome to identify pathways associated with a trait of interest. In a second step, important single nucleotide polymorphisms (SNPs) or genes may be identified within associated pathways. The pathways approach is motivated by the fact that genes do not act alone, but instead have effects that are likely to be mediated through their interaction in gene pathways. Where this is the case, pathways approaches may reveal aspects of a trait's genetic architecture that would otherwise be missed when considering SNPs in isolation. Most pathways methods begin by testing SNPs one at a time, and so fail to capitalise on the potential advantages inherent in a multi-SNP, joint modelling approach. Here, we describe a dual-level, sparse regression model for the simultaneous identification of pathways and genes associated with a quantitative trait. Our method takes account of various factors specific to the joint modelling of pathways with genome-wide data, including widespread correlation between genetic predictors, and the fact that variants may overlap multiple pathways. We use a resampling strategy that exploits finite sample variability to provide robust rankings for pathways and genes. We test our method through simulation, and use it to perform pathways-driven gene selection in a search for pathways and genes associated with variation in serum high-density lipoprotein cholesterol levels in two separate GWAS cohorts of Asian adults. By comparing results from both cohorts we identify a number of candidate pathways including those associated with cardiomyopathy, and T cell receptor and PPAR signalling. Highlighted genes include those associated with the L-type calcium channel, adenylate cyclase, integrin, laminin, MAPK signalling and immune function.

An optimal solution for SNP coefficient β j is then derived from the subgradient equations − x j (r l − k =j x kβk − x j β j ) + (1 − α)λw l s j + αλt j = 0 j = l 1 , . . . , l P l , (S. 2) whereβ k , k = j are the current estimates for other SNP coefficients in group l, and the group partial residual,r l = y − m =l X mβm . Here s j and t j are the respective subgradients of ||β l || 2 and |β j |, with If β l = 0, that is group l is not selected by the model, then from (S.2) − x jrl + (1 − α)λw l s j + αλt j = 0, j = l 1 , . . . , l P l .

(S.4)
Substituting a = X lr l gives a j = (1 − α)λw l s j + αλt j , j = l 1 , . . . , l P l so that s 2 j = 1 (1 − α) 2 λ 2 w 2 l (a j − αλt j ) 2 , j = l 1 , . . . , l P l , and j s 2 Also from (S.3), one further condition when β l = 0 is that t j ∈ [ −1, 1]. The values,t j that minimise the left hand size of (S.5) are therefore given bŷ Substituting for a j , we can then write the values for a j − αλt j that minimise the left hand side of (S.5) as is the lasso soft thresholding operator. Finally, we can now rewrite the condition forβ l = 0, (S.5) as ||S(X lrl , αλ)|| 2 ≤ (1 − α)λw l , (S.7) Where the vector S(X lr l , αλ) = [S(x l 1r l , αλ), . . . , S(x l Pr l , αλ)]. Note that with α = 0, this reduces to the group lasso group selection criterion. In the case that β l = 0, that is group l is selected by the model, from (S.2) and (S.3) we see that For completeness, we rewrite the criterion for selecting pathway l from (S.7) as ||S(X lrl , αλ)|| 2 > (1 − α)λw l (S.9) and the criterion for selecting SNP j in selected pathway l from (S.8) as |x jrl,j | > αλ (S.10) wherer l,j =r l − k =j x kβk is the SNP partial residual, obtained by regressing out the current estimated effects of all other predictors in the model, except for predictor j.
A number of methods for the estimation of β l in the case that ||β l || 2 = 0 have been proposed (Friedman, Hastie, and Tibshirani, 2010;Foygel and Drton, 2010;Liu and Ye, 2010;Simon et al., 2012). A complicating factor is the discontinuities in the first (and second) derivatives of s j at ||β l || 2 = 0, that is where ||β l || 2 first moves away from zero, and of t j when β j = 0. As with GL, Friedman, Hastie, and Tibshirani (2010) describe a numerical method using coordinate descent, by combining a golden search over β j with parabolic interpolation.
However we find this too computationally intensive for the large datasets we wish to analyse. Simon et al. (2012) propose an accelerated, block gradient descent method in which β l is iteratively updated in a single step along the line of steepest descent of the block objective function until convergence. We instead use a block, coordinate-wise gradient descent (BCGD) method that uses a Newton update, similar to that proposed by Zhou et al. (2010), and we describe this below.
To update β j from its current estimate,β j , we note from (S.2) and (S.3) that ifβ j = 0, the subgradient equation for predictor j is given by We then descend along the gradient atβ j towards the minimum using Newton's method. The Newton update, β * j , is then given bŷ is the derivative of (S.11) atβ j . The update (S.12) is repeated until convergence. We must also deal with the case whereβ j = 0. Here we adopt a slightly different strategy, since the partial derivative, t j of β j is not continuous. We avoid this discontinuity by testing the 'directional derivatives', ∂ + j and ∂ − j , respectively representing the partial derivatives at β j = 0 in the direction of increasing and decreasing β j . Recalling that we are dealing with the case ||β l || 2 = 0, at β j = 0 the group penalty term in (S.11) disappears. That is, once a group is selected by model it becomes easier for each SNP coefficient to move away from zero. The two directional derivatives are then given by Since the minimising function (S.1) is convex, there are three possible outcomes, and we substitute for ∂ j in (S.12) accordingly: In the third case, f (β l ) is increasing either side of β j = 0, so thatβ j must remain at zero. We can then proceed with the standard Newton update (S.12). Finally, since the Newton update may occasionally overstep the minimum (where ∂ j = 0), a simple remedy proposed by Zhou et al. (2010) is to check that f (β l ) is decreasing at each iteration. If this is not the case, then the step size in (S.12) is halved. The complete algorithm for SGL estimation using BCGD is presented in Box 1.
One remaining practical issue is the obtaining of a value for λ max , the smallest value of λ at which no groups are selected by the model. Noting thatr l = y when no groups are selected, from (S.7) we obtain the smallest value, λ min l , for the minimum value of λ at which group l is not selected as We can solve this in its quadratic form by first setting an upper bound for λ at the point λ * l , where the soft thresholding function S(X l y l , αλ) = 0, that is when no SNPs are selected by the model. We then obtain the solution by solving We do this using the 1d root-finding function, brentq, in Python's scipy library. Finally, we obtain a value for λ max as

S2 SGL with overlaps
We assume that X and β have been expanded to account for overlaps, but we drop the * notation for clarity. We proceed as before by solving the block-separable optimisation (4) for each group or pathway in turn. However, for overlapping pathways, the assumption of pathway independence requires that each X l , (l = 1, . . . , L) is regressed against the full phenotype vector y rather than the partial residual,r l . With this in mind, the revised subgradient equations for group l (S.2) are given by The estimation for group l then proceeds as described in the previous section, but with the partial residualr l replaced by y, so that the group sparsity condition (S.7) for ||β l || 2 = 0 becomes ||S(X l y, αλ) As before, where group l is selected by the model, the update for β j , with current estimatê β j , is derived from the partial derivative (S.11), which under the independence assumption is given by for j = l 1 , . . . , l P l . The Newton update (S.12) remains the same. Whenβ j = 0, the revised directional derivatives (S.13) are given by As before the conditions for SNP sparsity within a selected group are determined by (S.14). The value of λ max , the smallest λ value at which no group is selected by the model, is determined in the same way as before, since this procedure (described in (S.15), (S.16) and (S.17)) does not depend onr l .
Importantly, since each group is regressed independently against the phenotype vector y, there is no block coordinate descent stage in the estimation, that is the revised algorithm utilises only coordinate gradient descent within each selected pathway. For this reason we use the acronym SGL-CGD for the revised algorithm. The new algorithm is described in Box 2. Note that since the block coordinate descent stage is avoided, the new algorithm has the added benefit of being much faster than would otherwise be the case.

S3 Simulation study 1
A baseline phenotype, y is sampled from N (10, 1). To generate SNP effects, we first select a single pathway, G l , at random. From this pathway we randomly select 5 SNPs to from the set S ⊂ G l of causal SNPs. At each MC simulation we generate a genetic effect and adjust y so that Here δ controls the overall additive genetic effect on phenotype y due to all casual SNPs in S, and ζ k determines the contribution from causal SNP k, with k∈S ζ k = 1. In our simulations we maintain a constant overall genetic effect size, across all affected phenotypes, so that γ represents the proportionate increase in the mean value of y due to all genetic effects. We also set ζ k = 1/5, for k ∈ S, so that the contribution from each causal SNP allele is equal. This enables us to determine δ for a given γ as δ = 5γE(y) 2 k∈S m k .
Note that for constant γ, the proportionate effect on the mean value of y due to SNP k is MAF dependent, and is given by 2δm k /E(y).

S4 Weight tuning for bias reduction
For fixed α, and with λ tuned to select a single pathway, we need to establish which pathway enters the model first, as λ is reduced from its maximal value, λ max . From (S.17), at phenotype permutation r, the pathwayĈ r selected with permuted phenotype y r is given bŷ ||S(X l y r , αλ min l )|| 2 (1 − α)w l , using the procedure described at the end of Section S1. For R permutations of the phenotype vector, y, the empirical pathway selection frequency distribution is then given by {Ĉ r = l}, l = 1, . . . , L.

S5 Investigation of effect of pathway size on pathway selection
We extend the simulation framework described for Simulation Study 1 by allowing pathway size, measured by number of SNPs, to vary. Specifically, we follow the same scenario for generating genotypes, but generate 20 non-overlapping pathways varying in size from 10 to 200 SNPs, in increments of 10. Phenotypic effects are generated as previously described, after randomly selecting 5 causal SNPs from a single randomly selected pathway at each MC simulation. We use an 'intermediate' effect size, γ = 0.05 (see Figure 3 in the main text), to best assess any potential variation in pathway selection power, defined as the proportion of simulations where the causal pathway is correctly selected. We perform 10,000 MC simulations, with λ tuned to ensure a single pathway is selected at each simulation, and with α = 0.85. We use a uniform pathway weight vector, w = 1. Pathway selection power is plotted against pathway size in Figure S1. There is no significant relationship between the two (linear regression: slope = 0.00013; r 2 = 0.12; p = 0.13).

S6 Comparisons with HDLC SNP GWAS
Here we present some further details on our method for comparing SGL gene rankings with those obtained from separate HDLC SNP GWAS studies for SP2 and SiMES cohorts. GWAS