Identification of gene pathways implicated in Alzheimer's disease using longitudinal imaging phenotypes with sparse regression

We present a new method for the detection of gene pathways associated with a multivariate quantitative trait, and use it to identify causal pathways associated with an imaging endophenotype characteristic of longitudinal structural change in the brains of patients with Alzheimer's disease (AD). Our method, known as pathways sparse reduced-rank regression (PsRRR), uses group lasso penalised regression to jointly model the effects of genome-wide single nucleotide polymorphisms (SNPs), grouped into functional pathways using prior knowledge of gene-gene interactions. Pathways are ranked in order of importance using a resampling strategy that exploits finite sample variability. Our application study uses whole genome scans and MR images from 464 subjects in the Alzheimer's Disease Neuroimaging Initiative (ADNI) database. 66,182 SNPs are mapped to 185 gene pathways from the KEGG pathways database. Voxel-wise imaging signatures characteristic of AD are obtained by analysing 3D patterns of structural change at 6, 12 and 24 months relative to baseline. High-ranking, AD endophenotype-associated pathways in our study include those describing chemokine, Jak-stat and insulin signalling pathways, and tight junction interactions. All of these have been previously implicated in AD biology. In a secondary analysis, we investigate SNPs and genes that may be driving pathway selection, and identify a number of previously validated AD genes including CR1, APOE and TOMM40.


Introduction
A growing list of genetic variants have now been associated with greater susceptibility to develop early and late-onset Alzheimer's disease (AD), with the AP OE 4 allele consistently identified as having the greatest effect (for an up to date list see www.alzgene.org).Recently, case-control susceptibility studies have been augmented by studies using neuroimaging phenotypes.The rationale here is that the use of heritable imaging signatures ('endophenotypes') of disease may increase the power to detect causal variants, since gene effects are expected to be more penetrant at this level (Meyer-Lindenberg and Weinberger, 2006).This 'imaging-genetic' approach has been used to identify genes associated with a range of imaging phenotypes including measures of hippocampal volume (Stein et al., 2012) and cortical thickness (Burggren et al., 2008).
AD is a moderate to highly heritable condition, yet as with many common heritable diseases, association studies have to date identified gene variants explaining only a relatively modest amount of known AD heritability (Braskie et al., 2011).One approach to uncovering this 'missing heritability' is motivated by the observation that in many cases disease states are likely to be driven by multiple genetic variants of small to moderate effect, mediated through their interaction in molecular networks or pathways, rather than by the effects of a few, highly penetrant mutations (Schadt, 2009).Where this assumption holds, the hope is that by considering the joint effects of multiple variants acting in concert, pathways genome-wide association studies (PGWAS) will reveal aspects of a disease's genetic architecture that would otherwise be missed when considering variants individually (Wang et al., 2010;Fridley and Biernacka, 2011).Another potential benefit of the PGWAS approach is that it can help to elucidate the mechanisms of disease by providing a biological interpretation of association results (Cantor et al., 2010).In the case of AD for example, an understanding of the underlying mechanisms by which gene mutations impact disease etiology may play an important role in the translation of basic AD biology into therapy and patient care (Sleegers et al., 2010).
In this paper, we present the first PGWAS method that is able to accommodate a multivariate quantitative phenotype, and apply our method to a pathways analysis of the ADNI cohort, comparing genome-wide single nucleotide polymorphism (SNP) data with voxel-wise tensor-based morphometry (TBM) maps describing longitudinal structural changes that are characteristic of AD.In this study we map SNPs to pathways from the KEGG pathways database, a curated collection of functional gene pathways representing current knowledge of molecular interaction and reaction networks (http://www.genome.jp/kegg/pathway.html).Our method is however able to accommodate alternative sources of information for the grouping of SNPs and genes, for example using gene ontology (GO) terms, or information from protein interaction networks (Wu et al., 2010;Jensen and Bork, 2010).
Many existing PGWAS methods, such as GenGen (Wang et al., 2009) and ALLIGATOR (Holmans et al., 2009) rely on univariate statistics of association, whereby each SNP in the study is first independently tested for association with a univariate quantitative or dichotomous (case-control) phenotype.SNPs are assigned to pathways by mapping them to adjacent genes within a specified distance, and individual SNP or gene statistics are then combined across each pathway to give a measure of pathway significance, corrected for multiple testing.Methods must also account for the potentially biasing effects of gene and pathway size and linkage disequilibrium (LD), and this is generally done through permutation.A potential disadvantage of these methods is that each SNP is considered separately at the first step, with no account taken of SNP-SNP dependencies.
In contrast, a multilocus or multivariate model that considers all SNPs simultaneously may characterise SNP effects more accurately by aiding the identification of weak signals while diminishing the importance of false ones (Hoggart et al., 2008).
In earlier work we developed a multivariate PGWAS method for identifying pathways associated with a single quantitative trait (Silver and Montana, 2012).We used a sparse regression model -the group lasso -with SNPs grouped into pathways.We demonstrated in simulation studies using real SNP and pathway data, that our method showed high sensitivity and specificity for the detection of important pathways, when compared with an alternative pathways method based on univariate SNP statistics.Our method showed the greatest relative gains in performance where marginal SNP effect sizes are small.Here we extend our previous model to accommodate the case of a multivariate neuroimaging phenotype.We do this by incorporating a group sparsity constraint on genotype coefficients in a multivariate sparse reduced-rank regression model, previously developed for the identification of single causal variants (Vounou et al., 2010).Our proposed 'Pathways Sparse Reduced-Rank Regression' (PsRRR) algorithm incorporates phenotypes and genotypes in a single model, and accounts for potential biasing factors such as dependencies between voxels and SNPs using an adaptive, weight-tuning procedure.
The article is presented as follows.We begin in section 2.1 with a description of the voxelwise TBM maps used in the study, and in section 2.2 we outline how we use these maps to generate an imaging signature characteristic of structural change in AD, that is able to discriminate between AD patients and controls.In section 2.3 we describe the genotype data used in the study, together with quality control procedures, and in section 2.4 we explain how this genotype data is mapped to gene pathways.The theoretical underpinnings of the PsRRR method are described in section 2.5.We explain our method for ranking AD-associated pathways, SNPs and genes using a resampling procedure in section 2.6, and discuss our strategies for addressing the significant computational challenge of fitting a regression-based model with such high dimensional datasets in section 2.7.Pathway, SNP and gene ranking results are presented in section 3, and we conclude with a discussion in section 4.

Materials and methods
Imaging and genotype data used in this study were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (adni.loni.ucla.edu).The ADNI was launched in 2003 by the National Institute on Aging (NIA), the National Institute of Biomedical Imaging and Bioengineering (NIBIB), the Food and Drug Administration (FDA), private pharmaceutical companies and non-profit organizations, as a 5-year public-private partnership.The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early AD.Determination of sensitive and specific markers of very early AD progression is intended to aid researchers and clinicians to develop new treatments and monitor their effectiveness, as well as lessen the time and cost of clinical trials.
Linear registration (9-parameter) was used to align the longitudinal scan series of each subject and then the mutually aligned time-series was registered to the International Consortium for Brain Mapping template (ICBM-53) (Mazziotta et al., 2001).Brain masks that excluded skull, other non-brain tissues, and the image background were generated automatically using a parameter-less robust brain extraction tool (ROBEX) (Iglesias et al., 2011).
Individual Jacobian maps were created to estimate 3D patterns of structural brain change over time by warping the skull-stripped, globally registered and scaled follow-up scan to match the corresponding screening scan.We used a non-linear, inverse consistent, elastic intensitybased registration algorithm (Leow et al., 2005), which optimizes a joint cost function based on mutual information (MI) and the elastic energy of the deformation.Color-coded maps of the Jacobian determinants were created to illustrate regions of ventricular/CSF expansion (i.e., with det J(r) > 1), or brain tissue loss (i.e., with det J(r) < 1) (Ashburner and Friston, 2003;Chung et al., 2001;Freeborough and Fox, 1998;Riddle et al., 2004;Thompson et al., 2000;Toga, 1999) over time.These longitudinal maps of tissue change were also spatially normalized across subjects by nonlinearly aligning all individual Jacobian maps to an average group template known as the minimal deformation target (MDT), for regional comparisons and group statistical analyses.
The study was conducted according to the Good Clinical Practice guidelines, the Declaration of Helsinki and U.S. 21 CFR Part 50-Protection of Human Subjects, and Part 56-Institutional Review Boards.Written informed consent was obtained from all participants before experimental procedures, including cognitive tests, were performed.

Voxel filtering
To maximise the power to detect causal pathways, we seek a phenotype which is highly representative of those structural changes in the brain that are characteristic of AD.One approach is to use prior knowledge on regions of interest (ROI) to extract a univariate quantitative measure as a disease signature (Potkin et al., 2009).We instead use a voxel-wise, data-driven approach to produce a multivariate disease signature.
We begin by selecting 464 individuals (99 AD, 211 MCI, 154 CN) with longitudinal maps at time points 6, 12 and 24 months, who have been genotyped by ADNI.Other time points are excluded because of missing observations.For each voxel and for each one of the three groups, we then fit a linear regression with an intercept term, where the dependent variable is the voxel value (change relative to baseline at screening), and the independent variable is time.The regression coefficient for the slope thus gives a summary measure of tissue change over time at each voxel.We next perform an analysis of variance (ANOVA) on each slope coefficient to identify which voxels are discriminative between AD and CN, with sex and age as covariates (MCI subjects are not used at this stage as we wish to obtain a clear imaging-derived signature for AD).We then select the most discriminative voxels whose ANOVA p-values exceed a level of 0.05, with a Bonferroni correction for multiple testing.The final set of phenotypes used in the study correspond to the voxel-wise slope coefficients for all 464 subjects (AD, MCI and CN) at the selected voxels, corrected for sex and age.

Genotype data
Genotypes for the 464 subjects in the study were obtained from the ADNI database.ADNI genotyping is performed using the Human610-Quad Bead-Chip, which includes 620,901 SNPs and copy number variations (see Saykin et al. (2010) for details).SNPs defining the APOE 4 variant are not included in the original genotyping chip, but have been genotyped separately by ADNI.These were added to the final genotype dataset.Subjects were unrelated, and all of European ancestry, and passed screening for evidence of population stratification using the procedure described in (Stein et al., 2010).We included only autosomal SNPs in the study, and additionally excluded SNPs with a genotyping rate < 95%, a Hardy-Weinberg equilibrium p-value < 5 × 10 −7 , and a minor allele frequency < 0.1.Finally, since our method does not allow for missing SNP minor allele counts, missing genotypes were imputed (see Vounou et al. (2011) for details).434,271 SNPs remained after all SNP filtering steps described above.

SNP to pathway mapping
Our SNP mapping procedure rests on the extraction of prior information from a pathways database that provides curated lists of genes, mapped to functional networks or pathways.Starting with a list of all genes that map to at least one pathway in the database, we assign SNPs to genes within a specified distance, upstream or downstream of the gene in question, and thence to pathways.This process is illustrated schematically in Fig. 1.For our AD pathways study, we are mapped to pathways using information on gene-gene interactions (top row), obtained from a gene pathways database.Many genes do not map to any known pathway (unfilled circles).Also, some genes may map to more than one pathway.(ii) Genes that map to a pathway are in turn mapped to genotyped SNPs within a specified distance.Many SNPs cannot be mapped to a pathway since they do not map to a mapped gene (unfilled squares).Note SNPs may map to more than one gene.Some SNPs (orange squares) may map to more than one pathway, either because they map to multiple genes belonging to different pathways, or because they map to a single gene that belongs to multiple pathways.
proceed as follows.A list of 21,004 human gene chromosomal locations, corresponding to human genome assembly GRCH36 was obtained using Ensembl's BIOMART API (www.biomart.org).
SNPs were then mapped to any gene within 10k base pairs, upstream or downstream of the gene in question.This resulted in 211,106 SNPs being mapped to 18,405 genes.While the majority of known genes did map to at least one SNP in our study, approximately half of the SNPs passing QC were not located within 10kbp of a known gene.For pathway mapping, we used the KEGG canonical pathway gene sets obtained from from the Molecular Signatures Database v3.0 (http://www.broadinstitute.org/gsea/msigdb/index.jsp), which contains 186 gene sets, which map to a total of 5,267 distinct genes, with many genes mapping to more than one pathway.
Note that only around 25% of all known genes map to a pathway in this dataset.We map all SNPs within 10kbp of one or more of the 5,267 pathway-mapped genes to the pathway(s) concerned.
Finally, we exclude the largest pathway, by number of mapped SNPs, ('Pathways in Cancer') that is highly redundant, in that it contains multiple other pathways as subsets.This results in 66,162 SNPs mapped to 4,425 genes and 185 pathways (see Fig. 2).The distribution of pathway sizes in terms of the number of SNPs that they map to is illustrated in Fig. 3 (left).Pathway sizes range from 57 to 5,111 SNPs (mean 949).The distribution of overlapping SNPs, that is the number of pathways to which each SNP is mapped, is illustrated in Fig. 3 (right).This ranges from 1 to 45 pathways (mean 2.65).
Note that following the above procedure, some genes previously implicated in AD studies do not map to any pathways, and thus are not included in the analysis.For example, in this study, 12 out of 30 genes highlighted in the review by Braskie et al. (2011) are mapped to pathways.Also note that since SNPs are mapped to all genes within a range of 10kbp, AD implicated SNPs may map to more than one gene, and its corresponding pathway(s).This is the case for example with a number of SNPs mapping to the APOE and TOMM40 genes.This information is summarised in Table 2.

Pathways sparse reduced-rank regeression
We consider the problem of identifying gene pathways associated with a multivariate quantitative trait (MQT) or phenotype, Y ∈ R Q .The observed values for phenotype q, measured for N unrelated individuals, are arranged in an (N × 1) response vector y q , and the Q phenotypes are arranged in an (N × Q) response matrix Y = (y 1 , . . ., y Q ).We assume minor allele counts for P SNPs are recorded for all individuals, and denote by x ij the minor allele count for SNP j on individual i.These are arranged in an (N ×P ) genotype design matrix X.We additionally assume Overlapping pathways per SNP SNPs map to multiple pathways either because they map to a gene that belongs to more than one pathway, or because they map to more than one gene belonging to more than one pathway.
Table 2: AD genes included in this study.12 out of 30 genes previously implicated with AD (Braskie et al., 2011) that are included in this study are listed in the left hand column.These are genes that (a) map to a KEGG pathway and (b) have a genotyped SNP within 10kbp.The right hand column shows neighbouring genes that map to one or more SNPs mapping to the respective AD implicated gene.
Implicated gene Mapped genes in study all phenotypes and genotypes are mean centred, and that SNP genotypes are standardised to unit variance, so that i x 2 ij = 1, for j = 1, . . ., P .If we denote by C = (C 1 , . . ., C Q ), a (P × Q) matrix of regression coefficients, then we can model the multivariate response as ( 1) where E is an (N × Q) matrix of error terms.A least squares estimate for C may be obtained by generalising the multiple least squares optimisation to include a multivariate response, that is by minimising the residual sum of squares Where N > P and the design matrix X is of full rank, the least squares estimates are given by Ĉ = (X X −1 )X Y.Note that the (P × 1) column vectors Ĉ1 , . . ., ĈQ of Ĉ are just the least squares estimates of the regression of each y q on X, that is where || • || 2 denotes the 2 (Euclidean) norm.
For high-dimensional datasets, such as those typically found in genomics, this model is unsuitable for a number of reasons.Firstly, P N , so that X X is singular and thus not invertible and the estimates Ĉq are not uniquely defined.Even where P < N , for example in a candidate gene study, LD or equivalently near multi-collinearity between predictors means that X X is nearly singular, resulting in inflated variance in SNP coefficient estimates.Furthermore, the estimation (3) is equivalent to performing Q independent regressions, and takes no account of the multivariate nature of Y. Ideally, we would like to exploit this in our estimation procedure to boost power (Breiman and Friedman, 1997;Vounou et al., 2010).
These limitations are addressed in reduced-rank regression (RRR), (Reinsel and Velu, 1998;Hastie et al., 2008) equivalent of ( 2), ( 6) where Γ is a given (q × q) positive definite matrix of weights.The choice of Γ reflects how we deal with correlation between the responses y 1 , . . ., y q in the least squares optimisation.Such correlations can be exploited by setting Γ to be the inverse of the estimated covariance of the responses.In the context of imaging genetics for example, where a voxel-wise multivariate response may be derived from structural MRI, spatial correlations between phenotypes are expected in part to reflect common genetic variation.However, the calculation of the inverse (Y Y) −1 is computationally very intensive, so in common with Vounou et al. (2010), we instead use the simplifying approximation Γ = I q , effectively assuming the responses to be uncorrelated.
We now turn to the case where all P SNPs may be mapped to L groups, G l ⊂ {1, . . ., P }, l = 1, . . ., L, for example by mapping SNPs to gene pathways (see section 2.4).We begin by assuming that pathways are disjoint or non-overlapping, that is G l ∩ G l = ∅ for any l = l .We denote the rank-1 vector of SNP regression coefficients by b = (b 1 , . . ., b P ).We additionally denote the matrix containing all SNPs mapped to pathway G l by X l = (X l1 , X l2 , . . ., X S l ), where In general, where P is large, we expect only a small proportion of SNPs to be 'causal', in the sense that they exhibit phenotypic effects.We further assume that causal SNPs will tend to be enriched within functional groups, or gene pathways.This latter assumption is illustrated schematically in Fig. 4, where causal SNPs (marked in grey) tend to accumulate within a small number of causal pathways, while the majority of pathways contain no causal SNPs.A model that generates such a sparsity pattern is said to be group-sparse, in that SNPs affecting Y are to be found in a set C ⊂ {1, . . ., L} of causal gene pathways (groups), with |C| L, where |C| denotes the cardinality of C. We seek a parsimonious model that is able to identify this set, C, of causal pathways, by imposing a group-sparsity constraint on the estimated SNP coefficient vector, b.
In sparse reduced-rank regression (sRRR) (Vounou et al., 2010(Vounou et al., , 2011)), sparse estimates for genotype and/or phenotype coefficient vectors are obtained by imposing a regularisation penalty on b and/or a respectively.Apart from the benefits of model parsimony, enforcing a sparsity constraint on b also allows us to deal with the P N case, and with multicollinearity between predictors.In our proposed 'Pathways Sparse Reduced-Rank Regression' (PsRRR) model, the required group sparsity pattern is obtained by imposing an additional group lasso penalty (Yuan and Lin, 2006) on ( 6).Group-sparse solutions to the rank-1 RRR model ( 5) are then obtained by For fixed a, this penalised least squares problem equates to a convex optimisation in b, and is thus amenable to solution using coordinate descent (Friedman et al., 2007).A global solution can then be obtained by iteratively estimating one coefficient vector (b or a), while holding the other fixed at its current value, until convergence (Chen and Chan, 2012).
Thus, for fixed b and λ, and with the additional constraint that bb = 1, we estimate â as Differentiating and setting to zero gives Similarly, for fixed a, and with the additional constraint that aa = 1, we have This is equivalent to a standard group lasso estimation problem with univariate response vector Ya .In earlier work we describe a method, 'Pathways Group Lasso with Adaptive Weights' (P-GLAW), for solving this problem, specifically tailored to the situation where predictor variables are SNPs grouped into pathways (Silver and Montana, 2012).Here, we briefly recap key points of this method, and incorporate a number of extensions designed to accommodate a MQT in the context of PsRRR with coordinate descent.
The minimising function ( 9) is convex, and can be solved using block coordinate descent (BCD) (Friedman et al., 2010), an extension of coordinate descent to convex estimation with grouped variables.BCD rests on obtaining successive estimates, b l , for each pathway in turn, while keeping current estimates for all other pathways, b k , k = l, constant, until a global minimum is obtained.
For pathway G l , l = 1, . . ., L, estimates for each SNP coefficient, b j , j = l 1 , . . ., l S l are obtained through coordinate descent within the group.The group lasso estimation algorithm using BCD is presented in Box 1.
Box 1 Ω(a, Y, X, λ): GL estimation algorithm using BCD The full PsRRR estimation algorithm is presented in Box 2.
Box 2 Rank-1 PsRRR estimation algorithm using coordinate descent 1. a ← 1/||1|| 2 2. repeat: 8. until b and a converge Note that we set the regularisation parameter, λ, to be a constant fraction (γ) of the maximal value, λ max , where no groups are selected by the model.
A key feature of our P-GLAW method is the need to accommodate the fact that pathways overlap, that is G l ∩ G l = ∅ for some l = l , since SNPs may map to multiple pathways.To enable the independent selection of pathways, we instead require that groups are disjoint (Jacob et al., 2009).This is achieved through an expansion of the design matrix, X, formed from the column-wise concatenation of the L sub-matrices of size (N × S l ), so that X = [X 1 , X 2 , . . ., X L ].
This expanded X has dimensions (N × P * ), with P * = l S l .A corresponding expansion of the Another issue that we address is the problem of pathway selection bias, by which we mean the tendency of the group lasso to favour the selection of specific pathways, under the null, where no SNPs influence the phenotype.Such biases can arise for example from variations in the number of SNPs or genes in pathways, and varying patterns of dependence (LD) between SNPs within pathways.Under the null, with the regularisation parameter λ tuned so that a single pathway is selected, pathway selection probabilities should follow a uniform distribution, namely with probability Π l = 1/L, for l = 1, . . ., L. However, where biasing factors are present, the empirical probability distribution, Π * will not be uniform.Our iterative weight tuning procedure works by applying successive adjustments to the pathway weight vector, ω = (ω 1 , . . ., ω L ), so as to reduce the difference, d l = Π * l (ω)−Π l , between the unbiased and empirical (biased) distributions for each pathway.We begin with an initial weight vector, ω (0) l = √ S l , which corrects for the biasing effect of group size in the group lasso model (Silver and Montana, 2012).At iteration τ , we compute the empirical pathway selection probability distribution Π * (ω (τ ) ) over multiple model fits with permuted phenotypes, and compute d l for each pathway.We then apply the following weight adjustment w where the paramater η controls the maximum amount by which each w l can be reduced in a single iteration, in the case that pathway G l is selected with zero frequency.The square in the weight adjustment factor ensures that large values of |d l | result in relatively large adjustments to w l .
Iterations continue until convergence, where L l=1 |d l | < .Even when relatively few SNPs or genes are associated with the phenotype, we can expect multiple pathways to harbour genetic effects since many SNPs and genes overlap multiple pathways.Where more than one pathway is selected by the model, we therefore expect that pathway selection probabilities will not be uniform, since the presence of overlapping SNPs means that pathways are not independent.Instead, selection probabilities will reflect the pattern of overlaps corresponding to the distribution of causal SNPs (or spurious associations under the null).This non-uniform distribution of selection probabilities is to be expected and is in fact desirable, since a signal corresponding to causal SNPs or genes should be captured in each and every pathway that contains them.We have shown in extensive simulation studies, that where more than one pathway is selected, the weight tuning process described above leads to substantial gains in both sensitivity and specificity when identifying causal pathways (Silver and Montana, 2012).
Estimates for b and a respectively represent the first (rank 1) latent factors that are expected to capture the strongest signal of association between gene pathways and the phenotype.In principle, it is possible to capture further latent factors of diminishing importance, by iteratively repeating the procedure described above, after regressing out the effects of previous factors (Vounou et al., 2010).With PsRRR, the estimation of further ranks is complicated by the need to recalibrate the group weights at each step, and by the typically large number of SNPs in selected pathways.For this reason we consider only the first latent factor in this study.

Pathway ranking
With most variable selection methods, a choice for the regularisation parameter, λ, must be made, since this determines the number of variables selected by the model.Common strategies include the use of cross validation to choose a λ value that minimises the prediction error between training and test datasets (Hastie et al., 2008).One drawback of this approach is that it focuses on optimising the size of the set, Ĉ, of selected pathways (more generally, selected variables) that minimises the cross validated prediction error.Since the variables in Ĉ will vary across each fold of the cross validation, this procedure is not in general a good means of establishing the importance of a unique set of variables (Vounou et al., 2011).Alternative approaches, based on data resampling or bootstrapping have been demonstrated to improve model consistency, in the sense that the 'true' variables are selected with a high probability (Bach, 2008;Meinshausen and Bühlmann, 2010).We adopt a resampling approach, in which we calculate pathway selection frequencies by repeatedly fitting the model over B subsamples of the data, at a fixed value for λ.Selection frequencies are expected to be relatively insensitive to the choice of λ, provided that it is small enough to ensure that the set C of causal pathways is selected with a high probability at each subsample (Meinshausen and Bühlmann, 2010).

SNP and gene ranking
The PsRRR model is designed to identify important pathways which may contain multiple genetic markers with varying effect sizes.However, it is still interesting to establish which SNPs and genes are most predictive of the response amongst those mapped to the set Ĉ(b) of selected pathways at subsample b.Note that these are not necessarily the SNPs and genes that are driving the selection of any particular pathway in the PsRRR model.
To rank SNPs and genes, we perform a second level of variable selection using sRRR with a lasso penalty (Vounou et al., 2011).We first form the reduced (N × Z (b) ) matrix X Ĉ(b) , with columns {X j : j ∈ l∈ Ĉ(b) G l } corresponding to all SNPs in pathways selected at subsample b.
Sparse estimates for the corresponding SNP coefficient vector, β, and rank-1 phenotype vector α then satisfy the equivalent of ( 8) with a lasso penalty, namely β, α = arg min We denote the set of SNPs selected at sample b by S (b) , and further denote the set of selected genes to which the SNPs in S (b) are mapped by φ (b) ⊂ Φ, where Φ = {1, . . ., G} is the set of gene indices corresponding to all G mapped genes.Using the same strategy as for pathway ranking, we obtain an expression for the selection probability of SNP j across B subsamples as where the indicator variable, b) , and 0 otherwise.A similar expression for the selection probability for gene g is g where the indicator variable, b) , and 0 otherwise.SNPs and genes are then ranked in order of their respective selection frequencies.

Computational Issues
All computer code is written in the open source Python programming language, using Numpy and SciPy modules which are optimised for efficient operation with large matrices.Execution of the PsRRR estimation algorithm nonetheless presents a considerable computational burden, both in terms of processor time and memory use.We therefore implement a number of strategies designed to increase computational efficiency (see Silver and Montana (2012) for details).We use a Taylor approximation of the group penalty that avoids the need for computationally intensive numerical search methods (Breheny and Huang, 2009;Friedman et al., 2010).In addition, we use an 'active set' strategy (Tibshirani et al., 2010;Roth and Fischer, 2008), that identifies a subset of pathways that are more likely to be selected by the model at a given λ.Model estimation then proceeds with this reduced set, followed by a final check to ensure that no other pathways should have been included in the active set in the first place.Depending on the choice of λ, this can lead to substantial gains in computational efficiency and a large reduction in memory requirements, resulting from the very much reduced size of X in Ω(a, Y, X, λ).
The need to fit a large number of PsRRR models over multiple subsamples of the data for pathway ranking presents another major computational bottleneck.However, the fact that each subsample is generated entirely independently presents an opportunity for performing multiple model fits in parallel.We implement such a strategy using a computer cluster, in which a single client node distributes subsamples across 8 server nodes.Parallel computations and client-server communication are implemented in Parallel Python (http://www.parallelpython.com/).The reduction in computation time due to parallelisation is considerable.For example, in the AD study described in this paper, total execution time (excluding weight tuning) with B = 1000 subsamples was 6 1 2 hours, whereas total execution time if each job were run separately would be approximately 10 1 2 days.

AD associated phenotypes
An imaging signature characteristic of AD was created using the procedure described in section 2.2.As described previously, we begin by computing a linear least-squares fit of the longitudinal structural change across 3 time points at each voxel.An illustration of average slope coefficients, and their variation between subjects, is shown in Fig. 5. Increased expansion of ventricular volumes is clear in all subjects, but this increase is most marked in AD patients, where ventricular volumes expand by an average 1.2% per year (white regions in left hand part of Fig. 5).AD patients also show the most variation in structural change over time.
A statistical image showing the corresponding ANOVA p-values, a measure of the extent to which each voxel is able to discriminate between ADs and CNs, is shown in the top row of Fig. 6.
From the Q * = 2, 153, 231 voxels in this image, we extract a final set of Q = 148, 023 voxels whose p-values exceed a Bonferroni-corrected threshold of 0.05/Q * .This final set of voxels that are most discriminative between ADs and CNs, are highlighted in yellow in the bottom row of Fig. 6.These Q voxels constitute the phenotype for each subject used in the study.We visualise the Euclidean distances between subjects using the selected voxels in a 3D multi-dimensional scaling plot in Fig. 7.
To verify the power of the set of selected voxels to discriminate between ADs and CNs, we used a linear classifier, with Gaussian, class-conditional densities and common diagonal covariance matrices.The performance of the linear classifier was assessed using 10-fold cross-validation, giving  figures for accuracy, sensitivity and specificity of 85.0%, 78.8% and 89.0%, respectively.While optimisation of classification performance was not our primary concern, these figures compare well with a range of high dimensional classification methods cited in the literature (Liu et al., 2012;Cuingnet et al., 2011)

Pathway, SNP and gene rankings
We use the PsRRR algorithm described in section 2.5 to identify KEGG pathways associated with the AD-discriminative longitudinal phenotypes described in the preceding section.Pathways are ranked in order of importance using the pointwise stability selection method described in section 2.6, with B = 1000 subsamples.We use λ = 0.8λ max , which results in the selection of an average of 7 pathways at each subsample (min 1, max 15, SD = 2.3).Pathway ranking results are presented in Table 3. SNPs and genes are ranked using sRRR with a lasso penalty on the SNP coefficient vector, as described in section 2.6.Lasso selection is performed on pathways selected at each subsample in the pathways analysis described above, so that once again B = 1000.The number of SNPs, Z (b) , included in the lasso model at subsample b varies according to the number and size (in terms of the number of mapped SNPs) of selected pathways.Z (b) ranges from a minimum of 132, to a maximum of 21,158 (mean = 9,835; SD = 3,729).As with pathway ranking, we use λ = 0.8λ max , which results in the selection of an average of 13.1 SNPs at each subsample (min 1, max 75, SD = 15.5).We first consider the pathway ranking results in Table 3.Under the null, where there is no association between phenotypes and genotypes, and with a single pathway selected by the model at each subsample, the expected pathway selection frequency distribution is uniform, with, π path l = 1/185 ≈ 0.005.As explained in section 2.5, where more than one pathway is selected at each subsample the selection frequency distribution will depend on the distribution of causal SNPs and genes, and will not be uniform.For this reason, while we report pathway selection frequencies, , the main focus is on pathway rankings.To aid interpretation of pathway rankings, for each pathway, we list those genes in the pathway that are ranked in the top 30 genes, selected by lasso selection (in Table 4).
In the final column of Table 3 we list genes in the top ranked pathways that have previously been linked to AD.Both the number of such genes affecting phenotypes in this study, and the extent to which these genes may drive pathway selection are unknown.It is nevertheless interesting to consider whether these genes are significantly enriched amongst high-ranking pathways.To do this we calculate an average ranking for each 'AD gene' by taking the average rank achieved by all pathways containing the gene in question.We then derive an AD gene enrichment score by summing average AD gene ranks across all AD genes.A lower score thus indicates pathways containing AD genes tend to be ranked high.We compare this empirically derived score with the distribution of scores obtained by permuting pathway rankings 100,000 times.The null distribution of this enrichment score (obtained by permutation), and the empirically observed value are compared in Fig. 8. Finally, we compute a p-value for the null hypothesis that the empirically observed enrichment score has arisen by chance, as the proportion of enrichment scores obtained through permutation that are lower than the observed value.This gives a value p = 0.0094.

Discussion
We describe a method for the identification of gene pathways associated with a multivariate quantitative trait (MQT).Here, we extend previous work modelling a univariate response, where we showed that a multivariate, group-sparse modelling approach can demonstrate increased power to detect causal pathways, when compared to conventional approaches that begin by modelling individual SNP-phenotype associations (Silver and Montana, 2012).We apply our method in an AD gene pathways study using imaging endophenotypes, but our method is not restricted to the case of biological pathways or imaging phenotypes, and can be applied to any data in which we seek to identify sparse groups of predictors affecting a multivariate response.
In any method modelling effects on an MQT, the use of a multivariate disease signature that is characteristic of the disease under investigation is important.This is especially so in the case of high-dimensional imaging phenotypes, where a poorly characterised imaging signature with low signal to noise ratio may show no advantage over a simple ROI average-based approach (Vounou et al., 2011).In this study we extract an AD imaging phenotype that is highly discriminative of subjects with the disease, compared to controls.
Of the pathways identified as being associated with these AD endophenotypes (see Table 3), functions associated with many of the top 10 ranked pathways have been linked to aspects of AD biology described in the literature, including chemokine signalling, Jak stat signalling, tight junction protein action, calcium and insulin signalling (Xia and Hyman, 1999;Kim et al., 2003;Huber et al., 2001;de la Monte and Wands, 2005;Steen et al., 2005;Ravetti et al., 2010).
In order to better elucidate which genes may be driving pathway selection, we performed a follow up analysis designed to identify SNPs and genes in selected pathways that are separately associated with the phenotype (see Table 4).Since these gene (and associated SNP) rankings are derived from lasso selection of all SNPs within selected pathways, irrespective of their 'group' structure within pathways, they are expected to capture larger, independent signals of association, and not necessarily all the salient signals within a particular pathway that may be driving pathway selection.In particular, the group lasso is designed to detect distributed signals that may not be highlighted using lasso selection.From this analysis, it is clear that the lipid kinase genes PIK3R3/PIK3CG, and the calcium-activated, phospholipid-dependent genes PRKCA/PRKCB are important in driving selection of many pathways in the top 30 ranks.All these genes have previously been linked in gene expression studies with β-amyloid plaque formation in the AD brain (Liang et al., 2008).Aside from the validated AD risk gene CR1 (see below), other top-ranked genes occurring more than once in the top 10 ranking pathways, include ADCY2 and GNAI1, both of which have also been associated with AD in gene expression studies (Ravetti et al., 2010;Taguchi et al., 2005).
Of particular note amongst top-ranking pathways is the citrate (tca) cycle, since this contains no genes identified in the separate gene-ranking analysis.This suggests that selection of this pathway might be driven by signals with distributed small effects, rather than signals with larger marginal effects.Interestingly, this pathway is ranked first if the analysis is run selecting only a single pathway at each subsample (data not shown).A number of studies have suggested links between aberrations in the tca cycle and AD (Atamna and Frey, 2007).
Turning to the SNP and gene rankings in Table 4, the top ranking gene is the complement component receptor gene CR1, and many of the top ranking SNPs also map to this gene.CR1 has been identified as one of the major AD risk genes (Lambert et al., 2009;Sleegers et al., 2010), and has been associated with changes in hippocampal volume in AD (Biffi et al., 2010).As well as the high-ranking PIK3R3/PIK3CG/PRKCA/PRKCB genes discussed above, the major AD risk and phenotype-related gene APOE, and risk allele APOE 4 are respectively ranked 10 and 12.In our study the APOE gene maps to a single pathway, the KEGG Alzheimer's disease pathway, and this pathway is selected in ≈ 14% of subsamples.Notably, in all subsamples in which the KEGG Alzheimer's disease pathway is selected, the APOE 4 allele is the sole selected SNP, confirming the known large marginal effect of this allele on AD phenotypes.The fact that the Alzheimer's disease pathway is not ranked higher may reflect the fact that selection of this pathway is driven by the presence of this single, strong APOE 4 signal, and as explained above, the model is designed to identify distributed signals across a pathway.The slightly higher ranking of the APOE 4 SNP, relative to the APOE gene, reflects the fact that this SNP also maps to the TOMM40 gene, which occurs in a number of other pathways selected by the model.The TOMM40 and APOE genes are in LD, and there is evidence of their interaction in AD (Rajagopalan et al., 2012).
Our model rests on a number of assumptions, and as a consequence will fail to detect a number of different association signals.For example, while our model implicitly accommodates the fact that SNPs and genes interact within functional pathways, we do not explicitly model interaction effects.Also, we make the simplifying assumption that voxel-wise measures of atrophy are uncorrelated.In reality, the phenotype will exhibit a complex correlation structure which will affect the association signal.Vounou et al. (2010) have demonstrated that even under this simplifying assumption, significant gains in power can be achieved by modelling a multivariate phenotype, compared to a mass univariate modelling approach.Finally, our model is founded on the assumption that causal SNPs tend to accumulate within functional pathways, and as such is not designed to identify significant marginal effects, as evidenced by its failure to rank the high-risk AP OE gene highly.For this last reason, any pathways analysis should be seen as being complementary to conventional GWAS approaches.
This study also demonstrates some of the limitations of pathways studies in general.Many genes previously implicated in AD do not map to known pathways in our study, so that these genes and their associated SNPs, many of which are well validated, are excluded.This relative sparsity of gene-pathway annotations reflects the fact that our understanding of how the majority of genes functionally interact is at an early stage.As a consequence, annotations from different pathways databases often vary (Soh et al., 2010), and in any case are undergoing rapid change.
Pathways studies also suffer from a lack of benchmark datasets to compare different methods, and inevitably fail to capture the true complexity of dynamic interactions between pathways, by taking a snapshot of a biological system at a given point in time (Khatri et al., 2012).

Figure 1 :
Figure1: Schematic illustration of the SNP to pathway mapping process.(i) Known genes (green circles) are mapped to pathways using information on gene-gene interactions (top row), obtained from a gene pathways database.Many genes do not map to any known pathway (unfilled circles).Also, some genes may map to more than one pathway.(ii) Genes that map to a pathway are in turn mapped to genotyped SNPs within a specified distance.Many SNPs cannot be mapped to a pathway since they do not map to a mapped gene (unfilled squares).Note SNPs may map to more than one gene.Some SNPs (orange squares) may map to more than one pathway, either because they map to multiple genes belonging to different pathways, or because they map to a single gene that belongs to multiple pathways.

Figure 3 :
Figure3: Left: Pathway sizes.Distribution of KEGG pathways, by the number of ADNI SNPs that they map to.Right: SNP overlaps.Distribution of ADNI SNPs, by the number of pathways that they map to.SNPs map to multiple pathways either because they map to a gene that belongs to more than one pathway, or because they map to more than one gene belonging to more than one pathway.
, by restricting the rank of the coefficient matrix C. Specifically we impose the constraint that C has rank r < min(P, Q), and rewrite C as C = BA, where A and B both have (full) rank r.The reduced rank form of (1) is then given by (4) Y = XBA + E where B and A are (P × r) and (r × Q) matrices of regression coefficients respectively relating to genotypes and phenotypes.This model has the interesting interpretation of exposing r hidden or latent factors, which capture the major part of the relationship between Y and X.If we denote by B (k) , the kth column of B, then we see that the products XB (k) , k = 1, . . ., r, represent r linear combinations of the P predictor variables.Similarly, the r row vectors, A (k) , k = 1, . . ., r, represent the transformation of each of these back to the dimensions of Y, so that they can predict the response.The linear combinations XB (k) and YA (k) thus represent a reduced set of r (latent) factors that capture the relationship between response and predictors, reduced in the sense that this set has dimensionality r < min(P, Q).As a first approximation, we consider the rank-1 RRR model which captures the first set of genotype and phenotype latent factors describing the association between X and Y.With r = 1, we rewrite (4) as (5) Y = Xba + E where b and a are (P × 1) and (1 × Q) coefficient vectors respectively relating to genotypes and phenotypes.Least squares estimates for b and â are then obtained by minimising the rank-1 is the column vector of observed SNP minor allele counts for SNP j, and S l is the number of SNPs in G l .Finally, we denote the corresponding vector of SNP coefficients by b l = (b l1 , b l2 , . . ., b S l ).

Figure 4 :
Figure 4: Group-sparse distribution of causal SNPs.The set S ⊂ {1, . . ., P } of causal SNPs influencing the phenotype are represented by boxes that are shaded grey.Causal SNPs are assumed to occur within a set C of causal pathways.Here C = {2, 3}.Note that the particular distribution of causal SNPs may vary for each individual, i = 1, . . ., N .The group sparsity assumption is that |C| L.
b l converges 13. until b converges As λ increases, fewer groups (or pathways) are selected by the model (Box 1, step 5), while for selected pathways with b l = 0, estimated SNP coefficients, b j , j = l 1 , . . ., S l , tend to shrink towards zero (Box 1, step 11).
We denote the set of selected pathways at subsample b by Ĉ(b) = {l : b (b) l = 0} b = 1, . . ., B where b (b) l is the estimated SNP coefficient vector for pathway l at subsample b.The selection probability for pathway l measured across all B subsamples is then l ∈ Ĉ(b) , and 0 otherwise.Pathways are ranked in order of their selection probabilities, π path l1 ≥, . . ., ≥ π path l L .

Figure 5 :Figure 6 :
Figure 5: Sample mean (left) and standard deviation (right) of slope coefficients for the 3 subject groups.Slope coefficients represent a linear approximation of change in brain volume over time.Scales represent 10× percentage change in voxel volume per year, so that for example a slope coefficient of 12 (white areas in left hand plot) is equivalent to an average yearly increase in voxel volume of 1.2%.

Figure 7 :
Figure 7: 3D multi-dimensional scaling plot illustrating the spread of imaging signatures across ADs and CNs.Imaging signatures correspond to selected voxels only.

Figure 8 :
Figure 8: Measure of extent to which genes previously linked to AD are enriched in highly-ranked pathways.The histogram shows the distribution of AD gene enrichment scores obtained when permuting pathway rankings 100,000 times.The vertical black line indicates the observed AD gene enrichment score using the true pathway rankings obtained in the study.From this we derive a p-value indicating the probability that the empirical AD gene enrichment score could arise by chance as p = 0.0094.AD-linked genes are those identified in Braskie et al. (2011).
clinical sites in Canada.Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org).The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer's Disease Cooperative Study at the University of California, San Diego.ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of California, Los Angeles.This research was also supported by NIH grants P30 AG010129, K01 AG030514, and the Dana Foundation.

Table 3 :
Top 30 pathways, ranked by pathway selection frequency.

Table 4 :
Top 30 SNPs and genes, respectively ranked by SNP and gene selection frequency, using lasso sRRR.Note the APOE gene is selected at a lower frequency than the APO 4 SNP, since in a significant minority of subsamples the allele is selected in a pathway where it is mapped to the TOMM40 gene only.