Improved Lasso for genomic selection

ANDRÉS LEGARRA; CHRISTÈLE ROBERT-GRANIÉ; PASCAL CROISEAU; FRANÇOIS GUILLAUME; SÉBASTIEN FRITZ

doi:10.1017/S0016672310000534

Improved Lasso for genomic selection

Published online by Cambridge University Press: 14 December 2010

ANDRÉS LEGARRA ,

CHRISTÈLE ROBERT-GRANIÉ ,

PASCAL CROISEAU ,

FRANÇOIS GUILLAUME and

SÉBASTIEN FRITZ

Show author details

ANDRÉS LEGARRA*: Affiliation:
INRA, UR 631 SAGA, F-31326 Castanet-Tolosan, France
CHRISTÈLE ROBERT-GRANIÉ: Affiliation:
INRA, UR 631 SAGA, F-31326 Castanet-Tolosan, France
PASCAL CROISEAU: Affiliation:
INRA, UMR1313 GABI, F-78352 Jouy en Josas, France
FRANÇOIS GUILLAUME: Affiliation:
Institut de l'Elevage, F-75595 Paris, France
SÉBASTIEN FRITZ: Affiliation:
UNCEIA, F-75595 Paris, France
*: *Corresponding author. INRA, UR 631 SAGA, BP52627, F-31326 Castanet Tolosan, France. Tel: +33561285182. Fax: +33561285353. e-mail: andres.legarra@toulouse.inra.fr

Article contents

Summary
Introduction
Parameterization of the Bayesian Lasso
Estimation and cross-validation study
Results
Discussion
Conclusion
References

Rights & Permissions

Summary

Empirical experience with genomic selection in dairy cattle suggests that the distribution of the effects of single nucleotide polymorphisms (SNPs) might be far from normality for some traits. An alternative, avoiding the use of arbitrary prior information, is the Bayesian Lasso (BL). Regular BL uses a common variance parameter for residual and SNP effects (BL1Var). We propose here a BL with different residual and SNP effect variances (BL2Var), equivalent to the original Lasso formulation. The λ parameter in Lasso is related to genetic variation in the population. We also suggest precomputing individual variances of SNP effects by BL2Var, to be later used in a linear mixed model (HetVar-GBLUP). Models were tested in a cross-validation design including 1756 Holstein and 678 Montbéliarde French bulls, with 1216 and 451 bulls used as training data; 51 325 and 49 625 polymorphic SNP were used. Milk production traits were tested. Other methods tested included linear mixed models using variances inferred from pedigree estimates or integrated out from the data. Estimates of genetic variation in the population were close to pedigree estimates in BL2Var but not in BL1Var. BL1Var shrank breeding values too little because of the common variance. BL2Var was the most accurate method for prediction and accommodated well major genes, in particular for fat percentage. BL1Var was the least accurate. HetVar-GBLUP was almost as accurate as BL2Var and allows for simple computations and extensions.

Type: Research Papers
Information: Genetics Research , Volume 93 , Issue 1 , February 2011 , pp. 77 - 87

DOI: https://doi.org/10.1017/S0016672310000534 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2010

1. Introduction

Genome-wide strategies for genetic evaluation can be roughly divided into BLUP-like methods (postulating normal distribution of single nucleotide polymorphism (SNP) effects) and variable selection methods using more sophisticated distributions. The seminal paper of Meuwissen et al. (Reference Meuwissen, Hayes and Goddard2001) already made this distinction, by creating BLUP and Bayes (A, B) methods. In the first group, marker effects are posited normal distributions with zero mean and identical variance for all markers. This results in nice properties, like simplicity of computations and, in particular, an equivalent model using a ‘genomic’ relationship matrix (Van Raden, Reference Van Raden2008). The latter can be meshed with additive relationship matrices and extended to the whole pedigree (Legarra et al., Reference Legarra, Aguilar and Misztal2009). Further, under mild assumptions, equivalences exist between genetic variances in an additive relationship model and marker variances (Gianola et al., Reference Gianola, de los Campos, Hill, Manfredi and Fernando2009).

However, at least for some traits, it has been shown that departures of SNP effects from normality exist. This results in (and can be seen by) higher accuracy of methods with more sophisticated a priori distributions of the marker effects, like BayesA or non-linear regression (Hayes et al., Reference Hayes, Bowman, Chamberlain and Goddard2009; Van Raden et al., Reference Van Raden, Van Tassell, Wiggans, Sonstegard, Schnabel, Taylor and Schenkel2009b). These methods are sometimes called ‘Bayesian methods’ (Lund et al., Reference Lund, Sahana, de Koning, Su and Carlborg2009). This is inappropriate, because BLUP is also a Bayesian method, and also because they have frequentist counterparts (e.g. Usai et al., Reference Usai, Goddard and Hayes2009). Thus, we will call them ‘variable selection methods’ because most of them assume values of most SNP effects to be zero or close to zero. Another property of variable selection methods, shown in simulations, is that these methods have better properties in the long run, that is, estimates of SNP effects are stable after several generations (Habier et al., Reference Habier, Fernando and Dekkers2007). In addition, small (and possibly much cheaper) subsets of markers chosen by variable selection methods have been shown to be of acceptable accuracy (Weigel et al., Reference Weigel, de los Campos, González-Recio, Naya, Wu, Long, Rosa and Gianola2009). Thus, variable selection methods are being heavily used in simulations (Meuwissen et al., Reference Meuwissen, Hayes and Goddard2001; Calus et al., Reference Calus, Meuwissen, de Roos and Veerkamp2008; Kizilkaya et al., Reference Kizilkaya, Fernando and Garrick2010) and in real data analysis (Hayes et al., Reference Hayes, Bowman, Chamberlain and Goddard2009; Van Raden et al., Reference Van Raden, Van Tassell, Wiggans, Sonstegard, Schnabel, Taylor and Schenkel2009b).

Most variable-selection methods nevertheless require a priori distributions or tuning parameters. These include the number of SNPs a priori in the model and its variance (Meuwissen et al., Reference Meuwissen, Hayes and Goddard2001; Verbyla et al., Reference Verbyla, Hayes, Bowman and Goddard2009); or the ratio of variances of SNPs ‘in’ or ‘out’ (Calus et al., Reference Calus, Meuwissen, de Roos and Veerkamp2008; Verbyla et al., Reference Verbyla, Hayes, Bowman and Goddard2009); or the a priori variance of SNP effects (Kizilkaya et al., Reference Kizilkaya, Fernando and Garrick2010). No clear clue, based on biological knowledge, exists about these a priori distributions. This complicates their practical application.

The Lasso (least absolute shrinkage and selection operator; Tibshirani, Reference Tibshirani1996) combines variable selection and shrinkage. Its Bayesian counterpart, the Bayesian Lasso (Park & Casella, Reference Park and Casella2008) provides a more natural interpretation in terms of a priori distributions. It is well known that, generally, conditional expectations are optimal for selection (Gianola & Fernando, Reference Gianola and Fernando1986). These can be obtained through the Bayesian Lasso but not the regular Lasso. Also, in particular, Bayesian Lasso provides a fully parametric model with a simple Gibbs sampler implementation, as well as an EM algorithm for the estimation of the ‘sharpness’ parameter λ, needing little (or no) prior information. Thus, Bayesian Lasso is an attractive candidate for genomic selection because of its simplicity, computational ease and little (or no) need to postulate prior information. Further, the exponential distribution of the Lasso is thought to reflect reasonably well the nature of quantitative trait locus (QTL) effects (Goddard, Reference Goddard2008).

The (Bayesian or not) Lasso has been used in an animal breeding context (de los Campos et al., Reference de los Campos, Naya, Gianola, Crossa, Legarra, Manfredi, Weigel and Cotes2009; Usai et al., Reference Usai, Goddard and Hayes2009; Weigel et al., Reference Weigel, de los Campos, González-Recio, Naya, Wu, Long, Rosa and Gianola2009), albeit a broad comparison with related methods using several traits and a real, large data set has not yet been published. In addition, we find that the particular case of Park and Casella's Bayesian Lasso includes a common variance term for modelling both residual terms and effects in the model, instead of two different variances. We find that this parameterization is not optimal. The purpose of this paper is thus manifold. First, to propose and compare a different, more general, model for the Bayesian Lasso, which in fact is equivalent to Tibshirani's (Reference Tibshirani1996) original Lasso. This model implies different variances for residual terms and for SNP effects. Second, an alternative linear model for genomic prediction will be presented and tested empirically; in this model individual SNP variances are inferred via the Bayesian Lasso first and then used in a BLUP-like estimator. Third, we compare the performance of these models with a more standard ‘genomic BLUP’ (GBLUP) either fixing the variance for the marker effects from pedigree estimates applying a rough equivalence (Gianola et al., Reference Gianola, de los Campos, Hill, Manfredi and Fernando2009), or inferring and integrating it out from the data via the Gibbs Sampler.

2. Parameterization of the Bayesian Lasso

The base of the Lasso is a typical linear model of the form:

where b are fixed effects (e.g. an average mean), a are, in this work, SNP effects and MVN stands for multivariate normal. Originality of Lasso is in modelling effects a. In the Bayesian Lasso, the distribution of (a single) SNP effect a is modelled as

In the classical Lasso (Tibshirani, Reference Tibshirani1996) this distribution is actually p(a|σ², λ)=(λ/2)exp(−λ|a|); however, Tibshirani (Reference Tibshirani1996) assumes that incidence matrix Z has been standardized, which is not assumed here or in the Bayesian Lasso.

Finally, in Bayesian Lasso the variance of a is Var(a)=2σ²/λ².

Intriguingly, as shown in the expressions above, in Bayesian Lasso applications in genomic selection (e.g. de los Campos et al., Reference de los Campos, Naya, Gianola, Crossa, Legarra, Manfredi, Weigel and Cotes2009; Weigel et al., Reference Weigel, de los Campos, González-Recio, Naya, Wu, Long, Rosa and Gianola2009) the variance σ² has been used at the same time to model the residual term as well as the distribution of the SNP effects. However, we do expect the distribution of SNP effects not to be related to unobservable, unaccounted (residual) effects that can, for example, vary from site to site for the same individuals. Assume, for instance, a crop trial design in which some varieties are tested. Each variety can be tested 1 or 100 times. If the phenotype to be analysed is the average yield of the variety, everything else being equal, it is expected that the residual variation is divided by 10 in the second option, but not the variation across SNP effects. Another example is as follows. Assume that a set of dairy bulls is tested in two different locations, the second with less frequent milk recording. The second location will show higher residual variation for milk yield, whereas genetic variation in the bulls will be the same.

The implementation of the Bayesian Lasso in Park & Casella (Reference Park and Casella2008) does not take this into account. A more general implementation would split the sources of variation in purely residual (σ_e²) and variation due to SNPs (σ_a²), by rewriting the model as

However, this is clearly equivalent to

which is the original form of Tibshirani's (Reference Tibshirani1996) original Lasso, because only the ratio λ/σ_a is used and thus they cannot be estimated separately. Equivalently, the model could be written in terms of σ_a² by dropping λ.

In the original Lasso, cross-validation is used for the estimation of λ (Usai et al., Reference Usai, Goddard and Hayes2009). Park & Casella (Reference Park and Casella2008) proposed a fully parametric implementation by computing a posterior distribution (using the Gibbs sampler) or an empirical Bayes estimation by marginal maximum likelihood by a Monte Carlo Expectation–Maximization (MCEM) algorithm. The latter avoids the problem of choosing a hyperprior for λ, pointed out by both Park & Casella (Reference Park and Casella2008) and de los Campos et al. (Reference de los Campos, Naya, Gianola, Crossa, Legarra, Manfredi, Weigel and Cotes2009).

The hierarchical formulation of Lasso shown above includes explicitly two sources of variation and is thus akin to classical models in quantitative genetics and genetic evaluation (Henderson, Reference Henderson1984; Falconer & Mackay, Reference Falconer and Mackay1996) where variation is split into environmental and genetic variances. The shape of the distribution of SNP effects is determined by λ, which effectively determines the variance of SNP effects by using Var (a)=2/λ². Thus, λ plays the same role as the inverse of a standard deviation in normal models. This does not seem to have been recognized by previous scholars (de los Campos et al., Reference de los Campos, Naya, Gianola, Crossa, Legarra, Manfredi, Weigel and Cotes2009; Usai et al., Reference Usai, Goddard and Hayes2009).

Applying the same logic as in Gianola et al. (Reference Gianola, de los Campos, Hill, Manfredi and Fernando2009), and in ideal conditions, it is possible to establish a rough equivalence between genetic variance in a population (σ_u²; usually estimated by an additive, relationship-based model) and the variance of SNP effects:

where p_i is the allelic frequency at the ith marker.

3. Estimation and cross-validation study

(i) Data

Two sets of bulls from French dairy cattle populations have been analysed from, respectively, Holstein (1756 bulls) and Montbéliarde (678 bulls) breeds. Bulls were genotyped with the Illumina Bovine SNP50 BeadChip. Markers were discarded based on low call rate, lack of positioning in the genome, or very high Mendelian inconsistency rate. No minor allele frequency threshold was imposed. Finally, 51 325 and 49 625 polymorphic SNP were, respectively, used in each breed. A cross-validation approach was used where 1216 and 451 bulls were taken as the training data and the rest as validation data. Bulls in the validation data set were, roughly, bulls being tested in 2004 and 2005, and younger than the training bulls. All parameter estimation in this work was carried out on the training population.

Data for training (y in the model) were daughter yield deviations (DYDs; Van Raden & Wiggans, Reference Van Raden and Wiggans1991) as computed with data available in 2004; data for validation were DYDs from data available in 2009. Thus, the validation mimics well a real scenario. To account for different accuracies in the estimation of DYDs, these were weighted by their prediction error variances (in terms of number of equivalent daughters) as estimated from regular genetic evaluation. This will be explained in more detail later.

Traits analysed were milk, fat and protein yields (MY, FY and PY) and fat and protein percentages (FP and PP). Several models were used. The estimation was mostly made by Bayesian methods using Markov Chain Monte Carlo (MCMC) as well as, for certain cases, a marginal maximum likelihood by an MCEM algorithm, as suggested by Park & Casella (Reference Park and Casella2008) to avoid the use of a hyperprior for λ. An example of marginal maximum likelihood in the genetics literature is the REML estimator of variance components (Patterson & Thompson, Reference Patterson and Thompson1971). The models used to analyse the data sets are described next.

(ii) Bayesian Lasso with genetic and residual variances (Bayesian lasso with two variances; BL2Var)

The model is as follows:

where y contains twice the DYDs for each bull, μ is a general mean, Z is an incidence matrix of SNP effects a, e is a vector of residuals and F is a diagonal matrix that contains, in the diagonal, the inverse of the number of equivalent daughters for each DYD. The parameterization of SNP effects is as in Van Raden (Reference Van Raden2008) : −2p_i , 1−2p_i , and 2−2p_i for the genotypes 00, 01 and 11, where p_i is the allelic frequency of ‘1’. In this way, assuming Hardy–Weinberg equilibrium, SNP genotypic effects are substitution effects with average effect of 0 in the population (Falconer & Mackay, Reference Falconer and Mackay1996), which is one of the conditions for the expression of the genetic variation in the population as (Gianola et al., Reference Gianola, de los Campos, Hill, Manfredi and Fernando2009). This parameterization also results in slightly better predictive abilities compared to other ones such as −1, 0, 1 for the 00, 01 and 11 genotypes (data not shown).

The prior distribution for σ_e² was an inverted chi-square distributions with 4 degrees of freedom and expectations equal to the value used in regular genetic evaluation for σ_e². Prior for λ was deliberately vague, being uniform between 0 and 1 000 000.

In practice, the model above was transformed in an equivalent model, yielding the same solutions, as follows:

which amounts to multiply each row of 1 and Z by the square root of the number of equivalent daughters, so that

which simplifies the computations.

A Gibbs sampler was implemented as in Park & Casella (Reference Park and Casella2008) or de los Campos et al. (Reference de los Campos, Naya, Gianola, Crossa, Legarra, Manfredi, Weigel and Cotes2009), via the introduction of additional (augmented) variables τ_i², which can be seen as variance components for each SNP effect. The Gibbs sampler with residual update (Legarra & Misztal, Reference Legarra and Misztal2008) was used to speed up sampling of location parameters μ and a. The full conditional posterior distributions are as follows (the symbol indicates the current state of variable b):

where , z_i* is the row of Z* corresponding to the ith effect and a_−i indicates all a variables except for a_i . Further,

where IG stands for inverted Gaussian,

bounded between 0 and 1 000 000, and where G is a gamma distribution with shape ‘s’ and scale ‘sc’ and ‘nsnp’ is the number of a effects. Finally,

where S _e² is the scale of the a priori distribution of the residual variance and ndata is the number of records in y. For the inverted Gaussian distribution, we used the algorithm of Michael et al. (Reference Michael, Schucany and Haas1976) with a minor modification: extracting the largest root of the quadratic to avoid numerical cancellation.

For the MCEM estimation of λ (BL2Var-EM), the iterations proceed as above but sampling of λ is substituted by an updated estimate

where are obtained by MonteCarlo using the previous estimate of λ. In our case and after experimentation with one trait, the number of iterations to get was reduced to just one. This seems to be possible because the very large number (51 325) of variables included provides a reasonable estimate. At convergence, the last 100 samples of λ were averaged to obtain a MonteCarlo error-free estimate (as suggested by Park & Casella, Reference Park and Casella2008).

(iii) Bayesian Lasso with one variance (BL1Var)

The model by Park & Casella (Reference Park and Casella2008) and de los Campos et al. (Reference de los Campos, Naya, Gianola, Crossa, Legarra, Manfredi, Weigel and Cotes2009) postulates a one-variance component linked to a priori variation in both residual and SNP effects, and thus:

The conditional distributions are as above, with the following modifications:

and

where D is a diagonal matrix with τ_i²σ_e² in the (i,i) position. This conditional distribution shows well that SNP effects are in practice considered as pseudo-residuals in the one-variance Bayesian Lasso.

(iv) Bayesian mixed model with unknown genetic and residual variances (MCMC-GBLUP)

This model is similar to the ‘BLUP’ model in Meuwissen et al. (Reference Meuwissen, Hayes and Goddard2001), although the variance components are not fixed a priori. Instead, they are estimated in the model as in Legarra et al. (Reference Legarra, Robert-Granié, Manfredi and Elsen2008) :

The prior distribution for σ_e² is as in the Bayesian lasso with two variances; the prior distribution for σ_a² was a chi-squared distribution with 4 degrees of freedom and expectation equal to ; σ_u² being the genetic variance component used in genetic evaluation. The Gibbs sampler for this model has been extensively described (e.g. Sorensen & Gianola, Reference Sorensen and Gianola2002).

(v) Bayesian mixed model with known genetic and residual variances (GBLUP)

This model is as the previous one, except that variance components were assumed to be known with certainty and inferred from values used in current genetic evaluation, as for the priors in MCMC-GBLUP. To estimate solutions for μ and a, Henderson's (Reference Henderson1984) mixed model equations were used, which were solved by preconditioned conjugated gradients as described in Legarra & Misztal (Reference Legarra and Misztal2008).

(vi) Bayesian mixed model with heterogeneous genetic variances (Het-GBLUP)

This model assumes that components of overall genetic variation (λ and σ_e²) are known with certainty but allows for heterogeneous variances of SNP effects, which are τ_i² for the ith SNP. In order to accommodate heterogeneous variances in a linear estimator, these have to be previously known. Thus, we followed a three-step procedure. First, λ and σ_e² were estimated as in the Bayesian Lasso with two variances. Second, estimates of τ_i² were computed by a Gibbs sampler (as the one in the Bayesian Lasso with two variances) with λ and σ_e² fixed to their estimated values. Finally, a diagonal matrix D was formed to describe the heterogeneous variance, with in the (i,i) position. Thus, the model becomes

which is solvable by Henderson's mixed model equations as above.

All models above were fit to the five traits for the Holstein breed; for the Montbéliarde, only the Bayesian Lasso with two variances and GBLUP were fit. The MCEM was run for 50 000 iterations, with final convergence; the MCMC were run for 50 000 iterations with 25 000 of burn-in after which solutions for all unknowns were estimated by their posterior means. Self-made programs were written in Fortran95.

Different parameters were estimated. In addition to σ_e² and λ, a rough equivalent of the classical, pedigree-based genetic variance (σ_u²) was estimated for each model (except for GBLUP where it is supposed fixed). For the Bayesian Lasso with two variances, this is . For the Bayesian Lasso with one variance, this is . For the MCMC-GBLUP, this is . A pedigree-based estimate of σ_u² was obtained by REML for the Holstein breed using REMLF90 (Misztal et al., Reference Misztal, Tsuruta, Strabel, Auvray, Druet and Lee2002).

(vii) Cross-validation

Predictions (genomic estimated breeding values (GEBVs)) for the validation data test were computed as for the different models, and compared with 2009 progeny test-based DYDs. Predictive ability was measured as the correlation between both. The correlation was weighted by the number of equivalent daughters in 2009 DYD data. The formula for the weighted Pearson product moment correlation coefficient is as follows (e.g. Peers, Reference Peers1996):

where and and w are the weights for each data point. Cross-validation results for BL2Var-EM approach were essentially the same as BL2Var (correlation among EBVs was higher than 0·99) and are therefore not shown.

4. Results

(i) Estimates of parameters

Tables 1 (for Holstein) and 2 (for Montbéliarde) show estimates of λ parameters. Estimates are generally accurate as shown by their standard errors. Estimates from BL1Var are very similar across traits, which does not occur for BL2Var. On the other hand, estimates by full Bayesian inference (BL2Var) or marginal maximum likelihood (BL2Var – EM) are virtually identical in both Holstein (Table 1) and Montbéliarde (Table 2). Also, estimates in Holstein and in Montbéliarde are quite similar for the same traits.

Table 1. Estimates of ‘sharpness’ parameter λ (±se) in Holstein