Computing strategies for multi-population genomic evaluation

Background Multiple breed evaluation using genomic prediction includes the use of data from multiple populations, or from parental breeds and crosses, and is expected to lead to better genomic predictions. Increased complexity comes from the need to fit non-additive effects such as dominance and/or genotype-by-environment interactions. In these models, marker effects (and breeding values) are modelled as correlated between breeds, which leads to multiple trait formulations that are based either on markers [single nucleotide polymorphism best linear unbiased prediction (SNP-BLUP)] or on individuals [genomic(G)BLUP)]. As an alternative, we propose the use of generalized least squares (GLS) followed by backsolving of marker effects using selection index (SI) theory. Results All investigated options have advantages and inconveniences. The SNP-BLUP yields marker effects directly, which are useful for indirect prediction and for planned matings, but is very large in number of equations and is structured in dense and sparse blocks that do not allow for simple solving. GBLUP uses a multiple trait formulation and is very general, but results in many equations that are not used, which increase memory needs, and is also structured in dense and sparse blocks. An alternative formulation of GBLUP is more compact but requires tailored programming. The alternative of solving by GLS + SI is the least consuming, both in number of operations and in memory, and it uses only single dense blocks. However, it requires dedicated programming. Computational complexity problems are exacerbated when more than additive effects are fitted, e.g. dominance effects or genotype x environment interactions. Conclusions As multi-breed predictions become more frequent and non-additive effects are more often included, standard equations for genomic prediction based on Henderson’s mixed model equations become less practical and may need to be replaced by more efficient (although less general) approaches such as the GLS + SI approach proposed here.


Background
For genomic prediction in livestock and crops, marker effects are often modelled as different but correlated effects across populations [1,2]. This results in a multiple trait setting, in which each environment or population is modelled as a different trait. As individuals are present in a single environment, there is no covariance between other random effects (such as residual or permanent environmental effects). This leads to particular structures of the incidence matrices that make general computational strategies less efficient. In this work, we discuss some of these strategies, show that general strategies such as standard individualbased multiple trait genomic best linear unbiased prediction (GBLUP) leads to high computational redundancy, and we highlight that the old method of generalized least squares (GLS) followed by selection index (SI) [3,4] is a competing strategy in terms of efficiency. The motivation for the comparison of these strategies was the need to obtain estimates of marker effects in a computationally efficient manner, for their use in planning assortative matings in a two-way breeding scheme [5], using data from two purebred populations and a crossbred population, for a total of ~ 50K animals and genotypes at ~ 50K single nucleotide polymorphisms (SNPs). Here, we are concerned with medium-sized data sets, but with very complex models, where genetic evaluation can be done by exact (non-iterative) methods, i.e. by numerical inversion. This is a popular strategy for genomic evaluation in crops and in some populations of monogastric livestock.

Methods
For our argumentation, and following the example of [5], we assume additive ( a ) and dominant ( d ) SNP effects that are correlated across the different populations. In principle, the model can be extended to higher-order effects such as epistasis. The dominance effects have an a priori mean of 0 because genomic inbreeding is included in the model as a covariate [6]. For the example, we consider three populations but any number of populations can be accommodated. For each population i (i = {1, 2, 3}) , we have a single trait vector of phenotypes, y i , with the following linear model: where β i contains the fixed effects specific to population i (e.g., a mean and an inbreeding depression covariate), a i and d i are additive and dominance SNP effects, respectively, specific to population i , with incidence matrices coded, e.g., as Z i = {−1, 0, 1} and W i = {0, 1, 0} . Other codings are possible, such as in terms of breeding values and dominance deviations to achieve orthogonality [7]. The covariance structure, written using the Kronecker product ⨂, is: g 21 0d g 22 0d g 23 0d g 31 0d g 32 0d g 33 0d     .
Based on this notation, several equivalent estimators are derived in the following. We assume that variance components are known and that all individuals have a single record. Note that, by construction, we have a highly parameterized model, which in the GBLUP case has (many) more unknowns than records. I.e., each individual has an unknown additive and a dominance effect for each population but only one record in only one of the populations.

SNP-BLUP
This model is parameterized in terms of SNP effects across the different populations, resulting in the following structure of the left-hand side (LHS) of the mixed model equations (MME) (for clarity, only the random effects part is shown): , several sparse blocks of the form Ig 12 0a , and several zero blocks. This makes the use of current "state of the art" sparse matrix software inefficient, because most gains due to sparsity are lost. As a result, it may be more efficient to work with full storage, i.e. dense matrices. However, the size of the equations is rather large, i.e. 6m , where m is the number of SNPs. For instance, an analysis with 50 K SNP panels would involve MME of size 300 K and the inversion of the LHS would consume O(6m) 3 operations. The size of these equations are, however, invariant to the number of records n.

GBLUP using a standard multiple trait formulation
There are two possible implementations of GBLUP. The first considers three additive values and three dominance values per individual (one for each population). For each random effect, the MME use a single matrix that is arbitrarily scaled across all populations (the scale is arbitrary because there is no meaningful scaling factor that yields coherent "relationships" across the three populations). For instance, a possible matrix for the additive effect is: Note that the cost of inversion of G * A is O n 3 . Note also that, in order to describe covariances between individuals, G * A must be multiplied by the SNP effect variance (e.g. σ 2 a1 ), rather than by the population variance, which is why we prefer not to refer to G * A as containing "relationships". A similar matrix G * D is defined for dominance effects. The two matrices G * A and G * D are used in a multiple trait formulation with missing records for each trait, using the Kronecker factorization. This results in the following structure of LHS, as for multiple trait analysis (for clarity, only a portion of the structure is shown, for additive effects for two of the three populations): A similar pattern results for the dominance effects. There are also corresponding cross-products of incidence matrices in the additive x dominance blocks, e.g. Iσ −2 e1 . Thus, the LHS are composed of a dense formulation where each row has two times three blocks of size n , leading to 6n equations (and inversion cost of O(6n) 3 ). To our knowledge, this is the formulation used by standard BLUP/REML software programs (blupf90, Wombat, ASREML) with elements stored in memory. Note that these MME require G A and G D to be full rank, which is often not the case (because of the presence of clones, or as a result of "centering" the Z and W . . . matrices, or because n > m ). Matrices G A and G D may, therefore, require some "blending".

GBLUP using a compact formulation
The second option for the formulation of GBLUP, which requires much less memory, directly uses the inverses of the covariance matrices G A and G D (again, assuming that they are, or have been made, invertible), rather than the Kronecker factorization of covariances, e.g.: and then it uses the following LHS structure: which is of size 2n , but is still considerably dense. None of the above alternatives (SNP-BLUP or either GBLUP) are very satisfying since they all involve large matrices of size > n or > m , and only SNP-BLUP directly leads to estimates of SNP effects, which are required to predict newly genotyped animals (without re-running the evaluation) or to arrange assortative matings.

GLS and selection index formulation
An alternative is to use GLS followed by SI theory, as shown by Henderson [3,4]. First, we estimate the fixed effects by GLS: after which SNP effects are estimated through covariances as: where C a = Cov y, a ′ and C d = Cov y, d ′ . Constructing V is actually easy, because it is just a sum of matrices. Provided V is invertible, and using a single inversion, we can solve for β first and then backsolve for a and d . In the multi-population case, this may be computationally simpler than the SNP-BLUP and GBLUP strategies because V is smaller (of size n ) than several of the G matrices or LHS matrices in the SNP-BLUP and GBLUP formulations. Moreover, V is invertible by construction, while G A and G D may not be full rank and may need some "blending". Several authors [8][9][10] have already pointed out that the GLS formulation is computationally more compact when MME are non-sparse and requires, in principle, fewer computations. Therefore, to estimate SNP effects for multi-population evaluation, we propose the following algorithm based on GLS and SI. Assume that there are n 1 , n 2 , and n 3 records for each population: 1. Read data, build X , Z and W. 2. Create empty V of the right size ( n = n 1 + n 2 + n 3 ). 3. Add residual variances to V. 4. Add contributions from populations 1, 2, 3 and a , d , e to V. 5. Invert V (with associated cost O(n 3 )).

Solve for fixed effects
7. Solve for random effects using covariances as described in the following. This can be done in steps ( Z 1 , then Z 2 , etc.), specifically: To compute a we need C ′ a = Cov a, y ′ the covariance of a with y , which is: The algorithm to compute a then is:

And finally
We then proceed similarly for d. Useful by-products of the inversion-based (as opposed to iterative) computation of BLUP are reliabilities random effect predictions, which are usually obtained from prediction error variances. Prediction error variances of estimates of SNP effects can also be used in genome-wide association studies to assess significance of SNP effects. For the particular case of GLS + SI, individual reliabilities can be obtained from the inverses V −1 and X ′ V −1 X −1 and from the covariances of y with the different random effects [4]. Efficient algorithms for REML based on the GLS formulation also exist [10,11].

Iterative methods for genetic evaluation
In contrast to the exact inversion methods described above, most genetic evaluation software used for livestock use iterative methods that do not invert matrices or even set up MME explicitly, e.g. [12]. Iterative methods have two inconveniences compared to exact inversion methods: (1) convergence may be slow and is a priori unpredictable, and (2) other information such as reliabilities from the inverses of the MME is lost. For the types of multi-population models considered here, convergence of iterative methods is not always good, because there are many more effects than records and, for a given number of records, the condition number of the MME worsens with each extra effect.
We are not aware of iterative methods that use the GLS + SI formulation. However, in principle, it is possible to solve X ′ V −1 X β = X ′ V −1 y without inversion or even storage of matrices, by first solving V = y (so = V −1 y) and VM = X (so M = V −1 X) , and then To estimate SNP effects using SI, it is useful to note that y c = − M β.

Discussion
Henderson's formulation of MME allowed the use of linear methods for genetic evaluation as opposed to, say, likelihoods in pedigrees [13]. The two key discoveries of the MME and of the fast (and sparse) construction of the inverse of the pedigree-based numerator relationship matrix led to computational efficiency, not only for estimation of breeding values, but also for estimation of variance components, in which most algorithms use predictions of random effects and elements from the inverse of the MME. However, sparsity of the MME is only partly retained when using dense marker genotypes, as these invariably lead to dense cross-products, either as incidence matrices ( Z ′ Z ) or as covariance matrices (ZZ ′ ). In addition, the latter (ZZ ′ ) need to be inverted for their inclusion in MME. Within the framework of the use of linear models for genetic evaluation, the use of any computing strategy, including GBLUP, SNP-BLUP, and GLS + SI, is now mostly a matter of convenience for the user (availability of software) and for the programmer (general and/or easy formulations are preferred to complex ones). Computationally, the efficiency of the approach depends on the number of records and on the number of markers. In our particular problem of multi-breed prediction, generally, SNP-BLUP is computationally easier when n > m (more records than markers), GLS + SI is easier when m > n (more markers than records), and GBLUP is easier when m > n and the model is notoriously complex to fit (e.g. random regressions on time or temperature, correlated animal effects, etc.). SNP-BLUP models are interesting because they yield estimates of marker effects. These allow so-called "indirect" predictions (e.g. for young genotyped animals with no own record) to be calculated as the sum of SNP effect estimates weighted by gene content at SNPs. Dominance (or higher order interaction) effect estimates allow matings that maximize performance to be optimized. GBLUP or GLS + SI also allow estimates of marker effects to be obtained using covariances as explained before.
Compared to SNP-BLUP and GBLUP, an advantage of GLS + SI is the ability to fit increasingly complex models without increasing the dimensions of the GLS. Examples are genotype-by-genotype and genotype-by-environment interactions [14,15], which are of interest, respectively, for planned matings and breeding for target environmental conditions.
The focus of this note was medium-sized populations of up to ~ 100K individuals and/or genotyped markers. In these populations, it is possible to fit rather complex models while maintaining the favorable features of exact methods, including shorter computing time, computed reliabilities, maximum likelihood algorithms, and even genome-wide associations studies. For very large data sets, iteration on data [12,[16][17][18][19] are good options, as they do not require cross-products to be explicitly computed (or only partially) or inversion of the MME.

Conclusions
We have shown that for multi-breed prediction, Henderson's MME (either in terms of marker effects or of individual animal effects-SNP-BLUP and GBLUP, respectively) do not necessarily lead to the most computationally efficient approach, although it is a very flexible one. If most individuals are genotyped, then other more parsimonious alternatives could be considered in addition to GBLUP or SNP-BLUP. The use of GLS combined with SI is one of these.