Deflated preconditioned conjugate gradient method for solving single-step BLUP models efficiently

Background The single-step single nucleotide polymorphism best linear unbiased prediction (ssSNPBLUP) method, such as single-step genomic BLUP (ssGBLUP), simultaneously analyses phenotypic, pedigree, and genomic information of genotyped and non-genotyped animals. In contrast to ssGBLUP, SNP effects are fitted explicitly as random effects in the ssSNPBLUP model. Similarly, principal components associated with the genomic information can be fitted explicitly as random effects in a single-step principal component BLUP (ssPCBLUP) model to remove noise in genomic information. Single-step genomic BLUP is solved efficiently by using the preconditioned conjugate gradient (PCG) method. Unfortunately, convergence issues have been reported when solving ssSNPBLUP by using PCG. Poor convergence may be linked with poor spectral condition numbers of the preconditioned coefficient matrices of ssSNPBLUP. These condition numbers, and thus convergence, could be improved through the deflated PCG (DPCG) method, which is a two-level PCG method for ill-conditioned linear systems. Therefore, the first aim of this study was to compare the properties of the preconditioned coefficient matrices of ssGBLUP and ssSNPBLUP, and to document convergence patterns that are obtained with the PCG method. The second aim was to implement and test the efficiency of a DPCG method for solving ssSNPBLUP and ssPCBLUP. Results For two dairy cattle datasets, the smallest eigenvalues obtained for ssSNPBLUP (ssPCBLUP) and ssGBLUP, both solved with the PCG method, were similar. However, the largest eigenvalues obtained for ssSNPBLUP and ssPCBLUP were larger than those for ssGBLUP, which resulted in larger condition numbers and in slow convergence for both systems solved by the PCG method. Different implementations of the DPCG method led to smaller condition numbers, and faster convergence for ssSNPBLUP and for ssPCBLUP, by deflating the largest unfavourable eigenvalues. Conclusions Poor convergence of ssSNPBLUP and ssPCBLUP when solved by the PCG method are related to larger eigenvalues and larger condition numbers in comparison to ssGBLUP. These convergence issues were solved with a DPCG method that annihilates the effect of the largest unfavourable eigenvalues of the preconditioned coefficient matrix of ssSNPBLUP and of ssPCBLUP on the convergence of the PCG method. It resulted in a convergence pattern, at least, similar to that of ssGBLUP. Electronic supplementary material The online version of this article (10.1186/s12711-018-0429-3) contains supplementary material, which is available to authorized users.


Background
In general, genomic data for livestock animals include several thousand single nucleotide polymorphisms (SNPs), which are used in genetic evaluations to obtain genomic estimated breeding values [1][2][3]. Currently, the method of choice that simultaneously combines phenotypic and pedigree information of genotyped and non-genotyped animals with genomic information of genotyped animals is the so-called single-step genomic best linear unbiased prediction (ssGBLUP) [3]. ssGBLUP includes genomic information by combining genomic and pedigree relationships into a combined genomicpedigree relationship matrix [3][4][5]. However, a major inconvenience of ssGBLUP is that the inverse of a dense genomic relationship matrix ( G ) is required, which can be computed up to approximately 100,000 genotyped animals on current computers [6]. Thus, some methods were proposed to approximate the inverse of G , such as the algorithm for proven and young animals (APY) [6], or to compute its inverse implicitly based on singular value decomposition (SVD) [7] or on the Woodbury decomposition [8]. Another approach to avoid the computation of the inverse of G , or even G itself, is to fit the SNP effects explicitly, or principal components obtained from a SVD of the genotype matrix, as random effects in the model. Several equivalent models were proposed in the literature that enable simultaneous modelling of genotyped and non-genotyped animals as in ssGBLUP [2,7,[9][10][11][12][13]. Equivalent models that directly estimate SNP effects as random effects [2,[9][10][11][12][13] will hereafter be referred to as single-step SNPBLUP (ssSNPBLUP). It has been suggested that the dimension of SNP-based models can be considerably reduced by applying random regression to principal components (PC) of the SNP genotypes, and that the remaining noise of genomic information can be ignored [14]. To our knowledge, a linear system of equations of single-step principal component BLUP (ssPCB-LUP) has never been solved with the PCG method for large datasets.
The ssGBLUP, ssSNPBLUP and ssPCBLUP models have linear systems of equations with sparse and symmetric positive (semi-)definite (SPSD) coefficient matrices. Thus, the preconditioned conjugate gradient (PCG) method is the primary choice as iterative solver for solving linear systems of ssGBLUP, ssSNPBLUP [11,[15][16][17], and of ssPCBLUP. The PCG method belongs to the family of conjugate gradient (CG) methods that are a realization of an orthogonal projection technique onto the Krylov subspace, which is generated by the initial residual and the system matrix (e.g., the preconditioned coefficient matrix) to which the CG method is applied [17]. The convergence rate of CG methods is bounded as a function of the spectral condition number of the system matrix, which is the ratio between the largest and smallest eigenvalues of the system matrix [17]. Preconditioning ensures faster convergence of the PCG method, compared to the CG method. Unfortunately, in contrast to ssGBLUP, convergence issues with the PCG method applied to ssSN-PBLUP have been reported [11,18], which we have also experienced in our initial analyses. Furthermore, we experienced similar convergence issues with ssPCBLUP in our initial analyses.
Taskinen et al. [11] suggested that convergence problems may be due to a poor spectral condition number of the system matrix of ssSNPBLUP. Thus, to achieve faster convergence, improvement of this spectral condition number is needed and can be obtained through methods that have been developed for ill-conditioned linear systems of equations. One such method is the deflated PCG method, which is a two-level PCG method for illconditioned linear systems [19][20][21]. The DPCG method has resulted in good performance in other contexts than genetic evaluations [22][23][24], and possesses interesting properties, such as its relatively easy implementation in current software based on a PCG method and its favourable properties for parallel computing [22]. To our knowledge, the DPCG method has never been applied in linear mixed models, whether for genetic evaluations or other purposes. Thus, the first aim of this study was to compare the properties of the system matrices of ssGBLUP of the ssSNPBLUP model that was proposed by Mantysaari and Stranden [13], and to relate this to observed convergence patterns obtained with the PCG method. Our second aim was to implement the DPCG method and test its feasibility for solving ssSNPBLUP in large genetic evaluation models, and its re-parametrization into a ssPCBLUP model.

Methods
The first part of this section describes the ssSNPBLUP model that was proposed by Mantysaari and Stranden [13] and its re-parametrization into a ssPCBLUP model. The second part describes the CG, PCG, and DPCG methods. The last part describes the datasets used for comparing the properties of the system matrices of the different models, and for testing the DPCG method.

A ssSNPBLUP model
In this study, we investigate the ssSNPBLUP model that was proposed by Mantysaari and Stranden [13] and is similar to the so-called hybrid model proposed by Fernando et al. [10]. This ssSNPBLUP model fits three types of additive genetic effects: SNP and residual polygenic effects for genotyped animals, and additive genetic effects for non-genotyped animals. Originally derived as a univariate ssSNPBLUP model, we (readily) extended this to a multivariate ssSNPBLUP model for t traits. In the following, I t is an identity matrix with size equal to the number of traits t , and the subscripts g and n refer to genotyped and non-genotyped animals, respectively. A standard multivariate mixed model for ssSNPBLUP can be written as: where y is the vector of records, b is the vector of fixed effects, u n is the vector of additive genetic effects for nongenotyped animals, a g is the vector of residual polygenic effects for genotyped animals, g is the vector of SNP effects, and e is the vector of residuals. The matrices X , W n , and W g are incidence matrices relating records to the corresponding effects. Without loss of generality, the matrix Z contains the SNP genotypes (coded as 0 for one homozygous genotype, 1 for the heterozygous genotype, or 2 for the alternate homozygous genotype) centered by their observed means, and M z = I t ⊗ Z. Additive genetic effects for the genotyped animals for t traits, u g , can be computed as u g = a g + M z g . We assume a multivariate normal distribution for the additive genetic effects u n , the residual polygenic effects a g , and the SNP effects g , with a mean equal to zero and covariance matrix , inverse of required for the mixed model equations associated with Eq. (1), can be derived from the inverse of the (co)variance matrix associated with the vector proposed by Liu et al. [12] as follows [13]: the inverse of the pedigree relationship matrix, A gg is the pedigree relationship matrix among genotyped animals, w is the proportion of variance (due to additive genetic effects) considered as residual polygenic effects, and m = 2 p i (1 − p i ) with p i being the allele frequency of the i th SNP, and such that var g = 1−w m G 0 ⊗ I. Knowing that A −1 gg = A gg − A gn (A nn ) −1 A ng , and after some algebra to avoid the computation of A −1 gg to form −1 , we obtain: where Q = A gn (A nn ) −1 A ng .
The linear system of mixed model equations of ssSN-PBLUP is as follows: This ssSNBLUP model is equivalent to the following ssGBLUP model: where G = 1−w m ZZ ′ + wA gg is the genomic relation- ship matrix modified for considering the residual polygenic effects [3,5].

A ssPCBLUP model
Due to linkage disequilibrium between SNPs, a small number of PC of the centered genotype matrix Z likely explain most of the genomic variation, while the remaining PC associated with small eigenvalues may reflect noise in the genomic information [14,25]. Principal components of Z can be obtained by SVD: where U and V are unitarian matrices with the left and right singular vectors of Z , respectively; and S is a diagonal matrix with non-negative diagonal elements known as singular values (i.e., square roots of the eigenvalues of ZZ ′ and Z ′ Z ). The matrix US is known as the PC score matrix.
Removing the noise can be performed by fitting explicitly only the PC associated with the largest eigenvalues that explain most (e.g., 99%) of the genomic variation, instead of fitting SNP effects, into a ssPCBLUP model as follows [7,14,25]: where v = V ′ g ; and T = UŜ with Ŝ containing the largest singular values of S corresponding to the largest eigenvalues that explain, e.g. 99%, of the genomic variation.
The linear system of mixed model equations of ssP-CBLUP has the same form as the linear system of mixed Vandenplas et al. Genet Sel Evol (2018) 50:51 model equations of ssSNPBLUP (2), except that Z is replaced by T in M z , i.e. I t ⊗ Z by I t ⊗ T . It is also worth noting that the number of columns of T (which is the number of PC kept) is smaller than the number of columns of Z (which is the number of SNPs) due to rank reduction.

Iterative solvers
The linear systems of mixed model equations of ssGB-LUP, ssSNPBLUP, and ssPCBLUP have the form: where C is a SPSD coefficient matrix, x is the vector of solutions, and b is the right-hand-side.
Such linear systems of equations can be solved using direct methods [17]. A bottleneck of most of these methods is that they involve an explicit factorization of C . The resulting matrix factor is often dense and might require excessive amounts of memory and computation. Therefore, direct methods are usually too expensive and, in some cases, even impossible for large linear systems. Instead of direct methods, iterative methods, i.e. methods that use successive approximations to obtain more accurate solutions for a linear system at each iteration step, are more attractive. With iterative methods, both memory requirements and computing time can be reduced, especially if C is large and sparse. Within the class of iterative methods, the CG methods are the best choice, especially when C is SPSD [17].

Conjugate gradient method and effective spectral condition number
The purpose of CG methods is to construct a sequence, x j , that satisfies x j+1 ∈x 0 + K j (C, r 0 ) , where x 0 is a vector of starting solutions, r 0 = b − Cx 0 , and K j (C, r 0 ) is the Krylov subspace K j equal to K j (C, r 0 ) = span r 0 , Cr 0 , C 2 r 0 , . . . , C j−1 r 0 . After j + 1 iterations, the error is bounded by [17]: is the effective spectral condition number of the coefficient matrix C and is defined as κ(C) = max min with min ( max ) being the smallest (largest) non-zero eigenvalue of C [26]. The more C is well-conditioned, the smaller is κ(C) , the smaller is the error bound, which is expected to result in faster convergence of the CG method [17]. It is worth noting that the convergence of the CG method does not depend only on κ(C) , since κ(C) affects only the upper bound of the error (i.e., the worst convergence rate). Indeed, convergence also depends on the clustering of the eigenvalues of the system matrix, on the right-handside b , and on floating point rounding errors. These factors may lead to different convergence patterns for two different systems of equations with a similar κ(C).

Preconditioned conjugate gradient method
To improve the performance of the CG method, the linear system of equations, Cx = b , is transformed into an equivalent linear system of equations for which the resulting system matrix, i.e. the preconditioned coefficient matrix, has an effective spectral condition number smaller than κ(C) . This can be realized by preconditioning the linear system with a symmetric positive definite matrix M , called preconditioner. The resulting preconditioned linear system of equations can be written as follows [17]: The preconditioned linear system can be solved with the PCG method using the algorithm given in Table 1. Equation (4) for the error bound of the CG method also applies to the PCG method by replacing κ(C) with κ M −1 C . Thus, the preconditioner M must be chosen such that κ M −1 C ≤ κ(C) . A general rule is that the preconditioner M approximates C to obtain eigenvalues that cluster around 1. The preconditioner M must be also chosen such that inexpensive costs are required for its construction and for the multiplication of its inverse M −1 , with a vector, as this operation is performed at each Table 1 Algorithm for preconditioned conjugate gradient (PCG) and deflated PCG (DPCG) methods for solving x in the linear system Cx = b using a preconditioner M , and Z d is a deflation-subspace matrix For the PCG implementation: the equation in line 11 (r j+1 = r j − α j w j ) was replaced by the equation r j+1 = b − Ax j+1 at each 50 iterations [16] 1 Select an initial guess x 0 ; iteration of the PCG method (Table 1). For linear systems of equations resulting from mixed models, such as models (1) and (3), a preconditioner M equal to the diagonal elements of C , i.e. M = diag(C) , is widely used [11,15,16,18,27]. For multivariate analyses, M is usually defined as a block diagonal matrix [11,15,16,18,27].

Deflated PCG method
The deflated PCG method is a two-level PCG method, which iteratively solves ill-conditioned linear systems of equations, i.e. linear systems of equations with a large effective spectral condition number [19][20][21]. Large effective spectral condition numbers are obtained when very small, or very large, or both, non-zero eigenvalues are present in the set of eigenvalues, called spectrum, of M −1 C . These very small or very large eigenvalues of a spectrum are called hereafter unfavourable eigenvalues. Deflation is used to annihilate the effect of the most unfavourable eigenvalues of the spectrum of M −1 C on the convergence of the PCG method by setting these unfavourable eigenvalues to 0 [20]. The deflation is performed by introducing a second-level preconditioner, P , also called deflation matrix, into the preconditioned linear system of equations, M −1 Cx = M −1 b , as follows [19][20][21]: where x d is the vector of deflated solutions and is related to the vector of solutions x of the system of equations the matrix Z d is the deflation-subspace matrix of rank k that contains k columns, called deflation vectors; and the matrix E = Z ′ d CZ d is a symmetric positive definite matrix, called Galerkin or coarse matrix [28], which can be easily computed and inverted, or factored if it is too large.
Because Eq. (4) for the error bound of the CG method also applies to the DPCG method by replacing κ(C) with κ M −1 PC , choosing an adequate combination of M −1 P , i.e. choosing a deflation matrix P , and thus a deflationsubspace matrix Z d , in combination with M , should yield faster convergence. Ideally, matrix Z d should contain the eigenvectors corresponding to the unfavourable eigenvalues of M −1 C to achieve the fastest convergence [20][21][22]. However, obtaining and applying such eigenvectors is computationally intensive. Therefore, the deflation vectors of the deflation-subspace matrix Z d should approximate the same space as the span of the unfavourable eigenvectors such that κ M −1 PC ≤ κ M −1 C . The number ( k ) of deflation vectors should be chosen such that the deflation approach gives good results while the additional computational costs are limited as much as possible. Indeed, the size ( k ) of the Galerkin matrix should be limited so that it can be stored in memory, and the computational costs associated with the multiplication of P = I − CZ d E −1 Z ′ d with a vector should be also limited because this operation is performed at each iteration of the DPCG method. For example, if k = 1 , the computational costs are minimized since Z d is a vector and E −1 is a scalar. However, in this case, the DPCG method is expected to hardly improve the convergence pattern. Contrariwise, if k is equal to the number of equations of the linear system (i.e., k is large), then Z d and E −1 are square matrices with the same size as C . Furthermore, if Z d is defined as an identity matrix, the DPCG method is equivalent to a direct solver, since In this case, the additional computational costs are equal to the costs of inverting C , and the DPCG method will converge in one iteration. The algorithm for the DPCG method is in Table 1.

Definition of the deflation-subspace matrix for ssSNPBLUP and ssPCBLUP
The deflation vectors of the deflation-subspace matrix Z d can be defined following several techniques based on, e.g., approximating eigenvectors [29], recycling information of previous Krylov subspaces [21], or subdomain deflation vectors [22]. All these approaches have their own advantages and disadvantages [20,28]. For example, some advantages of the subdomain deflation approach are that Z d is sparse, and that additional computations for the DPCG method (in comparison to the PCG method) can be implemented efficiently [22]. Due to these advantages and based on preliminary results, the deflation vectors of the deflation-subspace matrix Z d were defined following the subdomain deflation approach in this study [22]. This approach divides the computational domain R n (with x ∈ R n ) into k non-overlapping subdomains, with each i-th ( i = 1, . . . , k ) subdomain corresponding to the i-th deflation vector. An entry of the deflation vector Z di is equal to 1 if the corresponding entry is included in the i-th subdomain; otherwise the entry of Z di is equal to 0. Therefore, each row of Z d contains only one non-zero element. The subdomain deflation approach gives good results if k is large enough [22].
Multiple divisions of the computational domain of ssSNPBLUP (ssPCBLUP) into k non-overlapping subdomains are possible to define the required deflationsubspace matrix Z d . The optimal division depends on the properties of the linear system of equations. For example, Vuik et al. [20] defined the subdomains based on the properties of the eigenvectors associated with the smallest eigenvalues of M −1 C for a class of layered problems with extreme contrasts in C . This approach could not be extended to ssSNPBLUP because we were not able to identify which model terms were associated with the most unfavourable eigenvalues of M −1 C (results not shown). However, based on the observation that fitting SNP effects explicitly led to an increase of the largest eigenvalues of M −1 C for ssSNPBLUP, in comparison to ssGBLUP (see sections Results and Discussion), we hypothesized that grouping SNP effects in subdomains could enable the DPCG method to annihilate the effects of the most unfavourable eigenvalues. Thus, we divided the ssSNPBLUP domain per trait and then within trait as follows: (1) all effects associated with the same trait and other than the SNP effects were included in a separate subdomain; and (2) each set of l randomly chosen SNP effects associated with the same trait were included in a separate subdomain. Following this division, the number of subdomains k , and therefore the rank of Z d and the size of the Galerkin matrix E , is equal to k = 1 + n snp l * t , where n snp is the number of SNPs, and n snp l is equal to the smallest integer greater than or equal to n snp l . It is also worth noting that the proposed division of the ssSNPBLUP domain with l = 1 SNP effect per subdomain leads to a system matrix M −1 PC with zero entries for all equations associated with the SNP effects (see Additional file 1). Because the behaviour of the PCG method applied to ssPCBLUP was similar to that of the PCG method applied to ssSNPBLUP (see the Results section), we divided the ssPCBLUP domain with the same approach as for the ssSNPBLUP domain, except that each set of PC effects included l consecutive PC effects, which were sorted following a descending order of their associated eigenvalues. In this study, four different deflationsubspace matrices for ssSNPBLUP (ssPCBLUP) were defined by means of sets of l = 1, 5, 50, and 200 SNP (PC) effects.

Termination criteria
Because the PCG and DPCG methods are iterative methods, termination criteria must be defined to determine when the methods have reached convergence. In this study, r j+1 b ≤ δ and r dj+1 b ≤ δ were used as termination criteria for the PCG and DPCG methods, respectively, with �·� being the 2-norm, and r dj+1 being the residual of the deflated system after j + 1 iterations. It has been shown that the residual of the PCG method is the same as the residual of the DPCG method [20]. Therefore, the two termination criteria are the same.

Data and models
Two datasets, a reduced dataset and a field dataset, were provided by CRV BV (The Netherlands). To achieve the first aim of this study, the reduced dataset was used to compare properties of the system matrices ( M −1 C or M −1 PC ) of ssSNPBLUP and of ssGBLUP, and to relate this to observed convergence patterns. To achieve the second aim of this study, the field dataset was used to test the feasibility of the DPCG method for solving the linear system of equations associated with ssSNPBLUP and ssP-CBLUP applied to large multi-trait datasets.
The reduced dataset and associated variance components were extracted from the Dutch single-step genomic evaluation from August 2017 for ovum pick-up (OPU) and embryo transfer of Holstein dairy cattle. After extraction of the OPU sessions, the data file included 61,592 OPU sessions from 4109 animals, and the pedigree included 37,021 animals. The genotypes of 6169 animals without phenotype were available. Bulls were genotyped using the Illumina 50 K SNP chip. Cows were genotyped using the Illumina 3 K chip and were imputed to 50 K density using a combination of the Phasebook software [30] and Beagle [31]. Because some currently used (own) libraries cannot handle sparse matrices with more than 2 31 -1 elements, and also to keep the system of equations at a reasonable size for subsequent analyses (e.g., for the computation of all eigenvalues), genotypes included 9994 segregating SNPs with a minor allele frequency higher than or equal to 0.01 and randomly sampled from (imputed) 50 K SNP genotypes. The univariate mixed model included random effects (additive genetic, permanent environmental, and residual), fixed co-variables (heterosis and recombination), and fixed crossclassified effects (herd-year, year-month, parity, age (in months), technician, assistant, interval, gestation, session, and protocol) [32].
The field dataset and associated variance components were from the 4-trait routine genetic evaluation of August 2017 for temperament and milking speed of dairy cattle for the Netherlands and the Flemish region in Belgium [33,34]. Performance in both countries were considered as different traits, with genetic correlations between Flemish and Dutch traits higher than 0.85. The data file included 3882,772 records with a single record per animal. The pedigree included 6130,519 animals. The genotypes of 15,205 animals without phenotype and of 75,758 animals with phenotype were available. Animals were genotyped in the same manner as described above. After removing non-segregating SNPs and SNPs with a minor allele frequency lower than 0.01, genotypes included 37,995 segregating SNPs. The four-trait mixed model included random effects (additive genetic and residual), fixed co-variables (heterosis and recombination), and fixed cross-classified (herd × year × season at classification, age at classification, lactation stage at classification, milk yield and month of calving). More details on this genetic evaluation can be found in [33,34].
For both datasets, genetic groups were removed from the pedigrees (for simplicity), the proportion w for residual polygenic effects was assumed to be equal to 0.05, and the centered genotype matrices Z for ssSNPBLUP and the matrices G −1 − A −1 gg for ssGBLUP were computed with the software calc_grm [35]. For the field dataset only, a matrix T that contained PC kept for ssPCBLUP was also computed using the software calc_grm. The number of PC kept was equal to the number of the largest eigenvalues that together explain 99% of the genomic variation.

Statistical analyses Computation of eigenvalues and effective spectral condition numbers
Properties of the system matrices and convergence patterns of ssSNPBLUP and ssPCBLUP were compared to those of ssGBLUP. For the reduced dataset, the spectrum of the preconditioned coefficient matrix, M −1 C , of both ssGBLUP and ssSNPBLUP, and of the preconditioned deflated coefficient matrix, M −1 PC , of ssSNPBLUP with different deflation-subspace matrices, were computed using Intel(R) Math Kernel Library (MKL) 11.3.2 subroutines. For the field dataset, due to the large number of equations (> 10 7 ), the smallest and largest positive eigenvalues of the different system matrices of ssSNPB-LUP, and ssPCBLUP were approximated using the Lanczos method based on information obtained from the (D)PCG method [36,37]. Performed with the (D)PCG method, the Lanczos method approximates the smallest and largest eigenvalues that influence the convergence. Because the null space of a system matrix never enters the iteration of the (D)PCG method, the corresponding zero eigenvalues do not influence the convergence and therefore are not approximated with the Lanczos method [20,22,38,39]. It is also worth noting that the precision of the approximations of the eigenvalues may vary between analyses, which can partly explain the fact that the number of iterations to reach convergence may not be completely related with the associated effective spectral condition number. The different κ M −1 C and κ M −1 PC were thereafter computed to compare the different systems of equations.

Solving ssSNPBLUP, ssPCBLUP, and ssGBLUP
Linear systems of equations of ssSNPBLUP, of ssPCBLUP, and of ssGBLUP, were solved by using the PCG method. Systems of ssSNPBLUP were also solved by using the DPCG method with sets of 1, 5, 50, and 200 randomly chosen SNP effects per subdomain for the reduced dataset, and with sets of 5, 50, and 200 randomly chosen SNP effects per subdomain for the field dataset. The set of 1 SNP effect per subdomain was not used for the field dataset due to a Galerkin matrix of size 155,980, which was considered as too large for inversion. Similarly, systems of ssPCPBLUP were solved by using the DPCG method with sets of 1, 5, 50, and 200 consecutive PC effects per subdomain for the field dataset. For both the PCG and DPCG methods, the iterative process was run for a maximum of 10,000 iterations, or until termination criteria reached δ = 10 −6 . In addition, for both the PCG and DPCG methods, the preconditioner M was equal to: where the subscripts f and r refer to the equations associated with fixed and random effects, respectively, and block − diag(C rr ) is a block-diagonal matrix with blocks corresponding to equations for different traits within a level (e.g., an animal). For the field dataset, diagonal elements of Q = A gn (A nn ) −1 A ng were approximated using a Monte Carlo method [40,41].
Linear systems of equations of ssSNPBLUP, of ssPCB-LUP, and of ssGBLUP were solved by using a Fortran 2003 program exploiting BLAS and sparse BLAS routines and the parallel direct sparse solver PARDISO, all from the multi-threaded Intel Math Kernel Library 11.3.2, and OpenMP parallel computing. For the reduced dataset, the coefficient matrix C was held in memory using a compressed sparse row format, and the multiplication of P by a vector v required by the DPCG method, was per- indicate the order of the matrix-vector operations. The matrices E −1 and Ŵ = CZ d were readily computed from the coefficient matrix C held in memory. These matrices were computed before starting the iterative process and held in memory. The number of OpenMP threads was limited to 3 for the reduced dataset.
For the field dataset, the coefficient matrix C was reconstructed using a matrix-free approach when required for its multiplication by a vector. The matrix E −1 was held in memory. The computation of E was not as straightforward as for the reduced dataset because the coefficient matrix C was not held in memory. Therefore, the matrix E was computed following a suboptimal 4-step approach. The first step consisted of sequentially pre-and post-multiplying the coefficient matrix C by the first t deflation vectors of the deflation-subspace matrix Z d , i.e., the deflation vectors corresponding to the subdomains that included all effects associated with a same trait and other than the SNP effects. The second step consisted of computing each vector Z 1−w G −1 0 computed explicitly beforehand. Furthermore, because the matrix Ŵ = CZ d was too large to be held in memory for the field dataset, the multiplication of P by a vector v required by the DPCG method, was performed as . Due to this latter implementation of the multiplication of P by v , each iteration of the DPCG method requires two matrix ( C)-vector products, instead of one matrix-vector product for the PCG method. The number of OpenMP threads was limited to 5 for the field dataset. All real vectors and matrices were stored using double precision real numbers, except for the preconditioner, which was stored using single precision real numbers. All computations were performed on a computer with 528 GB and running RedHat 7.4 (x86_64) with an Intel Xeon E5-2667 (3.20 GHz) central processing unit processor with 32 cores. Main random access memory (RAM) and time requirements are reported for the field dataset. All reported times are indicative, because they may have been influenced by other jobs running simultaneously on the computer.

Comparison of estimates of different single-step BLUP
Estimates for all fixed effects, additive genetic effects, and other possible random effects, of ssGBLUP solved with the PCG method, of ssSNPBLUP solved with the PCG and DPCG methods, and of ssPCBLUP solved with the PCG and DPCG methods, were (almost) the same after convergence was reached. For example, Pearson correlations of all estimates of ssGBLUP solved with the PCG Fig. 1 Eigenvalues of different preconditioned (deflated) coefficient matrices for the reduced dataset. Eigenvalues of the preconditioned coefficient matrices of ssGBLUP and of ssSNPBLUP, and of the preconditioned deflated coefficient matrix of ssSNPBLUP with one SNP effect per subdomain are depicted on a logarithm scale. All eigenvalues less than 10 −11 were set to 10 −11 . Eigenvalues are sorted in ascending order Vandenplas et al. Genet Sel Evol (2018) 50:51 method and the corresponding estimates of ssSNPBLUP solved with the DPCG method using 5 SNP effects per subdomain were higher than 0.999 for both the reduced and field datasets ( Table 2) and (see Additional file 2: Figure S1). Regression of estimates of ssGBLUP on estimates of ssSNPBLUP solved with the DPCG method using 5 SNP effects per subdomain led to regression coefficients close to 1 and intercepts close to 0 ( Table 2). Similar results were obtained for ssPCBLUP solved with the DPCG method using 1 PC effect per subdomain for the field dataset (Table 2) and (see Additional file 2: Figure  S2).

Reduced dataset
For the reduced dataset, the number of equations was equal to 41,949 for ssGBLUP and to 51,943 for ssSNPB-LUP. Figure 1 shows the spectrum of M −1 C of ssGBLUP and of ssSNPBLUP, and the spectrum of M −1 PC of ssS-NPBLUP with 1 SNP effect per subdomain. All eigenvalues less than 10 −11 were assumed to be non-zero due to, for example, rounding errors, and therefore they were set to zero for subsequent analyses. Similar patterns for the different spectra were observed. The smallest nonzero eigenvalues of the different M −1 C and M −1 PC that influenced convergence, were equal to 1.1 × 10 −4 , regardless of the model or the definition of subdomains. The largest eigenvalue of M −1 C was equal to 12 for ssGB-LUP, and 181 for ssSNPBLUP (Table 3). When deflation was applied, the largest eigenvalue of M −1 PC varied from 6 with 1 or 5 SNP effects per subdomain to 99 with 200 SNP effects per subdomain. Deflation of the largest eigenvalues of M −1 C of ssSNPBLUP can be also observed in Fig. 1. After deflation, the effective spectral condition number of ssSNPBLUP decreased from 1.7 × 10 6 to between 5.9 × 10 4 with 1 SNP effect per subdomain and 9.3 × 10 5 with 200 SNP effects per subdomain. Both the PCG and DPCG methods reached the termination criteria within 10,000 iterations, and converged to the same solutions for all linear systems of ssGBLUP and ssSNPBLUP. When the PCG method was used, the number of iterations to reach convergence was more than 5 times larger for ssSNPBLUP compared to ssGB-LUP (Table 3; Fig. 2). However, when the DPCG method with 1 SNP effect per subdomain was used, the number of iterations decreased by a factor 5, and was similar to the number of iterations needed for ssGBLUP. Five, 50 and 200 SNP effects per subdomain also led to a decrease of the number of iterations by a factor 4.3, 1.7 and 1.3, respectively (Table 3). Figure 2 depicts termination criteria by iteration for the PCG and DPCG methods. A flat pattern is observed for the PCG method applied to ssS-NPBLUP. The DPCG method allowed removing this flat pattern such that a pattern similar to that of ssGBLUP was observed.
Regarding the wall clock time per iteration, when the PCG method was applied, about 0.05 s and 0.46 s were needed for ssGBLUP and ssSNPBLUP, respectively. The wall clock time for the iterative process to reach convergence, i.e. excluding the time needed for I/O operations and computations of different matrices (e.g., Z , , was about 11 s for ssGBLUP, and about 688 s for ssSNPBLUP solved with the PCG method. When the DPCG method was applied, the time per iteration for ssSNPBLUP slightly increased due to additional computations involving the deflation matrix P . However, the total time for the iterative process decreased to a minimum value of 170 s with 5 SNP effects per subdomain (Table 3).

Field dataset
For the field dataset, the number of equations was larger than 25.8 × 10 6 for all systems of equations. In total, 13,803 largest eigenvalues of Z explained 99% iterations of the PCG method (i.e., the iterative process that never reached convergence) was equal to 52,683 s for ssSNPBLUP and to 30,198 s for ssPCBLUP.
Regarding RAM and time requirements for the field dataset during the solving process, the peak RAM was about 70 GB for ssGBLUP, about 34 GB for ssSNPBLUP, Fig. 4 Termination criteria for the field dataset for ssGBLUP and ssPCBLUP using the PCG method and for ssPCBLUP using the DPCG method. The number of principal components retained was equal to 13,803. Number of PC effects per subdomain is within brackets Table 5 Computational costs for different matrices and for the software used for the field dataset a Number of SNP (PC) effects per subdomain is within brackets b The size of the Galerkin matrix is equal to the rank of the deflation-subspace matrix c Wall clock time required for the computation of the Galerkin matrix following a naive implementation, and computation of its inverse d The dense matrix is the centered genotype matrix Z for ssSNPBLUP and the matrix with principal components T for ssPCBLUP e The software peak memory is defined as the peak resident set size (VmHWM) obtained from the Linux/proc virtual file system has favourable properties for parallel computing [22]. For example, the deflation-subspace matrix in our implementation is stored as a vector of size of the number of equations of the system, and each entry of this vector contains the identification number of the subdomain associated with the corresponding equation. Moreover, the Galerkin matrix E was held in memory as a dense matrix in our implementation, which is possible on current sharedmemory computers when the numbers of SNPs and traits are reasonable. Holding the Galerkin matrix in memory also allows efficient parallel computing using Intel MKL subroutines. Furthermore, while a suboptimal approach was used in this study to compute the Galerkin matrix E , we expect its computation to be feasible within a limited amount of time and memory by taking further advantages of the symmetry of the coefficient matrix C and of the properties of the subdomain deflation approach. Improvement of the current definition of the subdomains for ssSNPBLUP and ssPCBLUP could reduce further computational costs (i.e., time and memory requirements). The definition of subdomains used in this study was arbitrary, that is the number of SNP effects assigned to one subdomain was the same for each subdomain, and SNP effects assigned to one subdomain were randomly chosen. It would be worth investigating whether assignments of SNP effects to a subdomain based on properties of the SNP genotypes, such as linkage disequilibrium, could reduce the number of subdomains while maintaining, or decreasing, the obtained effective spectral condition numbers. Indeed, the current definition of subdomains could lead to too large Galerkin matrices for ssSNPBLUP with a large number of traits. Furthermore, for the field dataset, the current definition of subdomains did not allow to hold in memory the matrix CZ d for computation efficiency [22]. Instead, we had to perform the multiplication of C by a vector twice each DPCG iteration, leading to double wall clock times per DPCG iteration in comparison to the PCG method.
For large datasets, a matrix-free approach (that is our second implementation) allows to solve ssSNPBLUP and ssPCBLUP with a (D)PCG method on current computers and with limited amounts of RAM and of wall clock time. Indeed, large and dense matrices of the linear system of Eq. (2), such as H 13 = G −1 0 ⊗ A ng Z for ssSNPBLUP, are never computed explicitly. Instead, the matrix-free approach takes advantage of the fact that the (D)PCG method requires the multiplication of C by a vector. For example, the multiplication of H 13 by a vector d is performed in three parts, i.e.
] where the brackets [·] indicate the order of the matrixvector operations. Also, when using a matrix-free approach, one of the largest computational costs of ssS-NPBLUP (ssPCBLUP) solved with a (D)PCG method is, most likely, the multiplication of Z ( T ) by a vector d . Thus, it is expected that the main computational costs of ssSNPBLUP and of ssPCBLUP will increase linearly with increasing numbers of genotyped animals. Such a linear increase of the computational costs is also observed for ssGBLUP using APY [6] or ssGTBLUP based on a Woodbury decomposition of G [8].
While the solving process for ssPCBLUP seems to be more favourable than that for ssSNPBLUP in terms of memory and time requirements, the comparison between the two approaches should also consider additional computations, such as the SVD of the centered genotype matrix for ssPCBLUP. For the field dataset, the computational costs were quite substantial (i.e. > 15,000 s with 10 threads), and these will increase linearly with the number of genotyped animals and quadratically with the number of SNPs (assuming that the number of genotyped animals is larger than the number of SNPs). However, the time needed for SVD can be reduced by analysing different genome segments (e.g., chromosomes) in parallel [7]. For example, using an own Coarray Fortran program with 5 images (processes) using each 2 CPU, performing the SVD of 5 genome segments (of the same size) in parallel took 1276 s, and 16,662 PC were kept (instead of 13,803 PC kept with the SVD to the full genotype matrix). Performing the SVD on 5 genome segments instead of on the full genotype matrix, only marginally increased time and memory required for ssPCBLUP using the DPCG method with 1 PC per subdomain (results not shown). Therefore, further studies comparing computational costs for the whole process of ssPCBLUP, of ssSNPB-LUP, but also of ssGBLUP and related methods (ssGB-LUP using APY [6], and ssGTBLUP [8]) are needed. Such studies should consider costs of SVD, of computation of genomic relationship matrices, and of back-solving SNP effects from genomic estimated breeding values.
Because both PCG and DPCG methods are CG-based methods, the DPCG method can be easily implemented in current software based on the PCG method for other ssSNPBLUP (ssPCBLUP) models, or even for pedigreeand ssGBLUP models, for which convergence issues are observed. Modifications of existing PCG software would be mainly associated with the multiplication of P = I − CZ d E −1 Z ′ d by a vector, which can rely on existing code for the multiplication of C by a vector. Using the DPCG method with pedigree-BLUP or ssGBLUP could also improve their convergence patterns. For example, the number of iterations to solve the pedigree-BLUP of the field dataset decreased by about 30% (in comparison to the PCG method) after associating one subdomain with each of the 100 sires that had the largest progeny. While this approach could not be generalised to other available field datasets (results not shown), it seems worthwhile to investigate the DPCG method in the pedigree-and ssGB-LUP contexts for performing routine genetic evaluations with increasing datasets within a reasonable time.

Conclusions
We showed that convergence issues observed with ssS-NPBLUP and ssPCBLUP solved by the PCG method are related with larger eigenvalues and larger effective spectral condition numbers in comparison to ssGBLUP. These convergence issues of ssSNPBLUP and of ssPCB-LUP were solved with a DPCG method, which is a twolevel PCG method for ill-conditioned linear systems. As defined in this study, the DPCG method treats the largest unfavourable eigenvalues of the preconditioned coefficient matrix of ssSNPBLUP and of ssPCBLUP, and leads to a convergence pattern, which is at least similar to that of ssGBLUP.

Additional files
Additional file 1. Derivation of a preconditioned deflated coefficient matrix. Description: Here we derive the preconditioned deflated coefficient matrix when the computational domain is divided such that some effects are included alone in a separate subdomain.
Additional file 2: Figure S1. Comparison of the estimates of ssGBLUP solved with the PCG method and of ssSNPBLUP solved with the DPCG method using five SNP effects per subdomain. Estimates are for all fixed effects and random additive genetic effects for the field dataset. Figure  S2. Comparison of the estimates of ssGBLUP solved with the PCG method and of ssPCBLUP solved with the DPCG method using one PC effect per subdomain. Estimates are for all fixed effects and random additive genetic effects for the field dataset.
Authors' contributions JV conceived the experimental design, ran the tests, and wrote the programs and the first draft. HE prepared data. CV, MPLC, and HE provided valuable insights throughout the analysis and writing process. All authors read and approved the final manuscript.