An Efficient Method to Calculate Genomic Prediction Accuracy for New Individuals

Diagonal elements of the coefficient matrix are necessary to calculate the genomic prediction accuracy. Here an improved methodology is described, to update the inverse of the coefficient matrix (C) for new individuals with a genotype, with and without phenotypes. Computational performance is significantly improved by re-using parts of the coefficient matrix inverse calculations that do not change from one animal to another, in combination with updated calculations for those that do change. This method expedites calculation of accuracy for new individuals with genotypes, without re-doing the whole population, by using the previously calculated matrices.


INTRODUCTION
In the last decade, technological advances have significantly decreased genotyping costs, particularly for agricultural livestock and cropping species. This reduction in costs has enabled regular genomic best linear unbiased prediction (GBLUP) (VanRaden, 2008) analyses for the production of genomic breeding values. Currently, genomic information is routinely used in the Australian beef industry in producing estimated breeding values (EBVs). Low costs and high industry uptake has resulted in a rapid increase in the number of new genotypes and thus the size of the genomic population is growing larger. GBLUP requires inversion of the genomic relationship matrix (G) and the coefficient matrix (C), which is computationally demanding. More efficient methods such as APY (Algorithm for Proven and Young animals) (Misztal, 2016) and PICD (Partial Incomplete Cholesky Decomposition) (Hancock, 2017) can handle large numbers of genotyped animals by approximating the inverse of G only and not the coefficient matrix. However, these approaches do not address the need for diagonal elements of the coefficient matrix inverse (left hand side) required for calculating EBV accuracies. With the increasing speed at which new genotypes are provided, inversion of the coefficient matrix for accuracy calculation is increasingly computationally demanding and time consuming, requiring more efficient methods.
Here we propose a method to calculate the accuracy of new individuals, with and without phenotypes, by updating the coefficient matrix inverse (C −1 ) for new individuals only, without redoing the whole population. Using this method, we significantly reduce time and computational demand by updating the accuracy of new individuals and reducing redundancy in the reference population.

Theory
We consider a simple animal model without fixed effects. This model is where y, Z, u, and e are vector of observations, design matrix, vector of solutions, and vector of random residual effects, respectively. The solutions and residual variances are var(u) = Gσ 2 u and var(e) = Iσ 2 e . The mixed model equations (MME) for the above model are where C is the coefficient matrix and α = σ 2 e σ 2 u . Henderson derived a method by using the diagonal of C −1 and the diagonal of G to calculate the accuracy of each estimated breeding value (Henderson, 1975). Accordingly, the accuracy can be calculated with this formula 1 − α c ii g ii , where c ii is the diagonal element of C −1 for individual i, and g ii is the diagonal element of G for individual i.

Updating C −1 to Illustrate the Proposed Method
To calculate the accuracy of individuals with or without phenotypes, each individual can be added to C −1 separately. In this case, the partitioned matrix of MME (Equation 2) is where subscript p and q are core individuals forming the reference population and new individuals respectively. New individuals may or may not have phenotypes. As demonstrated in Equation (2), Z ′ Z becomes where (Z ′ Z) pp is a diagonal matrix with dimension equal to the number of core animals in the population, and the (Z ′ Z) qq in the lower diagonal represents new animals.
Since G −1 becomes based on Equations (2) and (3). Inverting C is computationally demanding, as both G and the entire C should be inverted in each analysis for all individuals. G −1 pp needs to be updated as the new individuals are added to G. This can be performed by the method explained in Meyer et al. (2013). However, since we want to know the accuracy, we must invert C as well as G. Equation (4) can be converted with the following inversion lemma which is equivalent to the Woodbury's formula (Henderson, 1963(Henderson, , 1975Henderson and Searle, 1981): where M −1 as (G + α(Z ′ Z) −1 ) −1 is used for simplification and is shown below in Equation (8). For the partitioned matrices in Equation (4), where ≈ is approximation sign. By using lemma (6) G −1 is not required and we only need to invert the middle matrix (M) in Equation (7). With this simplification M −1 can be updated for each new individual using Cholesky decomposition and multiplying the Cholesky factors, i.e., M −1 = L −T L −1 (Harville, 1997;Meyer et al., 2013).
Therefore, Equation (7) can be written as based on Equation (8 By multiplying back the Cholesky factors of M the solutions for L ′ pq and L qq L T qq are and (11) and C qq which is the inverse of C qq becomes where "-" is the right division sign (multiplying numerator by inverse of denominator) and Only G ′ pq , G pq , G qq , and (Z ′ Z) qq change with each new genotyped individual.
For animals without phenotypes (Z ′ Z) qq is a null matrix and the denominator in equation 14 becomes zero. However, we can assume the limit approach to α as (Z ′ Z) qq approach to zero. Thus, C qq is In summary, Equations (14) and (15) can be used to calculate the prediction accuracies of individuals with and without phenotype, respectively.

Updating the M −1 for New Individuals
Based on Equations (14) and (15) by regarding previous assumptions and Equation (8). M −1 is the largest matrix that was generated in the previous run and can be compressed and stored in binary format to avoid memory issues. The other matrices were small and can be built efficiently by using optimized Linear Algebra PACKage (LAPACK). The equations (14, 15, and 16) were implemented as an R function (Appendix A) to show the prototype and in C++ with Armadillo library (Sanderson and Curtin, 2016) to assess its performancesingle thread.

Simulated Data
Matrices with seven columns representing seven singlenucleotide polymorphisms for each individual and 1000, 2000, 3000, ... 24000, and 25000 rows were created and filled with 0 (AA), 1 (AB), and 2 (BB) randomly. The genomic relationship matrices (G) were built by using VanRaden (2008) method 1, with dimension individuals by individuals. Importantly, increasing the number to SNPs does not affect the computational time. These matrices were used to assess the performance of the proposed method to calculate accuracy.

Performance Evaluation
To evaluate performance, each set was run in three steps. In the first step, the elapsed time to build the coefficient matrix by using the classic approach (i.e., inverting G and C) was measured.
In the second step, the time to build (G pp + α(Z ′ Z) −1 pp ) −1initial matrices required to update c qq was measured. In the third step, the time to calculate c qq by using the initial matrices was measured.

RESULTS AND DISCUSSIONS
By calculating the accuracy of young individuals using Equations (14) and (15) computational times have been significantly reduced. Computational performance using this method is considerably faster, in comparison with existing methods, as shown in Figure 1, with only negligible differences in accuracy due to rounding errors (less than 8.88 × 10 −16 ). The proposed approach using Z ′ Z (a diagonal matrix) resulted in shorter time to build the matrices used to update c qq compared to when using the classic approach to calculate accuracies. This method can be extended in order to accommodate fixed effects and dense Z ′ Z when c qq is updated. Furthermore, the part of C for animals with phenotype (C pp ) must be updated as more individuals are phenotyped.
This method could be exploited within routine breeding value estimation for expidited accuracy calculations. Breeding value accuracy is based on an individual's relatedness to the core reference, such that high accuracy indicates high relatedness. This method to calculate accuracy will affect how the genotypes are used, based on how informative they are for the prediction, improving efficiency by reducing redundant information.
New individuals with phenotypes and low accuracy can be added to the core population, as it is likely these animals are lowly related. Their addition improves the diversity and informativity of the core reference population, and can further improve imputation accuracy of the missing genotypes, with added FIGURE 1 | The graph shows the elapsed time required to calculate c qq using different approaches. inv M is the elapsed time to calculate the M matrix which is the core element to calculate c qq for a new individual. Update Reference is the elapse time to update M matrix, the c qq with and without phenotype shows the elapsed time to calculate c qq when there is or there is not phenotype for the individual, respectively. Their performances were very similar, and as such the lines overlap.
diversity into the imputation haplotype library. Individuals with high accuracy are not required to be added to the core, with or without phenotype, as their accuracy indicates their relatives are already included in this reference population, making their addition redundant. New individuals without phenotypes and low accuracy, should have relatives genotyped to improve accuracy and/or should have their phenotypes recorded to improve the core population.
It is possible to exploit the accuracy calculation as a type of quality control filter for population data, such that individuals with an expected level of relatedness to the reference population, obtains a low accuracy, this may be indicative of genotyping/sampling error, mis-assigned breed, etc. The rapid accuracy calculation for those individuals without phenotype can provide important context for quickly developing a phenotyping strategy.

CONCLUSION
Updating the inverse of C for new individuals with and without phenotype, using the method here, is shown to reduce the computational effort significantly. With increasing numbers of genotyped animals in genetic evaluations, computational efficiency is essential for frequent and timely evaluations. This method provides an improved and efficient method to deliver accurate and fast evaluations when few new young individuals are genotyped but may or may not have phenotypes.

AUTHOR CONTRIBUTIONS
MF developed the method, structured the manuscript, and wrote the method and theory. NC wrote the introduction, result and discussion, and performed major revision. BT gave some comments to improve the final method and article.