Predicted Residual Error Sum of Squares of Mixed Models: An Application for Genomic Prediction

Genomic prediction is a statistical method to predict phenotypes of polygenic traits using high-throughput genomic data. Most diseases and behaviors in humans and animals are polygenic traits. The majority of agronomic traits in crops are also polygenic. Accurate prediction of these traits can help medical professionals diagnose acute diseases and breeders to increase food products, and therefore significantly contribute to human health and global food security. The best linear unbiased prediction (BLUP) is an important tool to analyze high-throughput genomic data for prediction. However, to judge the efficacy of the BLUP model with a particular set of predictors for a given trait, one has to provide an unbiased mechanism to evaluate the predictability. Cross-validation (CV) is an essential tool to achieve this goal, where a sample is partitioned into K parts of roughly equal size, one part is predicted using parameters estimated from the remaining K – 1 parts, and eventually every part is predicted using a sample excluding that part. Such a CV is called the K-fold CV. Unfortunately, CV presents a substantial increase in computational burden. We developed an alternative method, the HAT method, to replace CV. The new method corrects the estimated residual errors from the whole sample analysis using the leverage values of a hat matrix of the random effects to achieve the predicted residual errors. Properties of the HAT method were investigated using seven agronomic and 1000 metabolomic traits of an inbred rice population. Results showed that the HAT method is a very good approximation of the CV method. The method was also applied to 10 traits in 1495 hybrid rice with 1.6 million SNPs, and to human height of 6161 subjects with roughly 0.5 million SNPs of the Framingham heart study data. Predictabilities of the HAT and CV methods were all similar. The HAT method allows us to easily evaluate the predictabilities of genomic prediction for large numbers of traits in very large populations.


KEYWORDS
best linear unbiased prediction cross-validation generalized cross-validation genomic selection hybrid breeding mixed model Gen Pred Shared data resource Many diseases, anatomic structures, physiological characteristics, and behaviors in humans are polygenic traits. Most agronomic traits in agriculture, e.g., yield, are also polygenic. These complex traits require whole-genome study to understand the genetic mechanisms and to genetically improve the quality and quantity of agricultural products (de los Campos et al. 2009(de los Campos et al. , 2013a. Genomic prediction (selection) is a statistical method of whole-genome study (Meuwissen et al. 2001). It can lead to earlier detection of acute polygenic cancers (Vazquez et al. 2012). Genomic prediction is also an effective tool to select superior cultivars in plant breeding (Heffner et al. 2009). Genomic hybrid prediction will provide an opportunity to evaluate all potential hybrids and allow breeders to select superior hybrids that will have little chance to be discovered based on traditional hybrid breeding schemes (Xu et al. 2014). Genomic selection has been very successful in the dairy cattle industry (Goddard and Hayes 2007) and will soon become a routine procedure for breeding of a vast number of agricultural species. Among the commonly used methods for genomic prediction, BLUP (Henderson 1975) is one of a few suitable methods for handling highthroughput genomic data with millions of genetic variants (VanRaden 2008). Reproducing kernel Hilbert spaces (RKHS) regression (Gianola et al. 2006) is another method with such an ability, but RKHS has not been as well recognized as the BLUP method. Although variable selection approaches such as Bayes B (Meuwissen et al. 2001) and LASSO (Tibshirani 1996) are optimal for traits with a few detectable loci of large effects plus many undetectable modifying loci under low and intermediate marker density, BLUP is the most robust method and one of the most commonly used genomic selection methods (de los Campos et al. 2013b). More importantly, the computational speed does not depend on marker density because it takes a marker-inferred kinship matrix (covariance structure) as the input data, albeit computing kinship matrix taking additional time. To evaluate the predictability of the BLUP model, one has to resort to some other tools, such as validation or CV, where individuals predicted do not contribute to estimated parameters that are used to predict these individuals. If individuals predicted are not excluded from the training sample, serious bias will occur in prediction.
The predictability of a model is often represented by the squared correlation coefficient between the observed and predicted phenotypic values (Xu et al. 2014). This squared correlation is approximately equal to R 2 ¼ 1 2 PRESS=SS, where PRESS is the predicted residual error sum of squares and SS is the total sum of squares of the phenotypic values. Allen (1971Allen ( , 1974 proposed to use PRESS as a criterion to evaluate a regression model, in contrast to using the estimated residual error sum of squares (ERESS) as the criterion. To calculate PRESS, Allen (1971Allen ( , 1974 used an approach that is now called the leaveone-out cross-validation (LOOCV) or ordinary CV (Craven and Wahba 1979), in which an individual is predicted using parameters estimated from the sample that excludes this individual. When the sample size (n) is large, LOOCV presents a high computational cost because one will virtually have to analyze the data n times. The K-fold CV (Picard and Cook 1984) is an extension of LOOCV in which the sample is partitioned into K parts of roughly equal size. Individuals in a part are predicted simultaneously using all individuals in the remaining K 2 1 parts. Eventually, all parts are predicted once and used to estimate parameters K 2 1 times. When K is small, there are many different ways of partitioning the sample, leading to variation in the calculated predictability. This variation can be very large for small sample sizes. Therefore, people often repeat the K-fold CV a few times and use their average values to reduce the error due to random partitioning. If possible, LOOCV (also called the n-fold CV, a special case of K-fold CV when K = n and n is the sample size) is recommended because it eliminates all problems associated with this random partitioning variation. However, such a CV is not realistic for large samples under the mixed model methodology. Although a simple split CV (50% training and 50% test) should suffice with very large samples, still 50% of the sample is wasted. The LOOCV method may slightly overpredict the model compared with the K-fold CV when K is substantially smaller than n (Hastie et al. 2008). Cook (1977Cook ( , 1979 developed an explicit method to calculate PRESS by correcting the deflated residual error of an observation using the leverage value of the observation without repeated analyses of the partitioned samples. This method applies to least square regression under the fixed model framework, where the predicted y is a linear function of the observed y as shown below, is called the hat matrix. The predicted residual error for observation j is e j ¼ê j =ð1 2 h jj Þ whereê j ¼ y j 2 X jb is the so-called estimated residual error and h jj is the leverage value of the jth observation (the jth diagonal element of the hat matrix). It is the contribution of the prediction for an individual from itself and may be called the conflict of interest factor. The predicted residual error is the estimated residual error after correction for the deflation. The sum of squares of the predicted residual errors over all individuals is the PRESS, which is a well-known statistic in multiple regression analyses. To find an explicit expression of PRESS for a mixed model, we need to identify a random effect version of the hat matrix and use the leverage value of the jth observation to correct for the deflated residual error. The HAT method is a fast algorithm for the ordinary CV for a linear model (Allen 1971(Allen , 1974 because the regression analysis is only done once on the whole sample and then the estimated residual errors are modified afterward. Extension of the HAT method to mixed model has been made by Golab et al. (1979) in finding the optimal ridge factor in ridge regression (Hoerl and Kennard 1970a,b). It is well known that ridge regression can be formulated as a mixed model problem with the variance ratio replaced by a given ridge factor. Golab et al. (1979) proposed a generalized cross-validation (GCV) method to find the optimal ridge factor so that the generalized residual error variance is minimized. These authors showed that the GCV-calculated residual error sum of squares is a rotation-invariant version of Allen's PRESS. The residual error variance obtained from the GCV method is equivalent to calculating the residual error variance by dividing the ERESS by an "effective" degree of freedom. Properties of the GCV method have been extensively studied by Li (1987). Jansen et al. (1997) applied GCV to wavelet thresholding. When performing genomic prediction, we prefer to see the actual predicted residual errors (errors in prediction of future individuals) obtained from the ordinary CV because the residual errors obtained via GCV may not be intuitive to most of us. The important gain from the GCV method of ridge regression analysis to genomic selection is the HAT matrix of the random model when the genomic variance is given.
There is rich literature on smoothing spline analysis that also helped us to develop the fast HAT method for evaluation of mixed model predictability (Wahba 1975(Wahba , 1980(Wahba , 1990(Wahba , 1998Wahba and Wold 1975a,b;Craven and Wahba 1979;Wahba et al. 1995Wahba et al. , 2000Wahba and Luo 1997;Wang 1998a,b;Hastie et al. 2008). In smoothing spline curve fitting, a response variable is fitted to a predictor with an arbitrary functional relationship. The common approach is to fit the curve using B-spline or another type of nonparametric approach. Several spline bases (more than necessary) are constructed from the original predictor. These bases are considered as new predictors, which are then used to fit the response variable with linear relationship. The regression coefficients are then estimated using a penalized shrinkage method such as the ridge regression. The ridge parameter in smoothing splines is then called the smoothing parameter ðlÞ; which is often found so that the GCV residual error variance is minimized (Craven and Wahba 1979;Wahba 1980). Given the smoothing parameter, the predicted responses of all individuals are linear functions of all observed responses. Hastie et al. (2008) collectively called these linear functions the smoother matrix and denoted it by S l : This smoother matrix is the random effect version of the HAT matrix, where Q is a known diagonal matrix. A HAT matrix under the random model was also given by de los Campos et al. (2013b)  while H R has a rank of n (number of observations). So, the HAT matrix for a random model has been defined by the smoothing splines community. We may implement this HAT matrix in our BLUP prediction to evaluate the predictability of our models and avoid the lengthy CV analysis. The smoothing parameter (our variance ratio) should be given a reasonable value and the REML estimate from the whole sample is a natural choice. However, replacing the prechosen l by a data-driven estimate makes the HAT matrix a complicated function of the data. The question is, what is the difference between the HAT method (when l is estimated from the whole sample) and the actual CV (when l is estimated anew within each fold)? This becomes the main objective of this study. When revising this manuscript, a similar study was published in the same journal (G3: Genes| Genomes | Genetics) by Gianola and Schon (2016). They also recognized the approximation nature of the new method and stated that using the whole-sample-estimated l in place of the prechosen l will not affect the result too much, especially when the LOOCV is compared, because the training sample only differs from the whole sample by one observation. However, this is only a speculation (most likely true) and they did not explicitly investigate the difference. Since the new method represents a significant technical improvement in genomic selection, the community must be aware of the difference before widely adopting the new method to evaluate a genomic selection program. In this study, we explicitly answer this question by analyzing several agronomic traits and 1000 metabolomic traits from two rice populations. Further comparison was also made in genomic prediction of human height from the Framingham heart study data (Dawber et al. 1951(Dawber et al. , 1963).

Fixed model
The HAT method for calculating PRESS under the fixed model is given by Cook (1977Cook ( , 1979 for the LOOCV scenario but not for the leave n k out CV (the K-fold CV). We extended Cook's method to leave n k out for weighted least squares regression analysis. The predicted y is a linear function of the observed y as shown below, is the hat matrix. This H matrix is still idempotent. In a K-fold CV analysis, let n k be the number of observations in the kth fold for k ¼ 1; . . . ; K and P K k¼1 n k ¼ n: Define X k as an n k · p matrix of independent variables for individuals in the kth fold, where p is the number of independent variables. The "leverage" value for the kth fold is defined as an n k · n k matrix, where W k is the n k · n k subset of matrix W corresponding to the kth fold. This matrix must appear in the end, not in the beginning, of the above equation. Let be the estimated residual errors whereb is estimated from the whole sample. The predicted residual errors for the n k individuals in the kth fold is Therefore, the PRESS is defined as which is the weighted sum of squares of the predicted residual errors. Derivation of Equation 9 is given in Appendix A.

Mixed model
The linear mixed model for genomic prediction is written as where Xb represents the fixed effects, j is a vector of random (polygenic) effects with an assumed Nð0; As 2 j Þ distribution, and e Nð0; Is 2 Þ is a vector of residual errors. The expectation and variance of y are EðyÞ ¼ Xb and varðyÞ ¼ V ¼ As 2 j þ Is 2 ; respectively, where A is a marker-inferred kinship matrix (explained in detail below), s 2 j is the polygenic variance, and s 2 is the residual error variance. The parameters are u ¼ fb; s 2 j ; s 2 g and the variances are estimated using the restricted maximum likelihood method (Patterson and Thompson 1971) by maximizing the following likelihood function, The estimated genomic heritability (de los Campos et al. 2015) from the markers isĥ 2 ¼ŝ 2 j =ðŝ 2 j þŝ 2 Þ: The best linear unbiased estimates (BLUE) of the fixed effects areb ¼ ðX T V 21 XÞ 21 X T V 21 y and the BLUP of the polygenic effects areĵ ¼ŝ 2 j AV 21 ðy 2 XbÞ: The fitted phenotypic values areŷ ¼ Xb þĵ; which is a conditional prediction (not a marginal prediction). Corresponding to the predicted polygenic effectĵ ¼ŷ 2 Xb; we now define j ¼ y 2 Xb as the "observed" polygenic effect (it is indeed observed becauseb is used). The model goodness of fit (FIT) for the random effects is defined as the squared correlation between j andĵ:

Marker-inferred kinship matrix
The marker-inferred kinship matrix A is calculated from all markers of the genome using the following equation, where m is the total number of markers, a ¼ n 21 trð P m k¼1 Z k Z T k Þ is a normalization factor to make the diagonal elements of matrix A as close to unity as possible, and Z k is an n · 1 vector of genotype indicator variables for all individuals at marker k. For individual j, the numerical code for a genotype is where A 1 A 1 ; A 1 A 2 , and A 2 A 2 are the three genotypes of the marker. People often standardize the Z k vectors before using them to calculate the kinship matrix (see VanRaden 2008).

Cross validation
For a K-fold CV, we randomly partitioned the sample into K parts of roughly equal size. We then used K 2 1 parts to predict the remaining part. Let y ¼ Â y T k y T 2k Ã T be the vector of phenotypic values that are partitioned into y T k and y T 2k ; where y T k is a vector of phenotypic values of all observations in the kth part (test sample) and y T 2k is a vector of phenotypic values for all individuals excluding observations in the kth part (training sample). Corresponding to this partitioning of the sample, we have and The predicted phenotypic values in the test sample are Let j k ¼ y k 2 X kb2k be the "observed" polygenic effect (phenotypes of the test sample adjusted by the fixed effects or centered phenotypes) and be the predicted polygenic effect for the test sample. After all parts of the sample are predicted, we calculate the PRESS using The predictability is defined as where Figure 1 Comparison of predictabilities of the HAT and CV methods for seven agronomy traits in inbred rice. The first seven panels are the predictabilities of the HAT (blue) and CV (red) for the seven traits. The last panel shows the average predictabilities and the 95% confidence band of the HAT method for KGW from 100 random partitionings of the sample.

SS
is the total sum of squares of y adjusted by the fixed effects.

The HAT method
With the HAT method, we first defined the adjusted or centered phenotypic vector by the fixed effects j ¼ y 2 Xb as the "observed" j and defineĵ ¼ŷ 2 Xb as the predicted j; whereb is estimated from the whole sample. We then used the whole sample to predict the polygenic effectsĵ ¼ŝ 2 j AV 21 ðy 2 XbÞ ¼ŝ 2 j AV 21 j: Comparing the second form of the above equation (ĵ ¼ŝ 2 j AV 21 j) with the fixed model HAT function,ŷ ¼ Hy; we realized that j AV 21 is the HAT matrix of the random effects. Substituting V 21 by ðAs 2 j þ Is 2 Þ 21 and after a few steps of algebraic derivation leads to where l ¼ s 2 =s 2 j is the variance ratio. With eigen-decomposition for the A matrix, we have A ¼ UDU T ; A 21 ¼ UD 21 U T and UU T ¼ U T U ¼ I: Therefore, This expression (the second form) is exactly the one defined by Hastie et al. (2008) for the smoothing spline analysis given in Equation 3, where their X is replaced by the eigenvector U, their Q is replaced by the inverse of eigenvalue matrix D 21 (diagonal), and their smoothing parameter is our variance ratio. The HAT matrix is easy to compute because ðI þ lD 21 Þ 21 is diagonal. When some eigenvalues are zero, D 21 does not exist (very often), we reformulate it by DðD þ lIÞ 21 : Therefore, where d j is the jth eigenvalue of matrix A. Although the HAT method does not need to refit the model for each part predicted, it still needs to partition the sample into K parts if comparison with the traditional CV is of interest. Letê k ¼ j k 2ĵ k be the estimated residual errors for all individuals in the kth part and H R kk be the diagonal block of matrix H R corresponding to all individuals in the kth part. The predicted residual errors for the kth part are e k ¼ ðI2H R kk Þ 21ê k : Proof of this predicted residual error is provided in Appendix B. The PRESS under this random model becomes The predictability is measured by where SS ¼ P K k¼1 ðj k 2 jÞ T ðj k 2 jÞ is the total sum of squares for the centered y (adjusted by the fixed effects). The n-fold HAT approach is a special case where the kth part to be predicted contains only one individual, i.e., H R kk ¼ h R jj for k ¼ j: Therefore, the leave-one-out version of the PRESS is

PRESS
This predictability is roughly equal to the squared correlation between the fixed-effect-adjusted phenotypes and the predicted polygenic effects. The ERESS is ERESS ¼ P n j¼1ê 2 j and the usual R-square reported in regression analysis is R 2 ¼ 1 2 ERESS=SS; which is a measurement of model FIT, not predictability.
Generalized cross validation GCV (Golab et al. 1979) is an alternative method to correct the deflated residual error variance. The GCV-calculated residual error sum of squares is called generalized residual error sum of squares (GRESS), which is defined by whereĵ is the predicted polygenic effect from the whole sample. It is equivalent to dividing each estimated residual error by the average ð1 2 h jj Þ across all observations. Therefore, an intuitive expression of the above equation is where h ¼ P n j¼1 h jj =n is the average leverage value across all observations andê j ¼ j j 2ĵ j : The predictability is defined as Golab et al. (1979) stated that GRESS is a rotation-invariant PRESS. It is not intuitive to interpret GRESS and therefore we prefer to report PRESS and thus R 2 HAT :

Data availability
All data analyzed in this study have been previously published. Sources of these data are provided by the references cited in the text. The Framingham Heart Study Data were downloaded from NCBI dbGaP with an IRB number HS -11-159. The rice data along with the R codes are provided in Supplemental Files S1, S2, S3, S4, S5 and S6. Description of the supplemental files can be found in File S7.

Properties of the HAT method
Properties of the HAT method will be demonstrated using an experimental rice population consisting of 210 recombinant inbred lines (Yu et al. 2011). These lines were derived from the cross of two rice varieties. A total of 270,820 SNPs were used to infer breakpoints of the genome for each line, resulting in a total of 1619 bins. A bin is a haplotype block within which there are no breakpoints across the entire population. In the original analysis of Yu et al. (2011), each bin was treated as a genetic marker. In this study, we used all the 1619 bins to infer a 210 · 210 kinship matrix. The matrix represents the genetic relationships of the lines and is used to model the covariance structure of the polygene. The population size is reasonably small and enabled us to compare the HAT method with CV in great detail. Seven agronomic and 1000 metabolomic traits were included in the analysis. The agronomic traits are yield per plant (YD), tiller number per plant (TP), grain number per panicle (GN), 1000-grain weight (KGW), grain length (GL), grain width (GW), and heading day (HD). The first four traits (YD, TP, GN, and KGW) were field evaluated four times (two locations in 2 yr), and GL and GW were replicated twice (two different years), and HD was replicated three times (three different years). The phenotypic value of each trait for each line is the average of the replicates. The 1000 metabolites were measured from seeds (317) and leaves (683) with two biological replications (Gong et al. 2013). The phenotypic values of the metabolites are the average expression levels of the two replicates after log2 transformation.
Predictability of the HAT method was compared with that of the CV method starting at twofold and ending at n-fold incremented by one, as shown in Figure 1 for the seven agronomic traits. The two methods produced very similar values of R-squares, with a slight upward bias for the HAT method due to the use of l estimated from the whole sample. The biases are quite small for high predictability traits, e.g., KGW and GL. They appear to be large for low predictability traits such as YD and HD. However, this is partly due to the small scale of the y-axis (a visual effect). For example, the predictabilities of HAT and CV for trait KGW are 0.7564 and 0.7534, respectively, and the corresponding predictabilities for trait HD are 0.0774 and 0.0653. Figure 1 also shows that when the numbers of folds are small, the predictabilities vary wildly and the variation progressively reaches zero at n-fold. The variation is caused by the ways that the folds are partitioned within the sample. Therefore, when a low number of folds are used in CV, it is necessary to repeat the CV a few times to reduce this variation. Although multiple CV will cause extra computational time, the HAT method can easily evaluate this variation.
Since computing the HAT method is sufficiently fast, we were able to perform random partitioning of the sample 100 times within a few minutes for all folds running from 2 to n. The last panel of Figure  1 shows the mean and 95% confidence band for the replicated HAT predictability for trait KGW. The average predictability reaches a plateau at 10-fold, but the 95% band is still very wide. This result did not support the claim that the LOOCV seriously biased the predictability compared with K-fold CV (Hastie et al. 2008). Figure 2A shows the plot of predictability from n-fold CV against that from HAT for the seven agronomic traits of rice. The differences between the two methods are visually indistinguishable. We then compared the two methods for the 1000 metabolomic traits with n-fold CV. The CV method took a few days to complete the n-fold CV but the HAT method, again, took no more than a few minutes. The corresponding plots for the 1000 metabolomic traits are shown in Figure  2B. Except for three outliers, all points fall on the diagonal line. The three outliers show that the HAT prediction is overoptimistic.

Genomic hybrid prediction in rice
We used a hybrid population of rice (Huang et al. 2015) to demonstrate the application of the HAT method to genomic hybrid breeding. The population consists of 1495 hybrid rice with 10 agronomic traits measured in two locations in China (Hangzhou and Sanya). The 10 traits are YD, panicle number (PN), GN, seed setting rate (SSR), KGW, HD, plant height (PH), panicle length (PL), GL, and GW. The phenotypic value of each hybrid is the average of the two locations. We used 1.6 million SNPs to infer the kinship matrix and then performed predictions using both the HAT and CV methods. Although the n-fold sample partitioning can be easily accomplished with the HAT method, it would be too costly to do it with the CV method. Therefore, we compared the two methods under the 10-fold CV. We replicated the experiment 20 times per 10-fold CV to reduce the variation caused by random partitioning of the sample. The average of the 20 replicates presents the predictability for each method. The two replications of hybrid rice experiments allowed us to estimate trait heritability of the hybrid population using the traditional ANOVA method (Table 1). We partitioned the phenotypic variance into variance due to hybrids (genotypes) and variance due to residual error with systematic difference between the two locations excluded from the phenotype.
First, we compared the predictability of the CV method with the HAT method. Figure 2C shows the plot of the CV-generated predictability against the HAT-generated predictability. All the 10 points (one point per trait) fall on the diagonal line, indicating very good agreement between the two methods. We then compared the trait heritability (H2) from the two replicated environments with the predictability drawn from 10-fold CV, the predictability obtained from the HAT method (HAT), and the FIT. The plots are illustrated in Figure 2D. The R 2 of HAT and CV are the same (the red circles overlap with the blue circles). Both HAT and CV fall around the diagonal line with some upward biases compared to H2. The FIT are severely biased upwards and are not good representatives of H2 at all. Figure 3 shows a side-by-side comparison of H2 (trait heritability), R 2 of HAT, CV, and FIT for all 10 traits, where FIT is equivalent to genomic heritability (de los Campos et al. 2015). Different traits have very different H2, ranging from 0.08 (YD) to 0.92 (GL). The difference between HAT and CV is virtually zero across all traits and both are higher than H2 for the majority of the traits. For the three highly heritable traits (KGW, GL, and GW), the H2 is higher than or equal to HAT and CV. Interestingly, HAT and CV are substantially higher than H2 for HD.

Prediction of human height
We analyzed human height of 6161 subjects from the Framingham heart study (Dawber et al. 1951(Dawber et al. , 1963 with 0.5 million SNPs using the mixed model methodology incorporating the marker-inferred kinship matrix. The model included effects of generation (two levels) and gender (male and female) as fixed effects. The estimated polygenic and residual variances areŝ 2 j ¼ 9:2375 andŝ 2 ¼ 1:2617; respectively, yielding al ¼ 0:1365897 and an estimated genomic heritability of h 2 ¼ 0:8798: This genomic heritability is close to the reported gender average heritability of human height (0.75-0.88) (Silventoinen et al. 2003). The 10-fold CV and the HAT method gave predictabilities of 0:306360:0079 and 0:315160:0037; respectively. Note that the predictabilities are the averages of 20 replicated random partitions and thus there are small SEs associated with the average values. The predictability obtained from the leave-one-out HAT method is 0.3278, slightly higher than the 10-fold partitioning approach.
GCV and optimization of l Before we perform the following analysis, it is worthwhile to refresh our mind that the HAT method will slightly overestimate the predictability because of the approximation nature. We first used the human height trait as an example to demonstrate the difference between the HAT method and the GCV method. The REML estimate of the variance ratio isl ¼ 0:1366 and the corresponding predictability from the n-fold HAT method is R 2 HAT ¼ 0:3278: This REML estimate generates a GCV predictability of R 2 GCV ¼ 0:3536; different from that of the HAT method. We now treated l as a tuning parameter to maximize the predictability, as done by Mathew et al. (2015) in GCV for estimating breeding values. Using a grid search around the REML-estimated value (l ¼ 0:1366), we found that the maximum achievable predictability for the HAT method is R 2 MAX ¼ 0:3310 when l ¼ 0:218; leading to a gain of 0:3310 2 0:3278 ¼ 0:0032; which represents a ð0:3310 2 0:3278Þ=0:3278 1% gain in predictability. Although this gain is negligible, it demonstrates that the REML-estimated parameter does not give the maximum predictability. The good news is thatl is almost optimal, at least in this example. The corresponding maximum achievable predictability in GCV is R 2 MAX ¼ 0:3539 when l ¼ 0:158; leading to a gain of 0:3539 2 0:3536 ¼ 0:0003: Figure 4 shows the predictability profiles aroundl ¼ 0:1366: By tuning the parameter, the gain in predictability of the HAT method ( Figure 4A) is visible but the gain of the GCV method ( Figure 4B) is not recognizable.
To further compare the predictabilities of the HAT and GCV methods with their maximum achievable predictabilities, we used the "Brent" method of the "opim()" function in R to search for the optimal tuning parameter (l) for all 1000 metabolomic traits in the inbred rice population (210 lines). These optimal values of l may be called the maximum predictability estimates (MPE). Figure 5 illustrates the comparisons of predictabilities across all 1000 traits, where more than a dozen traits show visible gains in predictability by tuning the parameter around the REML-estimated value for the HAT method ( Figure  5A). Similar comparison is shown in Figure 5B for the GCV method where tuning the parameter achieves more than 20 visible gains in predictability. Figure 5C compares the predictabilities of GCV and HAT when the tuning parameter is fixed at the REMLestimated value. The two methods provided very similar predictabilities for all 1000 traits except a half dozen traits with visible differences. All three comparisons shown in Figure 5 have fitted R-squares at 0.9995 and the regression coefficients are not significantly different from one (P . 0.05) except C, where the regression coefficient is significantly . 1 (P , 0.05).

DISCUSSION
Very recently, Gianola and Schon (2016) published methods that are very similar to our HAT method to evaluate the predictability of a genomic selection model. They also recognized the approximation nature of the method when the smoothing parameter l is replaced by the estimated value from the whole sample. Their justification of the use of this whole sample-estimated parameter, particularly in LOOCV, is that the estimated l from the whole sample will not be much different from the ones obtained from the training samples that differ from the whole sample by just one observation. They actually investigated the variation of l across all training samples and found that the variance is indeed small. Gianola and Schon (2016) investigated the properties of the new methods in many different situations using an inbred population of wheat (n = 599) to see how the predictability changes when the training and test sample size ratio changes. These exhaustive investigations would take months or years to complete if the ordinary CV were carried out. In addition to BLUP, these authors also extended the method to RKHS (Gianola et al. 2006) and the Bayesian alphabetic series (Gianola 2013) by modifying the importance sampling schemes.
One important issue that was not addressed in Gianola and Schon (2016) is how much difference in predictability calculated between the fast method and the classical CV method can be expected. This question is fundamental because the new method represents a significant technical improvement in genomic selection and will be adopted widely soon after the GS community recognizes it. In our study, we particularly focused on this question and investigated the difference using seven traits from an inbred population of rice, 1000 metabolomic traits from the same inbred population, 10 traits from a hybrid population of rice, and one trait (human height) from a large human population. We found that the HAT method always provides a slightly biased predictability over that of the CV method. However, the bias is never sufficiently severe to distort the conclusion on the predictability of a model. For example, in the human height prediction, the 10-fold CV produced a predictability of 0:3063 and the corresponding number from the 10-fold HAT method was 0:3151: However, the model FIT is 0.8789. The HAT method gave a number much closer to the CV predictability than the model FIT.
In addition to comparing the differences between the HAT method and the ordinary CV, we also compared the new HAT method with the GCV method (Golab et al. 1979) and found that the two produced very similar results. Craven and Wahba (1979) compared GCV with CV and concluded that the smoothing parameter that maximizes the CV was amazingly close to the parameter that maximizes GCV. The GCV method has been available for almost four decades, but the genomic selection community, except Mathew et al. (2015), has never paid attention to it. Our study showed that both GCV and HAT can be applied to genomic selection. However, the HAT method directly addresses prediction of future individuals and therefore it is more intuitive to interpret the result. Hastie et al. (2008) claimed that LOOCV provides a biased prediction compared with CV with lower number of folds. We observed that when the number of folds is 10 or above, the predictability stabilizes (Figure 1, last panel). We did not observe a progressive increase of the predictability as the number of folds increases. Therefore, from our study, we recommend to perform LOOCV with the HAT method to avoid variation caused by random partitioning of the samples when the number of folds is small. When 10-fold or fivefold CV is carried out, the analysis will only be conducted 10 or 5 times, which may not be significant; therefore, the HAT method may lose its appeal. This statement may not be true considering the fact that the 10-fold CV must be run many times to reduce the variation caused by random partitioning of the samples. A multiple CV analysis for large samples is a significant burden to investigators. Therefore, the HAT method is a good alternative to CV to evaluate a genomic selection program. Figure 4 Tuning parameter (l ¼ s 2 =s 2 j ) that maximizes the genomic predictability (R 2 ) of human height. (A) predictability profile of the HAT method, where the red point represents the predictability (R 2 HAT ¼ 0:3278) when the tuning parameter takes the REML estimate (l ¼ 0:1366) and the blue point represents the maximum achievable predictability (R 2 HAT ¼ 0:3310) when the tuning parameter is l ¼ 0:2180: (B) predictability profile of the GCV method, where the red point represents the predictability (R 2 GCV ¼ 0:3536) when the tuning parameter takes the REML estimate (l ¼ 0:1366) and the blue point represents the maximum predictability (R 2 GCV ¼ 0:3539) when the tuning parameter is l ¼ 0:1580:GCV, generalized cross-validation; REML, restricted maximum likelihood.
We originally hoped to see a significant improvement in predictability by tuning the smoothing parameter around the REMLestimated parameter. It is disappointing that there were very few significant improvements from predictions of 1000 traits. The largest improvement occurred for the 422th metabolite with an improvement of ð0:6683 2 0:5615Þ=0:5615 ¼ 0:1902048 19% (see the red point most deviating away from the diagonal line in Figure 5A). The good news is that, in most cases, the REML estimate is close to the MPE and, therefore, the parameter does not need to be tuned. On the other hand, since the computation is simple and fast, why not go ahead to tune the parameter and, if lucky, we may get an improved predictability, like the 422th metabolite in the inbred rice population.
In mixed model prediction, the random effects are often the targets for prediction. This is the case in genomic prediction because the genetic values are treated as random effects. However, if the investigators are interested in prediction using the fixed effects only under the mixed model, the estimated marginal residual error needs to be adjusted by the leverage values from the fixed model hat matrix H F ¼ XðX T V 21 XÞ 21 X T V 21 (Schabenberger 2004). Let e k ¼ y k 2 X kb be the estimated marginal residual errors for individuals in the kth fold, the predicted marginal residual errors are approximated by e k ¼ ðI2H F kk Þ 21ê k ; where H F kk is the diagonal block of H F corresponding to observations in the kth fold. The MIXED procedure in SAS calls this method the noniterative influence diagnostics while the iterative influence diagnostics is through actual CV (refit model and reestimate covariance parameters). The noniterative and iterative influence diagnostics can be interpreted as the HAT method and the CV method, respectively. PROC MIXED does not provide influence diagnostics for prediction of random effects. If there is an interest in both the fixed and random effects for prediction, the HAT matrix should include both the fixed model part and the random model part of the HAT matrix, The estimated conditional residual errors areê k ¼ y k 2 X kb 2ĵ k and the predicted conditional residual errors are obtained by where H M kk is the diagonal block of H M corresponding to observations in the kth fold. When the mixed model includes multiple covariance structures, say S covariance structures, a similar H R matrix is used except that the s 2 j A and V matrices in H R are replaced by G ¼ P S s¼1 A s s 2 s and V ¼ P S s¼1 A s s 2 s þ Is 2 ; respectively, where A s is the sth covariance structure and s 2 s is the corresponding variance. An example of the multiple variance component model is the model with nonadditive variances that include dominance and epistasis (Xu 2013). Gianola and Schon (2016) also extended the new method to handle multiple kernels.
The HAT method applies to fixed models (exact result) and linear mixed models (approximate result). Is it possible to extend the HAT method to LASSO and PLS (partial lest squares)? An approximate extension may be possible by fixing the shrinkage parameter, like the extension to BLUP, but there is no exact extension. To carry out that approximate extension, we need to find the HAT function of the predicted y on the observed y, e.g.,ŷ ¼ H LASSO y andŷ ¼ H PLS y: In general, the HAT matrix is H ¼ @ŷ=@y (Schabenberger 2004), a Jacobian matrix holding each derivative of a predicted quantity with respect to an observed response.

ACKNOWLEDGMENTS
The author thanks two anonymous reviewers and the associate editor for their constructive comments and suggestions on the first version of the manuscript. The author is also grateful to Yanru Cui (postdoc) and Yang Xu (student) for their help in calculating the marker-inferred kinship matrices for the hybrid rice and human populations. The project was supported by a National Science Foundation Collaborative Research grant (DBI-1458515) to S.X. Figure 5 Comparison of predictability of REMLestimated l with the maximum achievable predictability by tuning l for 1000 metabolomic traits of rice. (A) Maximum achievable predictability of the HAT method by tuning l plotted against the predictability when l takes the REML estimate. (B) Maximum achievable predictability of the GCV method by tuning l plotted against the predictability when l takes the REML estimate. (C) Predictability of the GCV method plotted against the predictability of the HAT method when l takes the REML estimate. The red points indicate traits with visible differences in predictability between the method shown on the x-axis and the method shown on the y-axis. The linear regression equation is given at the bottom of each panel and the fitted r 2 of the regression is given at the top of each panel. GCV, generalized cross-validation; REML, restricted maximum likelihood.
We can see that the predicted residual errors are a linear function of the estimated residual errors. The next step is to simplify the linear function, Therefore, the predicted residual errors have been expressed as a simple linear function of the estimated residual errors, The predicted residual sum of squares (PRESS) is Let us define The PRESS is written as The PRESS is often translated into R-square to represent the predictability of a model, where SST is the total sum of squares of the response variable.

Proof of the HAT Method for PRESS in Mixed Models
Estimated random effects Let us define as the phenotypic values of the trait adjusted by the fixed effects, assuming that b is known. The estimated random effects are more appropriately called the fitted random effects. Let us define the estimated vector of random effects bỹ r ¼ KðK þ lIÞ 21 r ¼ Hr where H ¼ KðK þ lIÞ 21 is the HAT matrix, l ¼ s 2 =s 2 j is the variance ratio and K is the kinship matrix. In the main text, we used A in place of K. Here we used K again to be consistent with the genomic selection literature. Let us defineê j ¼ r j 2r j as the estimated residual error for the jth observation or jth block of observations. The predicted residual error for the jth block of individuals is