Imputation of Unordered Markers and the Impact on Genomic Selection Accuracy

Genomic selection, a breeding method that promises to accelerate rates of genetic gain, requires dense, genome-wide marker data. Genotyping-by-sequencing can generate a large number of de novo markers. However, without a reference genome, these markers are unordered and typically have a large proportion of missing data. Because marker imputation algorithms were developed for species with a reference genome, algorithms suited for unordered markers have not been rigorously evaluated. Using four empirical datasets, we evaluate and characterize four such imputation methods, referred to as k-nearest neighbors, singular value decomposition, random forest regression, and expectation maximization imputation, in terms of their imputation accuracies and the factors affecting accuracy. The effect of imputation method on the genomic selection accuracy is assessed in comparison with mean imputation. The effect of excluding markers with a large proportion of missing data on the genomic selection accuracy is also examined. Our results show that imputation of unordered markers can be accurate, especially when linkage disequilibrium between markers is high and genotyped individuals are related. Of the methods evaluated, random forest regression imputation produced superior accuracy. In comparison with mean imputation, all four imputation methods we evaluated led to greater genomic selection accuracies when the level of missing data was high. Including rather than excluding markers with a large proportion of missing data nearly always led to greater GS accuracies. We conclude that high levels of missing data in dense marker sets is not a major obstacle for genomic selection, even when marker order is not known.

some loci to have zero coverage in some individuals (Elshire et al. 2011). The proportion of missing data depends on the sequencing depth and library complexity. Greater sequencing depth leads to a smaller proportion of missing data but increases genotyping cost. Less-complex libraries, on the other hand, will have less missing data but a fewer markers. To generate a large number of markers at low cost, low sequencing depth is commonly used, leading to a large proportion of missing data points. Most analyses require a complete dataset; therefore, marker imputation is a necessary step before GBS data can be used for most purposes.
Imputation has been shown to increase power in association mapping studies (Marchini et al. 2007;Marchini and Howie 2010) and, for GS, imputation can enable the use of low-density genotyping without a major loss in accuracy because a closely related reference panel genotyped at high density can be used to impute markers not present in the low-density marker panel. (Habier et al. 2009;Weigel et al. 2010;Dassonneville et al. 2011;Mulder et al. 2012). Although several highly accurate and widely used imputation algorithms have been developed to assign allelic states of missing values in genotype data (reviewed by Pei et al. 2008 andMarchini et al. 2010), these algorithms were designed for human genetic data, and they require that the order of the markers be known because they are based on constructing haplotypes. For species lacking a reference genome and complete reference linkage map such as wheat, Triticum aestivum L., the majority of markers typed on a given population are unordered, and current genotype imputation methods cannot be used. Although for biparental populations linkage maps can be constructed, breeding populations for GS are derived from multiple parents and not well structured for developing genetic maps. Thus, alternative imputation strategies that are map-independent are necessary when GBS is used for species lacking a reference genome sequence and for populations unsuitable for linkage map construction. There are many general imputation methods that do not require any prior information about the variables to be imputed. Although these methods are used across many disciplines, they have not been tested for imputation accuracy of genome-wide marker data. It is also not known how imputation with a general and potentially less accurate method before GS model training will affect the GS model accuracy. However, we expect these imputation methods to improve the GS accuracy because during the imputation step, genotypic information from both the training and selection sets is used to estimate missing values. Thus, the validation set helps improve imputation of the training set and vice versa.
The objective of this study was to evaluate imputation strategies that do not require previous information about the order of the markers. The imputation methods compared were: mean imputation (MNI), k-nearest neighbors imputation (kNNI) (Troyanskaya et al. 2001), singular value decomposition imputation (SVDI) (Troyanskaya et al. 2001), expectation maximization imputation (EMI) (Dempster et al. 1977), and random forest regression imputation (RFI) (Stekhoven and Bühlmann 2011). By using array-based genotypic datasets with varying levels of simulated missing data, we compared these methods in terms of their imputation accuracy, computation time, and impact on GS prediction accuracy. The factors affecting imputation accuracy for each method at the marker genotype and individual genotype level were also examined. Finally, we determined whether excluding rather than including markers with high levels of missing data could lead to higher accuracy.

Original datasets
We used five different datasets consisting of genome-wide markers and breeding value estimates. These datasets are referred to as winter wheat (WW), spring wheat (SW), drought tolerant maize (DTM), North American barley (NAB), and stem rust resistant wheat (SRRW). The WW data (Supporting Information, File S2) consists of 374 elite inbred individuals originating from the Cornell winter wheat breeding program. The markers consisted of 1158 polymorphic diversity array technology (DArT) (Akbari et al. 2006) markers coded as "21" and "1." For a more detailed description of this dataset, refer to Heffner et al. (2011). The traits used for the evaluation of crossvalidated GS accuracies for WW were grain yield, height (HT), protein, and days to heading. The SW data are a historical dataset consisting of 599 elite inbred spring wheat lines originating from the International Maize and Wheat Improvement Center (CIMMYT) wheat breeding program. The markers consist of 1279 polymorphic DArT markers coded as "0" and "1" and the trait used for the evaluation of cross-validated GS accuracies was grain yield in CIMMYT megaenvironment 1. The DTM data consists of 264 tropical CIMMYT maize lines. The trait used to calculate cross-validated GS model accuracies for DTM was grain yield. The marker data consists of 1135 single-nucleotide polymorphisms coded as "21", "0", and "1." For more details about the SW and DTM datasets, or to access these datasets, refer to Crossa et al. (2010) The SRRW dataset consists of 360 recent, elite CIMMYT spring wheat lines that have been selected for quantitative resistance to stem rust caused by Puccinia graminis f.sp. tritici. The markers consist of more than 130,000 GBS polymorphisms. Three different versions of the SRRW GBS data, described in Table 1, were created based on different per-marker percent missing data thresholds. For the first version referred to as SRRW version NA20 (File S3), markers were excluded if they had more than 20% missing values, which resulted in 2014 total markers. For the second set and third sets, referred to as SRRW versions NA50 and NA70 (File S4 and File S5), markers were excluded if they had more than 50% and 70% missing data, respectively, and then 2014 markers were randomly selected. The percent of the data points that were missing in the original WW, SW, DTM, and NAB datasets was between 0.2 and 3%. This low level of pre-existing missing data was assumed to have a negligible effect on the imputation and GS accuracies and for these datasets the original marker data are referred to as version NA0.

Calculation of LD between marker pairs
For the original WW, SW, DTM, and NAB datasets, LD between all marker pairs was measured using the r 2 statistic, where r 2 between two markers was calculated using the formula: where D = x 11 -p 1 p 2 ; x 11 is the probability of observing the combination of allele 1 at marker j and allele 1 at marker l, p 1 is the probability of allele 1 at marker j, q 1 is the probability of allele 2 at marker j, p 2 is the probability of allele 1 at marker l, and q 2 is the probability of allele 2 at marker l. A maximum likelihood estimate of x 11 was obtained using an expectation maximum approach reviewed by Foulkes (2009). All calculations of the r 2 statistic were implemented in the R package "genetics" (Warnes et al. 2011).

Missing data simulation
For each of the WW, SW, DTM, and NAB datasets three versions of the genotypic data, summarized in Table 1, were created with different levels of simulated missing data. In each of the versions: NA20, NA50, and NA70, missing values were introduced at random but the maximum percent missing data at a given marker was set to 20%, 50%, and 70% respectively. Examples of the simulated markers sets are illustrated in the Figure S1. A total of 10 replicates of each simulated dataset were created, and the mean percent of total data points that are missing across the 10 replicates is shown in Table 1. The distribution of per-marker percent missing values from the SRRW data versions NA20, NA50, and NA70 were used to assign the percent missing at each marker for each of the WW, SW, DTM, and NAB datasets to produce versions NA20, NA50, and NA70, respectively. Across all the missing data versions of all the datasets, the percent missing per marker distribution had a long left tail and a large concentration of values near the threshold level.

Imputation methods
In all cases, the genotypic data were considered continuous variables. The methods MNI, kNNI, SVDI, EMI, and RFI were used to impute the simulated missing values. For all methods the input was an m x n genotype matrix M with m individuals and n markers. For MNI, each missing data-point x ij at a given marker j was replaced with the mean of the non-missing values at that marker.
For kNNI (Troyanskaya et al. 2001), the data points were imputed by replacing them with the weighted average of the data points at the k closest markers. Euclidean distance was used as the measure of marker distance. Euclidean distance between marker genotype vectors q ! and v ! of length m was defined as: In detail, (1) missing values were first replaced using MNI and the Euclidean distance between all of possible pairs of marker vectors was computed. Each marker was included in the marker matrix twice, both in its original and flipped state, to ensure that markers in negative LD would not be considered distant to the marker of interest.
(2) For each marker j, markers were sorted based on Euclidean distance to marker j.
(3) For each row i of marker j the weighted average of the k closest markers with nonmissing values at row i were used as an estimate of marker data point x ij . The weight of each marker was 1/d 2 were d is the Euclidean distance between marker j and the marker to be weighted. kNNI makes no assumptions about the distribution of the data.
For SVDI (Troyanskaya et al. 2001), a singular value decomposition of genotype matrix M was used to obtain a set of the k most significant Eigen-vectors of the markers. These k Eigen-vectors were then used as the predictors for linear regression estimation of the missing data points. SVDI was implemented in R (R Development Core Team 2011) using the package "bcv" (Perry 2009). The genotype matrix M can be described as: Where U has dimensions m · k, V has dimensions n · k, and S is a k · k diagonal matrix. U contains the left singular vectors with are equivalent to the Eigen-vectors of the markers. The corresponding singular values are in the diagonal elements of S. The singular values are equivalent to the square root of the Eigen-values. The k most significant Eigen-vectors of the markers were those with the k largest Eigen-values. The imputation procedure is described as follows: (1) Missing values were originally imputed using MNI.
(2) Singular value decomposition was used to estimate the k most significant Eigen-vectors of the markers:Û.
(3) For each marker j, linear regression coefficients of each column ofÛ were estimated by the multiple linear regression equation: Where Y is a column vector for marker j,Û is an m · k matrix of k Eigenvectors, b is a vector of regression coefficients and e is a random error term. Only individuals with nonmissing values in Y were used to estimate b. (4)Û and the estimates of the regression coefficients,b, were used to estimate the missing values at marker j. (5) Using the current version of the genotype matrix, we repeated steps two through four for a total of 10 iterations, sufficient to meet the convergence criteria, which was: RSS is the residual sum of squares between the nonmissing values and their SVDI model approximation. RSS 0 and RSS 1 are the RSS values of successive iterations. SVDI assumes that the genotype matrix is multivariate normal distributed. For the optimal k value calculation methods and results for both kNNI and SVDI, see File S1.
Optimal k values are listed in Table S1. For EMI, the nonmissing marker data were used to obtain maximum likelihood estimates of the vector of means,û, and covariance matrixX of the individuals based on the markers. These estimates were then used to obtain multiple linear regression estimates of the missing marker values.û, andX were then re-estimated and were used to re-estimate the missing marker values. This process was repeated until the difference between the new estimate and the previous estimate ofû þX ÃX T was 0.02 or less. EMI was implemented using the R package rrBLUP (Endelman 2011). For a more detailed description of this EMI algorithm, refer to . For a more through description of the EM imputation algorithm in general, refer to Dempster et al. (1977). For RFI, missing marker values were estimated using random forest regression (Breiman 2001) using all available data to predict the missing values for every marker. RFI was implemented in R using the package "MissForest" (Stekhoven and Bühlmann 2011). The imputation procedure was: (1) for marker matrix M, markers were sorted from lowest to highest percent missing and missing values were imputed using MNI. (2) At each marker j containing missing values, the nonmissing values, Y, were used to grow 100 random forest regression trees Q1. . . Q100. Each tree was grown using a bootstrapped sample of individuals Y and a random sample of ffiffiffiffiffiffiffiffiffiffiffiffiffi n 2 1 p marker predictors were used where n-1 is the number of markers excluding marker j. Each tree Q contains the terminal node values and a set of instructions for recursively partitioning the observations into the terminal nodes: these instructions include the split variables at each node, and the value of the split variable used for partitioning. (3) Missing values at marker j were imputed as: where x is an input vector.
(4) Marker j was then updated in marker matrix M by using theŶ values as the estimate of the missing values. (5) Steps two through four were repeated for each subsequent marker until all markers were imputed. (6) Then, using this imputed matrix, we repeated steps two through five until convergence or for a maximum of 10 iterations. Convergence was declared as soon as the DN increased for the first time where: M 1 and M 0 are the newly imputed and previously imputed marker matrices respectively. If the convergence criterion was met, M 0 was used as the final estimate of M. RFI makes no assumptions about the distribution of the data. The implementation of all imputation methods is demonstrated in File S6.

Imputation accuracy calculations
The per-marker imputation accuracy, R 2 m , was described using the R 2 value between predicted data points and the original data points for a given marker vector or individual vector x of length j. The R 2 was defined as The R 2 m , as well as the imputation R 2 of the individual genotypes, referred to as R 2 i , were calculated. For each dataset and missing data level, average R 2 i and R 2 m across the 10 missing data simulations were also calculated and referred to as R 2 i and R 2 m . To compare with imputation accuracies reported in other publications, for each R 2 m value, the equivalent percent correct was also calculated. Because imputed values were continuous, the percent correct for each marker could not be directly calculated. Instead, for each marker, equivalent percent correct values were determined by simulation using each marker's MAF and R 2 m (see File S1).

Computational time
For the first replicate of simulated missing datasets, whenever a dataset was imputed, the number of seconds required for imputation to be completed using one central processing unit was recorded. All jobs were submitted to the Computational Biology Service Unit at Cornell University, which uses (1)

Assessment of factors affecting imputation accuracy
For each imputation method factors affecting the imputation accuracy were assessed. A marker's minor allele frequency (MAF), number of nonmissing data points, and level of LD with other markers were considered as factors that could impact its imputation accuracy. The distance between an individual and its closest relative and the expected prediction error variance (PEV) were considered as factors affecting the imputation accuracy on an individual genotype basis. The impact of each of these factors was assessed for each imputation method using the WW, SW, DTM, and NAB datasets post imputation. First, the impact of MAF on the imputation accuracy was assessed. For each dataset-imputation method combination, R 2 m was averaged across dataset versions NA20, NA50, and NA70 and this overall estimate of marker imputation accuracy is referred to as R 2 m .The median R 2 m for each value of MAF rounded to the nearest tenth was calculated. The relationship between the median R 2 m and the MAF value was then plotted to characterize the relationship.
The impact of the number of nonmissing data points at a marker on the marker's imputation accuracy was assessed for each datasetimputation method combination using data from all 10 replicates and versions NA20, NA50 and NA70 combined. For each marker, the number of nonmissing data points was rounded to the nearest factor of 5, and for each value the median R 2 m was calculated. To determine the impact of the LD level with other markers on the imputation accuracy, markers were first classified as markers in low LD with all other markers or markers in at least moderate LD with at least one other marker. Markers whose highest r 2 statistic was less than 0.5 were considered to be in low LD with all other markers. A marker that had at least one r 2 statistic greater than or equal to 0.5 was considered to be in at least moderate LD with at least one other marker. The median R 2 m of markers in low LD and of markers in at least moderate LD with at least one other marker was calculated. The ratio of R 2 m for markers in low LD to the R 2 m for markers in at least moderate LD was then examined.
To assess the effect of the genetic distance between an individual and its closest relative on the individual genotype imputation accuracy, the Euclidian distance was calculated for each pair of individuals and the R 2 i of each dataset was measured for each simulated dataset and imputation method combination. The mean R 2 i values across all replicates, R 2 l , were averaged across versions NA20, NA50, and NA70 of a given dataset-imputation method combination to calculate an overall mean R 2 i for each individual which is referred to as R 2 l . The Euclidian distance between each individual and its closest relative, rounded to the nearest whole number was plotted against the median R 2 l to examine the relationship.
The relationship between PEV for the genetic values and the R 2 l was also examined. An individual's PEV is a measure of genetic connectedness to the other individuals (Kennedy and Trus 1993) where an individual's connectedness is determined by the number and strength of the genetic relationships between that individual and the other individuals in the dataset. For example, a low PEV indicates high connectedness and high degree of genetic relationship.
To measure an overall PEV value for each individual, a vector of PEVs was calculated for each marker using the mixed model equations (Searle et al. 1992) implemented in the R package "rrBLUP" (Endelman 2011). The genetic and error variance components were estimated using restricted maximum likelihood, and the genomic relationship matrix, excluding the response variable marker, was used as the covariance matrix between genotypes. The sum of the PEV vectors across all markers was used as the overall PEV vector. Because PEV is a reflection of the number and strength of the genetic relationships between individuals, it is expected to be a useful indicator for how well and individual's missing data can be imputed using all other individuals as a reference.

GS accuracy calculation
All 10 simulations of missing data versions NA20, NA50, and NA70 of the WW, SW, DTM, NAB, and SRRW marker sets were imputed with each of the imputation methods: MNI, kNNI, SVDI, EMI, and RFI. Then, each of the 10 replicates of the marker set-imputation method combinations was used to calculate the 10-fold cross validation GS accuracy for both Ridge-Regression (Whittaker et al. 2000) and Bayesian LASSO (de los Campos et al. 2009), see File S1. GS accuracies are computed as the Pearson's correlation between the phenotype estimated breeding values and the genomic-estimated breeding values. The mean accuracy for each marker set-imputation method-prediction model combination was computed. GS accuracies were also computed using version NA0 of the WW, SW, DTM, and NAB genotypic data.

LD between markers
For each dataset, the LD between marker pairs was quantified using the r 2 statistic. Markers that had at least one other marker associated with r 2 $ 0.5 were considered to be in at least moderate LD with at least one other marker. In the WW, SW, DTM, and NAB datasets, 62%, 74%, 12%, and 69% of the markers had at least one other marker in at least moderate LD, respectively. Comparatively, LD between markers was high in the SW, NAB, and WW datasets and much lower in the DTM dataset.

Imputation accuracy
The imputation accuracy reported as the median R 2 m is shown in Figure 1 for kNNI, SVDI, RFI, and EMI. For all dataset-imputation method combinations, R 2 m values were non-normal, and there were  Table 2. The population with the highest median R 2 m for each of the levels of missing data were the NAB population, whereas the lowest imputation accuracies were observed with the DTM population. As expected, median R 2 m values always decreased as the level of missing data increased. RFI always produced the highest accuracies; kNNI generally produced the second highest accuracies, followed by EMI and SVDI. The rankings were slightly different for the DTM dataset, where RFI was most accurate followed by EMI, SVDI, and kNNI. The rankings of the methods for each dataset according to the median percent correct are the same as those according to the median R 2 m ; however, the median percent correct values could not be compared across datasets because percent correct values are influenced by the MAF which differs among datasets.

Computational time
Large differences in the computational requirements for the imputation methods were observed (Table 3). kNNI, SVDI, and EMI required relatively little computation time on average, while RFI required at least 95x, 760x, and 65x more computation time than kNNI, SVDI, and EMI, respectively. For SVDI and kNNI, the computation time required for determining optimal k values was not included in the estimates of the average computational time because the computation time for optimal k estimation depends on the method used for estimation. The 10-fold cross validation approach that we used to estimate optimal k values for SVDI and kNNI requires approximately 50 runs of the SVDI and kNNI respectively. If the time required to estimate optimal k values for SVDI and kNNI was included in the total computational time, EMI would be the fastest of the four imputation methods.

Factors affecting imputation accuracy
MAF: For all datasets, R 2 m values for markers with MAF , 0.1 were low compared with that of markers with MAF . 0.1; however, the relationship between MAF and R 2 m for markers with MAF . 0.1 was different for each dataset (Figure 2). In general, R 2 m increased as MAF increased as long as MAF , 0.4; however, with the NAB dataset ( Figure  2D) there was no relationship between MAF and R 2 m for MAF . 0.1. Accuracy in terms of percent correct had a strong negative linear relationship with the MAF across all imputation methods and datasets. Markers with lower MAF values tended to have higher percent correct values (data not shown).

Number of nonmissing data points:
With almost all datasetimputation method combinations, as the number of non-missing data points increased, the R 2 m levels increased in a linear fashion ( Figure 3). The strength of this linear relationship was similar for all imputation methods; however, with the DTM dataset, R 2 m for kNNI and SVDI were close to zero regardless of the number of nonmissing data points.

LD between markers:
The ratio of the median imputation R 2 m for markers with no other markers in moderate LD to the median imputation R 2 m for markers with at least one other marker in moderate LD was always less than one (Table 4), indicating that the imputation accuracy for markers without markers in moderate LD was always lower than that for markers that had at least one other marker in moderate LD. Across all datasets, the R 2 m ratios for the two classes of markers was much smaller for kNNI compared to the other imputation methods, indicating that the imputation accuracy of kNNI was more strongly influenced by the level of LD between markers compared to the other methods. With the WW, SW, and NAB datasets the R 2 m ratios for the two classes of markers was similar for SVDI, RFI, and EMI indicating that the accuracy of these three methods is influenced by the level of LD between markers to a similar degree. However, with the DTM dataset, the R 2 m ratio for the two classes of markers was closer to one for SVDI compared to the other methods, indicating that for this dataset, the accuracy with SVDI was less affected by the LD between markers, compared to the other methods.
n Distance from the closest relative and PEV: Regardless of the dataset or the imputation method, the smaller the distance between an individual and its closest relative, the higher the R 2 l (Figure 4). One exception was observed with the DTM dataset, where for kNNI there was no relationship between the distance between an individual and its closest relative and R 2 l . We observed very similar trends between R 2 l and the overall PEV ( Figure S2). As an individual's PEV increased, indicating a decrease in the strength and number of genetic relationships between that individual and all other individuals, its R 2 l decreased in all cases except when the DTM dataset was imputed with kNNI.

Effect of imputation method on GS accuracy
In nearly all cases, GS accuracies did not differ greatly from one imputation method to another, with the exception of MNI, which sometimes led to much lower accuracies compared to all other methods when the NA70 dataset version was used ( Figure 5 and 6). Overall, GS accuracies were least affected by the imputation method for dataset version NA20, and most affected by the imputation method for dataset version NA70. The relative performance of each method in terms of GS accuracy after imputation depended on the dataset, and dataset version; however, RFI consistently performed well across all datasets. For the WW datasets, the relative performance of the imputation methods in terms of GS accuracy was inconsistent across the four traits tested.
For, a given dataset and dataset version, the rank of each method based on R 2 m , was not consistent with the rank based on GS accuracy using RR or BL post-imputation. The rank of the imputation methods, however, was consistent between the two different GS models. We also found that including rather than removing 'sparse' markers, those with large amounts of missing data, nearly always led to higher GS accuracies (methods and results described in File S1, Table S2, and Figure S3), especially when RFI, kNNI, or EMI were the imputation methods used ( Figure S4).

Imputation accuracy
This study found that map-independent imputation methods other than MNI can be surprisingly accurate, especially when LD between markers is high and the genotyped individuals are related. RFI was the most promising method overall because of its consistently high performance in terms of imputation accuracy and subsequent GS accuracy; however, it was the most computationally intensive method evaluated. kNNI, although less accurate than RFI, may be a good alternative to RFI if there are computational limitations to completing the imputation. It is likely that RFI and kNNI produced comparable levels of accuracy because both use a similar model free approach for imputation that involves weighting a selected set of k important variables according to a distance metric (Lin and Jeon 2006). The weighted average of these variables is the predicted value of the variable of interest. For kNNI the distance metric was the Euclidean distance and k was a fixed number across all variables. For RFI, the k important variables and their weights are determined by the splitting scheme of the tree that is determined using the response variable. The The median R 2 m obtained for a given MAF rounded to the nearest tenth is plotted for each dataset: (A) WW; (B) SW; (C) DTM; and (D) NAB. Each color and symbol represents a different imputation method: kNNI, orange triangles; SVDI, purple squares; RFI, red circles; and EMI, blue crosses.
increased accuracy but greater computational burden of the RFI method compared with kNNI is due to its adaptive weighting of variables that takes into account the response variable.
A possible reason that EMI and SVDI were less accurate than RFI and kNNI is that the genotypic datasets that we used may have violated multivariate normality, an underlying assumption for EMI and SVDI. Alternatively, EMI and SVDI may not have been as effective at ignoring uninformative predictors. If true, linear regression based imputation methods involving variable selection could be as accurate as kNNI or RFI. However, due to multicollinearity, attempts to test imputation based on subset selection methods such as stepwise regression were not successful. Regression imputation using variable selection methods which can cope with multicollinearity, such as the least absolute shrinkage and selection operator (Tibshirani 1996), would be interesting to test in future studies.
EMI performed consistently better than SVDI which is likely because EMI incorporates all the marker data as predictors whereas SVDI first used a data reduction step, potentially eliminating useful information. SVDI may have outperformed EMI if the datasets had a greater rate of genotyping error because it is expected to better cope with noisy data (Troyanskaya et al. 2001).
For all methods, average median imputation accuracies on an individual genotype basis R 2 l were not always homogenous across population sub-groups as illustrated in Figure S5, which shows individuals plotted according to the first two principal components of their marker genotypes and color coded according to their n Table 4 Ratios of median R 2 m of markers having no markers in moderate linkage disequilibrium (LD) to that of markers with at least one other marker in moderate LD imputation accuracy. With the DTM and WW datasets, small subgroups of individuals that clustered together according to the first two principal components of marker genotypes tended to have similar ranges of accuracy. However, with the SW and NAB datasets R 2 l was relatively homogenous across population subgroups. An association between R 2 l and population subgroup is undesirable because it may create or worsen an association between GS accuracy and population sub-group. Using large datasets with minimal population structure for imputation and GS is advocated to avoid heterogeneity of imputation and GS accuracies across subgroups of individuals.
Population structure may also lead to increased imputation accuracy for markers with high levels of population subdivision (Iwata and Jannink 2010) because an individual's allelic state can be predicted largely based population sub-group alone. Accuracy levels for datasets with many markers highly subdivided by population may be high largely because of structure; we therefore calculated R 2 m excluding markers with high levels of population subdivision as indicated by their Fst values, where high Fst indicates high population subdivision (for methods, see File S1). For markers with MAF . 0.1, on average, R 2 m excluding markers with the 25% highest Fst values were 0.9, 1.17, 1.02, and 0.9 times those of overall R 2 m for the WW, SW, DTM, and NAB datasets, respectively. Thus, for the WW and NAB datasets, the high imputation accuracies we observed may have been in small part due to population structure.
Comparing our imputation accuracy results with those of other studies is difficult because each study uses different populations of different sizes, levels of missing data, MAF distributions, and levels of LD between markers. In addition, accuracy reported as percent correct cannot be compared across datasets with different MAF distributions. Nevertheless, we assume that map-dependent imputation methods would outperform the map-independent methods that we evaluated (given the availability of an accurate genetic or physical map) because physically linked markers are used to predict missing values. These physically linked markers should be more reliable predictors compared to markers that are in LD but may not be physically linked. As genetic and physical maps develop for wheat and barley the assumption that map-dependent methods would outperform the map-independent methods can be tested.

Factors affecting imputation accuracy
Markers with very low MAF had low R 2 m values. There are two possible explanations for this observation. First, because of the way R 2 m is calculated, a single imputation error has a much larger negative impact on the R 2 m for markers with lower MAF values ( Figure S6). Thus, it is harder to achieve high R 2 m for markers with a low MAF. Second, individuals with the minor allele at a given marker are not well represented, making their marker genotype more difficult to predict. A similar relationship between MAF and R 2 m was also found by studies by Iwata and Jannink (2010) and Li et al. (2011), which used mapdependent imputation methods. Unlike R 2 m , imputation accuracy in terms of percent correct had a negative linear relationship with MAF (data not shown), this is because markers with lower MAF can always be imputed with a reasonably high percent correct based on the marker mean alone. Other studies of map-dependent imputation methods report a negative relationship between MAF and percent correct (Pei et al. 2008;Hickey et al. 2012). The number of nonmissing data points, analogous to reference panel size in other studies was found to positively impact the R 2 m . This finding is consistent with other studies in which researchers tested the effect of reference panel size on the imputation accuracy using mapdependent methods (Pei et al. 2008;Druet et al. 2010;Li et al. 2010). For RFI, EMI, and SVDI, which involve a model training step, fewer missing data points means that more individuals are available for model training. With kNNI, a smaller number of nonmissing data points at a given marker leads to a more accurate estimate of its distance from all other markers. However, with the DTM set there was no trend between accuracy and the number of nonmissing data points with kNNI. This was because accuracy with kNNI for this dataset was very low overall.
The presence of one or more markers in moderate LD (r 2 statistic $ 0.5) was a more important factor for kNNI compared with RFI, EMI, and SVDI because kNNI bases its predictions on a fixed number of close markers, whereas RFI, EMI, and SVDI use information from all markers in the dataset to generate predicted values for the missing data points. The LD between markers on a whole dataset basis also appeared to be an important factor affecting the R 2 m of all methods because accuracies with the DTM dataset, which had low levels of LD between markers overall, were much lower than accuracies with the WW, SW, and NAB datasets. Other publications that have evaluated the effect of LD on imputation accuracy for map-dependent methods have found similar trends (Pei et al. 2008;Hickey et al. 2012).
We found that imputation accuracy on an individual genotype level was negatively correlated with the distance from the closest relative in the dataset, and the PEV, which is an indication of the relationship between an individual and other genotypes. A similar relationship between imputation accuracy and relationship has been found by other studies of map-dependent imputation methods Zhang and Druet 2010;Hickey et al. 2012). It is clear that to ensure effective imputation, the dataset to be imputed should contain related individuals. If the dataset is suited for GS, it is likely that the individuals are already related. However, to increase the chances Figure 5 GS accuracy obtained using ridge regression after imputation. Mean GS accuracies obtained using the dataset versions NA0, NA20, NA50, having up to 0%, 20%, 50%, and 70% missing data per marker, respectively, imputed with either MNI (black stars), kNNI (orange triangles), SVDI (purple squares), EMI (blue crosses) and RFI (red circles) are shown for (A) WW-yield, (B) WW-height, (C) WW-protein, (D) WW-days to heading, (E) DTM, (F) SW, (G) NAB, and (H) SRRW datasets. Each plot has a different y-axis range. Error bars depict SE.
that an individual will have close relatives in the dataset, all available genotypic data for the germplasm pool of interest should be combined before imputation.

GS accuracy
The GS accuracies that we observed may be sufficiently high to lead to increased rates of genetic gain compared to phenotypic selection (PS), depending on the accuracy of PS and the selection cycle duration of both PS and GS. It is important to note that all GS accuracies reported for a given dataset are global estimates across all potential subpopulations. Based on other studies evaluating GS accuracies within and across subpopulations (Zhao et al. 2011, Heslot et al. 2012, Windhausen et al. 2012, this global accuracy estimate may be greater than the accuracy measured within individual subpopulations. Effect of imputation method on the GS accuracy Improved accuracy of GS after application of map independent imputation methods was another important finding of this study. Based on our results, unordered markers with missing data can be included in the dataset to improve accuracy through imputation with RFI, kNNI, EMI, or even SVDI rather than MNI. However, for datasets with low levels of missing data (up to 20% per marker), imputing with MNI is sufficient. Although our results do not support removing markers with high levels of missing data prior to GS, in many datasets markers with low levels of missing data may be sufficient to saturate the genome. With the datasets used in this study, the average number of markers with up to 20% and 50% missing data were 18 to 37 and 99 to 186 respectively, and these reduced marker sets were not sufficient to saturate the genome. Thus, including markers with larger amounts of missing data led to improved GS accuracies. Interestingly, a low median R 2 m was not reflective of the merit of imputation prior to GS. The median R 2 m for the datasets with up to 70% missing data per marker were the lowest of all the missing data levels; however we saw the greatest gain in GS accuracy from kNNI, SVDI, EMI, or RFI relative to MNI with this level of missing data. This was especially apparent for the DTM dataset, which had a median R 2 m near zero for most methods when there was up to 70% missing data per marker. However, RFI on this dataset produced GS model accuracies 1.3 times Figure 6 GS accuracy obtained using Bayesian least absolute shrinkage and selection operator after imputation. Mean GS accuracies obtained using the dataset versions NA0, NA20, and NA50, having up to 0%, 20%, 50%, and 70% missing data per marker respectively, imputed with either MNI (black stars), kNNI (orange triangles), SVDI (purple squares), EMI (blue crosses) and RFI (red circles) are shown for (A) WW-yield, (B) WW-height, (C) WW-protein, (D) WW-days to heading, (E) DTM, (F) SW, (G) NAB, and (H) SRRW datasets. Each plot has a different y-axis range. Error bars depict SE. greater than those achieved when MNI was used before GS. Surprisingly, the most accurate imputation method was not always the method that gave the greatest GS accuracy. This may be caused by nonrandom imputation errors. If some imputation errors are similar for related individuals, these nonrandom errors may able to capture some genetic relationships in the GS model. The idea that the imputation errors may capture some genetic relationships was suggested by a study by Weigel et al. (2010).
This study has important implications for species that lack a reference genome, complete reference map, and predesigned high-throughput genotyping platforms. First, unordered markers can be imputed with high levels of accuracy, and even greater accuracies may result if additional reference genotypes can be added to the dataset prior to imputation. Based on the results of this study, if a large number of marker genotypes are produced (so that markers are in LD with each other), and the population contains individuals with some genetic relationship, missing data can be imputed with reasonable accuracy even if the level of missing data are high; up to 70%. Future work to improve upon and reduce the computational burden of the most promising methods in this study, RFI and kNNI, would be especially useful if these methods are to be used widely. The second implication of this study is that a large proportion of missing data in dense marker sets is not a major concern for GS. As long as the marker density is sufficiently high, the accuracy does not appear to be strongly negatively affected. In cases where missing data does negatively impact the GS accuracy imputation using a method other than MNI before GS model training and validation can help improve the accuracy. Overall, mapindependent imputation shows promise for the feasibility of applying GS, enabled by emergent sequence-based genotyping technologies, to almost any species regardless of the availability of pre-existing genotyping resources.