Selecting sequence variants to improve genomic predictions for dairy cattle

Millions of genetic variants have been identified by population-scale sequencing projects, but subsets of these variants are needed for routine genomic predictions or genotyping arrays. Methods for selecting sequence variants were compared using simulated sequence genotypes and real July 2015 data from the 1000 Bull Genomes Project. Candidate sequence variants for 444 Holstein animals were combined with high-density (HD) imputed genotypes for 26,970 progeny-tested Holstein bulls. Test 1 included single nucleotide polymorphisms (SNPs) for 481,904 candidate sequence variants. Test 2 also included 249,966 insertions-deletions (InDels). After merging sequence variants with 312,614 HD SNPs and editing steps, Tests 1 and 2 included 762,588 and 1,003,453 variants, respectively. Imputation quality from findhap software was assessed with 404 of the sequenced animals in the reference population and 40 randomly chosen animals for validation. Their sequence genotypes were reduced to the subset of genotypes that were in common with HD genotypes and then imputed back to sequence. Predictions were tested for 33 traits using 2015 data of 3983 US validation bulls with daughters that were first phenotyped after August 2011. The average percentage of correctly imputed variants across all chromosomes was 97.2 for Test 1 and 97.0 for Test 2. Total time required to prepare, edit, impute, and estimate the effects of sequence variants for 27,235 bulls was about 1 week using less than 33 threads. Many sequence variants had larger estimated effects than nearby HD SNPs, but prediction reliability improved only by 0.6 percentage points in Test 1 when sequence SNPs were added to HD SNPs and by 0.4 percentage points in Test 2 when sequence SNPs and InDels were included. However, selecting the 16,648 candidate SNPs with the largest estimated effects and adding them to the 60,671 SNPs used in routine evaluations improved reliabilities by 2.7 percentage points. Reliabilities for genomic predictions improved when selected sequence variants were added; gains were similar for simulated and real data for the same population, and larger than previous gains obtained by adding HD SNPs. With many genotyped animals, many data sources, and millions of variants, computing strategies must efficiently balance costs of imputation, selection, and prediction to obtain subsets of markers that provide the highest accuracy.

can be targeted to specific traits such as genes related to fertility, thereby slightly improving reliability for daughter pregnancy rate by 0.2 percentage points when 39 single nucleotide polymorphisms (SNPs) were added to the marker set used for genomic prediction [9]. The number of sequenced animals should continue to increase as researchers examine more families and the costs of generating data continue to decrease.
Imputing, selecting, and predicting effects for millions of variants and many thousands of individuals require efficient computation. Computational costs, which are proportional to the number of variants multiplied by the number of individuals, could exceed the marginal benefits from adding more variants. Variants within or near genes should improve the reliability of predictions, and direct use of causal variants is preferred to using linked markers. Strategies to choose variants for inclusion on genotyping arrays of different densities or in routine predictions were developed and compared using simulated data for Holstein bulls. Here, we first examined simulated data and then real sequence genotypes from the 1000 Bull Genomes Project [10].
The goals of this study were to (1) compare the reliability of prediction from sequence, array, and combined data as well as different types of variants, (2) test the methods first on simulated data before applying them to real sequence data imputed for a large reference population, and (3) investigate editing, imputation, and computing strategies that are efficient for even larger genotyped populations.

Simulated sequence data
Our simulation was designed to closely mimic an actual large-scale sequencing project for cattle, in which a subset of ancestor bulls had WGS data, another subset of ancestor bulls had high-density (HD) SNP-array genotypes, and most bulls had medium-density genotypes. Sequence variants were simulated for 26,984 Holstein bulls in the US reference population in December 2014 using a pedigree file that included 112,905 animals, and the sequences were then reduced to mimic the actual available array genotypes. Among these animals, the 1000 bulls that had the most daughters had genotypes observed for 30 million sequence variants, and 773, 24,863 and 343 other bulls that had fewer daughters were genotyped with 600,000 (600 k), 60,000 (60 k) and 12,000 SNPs, respectively. Each simulated chip was an evenly spaced subset of the previous chip and the sequence variants. The 30 million variant sites were randomly located across 30 chromosomes each 100 million bases long, and all variants had two alleles. The genotypes were simulated using genosim software [11], which generates founding chromosomes with linkage disequilibrium (LD) and descendant chromosomes with recombination using the actual pedigree of the bulls. A parameter of 0.9998 was selected to generate average LD similar to that in the real sequence dataset, as in previous tests [12].
Editing reduced the list of variants to 8.4 million by removing SNPs with a minor allele frequency (MAF) lower than 0.01 and a level of LD less than 0.95 with any remaining neighbouring SNP, but all 0.5 million variants that were within or near the 10,000 (10 k) simulated QTL were retained. The QTL were located randomly across the genome, and the 25 variants on either side were retained. No actual genes were simulated, only the QTL and other variants. If any of the 350 variants on either side of a specific marker were correlated i.e. with an |r| higher than 0.95, editing based on LD retained one variant and removed all others that had an |r| higher than 0.95 with that variant. The 600 k SNPs were all retained to improve imputation, and the 505,210 SNPs that were within 2500 bases of a true QTL were retained to mimic bioinformatic selection using gene positions. The selected SNPs were imputed for all bulls. Strategies were compared to choose the most significant variants or those with the largest estimated variances or effect sizes for five independent traits using individual regressions on each variant or multiple regression on all variants.
Breeding values for five independent traits were simulated by summing across effects of the 10 k QTL. The five traits were not true replicates because the QTL locations did not vary, only the effects, mimicking quantitative inheritance where each QTL may affect most traits very little but some traits more. A heavy-tailed distribution was generated from normally-distributed effects (q) raised to the power of 2.7 (|q|−2) such that the largest effect contributed 3 to 13% of the genetic variance, the largest 10 effects contributed 20 to 34%, the largest 100 contributed 57 to 63%, and the largest 1000 contributed 90 to 93%. Actual traits may be controlled by QTL with smaller effects, however, most actual traits had at least one QTL as large as those simulated here [13]. Simulated phenotypes for five independent traits had reliabilities equal to those for milk evaluations of the actual bulls.
After imputing the 8.4 million edited variants for all bulls, the variants with the largest effect estimated by genomic prediction or the most significant variants from genome-wide association (GWA) analysis were selected. The oldest 17,896 bulls were used as the reference population, and true breeding values (TBV) of the 9088 younger bulls were used to validate predictions from the selected variants. In all tests, the phenotypes used for estimating effects and selecting variants were only from the truncated reference data so that validation phenotypes were independent and tests should be unbiased.
Many of the reference bulls and a few validation bulls had sequence data included in the 1000 Bull Genomes Project and used for variant discovery, which might bias estimates of allele frequency, but should not bias the phenotypic effects.
Variants can be selected based on the highest significance test, largest absolute effect, or largest genetic variance contributed by the locus, which is computed as 2p(1 − p)α 2 , where p is the allele frequency and α is the allele substitution effect. All three methods were compared. Selecting the variants that contribute the most variance has more theoretical appeal and results in variants with higher MAF, which could also contribute to improve imputation accuracy. Using the nonlinear Bayes A algorithm of VanRaden et al. [12], the highest ranking markers were selected based on their largest effect or largest variance regardless of their location. Using GWA, the significance of each variant was tested conditional on neighbouring variants already included, and the tests were then combined for each of the five independent traits into an overall measure of significance. The single regression model in GWA was processed using MMAP [14,15] and included SNP as a fixed effect and breeding values as random effects modelled with pedigree relationships. Pedigree information was used rather than genomic relationships based on sequence data to separate the individual effect of SNPs from the random, polygenic effect. Multiple regression requires hundreds of iterations to converge, whereas GWA can test many variants without iteration.
Genomic predictions from 60 or 600 k SNPs were compared with predictions from additional markers selected also using Bayes A multiple regression. To mimic the selection process used to design the GeneSeek HD version 1 chip [16], the top 5000 HD SNPs for each of the five traits were selected, and the combined set of 23,600 (24 k) selected SNPs after removing 1400 duplicates were added to the 60 k SNPs. To mimic selection on net merit [17], another test selected 24 k SNPs with the largest variance averaged across the five traits instead of selecting the top SNPs for each trait and then combining them.
Selection based on sequence variants should improve accuracy more than selection on HD SNPs, but the previously genotyped SNPs must be retained during imputation because sequence variants are not available for most animals. Genomic predictions included the 600 k SNPs plus 500 k sequence variants near QTL totalling 1.1 million variants, which was similar to the analysis of real data by Hayes et al. [10]. The variants that were chosen in close proximity to QTL are referred to as the genic subset of WGS variants although gene locations were not simulated, only QTL locations were. Final tests of the simulated data added the 10 k true QTL to the 60 k SNPs, and an upper limit on reliability was obtained using only the imputed QTL in prediction with no prior variance assigned to the markers, the parameter of the heavytailed distribution set to the true parameter, and polygenic variance set to 0% instead of the 10% in other tests.

Real variants derived from population-scale WGS data
SNP and insertion-deletion (InDel) calls (sequence variants) from run 5 of the 1000 Bull Genomes Project [18] were released in July 2015. Sequence variants for 444 Holstein animals and HD imputed genotypes for 26,970 progeny-tested Holstein bulls were combined by imputation using findhap software (version 3) [19]. The total number of variants identified in run 5 was equal to 38 million SNPs and 1.7 million InDels, but many of those variants are monomorphic within the Holstein breed. InDels were on average 3 bp long and no more than 86 bp. Imputed sequence genotypes from the 1000 Bull Genomes Project data were set to missing if none of the three genotype probabilities (AA, AB, or BB) were higher than 0.98 as estimated by Beagle [20].
The HD genotypes of 2394 Holsteins mainly from North America, Italy, and Great Britain were used to impute genotypes of 590,363 other Holsteins that had genotypes obtained mainly by using SNP chips with 50,000 or fewer SNPs. The imputed HD genotypes of bulls used in this study were a subset of those animals. The original 777,962 HD SNPs were reduced to 312,614 by removing highly correlated markers with an |r| higher than 0.95 and by further editing before imputation with findhap (version 3) [12]. To verify direction and consistency of allele codes, genotypes called from sequences were matched to corresponding chip SNPs for 179 Holstein or red Holstein animals that had SNP genotypes imputed in the US database and sequences in the 1000 Bull Genomes Project database.
Variants with a MAF lower than 0.01, incorrect map locations, an excess of heterozygotes, or low correlations (|r| < 0.95) between sequence and HD genotypes for the same variant were removed. A few hundred sequence variants were removed in specific regions that were known to be mapped incorrectly in the UMD3.1 bovine reference assembly. Most map issues had been previously detected by using small sets of SNPs that were lowly correlated to adjacent sets within windows that had excessive total numbers of haplotypes [12].
After merging sequence and HD data, Mendelian conflicts between parents and progeny were set to missing for 0.01% of the genotypes. The percentage of conflicts was expected to be small because both the HD and sequence genotypes had been previously edited. About 1% of the HD imputed genotypes were unknown in the findhap output, and their allele frequencies were substituted when used in genomic prediction. All HD SNPs that were also in the sequence data were retained except in cases when the absolute correlation between HD SNPs was lower than 0.95. This editing step removed less than 1000 (0.3%) of the HD SNPs because a similar edit had previously been applied before imputation [12].
Three different sets of variants were imputed to test the use of candidate SNPs (Test 1), candidate SNPs and InDels (Test 2), and candidate SNPs, InDels, and intergenic and intronic variants (Test 3). Predictions and QTL discovery using Test 3 data will be reported separately. The initial edits for sequence genotypes used in Tests 1 and 2 were revised in Test 3 because imputation accuracy decreased when millions of intergenic and intronic variants were included. The new edits for Test 3 computed statistics across all samples to improve imputation accuracy instead of editing each animal individually. The VCF file contains three genotype probabilities from Beagle, and the editing done for Tests 1 and 2 simply retained any genotype with a probability higher than 0.98. The new edits were based on an individual probability higher than 0.95, and after processing all animals, a second edit deleted any variant that had more than 5% missing genotypes for low frequency variants (MAF < 0.10) or more than MAF/2 missing genotypes for more common variants. Thus, variants with MAF = 0.50 were not used if more than 25% of the called genotypes had a probability below 95%. The third new edit for Test 3 checked for Hardy-Weinberg equilibrium and deleted variants that had 1.5 times more heterozygotes than the expected fraction of 2p(1 − p). After these edits, only 3,148,506 variants remained.
Quality and orientation of calls were examined using 179 bulls that had both sequence and HD genotypes. After reversing the orientation of the HD SNPs to match sequence data and keeping the sequence instead of the HD genotype for animals that had both, the two datasets were combined, resulting in 27,235 animals. Quality of imputation was assessed by keeping 404 of the sequenced animals in the reference population and randomly choosing 40 animals as a test set. Their sequence genotypes were reduced to the subset of genotypes that were in common with HD genotypes and then imputed back to sequence. The percentage of imputed genotypes that matched the original genotypes was the simple measure of sequence imputation accuracy.
Genomic predictions were computed using deregressed evaluations from August 2011 for 33 traits and 19,575 bulls. Predictions were tested using 2015 data of 3983 bulls with daughters that were first phenotyped after August 2011. Reliabilities were estimated from the squared correlations of predictions with the deregressed evaluations, divided by their reliabilities to account for error variance, and adding the difference between observed and expected reliability of parent average to account for selection [21]. Regressions of 2015 data on 2011 predictions were compared to the expected value of 1.0.
Test 1 combined 481,904 candidate sequence SNPs with HD genotypes for 312,614 markers and a total of 762,588 variants. The candidate variants included 107,471 variants in exons, 9422 in splice sites, 35,242 in untranslated regions at the beginning and end of genes, 254,907 within a 2-kb region upstream, and 74,862 within a 1-kb region downstream, for a total of 481,904 candidate variants based on the Ensembl gene annotation [22] database version 79 released in 2015 (ftp://ftp.ensembl. org/pub/release-79/gtf/bos_taurus). Test 2 also included any InDels that were located within genes or within the regions 2 kb upstream and 1 kb downstream of genes. Imputed data of Test 3 were used only for GWA because genomic predictions converged too slowly with more than 3 million variants, and the GWA results from real data will be reported separately. Additional file 1: Table  S1 lists the variants included in each test.
A subset of variants was selected for potential use in routine genomic prediction by applying methods similar to those used previously to select the HD SNPs with the largest effects in the national evaluation [16] except that only Holstein data were used in the current test. The top 1000 SNPs by absolute effect size for each of the 33 traits were selected from Test 1 and merged to eliminate duplicates. These 16,648 sequence variants with the largest effects were selected from the analysis of 762,588 markers and added to the 60,671 markers used previously. However, 6584 or about 10% of those previously used markers were not called as variable and thus not reported in the sequence data and were not used in the final test set of 70,735 markers.

Simulated sequence
Edits for MAF and LD removed 3.4 and 18.4 million variants, respectively, from the simulated WGS variants from our 1000 bull founder population, which reduced the variant list from 30 million initial simulated variants to 8.4 million that included the 600 k array SNPs and the 505,210 genic variants. For the 26.6 million variants with a MAF higher than 0.01, the maximum absolute correlation with any of the 350 variants on either side was on average equal to 0.96. High correlations improve imputation and also indicate that most QTL can be efficiently traced by nearby markers.
Average reliability of prediction was equal to 28.4% based on the simulated parent average, 77.8 and 80.1% based on the 60 and 600 k chips, respectively, 79.2% based on the markers selected by GWA from the 600 k chip, and 87.2% based on only the 10 k imputed true QTL with no weight on the markers ( Table 1). The reliability gain of 2.3 percentage points obtained for the 600 k compared with the 60 k SNPs is larger than reported earlier from either simulated (0.9) or actual (0.4) genomic predictions [12]. The previous results led to the conclusion that simply adding more markers resulted only in small improvements because prior variance for each marker was smaller, causing more shrinkage for all marker effects. Also, the additional markers were imputed rather than directly observed.
In Table 1, the other variant subsets were selected using effects from multiple regression instead of GWA. Adding 24 k SNPs from the 600 k with the largest effects to the 60 k SNPs resulted in higher reliability by 2.2 percentage points than adding 24 k SNPs selected by GWA and also in 1.3 percentage points higher than using all 600 k SNPs, which was consistent with previous results from real data [16]. Selecting SNPs on effect variance was expected to be more efficient than selecting on effect size, but effect size resulted in slightly higher reliability (81.5 vs. 81.2%). The increased MAF should have improved imputation accuracy, but only 19% of the SNPs differed between the two selection strategies. Selecting 24 k SNPs based on an average of the five traits to mimic index selection (results not shown in Table 1) led to about only 50% of the markers being in common with the other two strategies and resulted in slightly lower reliability than selecting for each trait and then combining them (81.1 vs. 81.2%).
The genic subset of 1.1 million simulated sequence variants resulted in a reliability of 86.4%, which was much higher than the 81.5% obtained from the best analysis from selecting 600 k SNPs and only about 1 percentage point less than the 87.2% maximum obtained by using just the 10 k true QTL. This confirms that selection of variants near genes improves accuracy if all genes are known and all variation is associated with genes, which is in agreement with Pérez-Enciso et al. [6]. Including 1.1 million variants in routine evaluations or on chips is difficult, but 60 k SNPs plus the top 24 k SNPs that are chosen from the 1.1 million by multiple regression resulted in a reliability of 85.0%. If the 10 k true QTL were added to the 60 k SNPs but were not given extra prior variance, the reliability was then only 84.5% because too much prior variance was assigned to the 60 k SNPs compared to the 10 k QTL. All tests of simulated data had regressions of TBV on genomic predictions that averaged 1.02 to 1.05 across five traits, which is slightly higher than the expected value of 1.0; regressions on parent average averaged 0.98.
Computing resources are in Table 2 for each step run on an IBM X3850 X5 with 4 Intel X7560 CPUs (32 cores, 64 threads @ 2.27 GHz), and 512 GB of memory. Genotype simulation required 56 h with one thread and 210 GB of memory and the output was a 32-GB file. Calculation of linkage correlations between neighbouring sequence variants and pruning those that were highly correlated took 1 h with 10 threads and 27 GB of memory. Imputation of 8.4 million variants for 26,984 bulls required 38 h with 20 threads and 13 GB of memory and the output was a 220-GB file. Selection of variants by GWA required only 30 min with 30 threads and very little memory. Genomic prediction for 1.1 million variants and five traits required 22 h with five threads and 20 GB of memory. Thus, GWA was faster for selecting variants, but multiple regression selected marker sets that gave more reliable predictions.

Real variants
Edits for real as well as simulated sequence variants are documented in Table 3. In the real data, 20 million of the initial 39 million variants were removed because of low MAF, and another 13 million were removed because of high linkage with neighbouring variants. Further edits in Tests 1 and 2 retained only the HD markers, candidate SNPs, and candidate InDels. In Test 3, 3 million of the remaining variants with lower genotype probabilities were removed to improve imputation accuracy.

Table 1 Reliabilities for five simulated traits from ten sources of genetic information
Reliabilities expressed as percentages; 24 k markers selected from 600 k SNPs by GWA p value, multiple regression effect size (ES) or effect variance (EV); 24 k markers selected from sequence SNPs in or near genes (G) by effect size; 600 k markers plus 500 k SNPs in or near genes ( Only 91% of the 60,671 chip SNPs currently used in official US evaluations were included in the sequence data. It is expected that some markers with a low MAF will be missing, but the average MAF of the 9% that were missing and the 91% that matched were both equal to about 0.28 in Holsteins. The missing markers are evenly scattered across the chromosomes; therefore, they probably do not indicate reference genome misassemblies but likely result from edits during variant identification [18]. The individual correlations of HD genotypes with sequence genotypes were mostly near +1 or −1, which indicates good quality for the 91% of HD SNPs present in the sequence data. About half of the genotypes had reversed allele coding compared to the sequence variant calls because sequence alternate alleles are coded as differences from a Hereford cow-derived reference genome, whereas the array alleles were in Illumina TOP encoding. Average imputation accuracy was equal to 97.2% of correct genotypes for the 762,588 variants in Test 1 across all chromosomes, with a maximum of 98.5% for bovine chromosome BTA20 (BTA for Bos taurus chromosome) and BTA22 and a minimum of 94.9% for BTA15 and 95.0% for BTA4 (Fig. 1). The X chromosome was split into the pseudo-autosomal region, which was labelled as BTA30 with poor imputation and the X-specific loci labelled as BTA31; no Y loci were present. Imputation accuracy was equal to 97.0% for the 1,003,453 variants that included InDels in Test 2 and 96.7% for the 3,148,506 variants that also included intronic and intergenic variants in Test 3. The percentages are inflated because they include the HD SNPs that were already present. If HD SNPs are not counted, accuracies of 95.3, 95.6, and 96.4% for just the new variants were found for Tests 1, 2, and 3, respectively. The lower imputation accuracy for BTA12 in Test 3 was mainly caused by a gap between 72.4 and 75.2 Mb for which no SNPs were available on the HD array.
The total time required to prepare, edit, and impute the 762,588 variants for 27,235 animals ranged from 1 to 5 h per chromosome (Table 4) and was about 5 days for all 30 chromosomes. Data manipulation steps such as transposing the sequence data and merging with HD SNPs used one thread and took more time than the imputation, which used 20 threads and took less than 1 day.
Reliability of predictions improved by only 0.6 percentage points on average using 762,588 variants (481,904 candidate sequence variants and HD SNPs) compared with using HD SNPs only ( Table 5). Inclusion of InDels decreased the advantage over HD SNPs to only 0.4 percentage points. Reliability improved by about 2.7 percentage points compared with 60 k SNPs only for the final set of 70,735 variants (60 k SNPs minus 6584 markers  The 60 k SNPs already included the best SNPs selected from the HD chip [16], which may explain why 60 k predictions slightly outperformed HD predictions. For most traits, regressions of validation data on genomic predictions were near the expected value of 1.0 and changed little with the selected subset of variants used (Table 6). For use with lower-density genotyping arrays, the list of 16,648 sequence variants was further restricted to 4822. Hand-made edits were applied to prevent too many candidate SNPs from all tagging the same QTL. Figure 2 provides an example for BTA5 of the SNPs that were retained or removed. The same list of 4822 SNPs was provided to Zoetis (Florham Park, NJ), GeneSeek (Lincoln, NE), and Genetic Visions (Middleton, WI) for potential inclusion on revised arrays. Benefits of adding the sequence SNPs directly to lower-density rather than only to medium-or higher-density arrays are that more young animals can be genotyped quickly and imputation loss can be avoided when including sequence SNPs in routine predictions. Re-genotyping or sequencing more reference animals could also help avoid imputation loss when estimating SNP effects for newly discovered variants.

Comparison with previous studies
Previous studies used 5000 bulls with HD genotypes and 10 million variants from run 3 sequence data [8] or 4 million variants from run 4 [3], but sequence predictions in those studies had slightly lower reliabilities than predictions from BovineHD or BovineSNP50 genotypes. The HD genotypes in those studies were all observed, but HD genotypes used in our study were mostly imputed. Use of run 3 or run 4 instead of run 5 sequence data could explain their slightly negative instead of slightly positive gains. The results from those studies and ours suggest that errors in the sequence data or remaining reference assembly mistakes that altered the order of variant sites could account for the small changes in prediction reliability when hundreds of thousands or millions of sequence variants were added.
Our results indicate that adding smaller numbers of selected sequence variants can be useful in routine prediction even if the analysis of all variants is not more accurate or feasible, which is consistent with previous conclusions for sequence [2] or HD data [16,23]. Brøndum et al. [2] added 1623 sequence variants selected by GWA from multiple breeds to a custom chip and reported gains in reliability that averaged about 2 percentage points. Small improvements (0.2 percentage points) from adding SNPs that are located in genes associated with fertility were observed by Ortega et al. [9], which is consistent with gains reported in this and earlier studies [12]. Using selected sequence variants and giving extra weight to candidate variants or QTL can improve predictions across breeds [5,[24][25][26][27], but advantages of focusing on candidate variants decrease if not all QTL are in the variant set [6]. Multi-trait methods can detect QTL that single-trait methods might miss [28], and even uncorrelated traits can help separate QTL from markers if many independent traits are controlled by a limited number of QTL.

Comparison of simulated and real selection
Properties of the real sequence data from the 1000 Bull Genomes Project were similar to those of the simulated data by VanRaden and O'Connell [29]. LD and MAF distributions in the real and simulated data are compared in Figs. 3 and 4, respectively. Overall, results were similar for real and simulated data, but more of the variants in the real data have a very low MAF or are in very low LD with neighbouring variants. Average MAF was the same (0.20) for the HD and genic SNPs in Test 1 but was lower (0.15) for the InDels added in Test 2, which could have affected imputation accuracy. Edits for MAF and for high LD reduced the 30 million simulated SNPs to 8.4 million, whereas the same edits reduced the 39 million real variants to 6.3 million (Table 3). Our edits for Test 3 were similar to those of Calus et al. [3], who obtained 4.1 million variants from Holstein data in run 4. The simulated variants were for one breed with a common pedigree, whereas the real variants were discovered in a wide Impute with edited data 16 10 Reduce some sequence to HD data 1 1 Impute with reduced data 17 9 Total 296 78 variety of breeds and only the variants that were polymorphic in Holsteins were retained. Also, the real data contain some false positive variants because of sequencing, alignment, and calling errors that are not modelled in the simulated data. Many variants had to be excluded from the Test 3 data because of low previous imputation accuracy, whereas the simulated data was of high quality for all variants. Gains in reliability from the use of real sequence data were smaller than from simulated data but higher than previous gains reported from HD data. Larger gains may be possible if the selected SNPs are added to arrays and genotyped directly with high accuracy instead of imputed from less accurate sequence data. Accuracies of genotypes from sequence variant calling can vary [30], whereas the error rate of Illumina BeadChip arrays is less than 1% for nearly all SNPs.

Computation
Most computing steps in Table 4 were programmed in Fortran for efficiency, but several steps were in SAS for convenience. The SAS program used to merge sequence and HD data took only 6 min for the shortest chromosome but 143 min for the longest one; it could be rewritten because it became a limiting step. Total times required for Tests 2 and 3 were only a little longer than those shown for Test 1 because imputation took a small fraction of the total time. Imputation of 8 million simulated variants took only 38 h with 20 threads for 25,984 reference bulls. Larger populations or variant sets can be imputed, but genomic predictions then become the limiting step. More research is needed on how to accurately and efficiently select the best subset of variants for routine use.

Economic benefit
Increasing the reliability of selection by 2.7 percentage points from 64.7 to 67.4% would add about $3 million per year to national genetic progress. Additional progress would be realized globally for foreign breeders that directly use the new genotyping arrays or that indirectly benefit by selecting breeding stock from the improved US population. Annual domestic progress is now about $50 per cow and would increase to $51 after multiplying by the accuracy ratio of 1.02, which is the square root of the reliability ratio (67.4/64.7). This higher accuracy has an annual national value of about $3 million because each year 3.3 million of the 9.2 million US dairy cows are replaced. These annual gains are permanent and will accumulate. The initial cost of generating the US sequence data for the 88 dairy bulls contributed to the 1000 Bull Genomes Project was $132,000 at current estimates of reagent costs (assuming a cost of ~$1500 per sample). The return on investment from this research is high and greatly increased because of data sharing. New animals will be directly genotyped for the selected variants and thus could have slightly higher reliability gains than in these tests that use imputed data, but most reference animals will still have imputed data. Re-genotyping old animals with the new arrays might be less expensive than additional sequencing to improve accuracy of imputation.

Conclusions
Variant selection is needed because routine genomic predictions cannot impute and include all of the millions of sequence variants for all animals. Significant gains in reliability are possible if the true QTL can be identified or if bioinformatic methods can choose the regions that are more likely to contain causative variants. Because individual QTL have such small effects, large reference populations are needed with phenotypes for the relevant traits and observed or imputed genotypes for the QTL or closely linked variants. Testing many individual traits gives more power because the effect of each QTL may be detectable only for a few traits, but these same QTL may have smaller effects on several correlated traits. Assigning more prior variance to the QTL or to the newly selected variants can improve reliability when estimating effects, but the SNPs from previous arrays must be retained during imputation because genotypes of previous animals include only the SNPs and not the new variants.
Computation becomes a limiting factor as reference populations and target populations grow in size. Total computing time was only a few days with up to 1000 sequences and 30,000 reference bulls, but more than 150,000 reference cows and 800,000 young animals were not included. Multiple regressions used for genomic prediction were more accurate than GWA for selecting variants but required much more computation time. Imputation allows  Cumulative distributions for minor allele frequencies many more sequence variants to be tested, selected, and included in routine predictions to increase their reliability. For both the simulated and real data, gains from selecting and including candidate sequence variants were larger than from selecting HD SNPs.