Development of a high-density, 2M SNP genotyping array and 670k SNP imputation array for the domestic horse

Robert J. Schaefer, Mikkel Schubert, Ernest Bailey, Danika L. Bannasch, Eric Barrey, Gila Kahila Bar-Gal, Gottfried Brem, Samantha A. Brooks, Ottmar Distl, Ruedi Fries, Carrie J. Finno, Vinzenz Gerber, Bianca Haase, Vidhya Jagannathan, Ted Kalbfleisch, Tosso Leeb, Gabriella Lindgren, Maria Susana Lopes, Nuria Mach, Artur da Câmara Machado, James N. MacLeod, Annette McCoy, Julia Metzger, Cecilia Penedo, Sagi Polani, Stefan Rieder, Imke Tammen, Jens Tetens , Georg Thaller, Andrea Verini-Supplizi, Claire M. Wade, Barbara Wallner, Ludovic Orlando , James R. Mickelson, Molly E. McCue

Here we describe the steps taken, including careful and extensive variant filtering, to create both a 2 million (2M) SNP array and a 670 thousand (670k) SNP array from over 23 million variants discovered from whole genome sequence of 153 horses representing 24 breeds (See Table S1). Genotypes from the 2 million probe genotyping array (MNEc2M) in a cohort of 332 horses, from 20 actively-researched or economicallyimportant breeds that represent known genetic diversity in the domestic horse [17], were used to select SNPs for inclusion on the commercially-available 670k probe array (MNEc670k). The MNEc670k array was designed for accurate genotype imputation up to the higher density, 2M SNP set present on the MNEc2M array. We report summary statistics, broken down by breed, for both arrays as well as preliminary results on genotype imputation performance from the MNEc670k array to the SNP density on the MNEc2M array.

Variant discovery from whole genome sequence
Whole genome, 100 base pair (bp), paired-end Illumina HiSeq reads were generated for 153 horses (including Twilight), representing 24 distinct breeds, at a depth between 1.7X and 64X, with a median depth of 13X (Table S1). Read mapping was performed using the PALEOMIX genome mapping protocol [26] to efficiently process samples in parallel and to assess individual sample quality. Each sample was mapped to the EquCab2.0 reference genome, which had been extended with an additional 7,850 de novo assembled scaffolds (See Methods), to produce a total of nearly 48 billion unique reads being aligned to the nuclear genome (See Table S1). Variants were identified by extending the PALEOMIX framework to identify SNPs using two variant callers (see Methods). To maximize efficiency of variant calling in individual breeds, and to minimize bias due to variable sequencing depth of coverage, individuals were broken up into 16 variant calling groups (see Table S1, Variant Calling Group columns) by estimated depth of coverage and breed.
Variants were called using permissive parameters in both the GATK UnifiedGenotyper [27] as well as SAMtools 'mpileup' utilities [28]. Approximately 23 million potential SNPs were called by both GATK and SAMtools. These 23 million SNPs were kept for further analysis and validation (See Table 1; Supp. File 1).

Precision of GATK QUAL scores for variants identified by both callers
We evaluated the performance of GATK QUAL scores by comparing genotypes generated from WGS to genotypes generated from the legacy Equine 54K SNP chip in 23 horses at two different sequencing depths.
Variants detected on chromosome 1 (ECA1) by both WGS and 54k SNP chip were ranked by decreasing QUAL score. The proportion of genotype calls that agreed between 54K and WGS (i.e., precision) was compared based on ranked QUAL score (i.e., recall) [29]. For each ranked variant we evaluated the precision at that point, which is the cumulative proportion of genotypes that agree between the two genotyping methods (Fig. 1). We note that this approach evaluates the concordance between the two genotyping technologies and is thus unable to assess the accuracies of the underlying genotypes themselves.
However, this approach does take into account the imbalance of false positives/negatives between genotypes called by WGS and by the 54K SNP chip [30].
Considering 100 percent of variants detected by both technologies, genotypes had an overall precision of ~99% for 12X-called variants and 94-95% for 6X-called variants (See Fig. 1, x-axis). This is consistent with results from a similar study that compared de novo variant genotypes to array based genotypes in the Franches-Montagnes horse breed [25]. However, many SNPs with very high QUAL scores (e.g., SNPs within the top 10 percent; Fig. 1) had disagreeing genotype calls between WGS and 54K SNP chip.

), in both 6X
MOR and STD comparisons, the proportion of matching genotypes increases, indicating that there are many variants with low QUAL scores and high precision. These results indicated that QUAL scores alone did not adequately rank variants, and that additional metrics were necessary to improve the reliability of SNPs ultimately chosen for the higher density genotyping arrays.

Identifying gold standard reference set for variant recalibration
In addition to QUAL type quality scores, GATK outputs additional metrics such as depth, quality of depth, Fisher strand position, map quality rank sum, and read position rank sum for each variant based on read statistics (See Methods for details of these metrics). These values were used to train linear mixed models, using the GATK VariantRecalibrator [31], to assign a composite quality score (VQSLOD) to help detect type I and II errors.
Training these models required a "gold standard" reference set of known genotypes across multiple individuals. Lacking this resource in the domestic horse, we defined three high confidence, putative "gold standard" datasets with which to train the VariantRecalibrator: 1) SNPs on the legacy SNP50 chip; 2) WGS variants which were seen in four or more (4+) calling groups (Table S1); and 3) WGS variants that were in the top one percent of QUAL scores. Models were trained on features from "gold standard" variants present on chromosomes 2-32 and used to calculate VQSLOD scores for SNPs on chromosome 1. To assess the performance of VSQLOD scores generated by each training set, scores from each group were compared to each other in addition to QUAL scores using horses genotyped on the 54K SNP Chip.
SNPs on chromosome 1 (ECA1) were ranked either by decreasing QUAL or by decreasing VQSLOD score generated from each of the three gold standard training groups. Fig. 2 shows the proportion of matching genotypes between WGS and SNP50 platforms for each of the four groups. With the exception of 6X Morgans, SNPs with high VQSLOD scores agreed more often with genotypes called on the 54K SNP chip compared to SNPs with high QUAL scores alone. For example, the top 10% highest scoring VQSLOD scores had a higher concordance rate than QUAL scores in all cases except for MOR6X, where only two of the recalibrated scores marginally outperformed QUAL scores (See Fig. 2; see discussion). Additionally, VQSLOD variants did not drop below the overall discordance rate, showing an overall better ranking than QUAL scores alone (red curve, Fig. 2).
While an exact differentiation between the gold-standard training groups remains unclear (See Discussion), recalibrated VQSLOD scores were better able to differentiate high scoring, false positives than QUAL scores alone, making them a more informative metric for selecting the final SNPs for use on the highdensity commercial array. VQSLOD scores were calculated for all 23 million variants, in all remaining horses, using WGS variants called in the 4+ breed calling groups (described above) as the training set (candidate 23M SNP set). Scored variants were then filtered in several steps (described below) to select the subset of SNPs to be included on the new higher density arrays.

Preliminary 5M SNP selection for genotyping probe design
We analyzed the 23M candidate SNP set with the goal of generating a target set of ~5 million highconfidence SNPs that would be compatible with array design. Following criteria provided by Affymetrix, 780 SNPs were filtered because they fell within repetitive regions; 9,342,733 SNPs were filtered because they were within 20 base pairs (bp) of another variant; and 4,858,273 SNPs were filtered for having a minor allele frequency below 1% in the 153 WGS horses. After filtering, approximately 10 million SNPs remained for further design selection (hereafter referred to as the 10M SNP Set, Table 1). Of these ~10 million arraycompatible candidates, 48,485 SNPs (65%) overlapped with the sites available from the ~74k legacy SNP sets (although all legacy SNP sites were included in MNEc2M/670k array design). Supp. Fig. 1 shows the alternative allele frequency distribution of the 23M de novo and 10 million design candidate SNPs (10M SNP set). The mean allele frequency is below 0.1, though there is a long tail of SNPs with a high alternative allele frequency. Interestingly, there are 445,421 SNPs (23M set) and 46,455 SNPs (10M set) where the alternative allele frequency is 1.0. This is likely due to errors in the EquCab 2.0 reference sequence (Sanger reads from Twilight) as Twilight's whole genome Illumina sequence data (collectively at more than 25X), and whole genome sequences of all other horses, contained the alternate allele. These SNPs were excluded from array design.
To ensure backwards compatibility, we started with the ~74k SNPs that were present on the two legacy (54K/65K) arrays. To fill the remaining target of ~5 million probe design candidates, we added SNPs from the 10M SNP set based on even genome distribution, informativeness among the variant calling groups, and linkage disequilibrium (LD). LD was calculated for all pairs of SNPs within 10kb of each other throughout the genome. SNPs that were in high LD (r 2 > 0.90) were filtered based on whether or not they were present in draft or pony variant calling groups to control for the fewer number of samples used in variant discovery. Previous SNP designs [2] show an underrepresentation of informative SNPs from these groups, which leads to poor mapping resolution in draft and pony breed groups (See Table S2). SNPs discovered in both draft and ponies were prioritized over those discovered in one group and over those absent in both. If SNPs were discovered in an equal number of priority groups, VQSLOD scores were used to break ties.
After applying these SNP candidate criteria, 5,443,950 SNPs (5M set, see Table 1) were kept, of which 2,199,467 SNPs occurred in pony calling groups and 2,782,917 SNPs occurred in draft calling groups. A total of 1,695,347 SNPs occurred in both ponies and drafts. Flanking sequences (35 bp upstream and downstream) for these 5M filtered SNPs were submitted to Affymetrix for SNP probe design analysis.
SNP selection for the high density MNEc2M SNP array SNP conversion recommendations for the 5M SNPs were provided from Affymetrix in both forward and reverse strand directions. SNPs were assigned to one of four groups based on decreasing probability of successful probe design: 'recommended', 'neutral', 'not recommended', and 'not possible'. To achieve an even distribution of ~2 million SNPs, the equine reference genome (~2.7 Gb) was divided into approximately 54,000 50 kb windows and a target of 37 SNPs per window was established. SNPs within each window were chosen for inclusion in the MNEc2M SNP set using a greedy algorithm. Briefly, the ~74,000 SNPs on previous generation arrays, as well as SNPs within the equine major histocompatibility complex (MHC) region (ECA20:28.7-33.6Mb), were given VIP status and were automatically included in both forward and reverse strand directions. SNPs were added to windows until the target of 37 SNPs was met. If a window had more than 37 candidates, SNPs were selected based on by Affymetrix recommendation group (See Methods for details). In total, given the above criteria, 2,001,826 high quality SNPs (2M SNP set) were chosen and submitted for probe design to comprise the MNEc2M genotyping array (Supp. File 2).

SNP selection for the MNEc670k array
A cohort of 347 horses were genotyped on the MNEc2M high-density array using DNA isolated from blood or hair roots (see Methods). The ~2M SNP genotypes were split across three different physical arrays.
Genotypes were called using Affymetrix Power Tools and sample quality control was assessed according to Affymetrix best practices (See Methods). For each sample, the three physical arrays were assessed separately for quality control. In total, 320 samples passed in all three arrays while 27 samples had one or more arrays that did not pass quality control. Of the 27 samples that failed, 7 samples had two passing arrays, 5 samples had one passing array, and 15 DNA samples failed to meet genotyping quality control metrics on all three physical arrays. Failed arrays were removed from the analysis, however if a sample had at least one array, genotypes from those arrays were retained. In total, viable genotypes were produced for 332 horses.
Of the failed arrays, 25 had DNA isolated from hair roots and 2 had DNA isolated from blood. Hair root DNA had a lower average DNA concentration (Pico Green) when re-hydrated (2.7 ± 2.9 ng/µL) than did DNA samples isolated from blood (43.1 ± 55.5 ng/µL). To determine if sample origin (blood versus hair roots) or DNA concentration, or both, were associated with failure to pass genotyping quality control metrics, a logistic regression for sample success on DNA concentration and blood/hair status was performed. (Supp. Figure 5). Both DNA source (p ≤ 4.05e-04) and DNA concentration (p ≤ 2.77e-10) significantly influenced the probability of samples to produce quality genotypes (see Discussion).
After quality control steps for samples, genotypes were called for all 2,001,826 SNPs. Genotypes were assessed for clustering quality using the Metrics.R script provided by Affymetrix (see Methods). In total, 92.2% of SNPs on the MNEc2M array passed quality control producing a set of 1,846,988 high-quality SNPs (1.8M; see Table 1) genotyped on the 332 horses remaining in the analysis (see Methods).
SNPs exclusive to a single WGS variant calling group were validated (polymorphic), on average, at a rate of 80%. However, validation rates were much higher in SNPs discovered in multiple calling groups (>96%) (See Table S3). Genotypes produced from the 332 horses with passing genotypes on the MNEc2M array ("SNP" genotypes) were combined with genotypes discovered from the 153 whole genome sequence horses ("WGS" genotypes) to create a dataset containing 1.8M genotypes for 485 horses ("WGS+SNP" genotypes). These variants were analyzed to select a subset of SNPs for inclusion on the MNEc670k array, with the intent that this array would be designed for genotype imputation to the full 2M SNP set.
To maximize the information content of the MNEc670k array SNPs, multi-marker r 2 statistics were calculated on the 1.8M high-quality candidate SNPs to identify 'tagging SNPs' that allowed for efficient reconstruction of genomic haplotypes in the 485 horses both within and across breeds. Tagging SNPs that reconstructed haplotypes across all 485 horses (inter-breed tag SNPs) were identified using the software FastTagger [32]. In total, 355,903 tagging SNPs were needed to reconstruct haplotypes present across all the horses in the cohort with a minor allele frequency (MAF) > 0.01 and tag-SNP r 2 > 0.99. These SNPs were included on the MNEc670k imputation array (see Table S4; Inclusion criteria: Inter).
Haplotype tagging SNPs were also examined at the breed-specific level. Horses were split into 15 tagging breed groups based on the minimal sample size necessary to perform SNP tagging (see Table S5). Tagging SNPs were identified separately in each tagging breed group using FastTagger, then a subset of population specific tag SNPs were identified using the software Multi-Pop-TagSelect [33]. Combined, a total of 1,754,075 SNPs were needed to reconstruct fine-level, breed-specific haplotypes in all of these breed groups (MAF > 0.10 and tag SNP r 2 > 0.90; See Supp. File 3). Breeds varied in the number of tag SNPs needed to reconstruct haplotypes. Table 2 shows the number of tagging SNPs required to reconstruct haplotypes in each of the tagging breed groups. The Ponies, Draft and Quarter Horses (tagging breed groups) required the most tagging SNPs, each requiring over 350,000 SNPs to reconstruct breed-specific haplotypes, while Thoroughbreds, Icelandic and Lusitano tagging breed groups each required less than 150,000 tag SNPs, to reconstruct haplotypes.
Tagging SNPs that were informative in 5 or more tagging breed groups were included on the MNEc670k array (n=206,822; see Table S4; inclusion criteria: Intra). An additional 13,993 SNPs that tagged haplotypes in the breeds requiring the largest number of tag SNPs were also included in the array (inclusion criteria: Diverse). Additionally, 7,394 SNPs were included due to their location within the equine MHC region

Imputation accuracy from the MNEc670k SNP set to the MNEc2M SNP
Genotype imputation accuracy from the MNEc670k array to the MNEc2M array was quantified using the 485 horse reference population. For each of the tagging breed groups a subset of 1/3 random individuals were masked down to the MNEc670k SNP set. Genotypes were then imputed to the MNEc2M SNP using the reference population after removing the individuals being imputed (See Methods). Imputation accuracy was measured as total genotype concordance across imputed individuals (correctly inferred genotypes/total number of imputed genotypes; see Methods for details).
Genotype imputation accuracy from the MNEc670k SNP set to the MNEc2M SNP set ranged between 96.6 and 99.4% in the 15 breeds tested (see Table 3). Tagging breed groups with over 99% mean imputation

SNP properties of the MNEc2M and MNEc670k arrays SNP informativeness and Inter-SNP distance
Inter-SNP distance was calculated for the MNEc2M, as well as the MNEc670k, SNP sets at different levels of MAF (see Fig. 3 and Table 4) to assess the distribution of SNPs across the genome. On average, 1,250 and 3,756 bp, respectively, separated variants on the two arrays. Informativeness, defined as the number of SNPs with at least one heterozygote, was calculated for the same MAF cut-offs (Table 4). Inter-SNP distance, as well as informativeness, were broken down by breed for both the MNEc2M SNP set (See Supp.

Alternate Allele Frequency
Frequency of alternate (non-reference) SNP alleles were calculated on both arrays using the full 485 sample dataset (WGS+SNP), genotypes derived from whole genome sequence only (WGS Only), and from samples genotyped on the 2M test array described above (SNP Only; see Table S5). Kernel density estimations (KDE) of the alternate allele frequency of SNPs on the MNEc2M array showed a mean frequency between 0.20 and 0.28 (See Fig. 4 and Table 5). In general, regardless of genotyping source, alternative allele frequencies shared similar distributions. Frequency distributions exhibit long tails indicating substantial numbers of samples with non-reference alleles. Alternate allele (ALT) distributions also exhibit a lower median frequency (red line) than mean frequency (red bar), a common property of right skewed distributions [34]. Alternate allele frequency distribution was also broken down by breed for both the MNEC2M array (Fig. 5) and the MNEc670k array (Supp. Fig 4) using genotypes derived from the WGS+SNP sample set. Allele frequency is balanced across breeds, though there were minute differences.
For example, the median MAF for Thoroughbreds was 3-12% lower than all other breeds, though this is not unexpected given the reference genome is a Thoroughbred. Despite minor differences, all breeds had long tails and similar distributions of allele frequencies indicating a balanced and representative SNP selection for GWAS.

Linkage disequilibrium decay by breed
Linkage disequilibrium was measured using genotype r 2 between all 1.8M SNPs within 1Mb of one another within and across breeds. LD across breeds (i.e., the WGS+SNP sample sets) (see Fig 6, SNP and WGS curves) decayed faster than LD within any given breed (remaining curves). Within-breed calculations demonstrated that Quarter Horses and the Pony breeds had the lowest LD between SNPs at long distances, decaying to below 0.10 at 1Mb, while Thoroughbreds had the highest LD at all distances considered.

Discussion
Our goal was to provide a high-quality, standardized SNP array designed for imputation to overcome limitations in SNP density that under-power many genome mapping projects in the horse. To do this, we utilized whole genome sequencing data from 156 horses representing actively-researched and economically-important breeds that were collaboratively collected by 17 laboratories within the equine genetics community. These data, in turn, enabled a large scale, whole genome sequence mapping pipeline with de-novo variant calling resulting in a starting set of over 23 million variants. Variant filtering was based on genome coverage, breed representation, linkage disequilibrium, and feasibility of array probe design. This resulted in the successful design of a fully backwards compatible, high-density, 2 million SNP genotyping array (MNEc2M). Using a test cohort of 485 horses, we identified "tag" SNPs that reconstructed haplotypes both across diverse breeds and within breeds, then selected a subset of ~670k SNPs for design of the commercially available MNEc670k array. The use of tagging SNPs in this commercial array ensures its utility as an imputation tool up to a SNP density of at least 2 million.
Several successful GWAS in other agricultural animal species have been reported using this same Affymetrix ® Axiom ® HD genotyping array technology [35][36][37]. High-density ~670k genotyping arrays were used in domestic cattle to identify structural variation [38,39]. Other studies have leveraged this technology to discover and refine loci associated with production traits [40][41][42]. Based on reports from other domestic animals, coupled with preliminary analyses performed here, we anticipate similar performance and increased power for map-based studies to soon follow in the horse.

Assembly correction of the equine genome
Initial filtering for array compatible SNPs immediately reduced the pool from ~23 million SNPs that were discovered from de novo variant calling to 10 million SNPs compatible with array design. These purely technical constraints, such as removing all variants that were in close proximity to one another, substantially reduced the amount of variants considered here. While we thoroughly characterized the SNPs that were ultimately included on the genotyping array, the entire 23M SNP set has been submitted to dbSNP and additional analyses are being performed to further enhance knowledge of equine genetic diversity at the WGS level. This includes investigation of the > 400,000 sites where all WGS alleles differed from the reference allele determined by Sanger sequencing of Twilight. If the discrepancies in these data are verified, this information will be useful for further annotation and error correcting of the current equine genome reference assembly. Here, we focused on variants that were compatible with array design, however, we anticipate that improved genome annotation and finer scale haplotype maps will result from this larger SNP set.

Determining highly precise, gold standard SNP sets
Quality control and evaluation of conversion rates on the previous arrays have shown that many of the SNPs assayed on the legacy arrays are non-informative or not polymorphic in many breeds (See Table S2).
These SNPs could truly occur at low frequencies within those populations, could have been singleton mutations in the individuals included in the legacy SNP discovery panel, or may have been false positive sites where no true genomic variation occurs. Compared to the SNP discovery panel used in the legacy arrays, the cohort of 153 horses used here provided a valuable opportunity to identify a precise set of SNPs that would be informative across many breeds.
While variant discovery from WGS had substantially higher overall depth of coverage than the SNP50 discovery effort, many breeds still had a modest number of individuals or relatively low average depth of coverage (Table S1). Still, genotype comparisons at variant sites from WGS to legacy arrays showed an overall high congruency rate (94% at 6X and 99% at 12X; Fig. 1), however, many SNPs with high QUAL score did not agree between WGS and legacy arrays. Since we were unable to determine the source of error in genotype mismatches between the WGS discovered variants and the legacy SNPs, especially those with high QUAL scores, we used additional information such as how many variant calling groups a SNP was discovered in as well as sequencing read level information to determine the criteria for including WGS discovered SNPs on the MNEc2M and MNEc670k arrays.
Distinguishing marginally higher quality variants, even at overall validation rates of over 94% becomes important when selecting a small fraction of variants to be included in a commercial array. Selecting a final subset (670k) containing only 6% of the initial ~10 million array-compatible variants posed an opportunity to control for overall false discovery rate during SNP selection. Variant recalibration allowed successful identification of variants with high QUAL scores that were less discordant between WGS and SNP50 genotypes. In the absence of high-confidence training SNP sets (i.e., dense HapMap data), we defined several different "gold standard" SNP sets for variant recalibration. While variant recalibration performed better than QUAL scores alone, there was not a single training set that clearly out-performed the others (Fig. 2). However, using recalibrated VQSLOD scores coupled with a focus on identifying high precision SNPs, did result in a high conversion rate of probes (92.3%) on the 2M test array. As more WGS data are generated in more horses allowing for further comparisons between SNPs genotyped using different technologies, and SNPs discovered here can be further validated, the ideal SNP training set for variant recalibration will become more formalized.

Genotyping success in blood versus hair root DNA
DNA samples genotyped on the MNEc2M array were primarily derived from blood, but also came from hair roots. Both DNA sources had failed samples, however, a substantially higher fraction of hair root samples produced poor genotyping rates. Hair root DNA also tended to have lower DNA concentrations when samples were re-hydrated. To maximize information, we chose to genotype all submitted samples, even though samples from hair roots did not meet minimal DNA quantity guidelines specified by Affymetrix, though samples were dropped if they did not meet genotyping quality control thresholds.
To determine if sample origin (blood versus hair roots), DNA quantity, or both, were associated with failure to pass genotyping quality control metrics, a logistic regression for sample success on DNA concentration (Pico Green) and blood/hair status was performed (Supp. Figure 5). It is clear that higher DNA concentrations increased the probability of genotyping success. However, while blood samples were more likely to produce passing samples regardless of DNA concentration, at adequate concentrations, hair root samples were also highly likely to produce passing genotypes.
We also noted variation in genotyping quality of specific SNPs based on tissue origin. During array design, SNPs were not disqualified that potentially performed better in blood versus hair root, e.g.
'PolyHighResolution' in one and not the other (Table S4). While it is difficult to determine if certain probes perform better using DNA isolated from one tissue versus the other, in the samples tested here, DNA isolated from blood was preferred over hair root when DNA concentration is low or questionable.

Setup for precision imputation
Although the legacy equine arrays were not designed for imputation, several previous studies have demonstrated their utility in imputing genotypes, which, if performed on a larger scale across many breeds, could greatly improve the chances of success in horse genome mapping studies. A study in Standardbreds, Quarter Horses and Thoroughbreds showed high fidelity imputation from legacy arrays (54k and 65k) to a higher marker density (74k) using a reference population of 248 horses [3]. Another study found that genotype imputation in Thoroughbreds was feasible from a very low density (1-3K markers) to a legacy SNP set (70K), although it was impacted by the minor allele frequency of the SNPs being imputed, as well as potentially complex LD structure [43].
We masked variants genotyped on the MNEc2M array down to the MNEc670k SNP set in a test cohort of 485 horses and found a high overall imputation accuracy (96-99%) up to the 2M density, across several different breeds of horses (See Table 3). While our imputation scenarios focused on imputing to the MNEc2M SNP set, imputation to higher SNP densities (>2M) is feasible. A recent study using 44 Franches-Montagnes and Warmbloods imputed SNPs from the legacy SNP50 genotypes to nearly 13M variants with 95% accuracy [44].
As sample size continues to plague the characterization of complex equine phenotypes, despite falling genotyping costs, the development of this affordable, backwards compatible, 670K imputation array will allow the inclusion of already genotyped individuals in higher-powered analysis, providing an economical trade-off for studies that must choose between more markers or more samples. We expect that imputation will complement future GWAS studies and mapping studies designed based on haplotype structure. In the future, improvements in imputation software as well as development of equine-specific imputation protocols will further increase accuracy of genotype imputation. Furthermore, as additional WGS are integrated into reference populations, imputation performance will continue to improve.

Conclusions
Here we report, through a community effort, the leveraging of WGS data from 153 individuals with a 13X median coverage to achieve new variant discovery in the domestic horse. The use of WGS enabled us to generate orders of magnitude more data than previous technologies allowed. Empirical SNP properties were used to produce composite scores for each SNP based on machine learning approaches, allowing better detection of false positive variant calls that could not be achieved using generic QUAL scores alone. Thus, from a starting set of over 23 million SNPs, we identified a set of ~5 million SNPs which were considered for array design. With probe design recommendations from Affymetrix, we further filtered this list to 2M SNPs (MNEc2M), which adequately represented the breeds used in variant discovery, were evenly spaced across the genome, and had the highest chance of conversion in the array implementation.
Using a test cohort of 332 horses, we used the MNEc2M SNP set to identify haplotype tagging SNPs both within and across breeds. Choices made in 2M array design, particularly with regard to variant filtering, resulted in greater than 92% of the SNPs on the 2M array (MNEc2M) returning high-quality genotypes.
Filtering further, we designed the MNEc670k genotyping array to contain SNPs which allowed for accurate imputation. Together these genotyping platforms represent the next generation of genomic array technologies for the domestic horse.

Whole-genome sequencing and mapping pipeline
Paired-end Illumina Hi-Seq 100 base pair whole genome sequences were generated for 153 horses representing 24 distinct breeds (Table S1). Raw reads for each individual were mapped using the PALEOMIX pipeline [26] and aligned to an extended EquCab version 2.0 genome (see below).
Specifically, reads for each individual were processed separately by sequencing lane and filtered for quality control using the program Adapter Removal [45], which removed PCR adapters, trimmed low quality basepairs, and collapsed overlapping paired-end reads into a single high-quality read, and the resulting reads were filtered by length. Passing reads were mapped to each reference file using BWA [46]. Paired-end reads for which the mate was filtered were mapped in single-end mode. PCR duplicates were detected and removed, the resulting bam-files were merged, and reads were realigned around detected indels.

Extended EquCab 2.0 reference genome
An extended version of the EquCab2 reference genome was used which included the 31 autosomal chromosomes, ECAX, and equine chromosome unknown 1 (ChrUn1) [1], together with an additional 7,850 contigs designated as ChrUn2 generated from unmapped Twilight genomic DNA reads which were de novo assembled using the Velvet assembler [47]. These additional contigs were required to meet any of the following criteria: 1) longer than 1,000 base pairs with no BLAST alignment (bit score > 99) to the human, canine, or bovine genomes; 2) longer than 1,000 base pair and either a single BLAST alignment to a human, canine, or bovine chromosome or multiple alignments that mapped to a single chromosome; 3) between 500 and 999 base pairs and a BLAST alignment to a single human, canine, or bovine chromosome with a bit score >499; 4) between 500 and 999 bp with alignment to a single chromosome where the total coverage of the aligned region included more than 80% of the coding length of an existing human, canine, or bovine annotated protein-coding gene.

PALEOMIX vcf pipeline implementation and python source code
Programs within PALEOMIX are abstracted as nodes within the program to allow files to be generated within a temporary directory and only to be moved to the final directory upon successful completion. In addition to nodes for the read alignment mapping programs, additional nodes were created to run variant calling programs implemented by GATK [27] and SAMtools [28]. Additional PALEOMIX nodes were created to perform variant recalibration as well as to assess precision versus recall for Morgan and Standardbred breed groups ( Fig. 1 and Fig. 2). Source code for the extended PALEOMIX nodes are available at https://github.com/schae234/pypeline.

Variant calling and validation PALEOMIX nodes
The PALEOMIX computational framework was further extended to process alignments and produce variant calls. Individuals were split into 16 different calling groups based on both breed and sequencing depth in order to minimize biases due to coverage and population stratification. Variant call files (.vcf) were produced for each group using both GATK Unified Genotyper with" -stand_call_conf" set to 30, "-stand_emit_conf" set to 10, and "-dcov" set to 200. Variants called by GATK were retained if they were independently called by SAMtools "mpileup" with default calling parameters. This allowed possible false negatives to be assessed downstream using other quality metrics. A total of ~23M variants called independently by both GATK and SAMtools were retained for further analysis.

Variant recalibration (VQSLOD scores)
Both GATK UnifiedGenotyper as well as SAMtools assign generic quality scores (QUAL) to each discovered variant, which is the posterior probability that a true variant exists given the pileup of reads at a given locus using base pair quality and expected allelic distribution of samples. Using GATK, additional metrics were generated for each variant including coverage (DP), quality of depth (QD), Fisher strand bias (FS), mapping quality rank sum (MQRankSum) and read position rank sum (ReadPosRankSum).
Definitions from the GATK manual for each metric are below. These parameters were used to train linear mixed models using GATK VariantRecalibrator which produces a composite variant quality metric called a VQSLOD score which can be summarized as using the formula VQSLOD ~ DP + QD + FS + MQRankSum + ReadPosRankSum. We defined three "gold standard" datasets: 1) SNPs previously genotyped on the 54K SNP chip, 2) variants called in four or more calling 11.21-11.99X, mean=11.63X). These "gold standard" training SNPs were used to assess the precision of different quality metrics of de novo variant calling.

Preliminary 5M high-density SNP selection
A preliminary list of approximately 5 million SNPs was prepared for probe design (conversion) recommendation by Affymetrix. VQSLOD scores were generated for all ~23 million initial bi-allelic candidate SNPs. SNPs were filtered out if they 1) fell within repetitive regions designated by RepeatMasker 3.3.0 [48] (780 SNPs); 2) fell within 20bp of another variant (9,342,733 SNPs); and 3) had fewer than 2 observed instances of minor allele (4,858,273). Of the remaining 11,435,936 SNPS, further filtering was applied to obtain equal coverage throughout the genome. Pairwise LD was calculated between SNPs within 10kb windows. If SNPs within a window had an r 2 value of over 0.90, priority was given to SNPs that were in called in Draft or Pony groups. If a SNP was in an equal number of priority groups, VQSLOD scores were used to break ties. After all filtering criteria were applied 5,443,950 SNPs remained.

High-density 2M SNP selection
From the ~5 million candidate SNPs, Affymetrix provided four classes of recommendations based on the probability the SNP would be convertible through probe design ('recommended', neutral, 'nonrecommended', 'not possible'). A recommended SNP has: a probability of conversion > 0.60, no interfering polymorphisms within 24 bases, and unique flanking sequence. Non-recommended SNPs have: either duplicate flanking sequence, a probability of conversion < 0.40, interfering polymorphisms within 21 bases, or more than 2 interfering polymorphisms within 24 bases. A 'not possible' designation is given to probes which probes cannot be created, and neutral recommendations cover all other cases. Furthermore, SNPs with alleles of A/T or C/G required two probes to differentiate between allele states and were tagged as allele-specific SNPs. Recommendations were generated for both forward and reverse strands based on the above criteria. Groups of SNPs were ranked within 50kb windows and probe design criteria were chosen using a greedy algorithm until 37 SNPs were chosen per window using the following criteria: 1) VIP SNPs previously designed on 54/65K chip (regardless of recommendation or strand), 2) SNPs for known Mendelian traits such as coat color (regardless of recommendation or strand), 3) SNPs designated as 'recommended' by Affymetrix and designable with one probe, 4) SNPs with a 'neutral' recommendation from Affymetrix designable with one probe, 5) any SNP within the equine MHC region, 6) SNPs requiring multiple probes. If a SNP was equally designable in forward and reverse strand, the forward strand was arbitrarily chosen. Using these criteria, 2,001,826 SNPs were chosen for array design.

2M SNP test array and sample quality control
Using probes designed according to the above criteria, 347 horses from 20 breeds were genotyped on the 2M array using DNA isolated from blood using the Gentra PureGene Blood kit. DNA from hair roots was isolated using the Gentra Puregene Blood Kit with a modified version of the Gentra Puregene Mouse Tail purification protocol. The amount of Proteinase K was increased to 20 µL, isopropanol to 650 µL, and ethanol to 500 µL. The DNA hydration solution was reduced to 20-30 µL depending on size of DNA pellet.
Quality control metrics were calculated on arrays grouped by tissue type (blood vs hair) as well as array batches, and arrays were dropped for various reasons at multiple QC steps. "DishQC" evaluates call rates between A/T probes and C/G probes of non-polymorphic sites to assess background probe contrast and was calculated using Affymetrix Power Tools. Samples with DQC scores below 0.60 were removed from further evaluation. Passing arrays were then genotyped using approximately 20,000 high-confidence probes  Table S5). FastTagger was run with parameters to differentiate tag-SNPs within each population at a resolution of 0.10 minor allele frequency and 0.90 r 2 . Representative SNPs from each population specific set of tag SNPs were collapsed using the program `Multi-Pop-TagSelect` which assesses overlap between sets of tag SNPs and identifies the subset of SNPs which near-minimally spans the set of populations [33]. Intra-breed specific SNPs were kept as long as they were tagged haplotypes in five or more breed groups.
Quarter Horses, Ponies, Morgans, and Standardbreds needed a much higher number of intra-breed specific SNPs to tag haplotypes, indicating they were the most diverse breeds. Additional tag SNPs (n=13,993) were included in the final commercial array if they tagged haplotypes in three or more of these diverse breed groups.  Table 5 for values).

MNEc2M Breed Specific Alternate Allele Frequency
Alternate allele frequencies from variants present o n the MNEc2M chip were split by breed group. Samples (WGS+SNP) were split into 15 tagging breed groups (See Table S5). Breed groups with asterisk (*) indicate a combination of studbook breeds . Boxplots show median (red line) and mean (red box) of the alternate allele frequency distribution.  Tables   Table 1 SNP Sets used at various steps in array design Variants discovered from whole genome sequencing were filtered at various steps for quality control or using array design criteria. Six distinct sets of variants ranging from the initial ~23M high-quality, variants discovered from WGS to the 670k variants available in the commercial genotyping array are described throughout this manuscript.    Table S5).

Supp. File 1 Position of variants discovered from WGS data
File contains chromosome, bp position, distance from previous discovered SNP and alternate allele frequency for the 23 million SNP set.

Supp. File 2 MNEc670k SNP information
Contains information on MNEc670k SNPs. Includes Affymetrix probe ID, genetic coordinates, MNEc Identifier, tissue origin indicating high quality genotyping in either hair root or blood, Affymetrix assigned genotyping cluster resolution, and inclusion criteria: Intra -SNP tags intra-breed haplotype Inter -SNP tags inter-breed haplotype

MHC -SNP within Equine MHC region
Diverse -SNP tags haplotype in 3 or more of diverse breeds that need many tagging SNPs (Quarter Horse, Pony, Morgan, Standardbred; See Table S4 ) Density -SNP included to target at least 8 SNPs per 50kb genomic window (targeting uniform genomic coverage)

Supp. File 3 Breed specific tagging SNPs
File containing informative tagging SNPs broken down by breed.

Supplementary Figures
Supp. Fig. 1 Alternate allele frequency for 23M and 10M SNP sets.
Histograms show the alternative allele frequency for the ~23 million SNPs discovered by WGS and the ~10 million SNPs that were compatible with array design (see Table 1). A high number of SNPs were observed at 100% alternate allele frequency in a sample cohort that included deep sequencing (see Table S1) of Twilight (the horse used for the reference genome) indicating likely Sanger sequence errors.
Distance between SNPs on the MNEc2M array were calculated by breed group (See Table S5) at various minor allele frequency cutoffs (0, 0.01, 0.03, 0.05, 0.10). Breed groups with asterisk (*) indicate a combination of studbook breeds.

Supp. Fig. 3 MNEc670k Inter-SNP distance and informativeness by breed
Distance between SNPs on the MNEc670k array were calculated by breed group (Table S5)

MNEc670k Breed Specific Alternate Allele Frequency
Alternate allele frequencies from variants present on the MNEc670k chip were split by breed group. Samples (WGS+SNP) were split into 15 taggin g breed groups (See Table S5). Breed groups with asterisk (*) indicate a combination of studbook breeds .
Supp. Figure 5 Logistic regression of DNA Sample success A logistic regression was fit, predicting the probability of a sample passing quality control using DNA concentration (Pico Green) and sample source as independent variables.