Suitability of Single-Nucleotide Polymorphism Arrays Versus Genotyping-By-Sequencing for Genebank Genomics in Wheat

Genebank genomics promises to unlock valuable diversity for plant breeding but first, one key question is which marker system is most suitable to fingerprint entire genebank collections. Using wheat as model species, we tested for the presence of an ascertainment bias and investigated its impact on estimates of genetic diversity and prediction ability obtained using three marker platforms: simple sequence repeat (SSR), genotyping-by-sequencing (GBS), and array-based SNP markers. We used a panel of 378 winter wheat genotypes including 190 elite lines and 188 plant genetic resources (PGR), which were phenotyped in multi-environmental trials for grain yield and plant height. We observed an ascertainment bias for the array-based SNP markers, which led to an underestimation of the molecular diversity within the population of PGR. In contrast, the marker system played only a minor role for the overall picture of the population structure and precision of genome-wide predictions. Interestingly, we found that rare markers contributed substantially to the prediction ability. This combined with the expectation that valuable novel diversity is most likely rare suggests that markers with minor allele frequency deserve careful consideration in the design of a pre-breeding program.


Minor Allele Frequency (MAF)
For SNP array and GBS datasets, the marker values were coded as 0, 1, 2 and NA. Therefore, if there are n genotypes across one locus, then, in which, n 2 , n 1 , and n 0 are the amounts for the values "2", "1", and "0", respectively.
in which, k is the kth genotype, and refers to the marker value of the kth genotype.
For each SSR allele, the allele frequency is in fact the occurring frequency, which is always not higher than 0.5, thereby, We used the average of MAF i from one SSR marker as the MAF of that marker.

Population Heterozygosity (H)
Once the allele frequency of each allele is determined, H will be retrieved to measure the genefrequency variation. H is applicable in any population regardless of the number of alleles at a locus or the pattern of evolutionary forces, as well as in any organism without considering its reproduction (breeding) way or chromosome ploidy (Nei, 1973). H is calculated as the following (Nei, 1973, Nagy et al., 2012: where AF i is the frequency of the ith allele and corresponds to the total number of alleles. The biallelic nature (AF 1 + AF 2 = 1) of SNPs markers (SNP array and GBS datasets), reduces H to:

Polymorphism Information Content (PIC)
For each marker locus, PIC is defined as (Botstein et al., 1980, Nagy et al., 2012: PIC is thereby considered as the corrected heterozygosity with the information from partially mating (Hildebrand et al., 1992).
For both, SNP array and GBS datasets: The maximum PIC for biallelic markers is 0.375. Markers having more alleles will be more informative, and their corresponding PIC is higher.

Rogers' Distance (RD)
The Rogers' Distance (RD) is used as an index for measuring the genetic distance between two genotypes according to the alleles within each marker system as: with m referring to the number of loci, n i being the number of alleles at the ith loci, while p ij and q ij correspond to the allele frequencies of the jth allele at the ith locus from two genotypes. RD is a measurement of gene diversity. Moreover, we estimated also the genetic similarity as 1 -RD.

Single-Kernel model
The general form of the single kernel model is the following: in which, μ is the overall mean, α corresponds to additive effects, n denotes the number of lines, p indicates the number of markers, Z A represents the design matrix with the dimension of ( × ) for the additive effects of the marker SNP array, GBS, SSR, individually, and 1 n is the vector of length n containing only ones.
The additive model (10) is equivalent to a GBLUP model: where g ~N(0, A * σ G ), and e ~N(0, σ e 2 ), and A is the numerator relationship matrix, and is calculated according to VanRaden (2008). This A matrix will be used as kernel matrix (K) for the single kernel method in the R BGLR-Package.

Multi-Kernel model
In order to test whether the different marker system can be complementary to each other, we also used a model combining two or three marker data together. In general, the multi-kernel model uses all of three marker data: Here α, β and γ are additive effects for SNP array, GBS, and SSR marker, respectively. Z SNP , Z GBS and Z SSR are the design matrix with the dimension of ( × ) for the related additive effects of each marker data. The row dimension of Z SNP , Z GBS and Z SSR is n, and n is number of genotypes. The column dimensions of them are the number of markers (p) for each type of marker data.
The equivalent GBLUP model of multi-kernel model (11) will be: ,and e ~N(0, Iσ e 2 ), and A SNP , A GBS , A SSR are the numerator relationship matrix also calculated according to VanRaden (2008). These numerator relationship matrices will be used as kernel matrix for the multi-kernel methods in the R BGLR-package.
We also tested all three possible combinations of two marker data: SNP array plus GBS markers, SNP array plus SSR markers, and SSR plus GBS markers. For this purpose, we just needed to remove the unused marker effect from the model (13).

Linkage disequilibrium
Linkage disequilibrium (LD) in the form of the squared Pearson coefficient of correlation (r2) was used as a measure of non-random association between two different loci (Hill and Robertson, 1968).

Supplementary Figure 2.
Distributions of polymorphism information content (PIC) (x-axis) for SNP array (SNP), genotyping-by-sequencing (GBS) and SSR markers. Results are shown for the total population (All), the elite lines (Elite), and the plant genetic resources (PGR).