Dissection The Practical Soybean Breeding Pipeline By Developing High Throughput Functional Array ZDX1

Microarray technology facilitates rapid, accurate, and economical genotyping. Here, using resequencing data from 2,214 representative soybean accessions, we developed the ZDX1 high-throughput functional soybean array, containing 158,959 SNPs, covering 90.92% of soybean genes and sites related to agronomically important traits. We genotyped 817 soybean accessions using ZDX1, including parental lines, non-parental lines, and progeny from a practical breeding pipeline. It was claried that non-parental lines had highest genetic diversity, and 235 SNPs were identied to be xed in the progeny. The unknown soybean cyst nematode-resistant and early maturity accessions were identied by using allele combinations. Notably, we found that breeding index was a good indicator for progeny selection, in which the superior progeny were derived from the crossing more distantly related parents with at least one parent having a higher breeding index. Based on this rule, two varieties were directionally developed. Meanwhile, redundant parents were screened out and potential combinations were formulated. GBLUP analysis displayed that the markers in genic regions had priority to be higher accuracy on predicting four agronomic traits compared with either whole genome or intergenic markers. Then we used progeny to expand the training population to increase the prediction accuracy of breeding selection by 32.1%. Collectively, our work provided a versatile array for high accuracy selecting and predicting both parents and progeny that can greatly accelerate soybean breeding. We provided the ZDX1 high-throughput functional soybean array for high accuracy evaluating and selecting both parents and progeny that can greatly accelerate soybean breeding.


Introduction
The goal of crop breeding is to develop plant varieties with ideal traits, such as higher yield, improved quality, and enhanced environmental adaptability (Liu et al. 2020a). Although commercially produced soybean [Glycine max (L). Merr.] yield has increased, due in part to the breeding of new varieties (Rincker et al. 2014), the yield increase per unit area has not changed signi cantly for the past few decades (Liu et al. 2020a). This shows that reliance on traditional phenotyping methods to develop new varieties has limitations (Barabaschi et al. 2016). Innovative genotyping platforms can accelerate the process of identi cation, evaluation, and use of elite germplasm. In particular, SNP arrays provide a high allele detection rate (Rasheed et al. 2017) and enable rapid, low-cost, high- have been successively developed. For example, the 50K soybean array was used to genotype 96 elite, landrace, and wild accessions and subsequently identify candidate genomic regions shaped by domestication or recent selection (Song et al. 2013). Similarly, this array was used to correlate protein-and oil-related loci via GWAS analysis of 298 strains (Hwang et al. 2014). The 180K soybean array was used to show that morphologically intermediate soybeans are natural hybrids between cultivated and wild soybean (Lee et al. 2015), while GWAS based on the 335K array identi ed a candidate interval on chromosome 20 that affected grain weight ). These related studies have increased our understanding of soybean genetics and have laid the foundation for the application of SNP arrays in breeding programs.
One of the key challenges facing plant breeders is the selection of suitable parents for generating su ciently rich genetic variation to allow a maximal selection response during the breeding cycle in self-pollinating crops (Ji et al. 2018). To meet this challenge, new and more effective breeding strategies that combine phenotypic data with high-throughput genotyping have been developed to better identify prospective germplasm and to evaluate progeny (Varshney et al. 2014). Soybeans of different types (Pandey et al. 2017) and from different sources (Marrano et al. 2019) can be distinguished using microarrays in order to analyze the genetic relationships between different materials, and consequently provide a basis for determining the most suitable parents. Linking genotypic information with agronomically valuable traits that are not easily scored (Rasheed et al. 2017) can also help in the early evaluation of parents and the identi cation of desirable progeny. With the development of microarrays, genome-wide selection (GS) based on a large number of markers can be more informative and robust in selecting for complex traits controlled by multiple genes, such as yield, seed quality, and disease resistance, accelerating the translation of genotypic data to phenotypic selection in the eld ).
However, there are relatively few reports describing the integration of high-throughput sequencing with the main breeding process.
Currently, there is an urgent need for a functional SNP array that covers the entire soybean genome and also contains representative and agronomically important sites to facilitate genetic research and molecular breeding.
Here, we screened representative SNPs from a wide range of germplasm resources to develop the 'Zhongdouxin No. 1' (ZDX1) functional array. Using a breeding population comprised of 817 accessions including parental lines, non-parental lines and progeny, we demonstrate the use of this array for improving steps in breeding including screening for new genetic resources, population diversity analysis, optimizing hybrid combinations, and progeny selection. The ZDX1 array described in this work, with associated breeding selection strategies can accelerate all steps in the breeding process.

Materials And Methods
SNP detection, ltering, and selection for array development Using resequencing data from 2,214 soybean accessions as the basic information and based on the Illumina platform, we obtained the VCF le by comparison with the reference genome Wm82.a2.v1 (Gmax_275_v2.0) and obtained 11,048,862 initial polymorphic SNP sites, including commercialized array sites, important gene sites, QTL and GWAS sites, and important trait functional sites. After removal of sites with a deletion rate of >0.1 and a degree of heterozygosity >15%, 9,092,282 sites were retained. We then screened 2,379,054 sites according to the rules of 'no interference SNP sites in 35 bp around each site' and 'retaining sites with MAF≥0.01'. We deleted the non-polymorphic sites and the sites with interference within 50 bp on either side, keeping the tiling order=1 sites, and 2,039,377 sites remained. We deleted the sites with errors, tested 41 sliding window gradients for site screening, and selected the 4,800 bp window. The principle for site selection was "priority + Illumina score ≥0.4 + non-AT/GC selection site (if there is no non-AT/GC, then select the sites with higher priority)", among which the priority de nition principles are: I. Excellent QTL sites, GWAS sites, important genes (VIP), selective genes, common genes, terminator/alternative splicing/non-synonymous mutation sites. II. Interspecies and intraspeci c subgroup unique sites. III. Selection interval (domestication) sites. IV. Whole genome coverage sites. V. Gap-lling sites. After genotyping and clustering with GenomeStudio software (GenomeStudio 2008), and testing and adjusting the typing signal which was >3, we nally obtained 158,959 SNP sites for ZDX1.

Plant materials and phenotypic data collection
The plants used in this study are 817 accessions from the actual breeding population, including 77  The experiment used a randomized block design and set one control line every 20 lines ('Neidou4hao' or 'Keshan1hao'), with a row spacing of 0.65 meters and a row length of 3 meters. Six quantitative traits were investigated: VE, de ned as the date of emergence of the cotyledons. beginning maturity (R7), de ned as the days from emergence to when one pod on the main stem has reached mature pod color (Fehr et al. 1971), and for each row, the date was de ned as when 50% of the plants meet the above condition (Qiu 2006). In each plot, 20 plants were continuously harvested where there was no shortage of seedlings, the seed yield (SY), 100-seed weight (SW), protein content, and oil content were measured, and the survey was conducted as previously described. One qualitative trait, leaf shape, was recorded as either narrow or broad lea et (Qiu 2006). between two SNPs was set to 1,000, and the correlation coe cient (r 2 ) of alleles was calculated to measure LD in each group level. The LD decay rate was de ned as the chromosomal distance at which the average r 2 dropped to half its maximum value. The Kinship matrix was calculated using the VanRaden method in Gapit software to obtain the genetic relationships between lines in the population.
After removing LD from parental lines and non-parental lines with a rule of typing success rate >0.9 and indeppairwise 50 5 0.5 (a. consider a window of 50 SNPs, b. calculate LD between each pair of SNPs in the window, c. remove one of a pair of SNPs if the LD is greater than 0.5, d. shift the window 5 SNPs forward and repeat the procedure), we obtained 8,940 loci. We used PLINK v2.1.1 for PCA analysis and R software to draw PCA diagrams.

Best linear unbiased estimates and breeding index
Based on the phenotypic data obtained by the multi-point eld identi cation method over several years, the R language asreml data package was used to calculate the best linear unbiased estimates (BLUE) from the phenotypic data for genomic selection (He et al. 2016) and breeding index.
We propose a selection index as a metric for breeding, we named this index value the breeding index (BI). The index is a linear combination of predicted values of comprehensive traits, each having a unique weight, as shown below: where I j is the selection index score for individual j, w k is the economic weight for the kth trait for k = 1,2,...,5, and is the standardized predicted value for trait k from the jth individual accession that is calculated by standardizing the values for each trait by subtracting the mean value and dividing by the SD. We included all ve traits in the selection index, corresponding to the following order; beginning maturity, 100-seed weight, protein, oil, Among the 246 parents, the BI of the top 1/3, middle 1/3, and bottom 1/3 from high to low were designated as high parents, medium parents, and low parents, respectively, with 82 accessions in each group. In addition, 'Rate over best-parent' means the proportion of progeny with better performance than that of the 'best' parent.
is the variance among soybean lines, is the genotype-by-environment interaction variance, is the residual variation, and e and r are the number of environments and replications within environments, respectively.
The Pearson correlation coe cient between the predicted and observed phenotype (rMP) was estimated, and the prediction accuracy (rGS) was calculated for the standardized rMP by the square root of the broad-sense heritability (Lehermeier et al. 2013). When comparing the prediction effects of gene regions, intergenic regions, and whole genome markers, the following strategies were adopted for marker sampling. Among the 69,022 loci  Table 1).
We next examined the genomic distribution of the 158,959 high-quality SNPs and found that they were evenly distributed across the 20 soybean chromosomes. The number of SNP sites on each chromosome ranged from 6,085 to 9,314, of which, 90.23% fell within 10 kb (6.0 kb average distance) (Supplemental Table 2). In addition, SNP number showed a highly signi cant positive correlation with chromosome length, with a Pearson's coe cient of 0.98 (p = 8.61E-14) (Fig. 1b). We mapped 64,435 of the candidate SNPs to 50,592 annotated genes, accounting for 90.92% of the total number of predicted genes in the soybean reference genome (Fig. 1c). In addition, another  (Fig. 1d). Collectively, these data showed that the SNPs selected for ZDX1 were evenly distributed across the soybean chromosomes and that the array had high gene coverage and utilization, thus enabling population structure analysis, whole-genome-based selection, and other related studies.
In addition, the ZDX1 array was also designed to retain high-priority loci, including 2,402 SNPs for genes related to important traits and 627 SNPs for genes that underwent domestication or improvement (Supplemental Table 3).
In addition, Analysis using Soybase showed that the candidate SNPs for ZDX1 also included 953 SNPs in QTL intervals, 547 GWAS-identi ed SNPs (https://soybase.org), and 110,811 SNPs that differed between ecological groups (Supplemental Table 4). Moreover, 3,869 SNPs from two low-density arrays, the 1.5K BeadChip (Shen et al. 2005) and the BARCSoySNP6K (Song 2014) array were also included (Supplemental Table 5). Compared with the three high-density arrays SoySNP50K, 180K AXIOM®, and NJAU 355K SoySNP, the ZDX1 array contains 134,737 unique sites (Fig. 1e), with a speci city rate as high as 84.8%. In addition, 14 important functional sites (causal SNPs) related to traits such as growth period, resistance to cyst nematodes, leaf type, pod setting habit, seed coat color, seed dormancy, and phosphorus e ciency (Supplemental Table 6) were selected for the array, thus facilitating identi cation of economically and agronomically valuable traits and screening for elite germplasm.
As a nal step in marker selection, we evaluated the accuracy of the marker information. To this end, we rst determined the site detection rates in 817 well-established breeding materials (Supplemental Table 7) and found that the detection rate for each sample was between 84.40% and 95.98%, with an average of 95.19%. At the same time, three DNA samples were randomly selected for two repetitions, and the genotype similarity between the repetitions was > 99.9% (Supplemental Table 8). The above data shows that this array has a high degree of accuracy and repeatability. Taken together, these results con rmed that the high-density ZDX1 array was both reliable and accurate.
Analysis of genetic diversity of breeding population and screening of xed sites in breeding improvement Subsequently, we applied the ZDX1 array for genotyping in a test population of 817 breeding lines in soybean breeding program, including 77 parental lines, 169 non-parental lines, and 571 stable progeny lines developed using the pedigree method after crossing. To analyze the genetic diversity of the three subpopulations, we next conducted linkage disequilibrium analysis (LD; indicated by r 2 ). The results showed that the attenuation rate of the non-parental lines r 2 was higher than that of the progeny and parental lines, and the distance at which r 2 decayed by half was 244 kb, 276 kb, and 303 kb, respectively (Fig. 2a). These results indicated that the genetic diversity among the non-parental lines was higher than that of the parental lines and progeny, which thus helped to broaden the genetic diversity of the parental lines. Similar to the results of LD analysis, PCA analysis (Fig. 2b) con rmed that the distribution of the non-parental line subgroup was more scattered, further indicating higher genetic diversity.
The results of MAF showed that 46,376 sites in the test population were completely xed, and that 38,625 sites (83.29%) contained differences between groups from geographically separated ecological regions. Furthermore, the percentages of xed sites in the non-parental lines, parental lines, and progeny were 34.72%, 41.79%, and 34.63%, respectively (Supplemental Table 9, Fig. S3). To further clarify which sites were selected and xed during the breeding process, 6,579 sites were selected based on their polymorphisms in the 817 accessions, as well as in the progeny subgroup (MAF = 0). It is worth noting that the minor allele types corresponding to these sites were the same across all three subgroups. Statistical analysis showed that the MAF values of the parental lines ranged between 0 and 0.0390, while the MAF values of the non-parental lines ranged between 0 and 0.1317 (Fig. 2c).
Among them, 235 sites were identi ed where the MAF values of the parental and the non-parental lines were > 0.01, and 109 sites were located in genic regions spanning 95 genes (Supplemental Table 10). Taken together, these results suggested that these apparently informative SNP sites were xed during the breeding process, which can be selected in future breeding.
Germplasm screening for breeding target traits using functional sites in ZDX1 array In order to then develop elite germplasm using the functionally informative SNP sites, we selected fourteen SNP sites from the array to identify the test population, among which ve were found to be non-polymorphic. These ve marker sites included stem termination (Dt1/Gmt 1-ta, Dt1/Gmt 1-ab), and the seed coat color (Gm850). Among the six sites related to maturity, e4-oto and GmGPRR3b/Tof12 were completely xed, and the MAF values for e1-fs, e1-as, e3-fs, and e4-keshuang were calculated to be between 0.001 and 0.192. In addition, the MAF values for three sites associated with cyst nematode resistance, rhg1-a/GmSNAP18, Rhg4/GmSHMT08, and GmSNAP11, were between 0.012 and 0.017, while the MAF value for leaf-shape Ln/ln site was 0.203. These ndings suggested that these un xed sites could be related to the genetic diversity of the phenotype, and potentially controlled traits that are desirable for breeders (Fig. S4).
Three nematode resistance-associated SNPs, including Gm18_1643660, Gm08_8361148, and Gm11_32970174 (located in the rhg1, Rhg4, and SCN3-11 genes, respectively) were also covered by the array (Table 1). We found that the frequency of alleles for enhanced disease resistance in the tested materials was relatively low: 1.22%, 1.71%, and 1.47%, respectively. These three sites could be found in eight allelic combinations among the 817 accessions of the diversity panel, while seven accessions were identi ed that carried all of the resistance loci, including three known resistant varieties, 'Kangxian1hao', 'Kangxian5hao', and 'Kangxian8hao'. In addition to these accessions, three new varieties not previously known to carry nematode resistance were also identi ed, including 'Shundou5hao', 'Qinong1hao', 'Fengdou 23', as well as the progeny 'HJ15-863'. The proportion of resistant progeny was extremely low, potentially due to the di culty of large-scale phenotypic identi cation and the lack of directional selection against SCN in the progeny.  Using breeding index and genetic distance to explore the method of parental selection In order to illustrate how the ZDX1 array can improve the parental selection process, we next used genotype data to generate a kinship matrix for the full accessions, which revealed pairwise genetic distance that ranged between − 0.54 and 2.56 (with larger values indicating closer kinship; see Fig. S6). Analysis of R7 (beginning maturity), SW (100-seed weight), protein content, oil content, and SY (seed yield) in 298 progeny derived from the parental lines for both parents showed that the rate over best-parent of each trait was non-signi cantly negatively correlated with the genetic relationship between the parents (p = 0.30-0.97), and the correlation coe cients (r hd ) were − 0.02 to -0.42, suggesting that the more distant the parental relationship, the greater the possibility that a higher proportion of progeny would outperform the parental lines. In addition, the mean value of each trait among progeny was positively correlated with the average parental value, with correlation coe cients (r po ) were between 0.33 and 0.73. Among them, mean oil and seed weight of the progeny showed an extremely signi cant (p < 0.01) correlation with the mean of these trait values among the parental lines (Fig. 3), which indicated that the use of elite parents in hybrid combinations allows the selection of elite progeny.
While high yield is the most important goal in soybean breeding, traits such as maturation time and seed quality (i.e., protein and oil contents) should also be comprehensively considered during selection. To this end, we included all of the ve traits in the selection index, de ned as the breeding index (BI), with which we scored the parental lines and 298 progeny which both parents are included in parental lines. Based on BI values, the parents could be categorized as high, medium, or low phenotypes (Supplemental Table 7). The 30 (top 10%) highperformance progeny could then be divided into ve types based on the parental BI index. Using this system, we identi ed two high×high types, 11 high×medium types, nine high×low types, three medium×medium types, and ve medium×low types. Of these types, 73.3% involved contributions from at least one high type parent (Fig. 4).
When evaluating new lines with BI, the commonly used control varieties 'Keshan1hao' (BI = 0.53) and 'Neidou4hao' (BI = 0.25) were rated as a "High" type. This standard was also used to screen out two new varieties 'Mengdou1137' and 'Mengdou640' that passed the national certi cation, the "High×Low" parental combination was used to generate these two varieties, the average genetic relationship was relatively distant (-0.15). These results indicated that the selection of more distantly-related parents, among which at least one parent has strong multiple trait indexes, will more likely produce progeny with highest composite agronomic performance for these traits. It provides a reference for us to select suitable parents in complex self-bred crop breeding.
Following identi cation of candidate parental lines with suitable genetic distance and high-performance phenotypes, we also needed to enable e cient breeding decisions by eliminating redundant germplasm accessions from the diversity panel, which otherwise results in considerable genetic redundancy in the selected parental subgroup. We have counted the parental lines of the bottom 30 progeny (bottom 10%), among these parents, 12 parents including 'Dengke4hao' and 'Hujiao1120' did not derive excellent progeny (top 10%) (Supplemental Table 12), they can no longer be used in future breeding. Meanwhile, based on the genetic relationships indicated by different metrics for genetic distance, including 0.5 ~ 1.0, we identi ed the non-parental lines with higher similarity to the parental lines used in crosses (Fig. S7A), and nally eliminated 21 redundant non-parental lines including 'Mei1' and 'Nenao08-1092' based on kinship scores of > 1.0 (Supplemental Table 13).
After screening out redundant parents, the improved combinations were proposed for future breeding. Meanwhile, the high-performing progeny lines should also be included in the parent nursery. We therefore selected the accessions with the top 10% of BI values, and calculated the number of potential combinations that could be formulated using different metrics for genetic distance, including − 0.5 ~ 0.0 (Fig. S7B). Using a kinship score of <-0.3 as the standard, we selected 46 high-potential combinations for use in future breeding experiments (Supplemental Table 14). By eliminating redundant parents and formulating potential combinations, the parent population structure is optimized and breeding e ciency is improved.
Different strategies based on ZDX1 array improve the accuracy of genomic selection in theoretical and actual breeding We next explored the e ciency of different strategies to improve the accuracy of genomic selection using the ZDX1 array. The results of GBLUP (genomic best linear unbiased prediction) analysis to test the accuracy of selection based on the ZDX1 array revealed that the prediction accuracy for the ve traits of beginning maturity, seed weight, protein content, oil content, and seed yield were 0.79, 0.73, 0.78, 0.77, and 0.69 respectively; these scores were all signi cantly higher than those of ABLUP (pedigree-based best linear unbiased prediction), based on pedigree relationships, and HBLUP (combined best linear unbiased prediction) based on both pedigree relationship and genotype data (p < 0.01) (Fig. 5a, Supplemental Table 15). These results indicated that the genomic information provided by the array can better re ect the population structure than pedigree relationships.
We subsequently identi ed 33,756, 33,733, and 33,761 sites that were respectively selected as marker subsets from gene regions, intergenic regions, or the whole genome. GBLUP analysis con rmed these three marker sampling methods showed no signi cant differences in their accuracy for predicting yield. For the other traits, the accuracy of prediction using markers for genic regions was 2.33% higher than that of SNP markers for intergenic regions, with highly signi cant differences among methods for each of the four traits (p < 0.01). Also, markers associated with genic regions were more accurate by an average of 0.57% than those sampled from across the whole genome, and were signi cantly more accurate for predicting 100-seed weight, protein content, and oil content (p < 0.01). Furthermore, use of only the 33,756 SNPs in genic regions also signi cantly improved the predictive accuracy (p < 0.01) for selecting 100-seed weight, and protein and oil contents compared with accuracy provided by using all 69,022 SNPs (Fig. 5b, Supplemental Table 15). These results showed that the sites on the genic-region in the array include more useful genetic information. For most traits, the strategy of sampling SNP markers for gene-encoding regions can reduce the number of requisite markers while also improving the accuracy of genomic selection.
In order to evaluate the e ciency of ZDX1 in predicting progeny in actual breeding, we also selected 246 parents as training population I, and 283 of the 571 progeny bred in 2015 as Predicted Population I. We used ZDX1 array to predict the top 50% of the 283 progeny. The prediction accuracy for the ve traits in these high-value lines ranged from 0.30 to 0.45 (Circle1). We then used these selected 141 high-value progeny to expand training population I to generate training population II to further predict the 288 progeny bred in 2016 (Predicted Population ) (Fig.5c). The results showed that with the exception of yield, the predictive accuracy was improved for these traits, ranging from 0.48 to 0.67 (Circle2), while the average accuracy was signi cantly increased by 32.1% (p=0.024) (Fig.5d, Supplemental Table15). Collectively, these results demonstrate that the predictive accuracy of breeding decisions based on the ZDX1 array can be improved by establishing a model using the parental lines and continuously expanding the model with high-performing progeny.

Discussion
Pan-genomic studies in soybean have shown that the use of a large number of accessions is more conducive to . Previous studies have shown that SNPs with higher MAF should be the rst choice for array design, which is another advantage of our array, together with the robustness of the Illumina bead chip system (Gunderson 2009). Although arrays with similar or higher density have been used for soybean genotyping (Lee et al. 2015;Wang et al. 2016), we paid more attention to the distribution of the SNP sites throughout the genome. Due to this even genomic distribution, the ZDX1 array provides unprecedented practicability and functionality. In particular, its extremely high coverage of annotated genes and many important sites makes it useful for correlation analysis and genetic mapping, while the moderate density reduces costs. These characteristics of the ZDX1 array contribute to its versatility and reliability for soybean breeding and genetic research.
When screening germplasm for potential use as parents, phenotypic identi cation is time-consuming and laborious, and the results are greatly affected by the environment. Therefore, molecular marker-assisted selection represents an e cient and effective method for screening target traits (Barabaschi et al. 2016). Previous studies have shown that some breeding materials which perform well locally are unique in terms of genotype and may therefore contribute useful genetic variation to breeding programs (Iquira et al. 2010).Other Studies have shown that a lack of genetic diversity can hinder efforts to increase yield potential in new varieties (Hegstad et al. 2019).
We hypothesized that non-parental lines or progeny with different genotypes could be used to expand the pool of parental lines, while the frequency of genetic resources with a higher degree of similarity to the parental lines could be reduced. If the genetic variation and distance between least-related accessions are su ciently large in the parent population, then a progeny population with greater genetic variation can be obtained (Mikel et al. 2010). Using our SNP array data, we con rmed that the greater the genetic distance among parents, the higher the rate over best-parent of progeny. Intuitively, to obtain progeny with higher absolute trait values, the parental lines should also show high performance for those traits. In summary, our results suggest a strategy for assembling parents with a greater chance of obtaining excellent progeny, while avoiding blindly formulating a large number of suboptimal combinations.
In In this study, predictive accuracy provided by GBLUPs (i.e., based on genomic information) was higher than that of ABLUP and HBLUP models, reaching an average of 0.75, and the accuracy of predicting performance in beginning maturity, 100-seed weight, protein, oil, and seed yield traits was similar or higher to that in previously reported results (Supplemental Table 16). These ndings further indicate that the SNPs used in the ZDX1 are broadly representative of soybean varieties used in various breeding studies. Moreover, genomic information can better re ect the genetic structure of the breeding population than pedigree relationships. although the sampling strategy is complex and varies from population to population. Here, we found that sampling SNPs located within genic regions was more informative than sampling SNPs from intergenic or random regions, and therefore marker effects do not need to be considered for each different population. Indeed, sampling SNPs from genic regions can ensure or even signi cantly improve the accuracy of prediction and reduce sequencing costs (Song et al. 2020). In addition, expanding the training group can also improve the accuracy of predictions (Hao et al. 2019). In the current study, we expanded the training population of parental lines to include progeny with higher predicted values for traits of interest, which dramatically increased the accuracy of predicting beginning maturity, 100-seed weight, protein content and oil content values among progeny by 32.1% on average.
Marker selection, BLUPs modeling, and expanding the training set based on marker data from the ZDX1 SNP array can thus improve the accuracy of predicting progeny phenotypes.
In traditional plant breeding, breeders mainly rely on phenotype and experience, which may be confounded by a range of factors (Barabaschi et al. 2016). Molecular breeding is therefore considered the best option for improving breeding e ciency ). However, molecular techniques have thus far failed to effectively integrate high-throughput genotyping with the whole breeding process. In this study, we propose an optimization strategy to comprehensively improve the breeding processes of parental evaluation, selection for crosses, and progeny selection using the ZDX1 array (Fig. 6). The low availability of phenotypically ideal germplasm accessions that can be used as parental lines represents a major bottleneck in the breeding progress, so it is necessary to introduce new, high quality germplasm that affects any given trait (Liu et al. 2020a  Mean value of parents and progeny, and the rate over best-parent of progeny for ve traits at plotted against genetic distance. The blue diamonds represent the average parental value, the red circles represent the average progeny, and the yellow triangles represents the rate over best-parent of progeny. The genetic distance is the mean value under different rate over best-parent; rhd represents the correlation coe cient between the rate over bestparent of progeny and the genetic relationship between parents; and rpo represents the correlation coe cient between the mean value of progeny and the mean value of parents Different strategies based on ZDX1 array in genomic selection. (a) The prediction accuracy (rGS) of three models for ve traits with 100 repetitions using 5-fold cross-validation. The prediction accuracy is shown as the mean value ± standard deviation. (b) Prediction accuracy of selected sites for gene region, whole genome, and intergenic region markers. The prediction accuracy is shown as the mean value ± standard deviation. (c) Simulating the process of predicting progeny performance by parental resources in actual breeding and the prediction process after using progeny to expand the training population. Optimized scheme for using genome-wide molecular marker breeding combined with array screening. Germplasm resources are introduced from a resource bank, redundant accessions are eliminated through genetic diversity analysis, and accessions with excellent alleles are retained. Germplasm accessions with higher breeding index (BI) are used as one of the candidate parents in cross breeding, and the superior resources are further screened for those with highly distant genetic relationships for cross breeding. A microarray is then used for F1 identi cation, hybrid segregation combined with phenotypic selection, and whole-genome selection. Germplasm with high breeding values with excellent multiple traits can also be used as recurrent parents, when germplasm with speci c traits is used for backcross improvement, functional markers can be used for foreground selection, and microarrays can be used for genome-wide background scanning, combined with phenotypes for selection, and excellent stable lines can be selected. The green dashed boxes indicate the commonly used breeding method, and the boxes enclosed by solid yellow lines represent the improved scheme proposed in this study

Supplementary Files
This is a list of supplementary les associated with this preprint. Click to download.