Efficacy of plant breeding using genomic information

Genomic selection (GS) proposed by Meuwissen et al. more than 20 years ago, is revolutionizing plant and animal breeding. Although GS has been widely accepted and applied to plant and animal breeding, there are many factors affecting its efficacy. We studied 14 real datasets to respond to the practical question of whether the accuracy of genomic prediction increases when considering genomic as compared with not using genomic. We found across traits, environments, datasets, and metrics, that the average gain in prediction accuracy when genomic information is considered was 26.31%, while only in terms of Pearson's correlation the gain was of 46.1%, while only in terms of normalized root mean squared error the gain was of 6.6%. If the quality of the makers and relatedness of the individuals increase, major gains in prediction accuracy can be obtained, but if these two factors decrease, a lower increase is possible. Finally, our findings reinforce genomic is vital for improving the prediction accuracy and, therefore, the realized genetic gain in genomic assisted plant breeding programs.


INTRODUCTION
Genomic selection (GS) is a predictive methodology proposed by Meuwissen et al. (2001), which has revolutionized plant breeding because breeders can select potential candidates before they are planted in the field. Researchers train a statistical machine learning model with recorded phenotypic and genotypic information of a reference population (also called training population), then engage it to make predictions regarding individuals for which only genotypic information is available (Desta & Ortiz, 2014). This predictive methodology is called genomic selection because it requires high-density markers across the whole genome and depends on the assumption that at least one genetic marker will be in linkage disequilibrium with a causative quantitative trait locus for the trait of interest (Meuwissen et al., 2001). Successful implementation of GS depends on many factors, including: (a) the heritability of the trait of interest, (b) relationships of the individuals in the training and testing set (Habier et al., al. 2007), (c) the size of the training and testing sets, (d) the goal of prediction, for example, tested lines in tested environments, untested lines in tested environments, tested lines in untested environments or untested lines in untested environments, (e) the statistical machine learning method implemented for obtaining the predictions, (f) marker density, and (g) population structure (Frouin et al., 2019).
Advantages of GS over traditional (phenotypic) selection include the following: (a) the possibility of selecting individuals before they are planted, (b) the potential for selecting superior genotypes with higher precision, (c) reduced costs, in part by saving resources required for extensive phenotyping, (d) less time needed for variety development by reducing the cycle length, (e) the ability to substantially increase the selection intensity, thus providing scenarios for capturing greater gain per unit time, (f) the ability to select difficult to measure traits, and (g) increased selection process accuracy.
Because GS promises more precise selection and shorter generation intervals to improve breeding efficiency (Ortiz, 2015), it is being implemented in many crops such as wheat, maize, cassava, rice, chickpea, and groundnut Huang et al., 2019;Roorkiwal et al., 2016;Wolfe et al., 2017). Nevertheless, practical application of GS is complex because successful implementation is determined by the predictive accuracy of the statistical machine learning methods employed.
Our goal in this research was to exhibit the importance of the genomic markers information to improve the prediction accuracy in plant breeding trials. Using 14 real datasets (some with several traits), we compared the prediction performance of not using (Method A) and using (Method B) genomic information in the prediction models. For the benchmark, we used the Bayesian best linear unbiased predictor model (BLUP)

Core Ideas
• Although genomic selection (GS) has been widely accepted and applied to plant, there are many factors affecting its efficacy. • Genomic prediction models are built based on the relationship between genetic markers and the traits of interest using a training population.
• If the quality of the makers and relatedness of the individuals increase, major gains in prediction accuracy can be obtained, but if these two factors decrease, a lower increase is possible.
with the effects of environment, genotypes, and genotypes × environment interaction.

Bayesian BLUP model (Model A)
The model that does not include genomic information is as follows: where represents the response variable measured in the jth line and ith environment, and μ is a general intercept. The environments ( ) are considered random variables with identically and independent normal distribution (iid) (0, σ 2 ) with mean 0 and variance σ 2 , the cultivars ( ) are iid random variables with normal distribution (0, σ 2 ) with mean 0 and variance σ , 2 , the cultivar × environment interaction ( ) are iid random variables with normal distribution (0, σ 2 ) with mean 0 and variance σ 2 , and ϵ are random error components in the model assumed to be iid with mean 0 and variance σ 2 .

Bayesian genomic best linear unbiased predictor (GBLUP) model (Model B)
The implemented single-trait statistical genomic model was as follows: where , μ, and ϵ are defined as in equation (1), but , = 1, … , , are the random effects of genotypes, are the random effects of genotype × environment interaction. Furthermore, it is assumed that where is the genomic relationship matrix (linear kernel) computed as proposed by VanRaden (2008) and T denotes matrix transpose. The implementation of both models was carried out in the R statistical software (R Core Team, 2022) using the Bayesian generalized linear regression library of Pérez and de los Campos (2014).

Datasets
Datasets wheat EYT_1, EYT_2, and EYT_3 These datasets are elite yield trial (EYT) cycles 2013-2015 from the Global Wheat Program of the International Maize and Wheat Improvement Center (CIMMYT) and belong to elite yield trials (EYT) established in four different cropping seasons with four or five environments in each.
The lines involved in this study correspond to years 2013-2014, 2014-2015, and 2015-2016. The EYT_1, EYT_2, and EYT_3, contain 776, 775, and 964 lines, respectively. The experimental design was alpha lattice, and the lines were sown in 39 trials, each covering 28 lines and two checks in six blocks with three replications. In these datasets, several traits were available for some environments and lines. We included four traits measured for each line in each environment: days to heading (number of days from germination to 50% spike emergence), days to maturity (number of days from germination to 50% physiological maturity or the loss of the green color in 50% of the spikes), plant height, and grain yield (GY). For full details of the experimental design and how the best linear unbiased estimates were computed, see Juliana et al. (2018). Wheat lines from datasets EYT_2, and EYT_3 were evaluated in five environments, and wheat lines from dataset EYT_1 was evaluated in four environments. For EYT_1, the environments were bed planting with five irrigations (Bed5IR), early heat (EHT), flat planting and five irrigations (Flat5IR), and late heat (LHT). For EYT_2, the environments were bed planting with two irrigations (Bed2IR), Bed5IR, EHT, Flat5IR, and LHT, while for dataset EYT_3, the environments evaluated were Bed2IR, Bed5IR, Flat5IR, flat planting with drip irrigation (FlatDrip) and LHT.
Genome-wide markers for the 2515 (776 + 775 + 964) lines in the two datasets were obtained using genotyping-bysequencing (GBS) (Elshire et al., 2011;Poland et al., 2012) at Kansas State University using an Illumina HiSeq2500. After filtering, 2038 markers were obtained from an initial set of 34,900 markers. The imputation of missing markers data was carried out using LinkImpute (Money et al., 2015) and implemented in trait analysis by association evolution and linkage (TASSEL) (Bradbury et al., 2007), version 5. Lines with more than 50% missing data were removed, and a total of 2515 lines across the three datasets were used in this study.

Dataset Groundnut
The phenotypic dataset reported by Pandey et al. (2020) contains information on the phenotypic performance for various traits in four environments. In our study, we assessed predictions using the trait seed yield per plant for 318 lines in four environments denoted as Environment1 (ENV1) The dataset is balanced, with a total of 1272 assessments in each line included once in each environment. Marker data were available for all lines and 8268 single nucleotide polymorphism (SNP) markers remained after quality control (each marker was coded with 0, 1, or 2).

Dataset Maize
This dataset was included in Souza et al. (2017); and is 722 (with 722 × 4 = 2888 observations) maize hybrids obtained by crossing 49 inbred lines. The hybrids were evaluated in four environments (Env1-Env4) in Piracicaba and Anhumas, São Paulo, Brazil, in 2016 where only GY was the trait included in this study. The hybrids were evaluated using an augmented block design, with two commercial hybrids as checks to correct for microenvironmental variation. At each site, two levels of nitrogen (N) fertilization were used. The experiment conducted under ideal N conditions received 100 kg ha −1 of N (30 kg ha −1 at sowing and 70 kg ha −1 in a coverage application) at the V8 plant stage, while the experiment with low N received 30 kg ha −1 of N at sowing. The parent lines were genotyped with an Affymetrix Axiom Maize Genotyping Array of 616 K SNPs. Markers with minor allele frequency (MAF) of 0.05 were removed. After quality control, 54,113 SNPs were available.

Dataset Disease
This dataset contains 438 wheat lines for which three diseases were recorded: Pyrenophora tritici-repentis (PTR) which causes infections known as helminthosporiosis, tan spot, yellow leaf spot, or yellow leaf blotch. Second, Parastagonospora nodorum, a major fungal pathogen of wheat fungal taxon which includes several plant pathogens affecting the leaves and other parts of the plant. Third, Bipolaris sorokiniana, which is economically important as the cause of seedling diseases, common root rot and spot blotch of barley, wheat, and several other crops. The 438 wheat lines were evaluated in the greenhouse for several replicates. The replicates were considered as different environments (Env1, Env2, Env3, Env4, Env5, and Env6). The total number of observations were 438 × 6 = 2628 observations for which the three traits were measured.
DNA samples were extracted from each line, following the manufacturer's protocol. DNA samples were genotyped using 67,436 SNPs. For a given marker, the genotype for the ith line was coded as the number of copies of a designated marker-specific allele carried by the ith line (absence = zero and presence = one). SNP markers with unexpected heterozygous genotypes were recoded as either AA or BB. We included markers that had less than 15% missing values. Then, we imputed the markers using observed allelic frequencies.
We also removed markers with MAF < 0.05. After quality control and imputation, a total of 11,617 SNPs were available for analysis.

Datasets wheat YT_1-6
Spring wheat lines selected for GY (only trait included in this study) analyses from CIMMYT first year yield trials (YT) were used as the training population to predict the quality of lines selected from EYT for GY analyses in a second year. The analyses were conducted for GY using six sets of data, as reported below: More details of these datasets can be found in Ibba et al. (2020).
All the lines were genotyped using GBS (Poland et al., 2012). The TASSEL v.5 GBS pipeline was used to call marker polymorphisms, and a MAF of 0.01 was used for SNP discovery. The resulting 6,075,743 unique tags were aligned to the wheat genome reference sequence (RefSeq v.1.0) with an alignment rate of 63.98%. After filtering for SNPs with homozygosity > 80%, p-value for Fisher's exact test < 0.001 and χ 2 value lower than the critical value of 9.2, we obtained 78,606 GBS markers that passed at least one of those filters. These markers were further filtered for less than 50% missing data, greater than a 0.05 MAF and less than 5% heterozygosity in all the datasets. Markers with missing data were imputed using the "expectation-maximization" algorithm in the "R" package rrBLUP (Endelman, 2011).

Dataset Indica
The rice dataset was reported by Monteverde et al. (2019) and contains information on the phenotypic performance of four traits (GY; PHR, percentage of head rice recovery; GC, percentage of chalky grain; and PH, plant height) in three environments (years 2010, 2011, and 2012). Three hundred twenty seven lines were evaluated in each year and 54 environmental covariates. This dataset is balanced, with 981 assessments with each line included in each environment. Marker data were available for all lines and 16,383 SNP markers remained after quality control (each marker was coded with 0, 1, or 2).

Dataset Japonica
This rice dataset belongs to the tropical Japonica population, was reported by Monteverde et al. (2019) and contains phenotypic information on the same four traits as for the Indica population (GY, PHR, GC, and PH), but evaluated in five environments (years 2009(years , 2010(years , 2011(years , 2012(years , and 2013(years ). For 2009(years , 2010(years , 2011(years , 2012(years , and 2013 lines were evaluated, respectively. Also, the same 54 environmental covariates measured in the Indica dataset (Dataset 13), were also not used. Since this dataset is not balanced, 1051 assessments were evaluated in the five environments. Marker data were available for 320 lines and 16,383 SNP markers remained after quality control (each marker was coded with 0, 1, and 2).

Data availability
The phenotypic and genomic information for the non-wheat datasets (datasets from Groundnuts, Maize, Disease, Indica, and Japonica) employed on this study can be downloaded from the link: https://github.com/osval78/PLS-data. The two wheat datasets, EYT_1-3 and YT_1-6 can be found in Juliana et al. (2018) and Ibba et al. (2020), respectively.

Metrics for evaluation of prediction accuracy
In each of the 14 datasets, the seven outer fold cross-validation was implemented (Montesinos-López et al., 2022). For this reason, 7 − 1 folds were assigned to the training set and the remaining were assigned to the testing set, until each of the seven folds were used as testing set once. Next, the average of the seven folds was reported as prediction performance using Pearsonťs correlation (Cor) and the normalized root mean square error (NRMSE), computed as NRMSE = RMSĒ , with denoting the observed value , whilê( ) represents the predicted value for observation and̄is the mean value of the observed values. To compare the two methods using (Method B) and not using (Method A) genomic information was computed the relative efficiencies in terms of NRMSE as follows: where RMSE Method A and NRMSE Method B denotes the NRMSE of Methods A and B, respectively. While in terms of Cor, the RE was computed as follows: Under both ways to compute the RE, if RE NRMSE (or RE Cor ) > 1, the best prediction performance in terms of NRMSE(or Cor) was obtained using Method B, but when RE NRMSE (or RE Cor ) < 1, the best method was Method A. When RE NRMSE (or RE Cor ) = 1, both methods were equally efficient.

RESULTS
Results are provided in five sections, the first three provide results for datasets EYT_2, EYT_3, and Groundnut, respectively, and the last section provides summary across all datasets. The prediction performance of using (Method B) and not using (Method A) genomic information is compared to the prediction modeling in terms of NRMSE and Pearsonťs correlation (Cor). The final section contains Supporting Information that provide results (tables and figures) for the other 11 datasets: Disease, EYT_1, Indica, Japonica, Maize, and YT_1-YT_6. General comments on the results from these datasets are given at the end of this section.

Dataset-EYT_2
The prediction performance across traits for each environment and across environments (Global) is provided for this EYT_2′s dataset under the 7-fold random cross-validation (7FCV) across the four traits. Results are shown in Table 1 and Figure 1a,b in terms of NRMSE and Figures 1c,d for Cor between observed and predictive values. As seen in Table 1 and Figure 1a The observed relative efficiency (RE) of comparing Method B versus Method A in terms of NRMSE for each environment and across environments were 1.053 (BED2IR), 1.062 (BED5IR), 1.051 (EHT), 1.037 (Flat5IR), 1.109 (LHT), and 1.060 (Global). This indicates Method B outperformed Method A in all environments mentioned by 5.3% (BED2IR), 6.2% (BED5IR), 5.1% (EHT), 3.7% (Flat5IR), 10.9% (LHT), and 6.0% (Global) ( Table 1 and Figure 1c,d). While the RE of comparing the Method B versus Method A for each environment and across environments in terms of Cor were 1.058 (BED2IR), 1.033 (BED5IR), 1.037 (EHT), 1.035 (Flat5IR), 1.130 (LHT), and 1.005 (Global). This indicates Method B outperformed Method A in all the traits and environments in terms of Cor mentioned by 5.8% (BED2IR), 3.3% (BED5IR), 3.7% (EHT), 3.5% (Flat5IR), 13.0% (LHT), and 0.5% (Global). This indicates Method B is considerably better in prediction performance than Method A since the RE were greatly superior to one in both metrics (NRMSE and Cor).

Dataset-EYT_3
The prediction performance across traits for each environment and across traits and environments (Global) is provided for this EYT_3′s dataset under the 7FCV. Results are shown in Table 2 and Figure 2a,d.

Groundnut
As seen in  Figure 3a,c).
The RE of comparing Method B versus Method A in terms of NRMSE for each environment and across environments was 1.038 (ALIYARNAGAR_R15), 1.069 (ICRISAT_PR15-16), 1.148 (ICRISAT_R15), 1.113 (JAL-GOAN_R15), and 1.089 (Global). These results indicate Method B outperformed Method A in all environments by 3.8% (ALIYARNAGAR_R15), 6.9% (ICRISAT_PR15-16), 14.8% (ICRISAT_R15), 11.3% (JALGOAN_R15), and 8.9% (Global). For more details, see Table 3 and T A B L E 3 Dataset Groundnut. Prediction performance across traits with (Method B) and without (Method A) genomic information for each environment and across environments (Global) for dataset Groundnut in terms of normalized root mean squared error (NRMSE) and Pearson correlation (Cor) under a 7FCV (7 fold cross-validation) strategy.  . Method B's prediction performance is superior to Method A since the RE's were greatly improved in both metrics (NRMSE and Cor). More details are given in Table 3 and Figure  3b,d.

Across datasets
Next, we provide a meta-analysis across traits, environments, and datasets for both metrics (NRMSE and Cor) resulting of implementing a 7FCV strategy in each dataset. As seen in Table 4, in terms of NRMSE the performance for Method A (0.832) was worse than Method B (0.781). In terms of Cor, the performance of Method A (0.170) was worse than Method B (0.248) as well. For details, see Table 4 and Figure 4a,c. The RE of comparing Method B versus Method A in terms of NRMSE strategy was 1.066, which indicates Method B outperformed Method A by 6.6% across traits, environments, and datasets. For more details, see Table 4 and Figure 4b.  (Table 4 and Figure 4d).
T A B L E 4 Prediction performance across traits, environments and datasets with (Method B) and without (Method A) genomic information in terms of normalized root mean squared error (NRMSE) and Pearson correlation (Cor) under a 7FCV (7 fold cross-validation) strategy.

3.5
Disease, EYT_1, Indica, Japonica, Maize, YT_1-6 Results from dataset Disease show the prediction performance across traits for each of the six locations included in the trial and based on both criteria (NRMSE and Cor) does not give Method B any superiority over Method A when genomic is not employed for the prediction of unobserved individuals (Supporting Information). Table S1 and Figure S1 show these results, specifically Figure S1B,C.
Results for wheat dataset EYT_1 show a significant advantage of the model that uses genomic information (Method B) over the non-genomic predictions (Method A) in terms of both matrics: RE for NRMSE and Pearson's correlation between observed and predicted values. See details in Table  S2 and Figure S2A-C where for each of the four environments Bed5IR, EHT,Flat5IR,LHT,and Global Genomic prediction (Method B) showed superiority over the case that genomic is not used for prediction (Method A).
Results from dataset Indica are given in in Table S3 and Figure S3A-C. Based on NRMSE, Method A was worse than Method B in years 2011, 2012, and Global but both methods gave similar prediction accuracy for predicting year 2010 ( Figure S3B). However, for criterion RE_Cor Method A was worse than Method B in all 3 years (2010, 2011, and 2012) and Global. For dataset Japonica the prediction based on genomic (Method B) was superior to the method not using genomic (Method A) in 3 years 2010, 2011, and 2013 based on criterion RE NRMSE and on when predicting all the years, except 2012 based on criterion RE_Cor (Table S4, Figure S4A-D). However, for dataset Maize no increase in prediction accuracy was exhibited for Method B against Method A, as shown in Table S5 and Figure S5A-D.
For the datasets YT_1-YT_6 that included spring wheat datasets from 2013 to 2020 (Table S6-S11 and Figures S6A-D and S11A-D) the prediction given by Method B outperformed Method A for each pair of years of each of the cases, including Global. It was also observed that the clear superiority of Method B over Method A for each of these data sets is shown under both metrics used to evaluate the performance of the two methods RE_NRMSE and RE_Cor.

DISCUSSION
This study evaluated how GS prediction accuracy can increase with and without makers. For more precision, we examined this question within a multi-environment framework (with a predictor considering environment + genotype + genotype × environment interaction) using 14 public datasets of plant breeding programs using GS. On average, across traits, environments, and datasets within the NRMSE, we identified a genetic gain of 6.6% when using markers as opposed to their absence. For Cor, a genetic gain of 46.1% was identified when using markers, as opposed to their absence. The average of the metrics results in a genetic gain value of 26.35% when using markers as opposed to their absence.
Our results support proponents (Frouin et al., 2019;Ortiz, 2015) of GS who observe GS has the potential to transform plant and animal breeding because prediction accuracy drops an average of 26.35% when markers are not employed. The distinction between successful and unsuccessful imple-mentation of GS can be discerned based on this variation. Several factors affect the successful implementation of the GS methodology by influencing the genomic-enable prediction. Three of these factors are discussed as follows: 1. the relatedness of the individuals in the training and testing set (Rincent et al., 2012;Schopp et al., 2015), because unrelatedness individuals points to a low prediction accuracy, even with high-quality markers. The degree of relatedness between individuals in the training and testing populations is important to guarantee high prediction accuracy in the context of genomic prediction because it determines the extent to which the genetic information learned from the training population can be effectively applied to the testing population. Genomic prediction models are built based on the relationship between genetic markers and the traits of interest using a training population. Therefore, if the individuals in the testing population are too closely related to those in the training population, the genomic predictions may be overly optimistic, leading to poor performance when applied to unrelated populations. On the other hand, if the individuals in the testing population are too distantly related to those in the training population, the genomic predictions may be inaccurate due to the lack of shared genetic information. Therefore, selecting an appropriate level of relatedness between the training and testing populations is essential to ensure the generalizability and accuracy of genomic prediction models (Habier et al., 2007;Rincent et al., 2012;Schopp et al., 2015); 2. the quality of markers used in genomic prediction is of paramount importance because the accuracy of genomic prediction depends on the ability of markers to capture the underlying genetic variation of the trait of interest. Inaccurate or low-quality markers can lead to biased or imprecise predictions, resulting in reduced effectiveness and efficiency of genomic selection programs. Therefore, ensuring the quality of markers is critical in achieving the desired outcomes of genomic prediction in animal and plant breeding programs (Goddard, 2009); 3. the coverage of markers also is of paramount importance to the successful implementation of GS because it determines the extent to which the genetic variation across the genome is captured by the markers. A higher marker density and wider genome coverage result in better representation of the genetic variation, leading to more accurate and reliable predictions of an individual's genetic merit for the traits of interest. On the other hand, low marker density or uneven distribution of markers across the genome can result in biased and imprecise predictions, limiting the effectiveness of genomic selection. Therefore, ensuring adequate marker coverage is crucial for achieving the desired outcomes of genomic selection in animal and plant breeding programs (Spindel & McCouch, 2016). The Plant Genome The three factors just pointed out why, in terms of correlation when markers are used, an improvement of more than 46.1% is observed when using genomic compared when not using genomic. It is documented that genomic information (a) measures relatedness arising from more distant ancestors (than those included in a historical pedigree) and (b) captures the variation in realized kinship arising from the stochastic effects of Mendelian sampling and recombination (Yu et al., 2017).
Even as our results corroborate the power of GS, we are aware many breeding programs will have problems implementing GS because it depends on the ability to optimizing all potential factors. Due to the easy of measuring aditional input data such as environmental information and various data types including transcriptomics, proteomics, and metabolomics, among others, there is a resonable expectation that GS, can be implemented with significantly reduced levels of uncertainty. Increasing empirical evidence shows prediction accuracy improves significantly by adding complementary prediction model inputs. For example, Wu et al. (2022), used three types of inputs of different omics data (genomics, metabolomics, and transcriptomics,) as predictors and found the integration of the three inputs improved prediction accuracy in barley.

CONCLUSIONS
In this study we evaluated 14 datasets to determine the gain in prediction performance which can be achieved using genomic information in the prediction models. For each dataset we implemented the GBLUP model with the same predictor with (Method B) and without (Method A) genomic information. We found in some datasets the gain in prediction accuracy when including the genomic information was almost null, but in other datasets, the gain was more than 70%. Across traits, environments, and datasets we found that for NRMSE the gain was around 6.6% and for Cor, around 46.1%, meaning across metrics the gain is around 26.35%. Our findings corroborate that the quality, good coverage of the genomic information, and a moderate to high degree of relatedness among the individuals in the training and testing set it is of paramount importance to take advantage of GS. Our research also illustrates that the quality of genomic information and a low degree of relatedness between the training and testing set can greatly compromise the efficacy of GS.