Performance prediction of crosses using estimated breeding values for regions of soybean production in Brazil

ABSTRACT The aim of this study was use the performance prediction of crosses in a group of conventional soybean genotypes to obtain the breeding value (BV), and to evaluate the correlation between the prediction and the actual productive potential of the progeny generated by this method in experimental tests for different seasons and environments, and determine whether the methodology is efficient in generating progeny of high productive potential for the soybean macro-regions (SMR) and soil and climate regions (SCR) of Brazil. A total of 481 conventional elite genotypes were selected as parents, the BV were generated, and crosses were predicted using the restricted maximum likelihood/best linear unbiased prediction mixed-model procedure (REML/BLUP). In 2019, the predicted crosses and advancement of the F1 and F2 segregating populations were carried and sent to the breeding programs of a private company in Passo Fundo-RS, Cambé-PR, Rio Verde-GO, Lucas do Rio Verde-MT and Porto Nacional-TO, where they were sown during the 2019/2020 crop season. During the 2020/2021 season, 1868 progeny were selected and tested in experimental trials at these locations. The progeny were again tested during the 2021/2022 season in experimental trials in 50 environments in SCR throughout Brazil. The result of the analysis showed a very weak to moderate correlation, indicating little efficiency for the prediction model used in this study. It is suggested that the prediction model be revised to include a greater number of variables, such as the kinship matrix, so that the BV of the genotypes can be more assertively estimated, especially when the aim is to select progeny in early generations with a high degree of heterozygosity.


INTRODUCTION
The principal aim of a breeding program is to develop genetic combinations between parents (RESENDE; ALVES, 2021) and then develop cultivars that are superior to current cultivars grown and selected in areas that represent the growing region of the species (FRITSCHE-NETO, 2013).
Man y cultivar development programs grow a large number of crosses in the expectation that one might result in a superior genetic combination that can be released as a new cultivar.With an eff ective method of choosing the parents, the number of crosses could be considerably reduced (BORÉM;MIRANDA, 2013).
Since its inception, plant improvement has been based on the visual selection of individuals, i.e., selection based on phenotypic value only (ALLARD, 1999).
It is presumed that the use of mixed models can predict the genotypic value, also known as the breeding value (BV), increasing the efficiency of the selective processes in plant breeding.The REML/BLUP (Restricted Maximum Likelihood/Best Unbiased Linear Prediction) mixed-model method is currently used for studying families, allowing the genetic parameters to be estimated and the genotypic values of the families to be predicted (RESENDE, 2002).According to Pinheiro et al. (2013), there is a growing number of reports on the use of BLUP/REML in soybean improvement.
Heff ner, Sorrells and Jannink (2009), describes how the success of a genetic improvement program depends on the accuracy of predicting the genetic value (BV) from phenotypic values, however, such predictive procedures still need to demonstrate their practical eff ectiveness in fi eld tests so that they can be incorporated into routines as a tool to aid and advance genetic improvement programs.The selection of good parents is the key to success in plant breeding, with individuals to be used as future parents being originally selected based on their superior genotypic value (BERNARDO, 2020), in order that methodologies which employ this information to satisfactorily predict the performance of crosses of high genetic potential might be promising.
Furthermore, performance prediction in crosses of proven value can make plant breeding programs more effi cient in terms of genetic gains, reducing the costs of selection processes, and of developing and obtaining new cultivars, thereby increasing the supply of new cultivars that can be recommended for the diff erent agricultural regions.
The aim of this study was to predict the performance of crosses in a group of conventional soybean genotypes from a private breeding program to obtain the breeding value (BV) of the genotype combination.Also, by means of experimental testing in different seasons and environments, to evaluate the correlation between the result of the prediction and the actual productive potential of the progeny generated by this method and thereby determine whether the methodology is efficient in generating progeny of high productive potential for the soybean macro-regions (SMR) and soil and climate regions (SCR) of Brazil.

MATERIAL AND METHODS
All the genotypes in this study are conventional and come from the soybean breeding programs of GDM Genética do Brasil S.A (GDM) implemented in each SMR and SCR in Brazil, as per the third approximation (KASTER; FARIAS, 2012).
A total of 481 elite genotypes were selected as genitors for predicting the crosses, including 31 genotypes from breeding program M1, relating to SMR 1; 120 genotypes from breeding program M2, related to SMR 2; 118 genotypes from breeding program M3, relating to SMR 3; 176 genotypes from breeding program M4, related to SMR 4, and 36 genotypes from breeding program M5, relating to SMR 5.
The REML\BLUP mixed-model procedure was used with the ASReml-R statistical package (BUTLER et al., 2017) to predict the genetic values of the genotypes (BV) and predict the performance of the crossings employing the Shiny package of the R software (R CORE TEAM, 2016), as per the following model: where µ is the overall mean value, G i is the fi xed eff ect of the ith genótipo( with a diff erent variance 2 ej for each environment j.The predicted cro ssings and generation advancement of the F 1 and F 2 populations were carried out in 2019 in Porto Nacional-TO.After harvesting, these were sent to the GDM breeding programs in Passo Fundo-RS (M1), Cambé-PR (M2), Rio Verde-GO (M3), Lucas do Rio Verde-MT (M4), and Porto Nacional-TO (M5), based on the crossing guidelines for each SMR and the aims of each breeder.In the 2019/2020 season, these segregating F 2 populations were sown in experimental trials (POP trials) that included controls, to characterize the relative maturity group (MGP) of each progeny based on comparative phenotypic observations.Performance prediction of crosses using estimated breeding values for regions of soybean production in Brazil In all, 1868 F 3 pr ogeny were selected, as follows: 206 progeny from 26 M1 pedigrees with an MGP ranging from 5.0 to 6.8; 202 progeny from 106 M2 pedigrees with an MGP ranging from 6.0 to 7.0; 679 progeny from 120 M3 pedigrees with an MGP ranging from 6.5 to 7.6; 732 progeny from 180 M4 pedigrees with an MGP ranging from 7.4 to 8.5, and 49 progeny from 28 M5 pedigrees with an MGP ranging from 8.0 to 8.7.
In the 2020/2021 s eason, the F 3 progeny were sown in the respective breeding programs to evaluate phenotype and yield (YLD) in trials (MROW trials) consisting of 50 treatments containing progeny with no repetitions, and 10 treatments containing controls repeated between each trial.The plot layout comprised two rows of fi ve meters at a spacing of half a meter between rows and half a meter between plots (plant corridor) in an augmented block design (ABD), as described by Federer (1956).To analyze the data, the R software (R CORE TEAM, 2016) was used employing mixedmodel methodology and considering random effects, as per the formula: where µ is the overall mean; G i is the fi xed eff ect of the For the 2021/2022 seaso n, the F 4 progeny were sown in experimental trials (RETEST trials) in 50 environments throughout Brazil, corresponding to the SCR in each SMR under the responsibility of the breeding programs, as shown in Table 1, allowing information on the genotype x environment interaction (GxE) to be captured.The RETEST trials comprised 50 treatments containing progeny with no replications, and 10 treatments containing controls that were repeated between each trial, using the ABD experimental scheme in plots of four rows of fi ve meters, at a spacing of half a meter between rows and half a meter between plots.The data were analyzed using the R software (R CORE TEAM, 2016) to obtain the fi nal yield of each progeny, considering the dataset from all the environments, respectively, using mixed-model methodology and including random eff ects, as per the formula: where µ is the overall mean; G i is the fi xed eff ect of the ith genotype ( ) is the Random eff ect of the jth environment To comp are the effectiveness of the predictions (YLD BV) in relation to the actual potential of the progeny in the trials (YLD MROW and YLD RETEST), correlation analysis was used, which, as described by Henriques (2011), studies the relationship between a dependent variable and other independent variables, expressed by one equation that associates them all.According to Martins (2014), the coefficient of determination (R 2 ) gives the percentage variability of the independent variable that can be explained as a function of the variability of the dependent variable.The square root of the coefficient of determination corresponds to the correlation coefficient (r) whose value should vary between 0 and 1.A value of zero means that there is no linear relationship between the variables (HENRIQUES, 2011).When interpreting the correlations, three aspects should be considered: magnitude, direction, and significance (NOGUEIRA et al. 2012).The linear regression model is represented by: Performance prediction of crosses using estimated breeding values for regions of soybean production in Brazil  2.
The correlations were analyzed using the R v4.2.1 software (R CORE TEAM, 2016), independently in the following different scenarios: Scenario M1, Scenario M2, Scenario M3, Scenario M4, Scenario M5, and jointly in a general scenario, to study the methodology applied in each SMR and breeding program.

RESULTS AND DISCUSSION
In Scenario M1, the result of 0.3278 for r shows that there is a weak correlation between YLD BV and YLD MROW, i.e., around 32.7% of the values of YLD BV explain the result of YLD MROW.The result of 0.2206 for r shows that there is a weak correlation between YLD BV and YLD RETEST, around 22% of the values of YLD BV explaining the result of YLD RETEST, as shown in Table 3.
The results showed a weak correlation between YLD BV and YLD MROW (0.088).The values of the variables are highly dispersed and distanced in relation to the line, which makes this correlation eff ectively weak.The correlation between YLD BV and YLD RETEST is also weak and only minimally positive (0.086), with highly dispersed data, as shown in Figure 1.
T a b l e 3 -Correlation analysis between the results of YLD BV x YLD MROW and YLD BV x YLD RETEST for Scenario M1 T able 4 -Correlation analysis between the results of YLD BV x YLD MROW and YLD BV x YLD RETEST for Scenario M2 Performance prediction of crosses using estimated breeding values for regions of soybean production in Brazil In Scenario M2, the result of 0.1428 for r shows that there is a very weak correlation between YLD BV and YLD MROW, i.e., 14.2% of the values for YLD BV explain the result of YLD MROW.The result of 0.030 for r shows that there is a very weak correlation between YLD BV and YLD RETEST, only 3% of the values for YLD BV explaining the result of YLD RETEST, as shown in Table 4.
The results also showed a very weak correlation between YLD BV and YLD MROW (-0.070) with highly dispersed data.The correlation between YLD BV and YLD RETEST is also weak (0.022) with highly dispersed data, as shown in Figure 2.
In Scenario M3, the result of 0.0360 for r shows that there is a very weak correlation between YLD BV and YLD MROW, where only 3.6% of the YLD BV values explain the result of YLD MROW.The result of 0.010 for r shows that there is a very weak correlation between YLD BV and YLD RETEST, only 1% of the YLD BV values explaining the result of YLD RETEST, as shown in Table 5.
In this Scenario there was a very weak correlation between YLD BV and YLD MROW (0.006) with highly dispersed data.The correlation between YLD BV and YLD RETEST is also very weak (-0.003), again showing highly dispersed data, as shown in Table 3.
I n Scenario M4, the result of 0.0574 for r showed that there is a very weak correlation between YLD BV and YLD MROW, i.e., 5.7% of the values for YLD BV explain the result of YLD MROW.The result of 0.0223 for r shows that there is a very weak correlation between YLD BV and YLD RETEST, with only 2.2% of the values for YLD BV explaining the result of YLD RETEST, as shown in Table 6.There was also a very weak correlation between YLD BV and YLD MROW (0.031) showing high data dispersion.The correlation between YLD BV and YLD RETEST is also characterized as very weak (0.012) and with high data dispersion, as shown in Figure 4.
For Scenario M5, the result of 0.0141 for r shows that there was a very weak correlation between YLD BV and YLD MROW, i.e., only 1.4% of the values for YLD BV explain the result of YLD MROW.The result of 0.0812 for r shows that there is a very weak correlation between YLD BV and YLD RETEST, approximately 8.1% of the values for YLD BV explaining the result of YLD RETEST, as per Table 7.
The correlation between YLD BV and YLD MROW is very weak (0.004) and shows high data dispersion.The correlation between YLD BV and YLD RETEST is also very weak (-0.039) with high data dispersion (Figure 5).
The general scenario is shown in Table 8.The result of 0.4800 for r shows that there is a moderate correlation between YLD BV and YLD MROW, where around 48% of the values for YLD BV explain the result of YLD MROW.The result of 0.0447 for r shows that there is a very weak correlation between YLD BV and YLD RETEST, with approximately 4.4% of the values for YLD BV explaining the result of YLD RETEST.
The correlation between YLD BV and YLD MROW is classifi ed as moderate and negative (-0.138) and is shown in Figure 6.The correlation between YLD BV and YLD RETEST was also very weak (-0.035) and with highly dispersed data.
As the environmental eff ect exerts a great infl uence on the behavior of the germplasm, it was certainly one of the causes of the diff erences in YLD between and in each SMR.One probable explanation for this low and negative correlation may be connected to the prediction model used in the study, which may not be adjusted or suitable for estimating the correct values for BV.Another possible explanation could be related to the number of data environments of each parent used in predicting the crosses, since the data from one environment might be used against the data from up to 133 diff erent environments, depending on the phase of each parent in the breeding program.
Despite being a viable alternative, because of the low availability of seeds during the initial stages of plant breeding programs, as described by Duarte, Vencovsky and Dias (2001), the ABD experimental model may have infl uenced the results, since the experimental error associated with the lack of repetitions can be signifi cant (Silva and Silva, 1999).One way to reduce the experimental error might be to change to a layout that reduces the border eff ect, since, as described in Silva, Souza and Montenegro (1991), this eff ect can result in low experimental precision.However, these improvements are subject to the volume of seeds available in the initial generations of the breeding programs.
Another causal factor of the very weak to moderate correlation between variables may be related to the size and number of the diff erent environments under analysis.Lima et al. (2008) describe how the genotype x environment (GxE) interaction is one of the  main complicating factors in the work of breeders.To reduce the eff ect of the GxE interaction, it is necessary to conduct experiments in a greater number of locations, evaluating the strength of the interaction and its possible impact when selecting and recommending genotypes.
Most agronomic characteristics are controlled by several genes, where the environment aff ects expression of phenotypic traits, especially quantitative traits such as YLD (LEITE et al., 2016).The prediction accuracy for any one characteristic across various environments can diff er due to the GxE interaction (WANG et al., 2018).Furthermore, most characteristics of agronomic importance have low heritability (BORÉM;MIRANDA, 2013).Because of this, one alternative might be the joint analysis of multiple traits that, according to Alimi et al. (2013), can improve prediction accuracy when using highly correlated characteristics, especially for some of the characteristics of low heritability.Heritability is positively related to prediction accuracy.
As these are progeny of generations F 3 and F 4 with a high degree of heterozygosity, the variation in YLD response is expected since the alleles are still undergoing genetic recombination.As stated by Mendonça et al. (2020), it is diffi cult to select for quantitative characteristics during the initial stages of breeding due to the high level of heterozygosity and the large number of new progeny.The correlation between the analyzed variables may show an increase in magnitude and a change in direction to the point where the degree of heterozygosity is reduced.
To accelerate the homozygous process, methods of rapid generation advancement can be effi ciently adopted, advancing two, three or even four generations in the same  (CASTRO et al., 2016).Furthermore, as stated by Bauer and León (2008), considering information of the parents or pedigree tends to be more efficient than only considering information of the progeny, as is the case when using a kinship matrix, as reported by Resende and Alves (2021) and Clark et al. (2012).When using parents that are genetically highly dissimilar in crosses to increase genetic variability, the process of progeny selection in early generations becomes more complex and challenging.As such, the response of these correlations may be associated with divergent crosses; there is a need for a better understanding of the genetic dissimilarity between parents and the direct and indirect eff ects on the progeny.
In the correlations in each of the presented scenarios the data was highly dispersed, and culminated in low correlation between the variables, indicating that the regression model may not fi t the data.
From the genetic point of view, another factor to be considered in the results is related the germplasm used in this study, which is essentially conventional.According to the International Service Acquisition of Agri-biotech Applications (2018), around 94% of the soybean planted in Brazil is genetically modifi ed (GMO).The wide adoption of new biotechnologies has made practically all the companies carrying out soybean breeding, whether public or, particularly, private, migrate their research to GMOs, causing a drastic reduction in the commercial availability of conventional cultivars, and severely limiting and narrowing the genetic base.
In the germplasm of a breeding program that includes superior genotypes, it cannot necessarily be considered that these will inevitably generate agronomically superior progeny when used as parents, albeit recurrent selection proves to be effi cient in most cases.
Performance prediction of crosses using estimated breeding values for regions of soybean production in Brazil The correlation between YLD BV and YLD MROW in the general scenario, with an r value of 0.4800, was moderate, showing that, for this scenario, predicting crosses can be used as one way of identifying crosses of greater potential, or even of discarding the worst crosses, providing it is used cautiously.As a comparison, Mendonça et al. (2020) achieved models that reached predictive abilities of between 0.40 and 0.56, thereby allowing low-intensity selection to be applied in F 2 .As a result, half of the progeny could be discarded without major losses, showing that with the use of genomic prediction, it is possible to select for quantitative characteristics during the initial stages of breeding.
Although the use of phenotypic data in this study to predict the performance of crosses and identify better crosses that might generate progeny of agronomically superior characteristics has not proven to be highly efficient to the point of being widely applied, several other studies show extremely positive results when using this methodology, such as Xu, Zhu and Zhang Most of the reported results derive from the study of homozygous populations (DUHNEN et al., 2017;JARQUÍN et al., 2014;ZHANG et al., 2016), and consider different crops and the GxE interaction.However, this does not include the complete set of situations for which prediction can be used.As such, little information on performance is available during the early stages of breeding to predict segregating progeny or populations (MENDONÇA et al., 2020).
Certainly, with advances in the processes of genetic improvement, the adoption of new methodologies and equipment, new biotechnological tools, statistical models, more improved predictive models, high-throughput phenotyping, and reductions in the costs of genotyping and data analysis, the time and resources spent on obtaining progeny of high productive potential and with superior agronomic characteristics tends to be reduced, thereby increasing genetic gain per year of breeding.As such, the genetic improvement of plants will continue to make a considerable contribution to increasing productivity.

CONCLUSION
The model for predicting the performance of crosses using estimates of the breeding value was not very effi cient in initially identifying crosses with a high potential for generating agronomically superior soybean progeny for the soybean regions of Brazil.The correlations between YLD BV (estimates of the breeding value) x YLD MROW (F 3 progeny) and YLD BV x YLD RETEST (F 3 progeny) were classed as very weak to moderate.
Ei is the random eff ect of the jth environment (combination of locality + crop year + sowing date) eff ect of the kth trial in the jth environment error associated with the experimental unit of the ith genótipo in the kth trial in the jth environment ( ) error associated with the experimental unit of the ith genotype in the kth eff ect of the kth block in the jth environment error associated with the experimental unit of the ith genotype in the kth block of the do jth environment () variance 2 ej for each environment j.
Dependent variable; β 0 = Coeffi cient of intersection (Value of Y for X = 0); β1 = Inclination of the line (can be positive, negative or zero); x = Independent variable; e = Error due to random eff ects.The degree of correlation between the variables (YLD BV, YLD MROW and YLD RETEST) was analyzed as per Devore (2006) and is shown in Table

Figure 2 -
Figure 2 -Correlation between the results of YLD BV x YLD MROW and YLD BV x YLD RETEST for Scenario M2

Figure 3 -
Figure 3 -Correlation between the results of YLD BV x YLD MROW and YLD BV x YLD RETEST for Scenario M3

Figure 5 -
Figure 5 -Correlation between the results of YLD BV x YLD MROW and YLD BV x YLD RETEST for Scenario M5

Figure 6 -
Figure 6 -Correlation between the results of YLD BV x YLD MROW and YLD BV x YLD RETEST for the general scenario (2014) with rice, Daetwyler et al. (2014) with wheat, and Mendonça et al. (2020) working with segregating populations and soybean progeny.

Table 1 -
Environments, SMR, SCR and number of progeny sown in the 2021/2022 season A.Dallastra et al.

T able 7 -
Correlation analysis between the results of YLD BV x YLD MROW and YLD BV x YLD RETEST for Scenario M5 Note: *p < 0.1; **p < 0.05; ***p < 0.01 Performance prediction of crosses using estimated breeding values for regions of soybean production in Brazil -Correlation analysis between the results of YLD BV x YLD MROW and YLD BV x YLD RETEST for the general scenario and breeding programs year, depending on the cycle of the progeny and the region to be cultivated, and on such techniques and supplementary tools as greenhouses, heating and ventilation systems, supplementation or suppression of the light to change the photoperiod, the use of hormones, the early harvesting of seeds, and changes in the CO 2 concentration in controlled environments, among others.In this case, as the interest of