Development of Optimized Phenomic Predictors for Efficient Plant Breeding Decisions Using Phenomic-Assisted Selection in Soybean

The rate of advancement made in phenomic-assisted breeding methodologies has lagged those of genomic-assisted techniques, which is now a critical component of mainstream cultivar development pipelines. However, advancements made in phenotyping technologies have empowered plant scientists with affordable high-dimensional datasets to optimize the operational efficiencies of breeding programs. Phenomic and seed yield data was collected across six environments for a panel of 292 soybean accessions with varying genetic improvements. Random forest, a machine learning (ML) algorithm, was used to map complex relationships between phenomic traits and seed yield and prediction performance assessed using two cross-validation (CV) scenarios consistent with breeding challenges. To develop a prescriptive sensor package for future high-throughput phenotyping deployment to meet breeding objectives, feature importance in tandem with a genetic algorithm (GA) technique allowed selection of a subset of phenotypic traits, specifically optimal wavebands. The results illuminated the capability of fusing ML and optimization techniques to identify a suite of in-season phenomic traits that will allow breeding programs to decrease the dependence on resource-intensive end-season phenotyping (e.g., seed yield harvest). While we illustrate with soybean, this study establishes a template for deploying multitrait phenomic prediction that is easily amendable to any crop species and any breeding objective.


Introduction
Soybean [Glycine Max (L.) Merr.] breeding programs have improved the crop genetic potential, while producers have modified their agronomic methods to increase seed yield (SY) [1][2][3][4][5]. While genomic-assisted breeding methods are now more routinely applied in large resource-rich breeding organizations, the development of phenomic-assisted breeding methods is in relative infancy and is amendable for costeffective deployment [6]. High-throughput phenomics has been proposed as a solution to lessen the throughput capacity, mechanical, and resource limitations that exist in plant breeding programs associated with phenotyping [7]. Studies have shown high correlation between phenomic traits collected using digital sensors and manually collected measurements [8,9] suggesting phenomic data can be acquired on a wide spatiotemporal scale by leveraging the technological advancements made in sensor technology with ground and aerial-based phenotyping platforms [10]. Empowered with phenomic data that was previously difficult or impossible to collect across an expansive spatiotemporal scale, scientists have begun disentangling the genetic architecture of traits through genomic studies [8,11,12], prediction of target trait performance using genomic [13][14][15][16], and phenomic prediction strategies [9,15,[17][18][19][20]. However, increasing soybean seed yield and on-farm profitability is the primary objective of soybean breeding programs making seed yield an important trait to target in both cultivar and germplasm breeding efforts utilizing phenomics tools that can lead to reduced environmental and genotype testing.
Research has been conducted across several crop species, including soybean, demonstrating the use of phenomic tools to measure traits such as canopy temperature (CT) [16], canopy area [17], and canopy spectral reflectance [18][19][20][21] for 2 Plant Phenomics seed yield prediction. For a phenomic trait to be a useful predictor of seed yield, it must have the following attributes: (a) high genetic correlation with seed yield indicating that the shared additive genetic variation is captured in the phenomic trait, and (b) must be highly repeatable and heritable [22,23]. Given the complexity of physiological processes responsible for seed yield [2][3][4][5] it is likely that a myriad of phenomic traits are required for accurate seed yield prediction across a wide spatiotemporal scale. Studies including phenomic traits in multivariate genomic selection (GS), designed to leverage the shared genetic correlation between traits, have shown increased prediction accuracy proposing the added advantage of including phenomic traits in evaluating candidate genotypes over using yield alone to measure breeding value [14][15][16]. However, more information is needed on deploying high-dimensional phenomic information to compare the predictability of phenomic traits simultaneously for use in seed yield prediction since breeding programs rely on identifying elite cultivars through empirical as well as prediction based approaches [24].
Given the throughput capacity of high-throughput phenotyping platforms to collect multiple sensor information simultaneously, plant scientists are often left with a highdimensional phenomic data cube [25]. The ability to handle large amounts of complex data and to capture complex nonlinear relationships between phenomic predictors and seed yield makes machine learning (ML) a viable mathematical tool [9,26]. Random forest [27] (RF), an ensemble learning ML method, provides the added benefit of using multiple decision trees to model complex trait relationships and the ability to concurrently gauge feature importance to enable the user to glean insights on how predictions were made. In addition to predicting seed yield, identifying an informative subset of predictors is important to reduce data redundancy, minimize sensor cost, and reduce the computational demand required for processing and analysis [28]. Random forest approaches provide simpler interpretability, although advances in deep learning models include explainability of features used in the models for phenotype and this is a rapidly advancing field [29]. In addition to prediction, optimization routine is needed for efficient phenomics based predictors to minimize cost and temporal requirements of data collection. Genetic algorithm (GA) is an optimization algorithm that has been used to identify informative hyperspectral wavebands for disease classification [9,26,28], wheat yield and nitrogen status prediction [30], and corn pollen shed detection [31]. GA is designed to mimic natural evolutionary processes to evaluate the performance (fitness) of a group (population) of predictors (chromosomes) and using selection theory to "breed" a new generation of individuals of each generation using a fitness metric to guide the search process so that only the most elite individuals are recombined until some criteria are met [32]. The premise of GA imparts it the ability to select a subset of hyperspectral wavebands to be concurrently deployed on multisensor payload on aerial based platforms for SY prediction, identification of useful genetic diversity [11,33,34] (for a more in-depth review on this subject see [35,36]), and breeding decisions for population advancement and line selection. While significant strides have been made in the use of the visible and near-infrared spectrum, exploring the extent of the spectrum which is currently collectable remains an elusive target. This work is motivated by the overall need to explore soybean genetic diversity for SY, development of phenomic predictors of SY across growth and development stages using multiple sensors, and data analytic approaches to glean informative pieces of information from a large dataset. Additionally, there is a need to minimize the cost and dedicated resources required for germplasm breeding efforts and to increase the operational efficiency. Therefore, the objectives of this research were (1) to explore and assess the importance of phenomic traits for SY prediction using a diverse set of 292 soybean accessions, (2) to use machine learning and optimization methods to develop prediction models enabling in-season SY prediction and identify informative subset of hyperspectral wavebands for potential phenomic applications to improve SY, and (3) to test and validate prediction models for multiple trait based SY selection. Since most of the yield prediction studies have relied on vegetation indices and canopy traits (area and temperature), we looked at an integrated approach of optimizing the selection of traits and expanding our search space to include individual wavebands.
We propose a framework that is easily transferable to different crops species and breeding program that is looking to fuse ML-based analytics and optimization tools with high-dimensional phenomics data to develop economical but scalable sensor solutions to be deployed using modern phenotyping platforms. These findings present germplasm breeders with an approach to expand testing capacity while improving the accuracy of yield estimation, critical to efficiently mine genetic diversity and drive genetic gain.

Germplasm.
We evaluated 292 diverse soybean accessions from 19 different countries adapted to the maturity group (MG) late I to early III (Table S1). Accessions were sourced from the soybean core collection of the USDA Soybean Germplasm Collection [37] and parents of the Soybean Nested Association Mapping (SoyNAM) panel [38] consisting of 260 and 32 accessions, respectively. These accessions were selected to represent the genetic diversity available to the US soybean breeders and can be classified into three genetic backgrounds (https://www.soybase.org/SoyNAM/): (1) elite, (2) diverse, and (3) plant introduction (PI). Elite cultivars consisted of public breeding lines developed by breeders across the US, diverse lines were developed through crossing elite and PI germplasm, and PI germplasm consisted of publically available lines from the USDA germplasm collection.

Experimental Design.
The data included in this study was collected across six locations over two years (2016 and 2017 growing seasons) (Table S2), and these environment-year combinations are henceforth referred to as environments. To manage spatial variability, an alpha-lattice design was created uniquely at each environment and consisted of two replications with 30 incomplete blocks. Experimental plots Plant Phenomics 3 were established with a GPS enabled ALMACO (Nevada, IA, USA) cone planter equipped with four row units (76 cm row spacing) and seeded to a length of 4.6 m with 0.9 m alley between plots. Plots were seeded at a rate of 296 K seeds ha −1 . Once seedling emergence was complete, the number of plants from a 1 m section from a randomly selected portion of the middle two rows was recorded for each plot to determine suboptimal plots for this study. Plots with low seedling emergence determined by observations more than two interquartile ranges below the first quartile were discarded (14 out of 3504 total plots across the six environments).

Phenotypic Data Collection and Processing.
In each environment, plots were phenotyped for physiomorphological (phenomic) traits at two time points during the growing season when plots reached the following approximate growth stages: S1: flowering (R1-R2) and S2: pod set (R3-R4) [39]. The inability and impracticality to collect crop growth stage specific data per plot motivated us to collect across the important crop development stages: flowering and pod set. We selected these two approximate growth stages due to the important phenological stages that impact final seed yield as suggested by previous research [2][3][4][5]. We ensured that stage specific data were collected as per the two stages by recording per genotype growth stage at each environment (from the first replication) for each set of phenotypic data collected in the study.
During the 2016 growing season, phenomic traits were collected manually using appropriate sensors and equipment. Through a preliminary study (data not presented), it was determined that four to six hours per sensor per environment was required to collect data depending on walking speed and weather conditions. To optimize data collection by minimizing time required for multiple sensor data collection, we constructed a mobile phenotyping platform similar to [15] and deployed during the 2017 growing season. All physiomorphological traits were collected from the middle two rows and data were collected by pushing/pulling the phenotyping buggy up and down passes while simultaneously collecting data from multiple sensors (canopy temperature, canopy area, and canopy spectral reflectance).
Canopy temperature (CT) was measured at four environments using a FLIR VUE Pro R (FLIR, Goleta, CA, USA) infrared camera with a 9 mm lens and 640 × 512 pixel resolution on cloudless days when wind speed was <2.24 m s −1 . The sensor was suspended 2.0 m above the soil surface in the nadir position and 8-bit JPG image recorded. Plot CT values were extracted using a custom MATLAB (R2017a) script to remove soil background values and mean thermal temperature in degrees Celsius computed for the canopy area remaining after image thresholding. CT data was then corrected for changes in ambient temperature by normalizing by pass which has been shown to increase repeatability [15].
Canopy area (CA) was determined using Canopeo app [40] in MATLAB to estimate fractional green canopy area from RGB images. JPG images were acquired using a Canon T5i camera (Canon U.S.A. Inc., Huntington, NY, USA) with an 18 to 55 mm lens suspended 2.0 m above the soil surface and 20 ∘ from nadir. One image was recorded per plot with camera lens zoom fully retracted and camera facing plot so that a landscape image was taken to capture a long transect of the middle two rows. To ensure consistent image quality, images were collected in automatic mode (Program AE) to automatically control both aperture and shutter speeds to maintain consistent exposure value.
Canopy spectral reflectance was measured using a Field-Spec5 4 Hi-Res (ASD Inc., Boulder, CO, USA) spectroradiometer which measures 2150 spectral wavebands from 350 to 2500 nm. Data was collected by positioning the fiber-optic cable 1 m above the canopy in the nadir position and two reflectance measurements were recorded from each of the middle two rows on cloudless days ±2 h of solar noon and calibrated every 20 minutes during data collection by normalizing to a white reference panel (Specralon5, Labsphere Inc., North Dutton, NH, USA).
We processed the data as follows: Data Processing Step 1: calculated average reflectance for each plot by averaging the two observations. Data Processing Step 2: computed repeatability for individual wavebands across all locations using the following equation [24]: Where 2 is the genotypic variance, 2 is the variance attributed to genotype environment interaction, 2 is the residual variance, is the number of replications, and is the number of environments. Data Processing Step 3: removed wavebands with 2 < 0.3. Data Processing Step 4: calculated vegetative indices (VI) previously characterized to be associated with different physiological traits (Table S3).
Data Processing Step 5: computed the mean reflectance across blocks of 10 nm regions (R) across the 1780 wavebands to produce 178 averaged wavebands. We chose to average every 10 nm to reduce multicollinearity between adjacent wavebands and to identify spectral regions with resolution consistent with customizable miniaturized multispectral cameras currently publicly available for research and breeding applications.
Seed yield (SY, kg ha −1 ) was measured from the middle two rows of four row plot by machine harvest using ALMACO SPC20 combine after plots had reached physiological maturity (R8). Seed moisture was measured of harvested plots to adjust plot SY values to 13% moisture. Preharvest shattering was scored for each plot on 1 (no shattering) and 5 (more than 50% of plants had more than 50% of seed loss) scale and yield observations with preharvest shattering score of ≥4 were removed from further analysis (27 out of 3504 total plots across the six environments).

Statistical Model.
A mixed linear model was fit to the alpha-lattice design to test model effects and obtain genotypic best linear unbiased predictions (BLUPs) of studied traits using the R package lme4. A mixed linear model was fit with the form: where is a vector of observed phenotypes, is the grand mean, is the effect of the th environment, is the effect of the th replicate, ( ) is the effect of the th incomplete block nested within the th replicate, is the effect of the th genotype, × is the effect of G x E, and is the residual error and is assumed to be normally and independently distributed, with mean zero and variance 2 . Assumptions of ANOVA were tested using Shapiro Wilk test and Bartlett's test using base functions in R. Residuals were normally distributed with homogenous variance. To identify inconsistencies in the data, outliers were removed by calculating studentized residuals for each observation of each trait and outliers excluded from the analysis with values ±3.
Analysis of variance (ANOVA) for seed yield was conducted to evaluate the effect of genotype, termed as fixed, and all remaining termed as random using a mixed linear model with the same as that for (2). Additionally, a two-way ANOVA Dunnett's test was used to compare PI and diverse accessions with elite genotypes as the control and adjusted P-values computed for comparison between each genotype and the control (elite genotypes). Accessions with statistically similar seed yield were defined as P > 0.05.
To deal with missing data at some locations and unbalanced sample size of phenomic information among accessions due to weather or logistical constraints during phenotyping (Table S4), genotype BLUPs were computed using two methods (also see Cross-Validation Section below): Method 1: from four out of six environments, byenvironment BLUPs, were computed as they had complete datasets.
Method 2: across-environment BLUPs were computed for all six environments.
These preprocessing steps of BLUP computation were motivated with the intention to compare phenomic prediction model accuracy when a complete training set is assembled across all environments versus a scenario where environments have sparse phenomic information. Both these scenarios are endemic to germplasm and cultivar development programs conducting multiple environment testing. Method 1 BLUPs were computed by removing all terms associated with environment, while Method 2 BLUPs were computed using (2) with all terms considered random.

Genetic Correlation and SNP-Based Heritability.
Genetic correlations ( ) between seed yield and phenomic traits were computed using multivariate mixed models [13]. SNP-based heritability (ℎ 2 ) [41] was calculated using a mixed linear model with the form: Where is a vector of BLUP phenotypic values computed from method 2 for the trait of interest, is a scalar intercept, is an incidence matrix for the random genotype term, is a vector of random effects corresponding to genotypes [ ∼ (0, 2 )] [ ∼ 0, 2], where A is the additive genomic relationship matrix [42], and E is a vector of residuals. Genotypic data for all 292 genotypes was obtained from the publicly available Illumina Infinium SoySNP50K Bead-Chip database (https://soybase.org/snps/). Single nucleotide polymorphism (SNP) markers with missing rate >10% were removed from the analyses and the remaining missing SNPs imputed using BEAGLE version 3.3.1 with default settings in synbreed package [43]. After imputation, SNPs with minor allele frequency (MAF) <5% were removed leaving 35,512 SNPs. Unlike conventional estimates of heritability, A is used to calculate marker-based genetic variance ( 2 ) associated with genotypes and ℎ 2 computed using: where 2 is the residual variance (for a more in-depth review see [13,42,44]). The R package sommer [45] was used to compute the A matrix, genetic correlation, and ℎ 2 using the built-in pin function and standard error estimates were computed simultaneously.

Phenomic Prediction Pipeline.
In this study, we developed an analytical pipeline using RF algorithm for prediction of SY (response variable) using phenomic traits (predictor variables). Predictive ability of phenomic traits for SY prediction was determined by partitioning predictor traits into three cohorts: (1) canopy (CA and CT), (2) VI, and (3) wavebands. For each cohort, predictor variables were independent factors. Models were trained using (a) canopy alone, (b) VI alone, (c) canopy and VI together, and (4) wavebands alone (see Data Processing Step 5 above). Essentially, sensor combinations that can be easily deployed onto payloads were the key driver in exploring prediction performance for these combinations of sensors. The caret package [46] implemented in R was used for model training and hyperparameters tuned using the tunelength function. To gauge model performance during training, repeated (n=5) 10-fold cross-validation was used and the coefficient of determination (R 2 ) and root mean square error (RMSE) for out-of-bag (OOB) samples are reported. Predictions were then projected onto an independent dataset (see Cross Validation section below) not included in model training and consisting of only phenomic traits. Variable importance was computed using the varImp function and mean importance is reported.

Cross-Validation (CV).
To evaluate model performance, we used two cross-validation (CV) scenarios to emulate phenomic prediction in plant breeding programs ( Figure 1): CV1: from all environments, 80% of accessions (n=234) were included in model training set and 20% (n=58) were kept in the testing set.
CV2: this was used for per environment prediction cross-validation and the four environments with complete datasets were included. For each of these four environments, 80% of accessions from the other three environments were used for model training, while 20% of accession for that environment was used for testing; i.e., for Environment#2, model training was done on 80% of random accessions from Environments# 1, 3, and 4, and testing was done on 20% of remaining accession from Environment#2. For CV1 and CV2, the training and testing procedures were repeated 10 times and the mean accuracy for each CV-Method combination is reported. Training and testing sets were compiled for each CV iteration and training data used to parameterize model and prediction made onto the test set following model training. Two preprocessing methods were used to parameterize RF prediction models (see Statistical Model section), and we then tested two CV scenarios to emulate prediction challenges faced by breeders in field trials with unbalanced data. From a practical application viewpoint, the CV1 strategy is a scenario where phenomic data is collected on all genotypes while yield is collected on a subset of lines and breeders may wish to estimate the rank performance of untested genotypes not phenotyped for yield but with available physiological trait data. The CV2 strategy is deployable where breeders are interested in predicting rank performance of untested accessions (no seed yield data) and untested environments (unseen environment) with no seed yield but with phenomic traits. The CV2 strategy is an improvement to leave-oneenvironment-out [47] situation as we excluded test genotypes from model training.
Model prediction accuracy is reported using Spearman rank correlation coefficient between observed values and predicted values of the test set computed by recording the mean values across all 10 training-testing iterations and all folds of CV. Cross-validation schemes were developed in R using in-house script.

Predictor Optimization.
To identify spectral reflectance wavebands and validate previous findings, we used a genetic algorithm (GA) optimization approach with RF-based predictor as the underlying function evaluator to identify a subset of wavebands capable of being deployed using a multispectral camera. The objective was to identify four wavebands common across the two growth stages (S1 and S2) that maximized seed yield rank correlation while deploying one multispectral camera; therefore our search space spanned the set of 356 wavebands (178 wavebands per growth stage) while ultimately picking the four most optimal wavebands. We chose to identify four wavebands as this is consistent with the current offering of third-party vendors providing customizable cameras that can be used as a selection tool for phenomic-assisted breeding selection processes. Details of the GA process are outlined in Table S5. As GA is a computationally intensive process and prior results showed higher prediction accuracy using Method 1 BLUPs, we limited future analyses to this subset and therefore only Method 1 results are presented. Furthermore, the GA approach was not used in Method 2 (for developing a regression model) due to insufficient dataset size. Using the same training and testing data in the aforementioned phenomic prediction section and once terminal conditions were met, a RF model was retrained and prediction performance assessed by predicting seed yield on the complete testing set using the four selected wavebands and Spearman rank correlation 6 Plant Phenomics was computed. To supplement wavebands, we selected the VI with the highest in the respective CV scenario and canopy (temperature and area) traits for each CV scenario. In addition to reporting Spearman rank correlation for the test set, we measured breeding success outcome given a hypothetical selection intensity of 20% through a confusion matrix: true positive (TP), true negative (TN), false positive (FP), and false negative (FN). From these values, classification metrics relevant to plant breeding were computed from the confusion matrix output: Spearman rank correlation and confusion matrix results are reported from a study of the effect of training population size using variable training set size: 80% (234 genotypes), 60% (175 genotypes), 40% (117 genotypes), and 20% (58 genotypes) for the optimized RF prediction model in the two CV procedures. Mean predictive performance was assessed for each training population size.

Seed Yield Performance.
A significant effect of genotype, environment, and their interaction was observed (Table S6). Mean SY of 2113 kg ha −1 was observed across the 292 accessions with elite germplasm (4008 kg ha −1 ) having superior SY followed by diverse (3570 kg ha −1 ) and PI (1968 kg ha −1 ). The extent of seed yield performance was extensive: 566-3537 kg ha −1 within the PI cohort, 2979-3991 kg ha −1 within diverse accessions, and 3335-4542 kg ha −1 within the elite accessions. Three diverse accessions were not significantly different compared to the mean performance of the elite accessions. While the most extensive trait variation was observed for PI, there was an overlap in performance of the three groups ( Figure 2). PI597482 (from South Korea) had the highest SY (3537 kg ha −1 ) within the cohort.

Phenomic-Enabled Yield Prediction.
Overall, we observed the following trends: (1) phenomic data collected at two growth stages during the growing season was predictive of SY rank at maturity, (2) the use of by-environment BLUPs had improved prediction accuracy compared to using acrossenvironment BLUPs for predicting seed yield, (3) RF model had improved prediction accuracy when training data was included from the same environment in which the test genotypes were evaluated, and (4) a wide range in prediction accuracy was observed among predictor cohorts demonstrating the need for identification of the best predictors to optimize sensor deployment (Figure 4). Higher rank correlation in CV1 was observed when compared to CV2, and higher rank correlation in Method 1 was observed in comparison to Method 2. The fourway classification of Method (1 and 2) and CV (1 and 2) showed that there was an increase in rank correlation from canopy (0.35) < waveband (0.49) < VI (0.67) < canopy + VI (0.68) (Figure 4). Canopy rank correlation increased by 62% with the addition of VIs (canopy + VI) and minimal change was observed between canopy + VI and VI (<1% difference). Method 1 (training set using by-environment BLUPs) had 18% higher rank correlation than Method 2 (across-environment BLUPs). CV1 (unknown accessions) had 22% higher rank correlation when compared to CV2 (unknown accession in unknown environment). Maximum rank correlation was observed for canopy + VI in Method 1 (0.76) and Method 2 (0.68). Moderate rank correlation (0.49) was observed using 178 raw reflectance wavebands per growth stage. When wavebands were considered, higher rank correlation was observed in Method 1 compared to Method 2 and CV1 compared to CV2 (34% higher in each). Variable importance analysis revealed CA and VREI2 were most important for models trained using canopy and VIs, respectively (Table S8). Wavebands in the visible to near-infrared region were most important overall and were consistent across CV scenarios and preprocessing methods (Figure 3(c)). Wavebands collected at S2 growth stage had higher importance than those collected in S1. Waveband 715 nm was identified as the most important across all growth stages. In Method 1, wavebands in the shortwave infrared region were also important to model prediction.

Phenomic Predictor Optimization and Its Application.
The majority of selected wavebands GA step were in the visible region: 405 nm, 435 nm, 705 nm, 715 nm, two in nearinfrared region: 795 nm, 815 nm, and one in the shortwave infrared region: 2255 nm. The most predictive bands for CV 1 were 435 nm, 705 nm, 815 nm, 2255 nm, while for CV2 were 405 nm, 705 nm, 715 nm, 795 nm. Based on our results on and feature importance analysis, and the ease of deployment of different sensors, VREI2, CA, and CT were chosen along with most predictive wavebands for testing their SY prediction performance ( Figure 5). Prediction performance (Spearman correlation) of CV1 and CV2 was 0.74 and 0.33, respectively. A slight increase in rank performance was noticed in CV1 when GA generated bands were used (rank correlation increased by 0.03) and a slight decrease observed in CV2 (rank correlation decreased by 0.11). High specificity (SPE) was observed among all models ranging from 0.81 to 0.94 and was slightly higher for models trained in CV1 (0.92) compared to CV2 (0.87). Similarly, moderate to high F score (FS) and balanced accuracy (BAC) was observed for all CVmodel combinations with higher values for CV1 compared to CV2.
As the amount of training data was reduced from 80% to 20%, models including wavebands + VI + canopy have consistently higher performance for rank correlation (28% higher) and classification metrics (18% higher). Spearman rank correlation decreased slightly for both models trained in CV1 (waveband + VI + canopy: 0.04 reduction, wavebands alone: 0.07 reduction) when comparing prediction performance trained using 80% of the data when compared to using just 20%. Minimal decrease in SPE was observed with just an average decrease in performance of 0.01 when using the minimum amount of training data, compared to using 80%. The largest change was observed for BAC and FS with an average reduction of 0.03 and 0.06, respectively. The largest change was observed when wavebands alone were used for model training in CV2 resulting in a 10% and 26% reduction in BAC and FC, respectively.

Discussion
Breeders and geneticists aim to utilize previously unused genetic accessions in cultivar development, and phenomicassisted breeding approaches have the potential to enhance the integration of genetic diversity in most mainstream programs [36]. Phenomic-assisted approaches can allow breeders to manipulate the genetic gain equation, particularly genetic variation and selection intensity. For improving SY using diverse accession, as a first step, there is a need to establish the relationship between phenomic traits with SY using high-throughput phenotyping techniques and advanced data analytics including machine learning [9]. These approaches need to work in conjunction with in-season SY prediction, but more importantly performance ranking that is the crux of trait selection in plant breeding programs. We identified a cohort of PI accessions with high yield, further demonstrating the wealth of genetic diversity available to soybean breeders in the germplasm collection. These results are consistent with a broader body of research demonstrating the utility of germplasm collection for modern breeding efforts for biotic [48][49][50] and abiotic [51][52][53] resistance and performance traits [33,[54][55][56]. The presence of genetic variation for SY makes this panel of 292 accessions relevant for study objectives as it covers a broader range of performance and background.

Phenomic-Enabled Yield Prediction. Moderate to high ℎ 2
for all traits suggest that phenomic trait measurements are repeatable making them useful in plant breeding pipelines. Spearman-rank correlation coefficient was used to assess model test performance as plant breeders are generally focused on correctly identifying top performers in early to mid-stages of testing pipeline instead of predicting actual SY [23]. The identification of best predictors for phenomicenabled rank correlation is important to maximize prediction accuracy thereby maximizing the detection of useful germplasm for use in cultivar development and also for selection of pure lines in breeding families from multienvironment tests.
Plant breeders often rely on multienvironment trials to evaluate cultivar performance in a target environment, quantify GxE interaction, and/or determine cultivar stability [57]. On average, we observed 18% higher prediction accuracy when training data consisted of BLUPs generated on a by-environment basis when compared to using acrossenvironment BLUPs. The use of mixed models for computing BLUPs is a staple in plant breeding statistical analyses and a main feature of the method is its ability to handle missing or unbalanced data, a common occurrence in multienvironment trials (MET) [24]. When complete data is generated in all environments, a single stage analysis [58] is preferred to preserve the environmental effect in the data. Nonetheless, assembling complete data in all environments is often not the case and therefore relying on the properties of the BLUP method is necessary to remove the experimental design effect from the estimates and simultaneously taking advantage of the amendable variance-covariance structure for genotypeby-environment (GxE) interactions [24]. Additionally, there is a setting off of prediction based selection and resource optimization which are popularizing experimental designs such as partial replication design in plant breeding programs [59]. The RF model accuracy was 22% higher when prediction was made in locations included in model training. We observed that RF models had higher prediction accuracy when by-environment BLUPs were used in model training; moderate accuracy levels were still attainable even when environments with sparse data were included in model training indicating the reaction norm across locations for phenomic trait relationships with SY was somewhat consistent in each environment. These findings demonstrate the impact that environment has on genotype performance and is evidence of the importance for having training data in environments reflective of the target breeding area.
The variation in prediction accuracy among predictor cohorts across the two preprocessing methods and two CV scenarios suggests that multiple trait information can help gain operational efficiencies. We observed moderate (S1: 0.33, S2: 0.25) between CA and SY is lower than previous studies [17] although the trait genetic correlation was observed in a biparental population. CT exhibited negative (-0.44) with yield and shows congruence with previous studies [16,53,54]. We observed dissimilarity between some phenomic traits with previously reported [5,17] canopy traits (CA and CT) produced only modest prediction accuracies. We observed a significant improvement when VIs were included in the model. Among VIs, VREI2 had the largest in magnitude (S1: -0.77, S2: -0.75) and is associated with chlorophyll concentration, water content, and canopy leaf area [60] and lends support to the utility of VREI2 as a yield predictor VI [11] since gain in genetic yield potential in soybean has been associated with an increase in canopy chlorophyll concentration [2,4,61]. Moreover, we report moderate to high in the shortwave infrared region, a region associated with plant water potential [62]. Research in wheat [63,64] and corn [18,65] using VIs associated with plant water content in shortwave infrared waveband regions has shown good correlation with yield; however, similar reports in soybean are lacking warranting additional investigation to associate shortwave infrared canopy spectral reflectance with yield especially to develop water deficit tolerant cultivars. Since majority of 292 accessions belonged to PI accessions, it was not surprising to see the value of chlorophyll based VI as an important predictor. For cultivar development programs, the role of chlorophyll based VI needs to be investigated prior to implementation in breeding selection.
The combination of high repeatability and genetic correlation makes phenomic traits useful in indirect selection for SY. Additionally, our results reveal that canopy spectral reflectance wavebands can be useful for yield prediction as reported by [19] and suggest that informative wavebands may be identified to design a multispectral camera for use in extremely high-throughput aerial-based phenotyping. Phenomic prediction has the potential to disrupt conventional breeding testing pipelines by integrating information on important biological processes across a spatiotemporal scale to enable in-season yield assessment and optimizing plant breeding operation efficiencies [7] and requires an interdisciplinary approach.

Phenomic Predictor Optimization and Its Application.
Optimizing the deployment of phenomic sensors specific to the breeding target is an important objective to maximize prediction accuracy while reducing the operational costs associated with data collection. However, there remains a gap in the current understanding of the utility of a multisensor approach for SY prediction to identify the optimal sensors for use in soybean germplasm breeding efforts.
Our results show the utility of canopy spectral reflectance for use in SY rank prediction using wavebands and VIs and are consistent with previous research findings made in soybean [11,20,21,66] and other crop species [19,30,67,68] for trait prediction; however, the utility of waveband reflectance as a predictor has not been extensively studied. Therefore, we chose to identify four wavebands which can allow the design of multispectral camera consistent with the current options available from industry providers offering customizable waveband selection of multispectral cameras. To do this, a genetic algorithm (GA) approach was used to identify wavebands for SY prediction. GA has been used for a wide variety of objectives in agriculture for variable and waveband selection [28,32,69,70] but limited work has been done for use in prediction of SY. Research has shown good prediction performance of models using all measured wavebands in wheat [19], but our results suggest that a subset can be used to achieve comparable prediction performance (Figure 4). This finding is likely due in part to the multicollinearity associated with neighboring wavebands allowing a subset of wavebands to capture the variation in entire electromagnetic spectrum [30,71]. While previously the waveband regions we report have been shown to be correlated in the visible and near-infrared regions of the electromagnetic spectrum [11,21,66], GA methodology enabled us to identify specific wavebands for SY prediction. The observation of wavebands in the shortwave infrared region important for yield prediction warrants additional research to explore this portion of the electromagnetic spectrum along with the need for future research to determine the physiological basis of wavebands and their prediction. The next step in SY prediction deployment in a breeding pipeline is the motivation to increase model prediction accuracy by combining multiple sensors as well as resolving challenges on spectral reconstruction from images [72,73]. While selected hyperspectral wavebands can be deployed on highthroughput phenotyping platforms using multispectral cameras, a multisensor approach needs to be tested to determine if it can maximize model prediction accuracy.
Past studies have established the use of single sensorbased prediction methods in plant breeding activities [14, 16, 18-20, 65, 74, 75] and multisensor based prediction in wheat [15]; however, there is little information on the use of multisensor based prediction in soybean. Thus, we selected VI VREI2, CA, and CT as these traits can be collected in tandem with a multispectral camera and have demonstrated strong and/or moderate to high feature importance to SY. Thus, we observed maximum prediction accuracy when a multisensor based model was used for prediction of SY. Thus, we propose this framework to deploy a multisensor based approach by relying on feature importance parameters and optimization procedures to maximize target trait prediction accuracy.
To determine the value of these approaches for use in plant breeding operations we varied the training/testing split and used a hypothetic selection intensity of 20%; both operational decisions breeding programs attempt to optimize [23]. These findings indicate that, when training data is collected from the same environments in which testing is done, phenomic prediction can be effective to correctly rank genotypes for SY. Moreover, high SPE (ability of the model to correctly identify accessions that did not meet our imposed selection criteria according to ground-truth yield data) was achieved regardless of both the CV scenario and the amount of training data used. While only slightly lower performance was observed for other classification metrics (BAC and FS), our results continue to suggest the efficacy of such phenomic prediction methodologies for breeding decision making. We anticipate that phenotyping and data analytics operability difficulties may need to be resolved for multiple sensor payload and balancing with area coverage of aerial systems and real-time of quick-turn around analytics and remain an area of research interest.
In order for phenomic traits to be informative predictors of target traits high genetic correlation among targetpredictor traits ( ) and high predictor trait heritability (ℎ 2 ) [23] are needed. Continued work is needed to provide insight into the attribution of phenomic traits for phenomic predictive ability and establishing the biological and physiological association between target traits with predictor traits. Future research is warranted to determine program and trait specific predictors, and such research requires larger datasets. As the hardware and analytics pipelines advance through continued improvement in high-throughput phenotyping, larger datasets will be achievable.
As a selection tool, our approach permits SY rank prediction and will allow the evaluation of specific trait efficiencies to identify useful germplasm on a per-trait basis and design future crossing combinations that assemble desirable traits together. This is a keystone concept in the process of physiological trait based breeding [76,77]. Overall, our findings suggest that a customized suite of phenomic sensors can advance germplasm and cultivar breeding efforts while reducing the cost and resource requirements and advance the integration of phenomic-assisted breeding approaches. The approach we propose can be utilized in breeding programs to identify informative waveband combinations tailored to the specific breeding objective for the design of customizable multispectral sensors. Our approach can be utilized as standalone but does not preclude the use of wavebands that have been traditionally used to compute various VIs.
While GS and other modern tools will remain an attractive arsenal in a breeder toolbox, the cost of GS assisted breeding can be out of reach for majority of programs in minor crops and in non-GM crops [7] and therefore cost affordable phenomic-assisted breeding approaches present exciting avenues for trait improvement including a multiobjective optimization scenario [78].

Data Availability
Data are available upon request.

Conflicts of Interest
The authors declare that they have no conflicts of interest.

Authors' Contributions
Kyle Parmley and Asheesh K. Singh conceptualized the research; Kyle Parmley and Koushik Nagasubramanian conducted statistical analyses with contributions from Asheesh K. Singh, Baskar Ganapathysubramanian, and Soumik Sarkar; Kyle Parmley and Asheesh K. Singh wrote the manuscript with inputs from all coauthors.
phenotyping. Computing support from ISU HPC computing clusters is sincerely acknowledged. Table S1: description of accessions, country of origin, and genetic background included in this study. 292 accessions were selected from the USDA Soybean Core Collection from MGI-III. Table S2: description of testing environment locations, planting date, seed yield (SY) performance, and climactic summary statistics. Soybean accessions were phenotyped in these environments for use in downstream phenomic prediction. Table S3: description of vegetation indices (VI) computed from canopy hyperspectral reflectance. Observations consisted of two measurements recorded within 2 hours of solar noon and mean reflectance averaged. VIs were used alongside other phenomic information for inseason seed yield prediction [79][80][81][82][83][84][85]. Table S4: description of phenotypic traits and instruments used for phenotypic characterization of a diverse panel of soybean evaluated in six environments. Table S5: details of genetic algorithm (GA) procedure used for selection of hyperspectral wavebands for identifying the most informative wavebands to allow intelligent design of a miniaturized hyperspectral camera for deployment on high-throughput phenotyping platforms. Table S6: ANOVA results of fixed effects for mixed linear model where seed yield (SY) was the response variable. SY was collected from 292 genotypes grown in six environments across central Iowa and measured by combine harvest. Table  S7: genetic correlation ( ) and SNP-based heritability of phenomic traits and seed yield and phenomic trait, respectively. Phenomic information was collected from 292 diverse soybean accessions grown in six environments across central Iowa and data collected during the growing seasons at two approximate growth stages. Table S8: phenomic traits feature importance computed from random forest model using two cross-validation scenarios while seed yield was used as the response variables. Phenomic traits were collected at two approximate growth stages and used to predict seed yield during the growing season to enable in-season selection. Feature importance was used to select the most informative vegetation indices and to identify other useful predictors of seed yield. Table S9: Spearman rank correlation obtained after random forest model prediction (seed yield = dependent variable) performance of predictors trained with remotely sensed phenomic traits (canopy traits, waveband, vegetation indices, and combination) in 292 soybean genotypes grown at six environments and data collected at two growth stages in each environment. Tabular data correspond to Figure 4. Table  S10: Spearman rank correlation and classification metrics of random forest model test prediction using only optimized wavebands and selected canopy traits. Applicability of using phenomic prediction in plant breeding operations was tested using four training/testing splits (80/20, 60/40, 40/60, and 20/80) and performance metrics were computed for each split. Seed yield and phenomic predictor trait data were collected from 292 genotypes grown in six environments and data collected at two growth stages in each environment. Tabular data correspond to Figure 5. (Supplementary Materials)