Number of trials to estimate the condition number in rye traits

ABSTRACT Multicollinearity must be diagnosed in multivariate analyses. Among the indicators, the condition number can be used to quantify the degree of multicollinearity. Hence, this study sought to determine the number of measurements (trials) necessary to estimate the number of condition in linear correlation matrices between rye traits. Five uniformity trials were carried out with ‘BRS Progresso’ rye, and eight morphological traits and eight productive traits were evaluated, forming two groups. In each group of traits, six cases (combinations of traits) were planned and the multicollinearity diagnosis was performed. Repeatability analyses were performed using the following methods: analysis of variance, principal component analysis, and structural analysis, and the number of measurements (trials) was determined for different levels of precision. A higher condition number of repeatability coefficients was obtained by the principal component methods (based on correlation and variance and covariance matrices) and structural analysis based on the variance and covariance matrix. A greater number of measurements (trials) is necessary to estimate the number of conditions in productive traits compared to morphological ones. One trial is enough to efficiently estimate the condition number with a minimum accuracy of 80% in morphological and productive traits of rye, whereas at least three trials are required for 95% accuracy.


INTRODUCTION
Multivariate analysis techniques allow researchers to better understand the phenomena of multiple measures of studied individuals.More reliable parameter estimates are obtained when assumptions are met, and in multivariate analyses, multicollinearity must be investigated (HAIR et al., 2009).It can be understood as the linear relationship between traits, and when present at high levels, performance and parameter prediction decrease in most linear methods due to information sharing between the characteristics (DORMANN et al., 2013).In most multivariate techniques, multicollinearity increases the variance of the estimated parameters (FIGUEIREDO FILHO et al., 2011), resulting in parameter estimates of low reliability (HAIR et al., 2009), overestimated statistics and excessive false positives (GOODHUE; LEWIS; THOMPSON, 2017), or even an inadequate interpretation of the results (ALVES; CARGNELUTTI FILHO; BURIN, 2017;DORMANN et al., 2013;TOEBE;CARGNELUTTI FILHO, 2013).

MATERIALS AND METHODS
Five uniformity trials (without applying treatments) were conducted with rye (Secale cereale L.); the cultivar utilized was 'BRS Progresso', which is intended for grain production (EMBRAPA, 2013).The experimental area belongs to the Department of Plant Science of the Federal University of Santa Maria (29º42'S, 53º49'W; 95 m altitude).The region climate is classified as humid subtropical Cfa with hot summers and no defined dry season, according to Köppen (ALVARES et al., 2013).The soil of the region is classified as Typical Dystrophic Brunogray Argisol (Argissolo Bruno-Acinzentado distrófico típico) (SANTOS et al., 2018).
Then, 100 plants were randomly collected and evaluated in each uniformity trial (sowing season), except in T4 (fourth season), in which 90 plants were evaluated, totaling 490 plants.The evaluations were performed on the stems of each plant collected (the primary stem and secondary stem or tiller), obtaining values for eight morphological and eight productive traits.In total, 1,136 stalks were evaluated (i.e., 193, 370, 242, 169, and 162 in T1, T2, T3, T4, and T5, respectively).The values were obtained by counting the number of nodes, spikelets, and spike grains -1 ; measuring the length of the stem, stalk, and spike (cm); and, weighing the fresh and dry mass of the stem, stalk, spike (grain and straw mass), and grain (g).
The following morphological traits were evaluated in each plant: 1) plant height (cm) obtained by the mean distance between the base of the plant to the last spikelet of all the stalks of the plant; 2) stem length (cm) obtained by the mean distance between the base of the plant until the fl ag leaf node of all the stalks of the plant; 3) peduncle length (cm) obtained by the mean distance between the fl ag leaf node and the spike insertion in the peduncle of all the stalks of the plant; 4) fresh mass of the aerial part (g) obtained by the mean mass of the aerial part of all the stalks of the plant; 5) total fresh mass of the aerial part (g) obtained by the sum of the mass of the aerial part of all the stalks of the plant; 6) the ratio between the mean of the fresh masses of stalk + leaves + peduncle on the total fresh mass of the aerial part; 7) the number of stalks obtained by the sum of the main stem + the number of tillers; and 8) number of nodes per stem obtained by dividing the number of nodes of the plant by the number of stalks.The following productive traits were evaluated in each plant: 1) spike length (cm) obtained by the mean length of the spikes on the plant; 2) grain mass (g) obtained by summing the grain mass of all spikes on the plant; 3) 100-grain mass (g); 4) the number of grains obtained by summing the number of grains on all spikes on the plant; 5) the number of spike grains -1 obtained by dividing the number of grains on the plant by the number of spikes on the plant; 6) the number of spikelets obtained by summing the number of spikelets in all spikes on the plant; 7) number of spike spikelets -1 obtained by dividing the number of spikelets on the plant by the number of spikes on the plant; and 8) the ratio of the mass of grains per stem to the total fresh mass of the aerial part.
Six cases were planned for each trait group (morphological and productive) and were formed by combinations of eight traits (p = 8 traits) taken by p i in p i (C (p,pi) with i = 2, 3, 4, 5, 6, and 7 traits).That is, in each trait group, in the fi rst case identifi ed as case 2, 28 combinations of eight traits were taken as two by two (C (p,pi) = C (8,2) = 28 combinations).In the following cases, by adding one trait, combinations with three (C (8,3) ), four (C (8,4) ), and so forth were obtained until the last case with seven combined traits (C (8,7) ).Therefore,28,56,70,56,28, and 8 combinations were obtained for the cases containing 2, 3, 4, 5, 6, and 7 traits, respectively.A total of 492 combinations were obtained, with 246 combinations belonging to the cases for the morphological trait group and another 246 combinations belonging to the cases for the productive trait group.
Next, the condition number (CN) estimates were obtained for each combination within each case, trait group, and trial.The CN was obtained by the ratio of the largest (λ max ) and lowest eigenvalue (λ min ) of Pearson's linear correlation matrix between the traits (CRUZ; REGAZZI; CARNEIRO, 2012; GUJARATI; PORTER, 2011).As a rule of thumb, the CN indicator divides multicollinearity into classes: weak (CN ≤ 100); moderate to strong (100 < CN ≤ 1,000); and severe (CN > 1,000) (MONTGOMERY et al., 2012).
Repeatability analysis for CN was performed in each case (combined traits) and trait group (morphological and productive), totaling 12 repeatability analyses (six cases × two trait groups).The diff erent combinations of traits within each case and trait group were considered to be the observed "subjects" and the trials (sowing seasons) the "repeated measures".
Considering the example of estimating the repeatability coeffi cient (rc) of CN in case 2 and the morphological trait group, the 140 CN estimates (28 combinations of eight traits taken two by two × fi ve trials) were considered, forming a matrix of 28 rows (combinations) and 5 columns (trials).The same number of estimates was obtained for the productive trait group because it also presents eight traits.Therefore, for each character group, 140, 280, 350, 280, 140, and 40 CN estimates were obtained for the cases with 2, 3, 4, 5, 6, and 7 combined traits, respectively.
For each of the cases with 2, 3, 4, 5, 6, and 7 combined traits and trait groups (morphological and productive), the rc and coeffi cient of determination (R²) were estimated by analysis of variance (ANOVA), principal components based on the correlation matrix (PCR), principal components based on variance and covariance matrix (PCS), structural analysis based on the theoretical eigenvalue of the correlation matrix (SAR), and structural analysis determined based on the theoretical eigenvalue of the variance and covariance matrix (SAS) (CRUZ; REGAZZI; CARNEIRO, 2012).
In the ANOVA method, the model was considered: Where: CN ij is the estimate of the condition number referring to the i-th combination and the j-th trial; m is the general mean; C i is the eff ect of the i-th combination under the infl uence of the trial; E j is the eff ect of the trial at the j-th measurement; and ε ij is the experimental error established by the eff ects of the j-th trial in the i-th combination.
The mean rc and R² of the cases were compared between trait groups (morphological and productive) within each method by the Student's t-test for independent samples at a 5% signifi cance level.
For each case, method, and trait group, the number of measurements or trials (η m ) to estimate the condition number with different determination coefficients (R² = 0.80, 0.85, 0.90, 0.95, and 0.99) were determined using the equation below (CRUZ; REGAZZI; CARNEIRO, 2012): (2) where: η m is the number of measurements (trials), rc is the repeatability coeffi cient, and R 2 is the coeffi cient of determination (R² = 0.80, 0.85, 0.90, 0.95, and 0.99).The means of η m of the cases were compared between the trait groups (morphological and productive) within each method and R² by the Student's t-test for independent samples at a 5% significance level.The analyses were performed using Microsoft Excel ® and R software (R CORE TEAM, 2021).

RESULTS AND DISCUSSION
The number of combined traits provided different estimates for the condition number (CN), with the highest values in cases with the highest number of combined traits (Table 1).In combinations with two traits (case 2), weak multicollinearity was observed, with means of 2.61 (1.0 ≤ CN ≤ 16.6) and 5.85 (1.0 ≤ CN ≤ 78.3) in the morphological and productive trait groups, respectively.In case 7, nevertheless, the CN means were roughly 163 and 80 times higher than the means in case 2, with values of 426.78 (75.6 ≤ CN ≤ 1,067.5)and 468.25 (144.7 ≤ CN ≤ 1,170.8),respectively.The percentage of combinations with CN ≤ 100 decreased toward the cases with a higher number of combined traits.An extreme case was observed in case 7 (seven traits combined) and the productive trait group, with no combination with CN ≤ 100.
Regardless of the trait group, we observed that, on average, the CN increased as more traits were used in the multicollinearity diagnosis.Additionally, the amplitudes were larger in the cases with more combined traits.This may be related to using all possible combinations in each case.The greater the number of traits present in the group under analysis, the greater the chance of strongly related traits.In an extreme case, high CN (CN ≥ 144.7) was verifi ed in all combinations of productive traits in case 7. Multicollinearity can be understood as the linear relationship between traits and information sharing (DORMANN et al., 2013).The high magnitude of the correlation between traits can be used to indicate the presence of high multicollinearity levels, and researchers must pay more attention to when correlations are above |r| ≥ 0.7 (DORMANN et al., 2013).
Based on fi ve trials, repeatability coeffi cients (rc) equal to or greater than 0.707 were observed for CN estimation, with a minimum accuracy of 92.4% in predicting its real value (R² = 0.924), regardless of the number of combined traits (cases), trait group, and method of repeatability (Table 2).A high value of R² indicates that the mathematical model used to determine repeatability was effi cient (CAVALCANTE et al., 2012).
Regardless of the method and trait group, the mean R² of the cases ranged between 0.959 and 0.997.These high R² values indicate that all methods accurately estimated rc.Differences between the mean R² of the cases among the trait groups were verified by the Student's t-test for independent samples at a 5% significance level when the repeatability analysis was performed using PCR, PCS, and SAR.Hence, these methods gave us higher accuracy to estimate CN repeatability observed in morphological traits compared to productive ones.No studies of repeatability analysis have been found for the rye crop nor for estimating CN.Similar high values and magnitudes of R² for the same trait were observed Number of trials to estimate the condition number in rye traits  -----------------------------Morphological traits ----------------------------- -------------------------------Productive traits -------------------------------- When analyzing the rc estimates between cases in each trait group and method, no pattern was observed with the decreasing or increasing number of combined traits.However, lower rc magnitudes were only observed when estimated by ANOVA and SAS and in case 7. Similar to what was observed with the mean R², higher rc means were observed in the group of morphological traits compared to the group of productive traits according to the Student's t-test for independent samples at a 5% signifi cance level when rc was obtained by PCR, PCS, and SAR.
In summary, the highest repeatability coeffi cients of condition number were estimated by principal component methods (based on the correlation and variance and covariance matrices -PCR and PCS, respectively) and structural analysis based on the variance and covariance matrix (SAS).The average estimates of the repeatability coeffi cient do not diff er between the groups of morphological and productive characters by analysis of variance (ANOVA) and structural analysis based on the matrix of variances and covariances (SAS), but with a higher mean for the group of morphological traits when the coeffi cient is estimated by the principal components based on the correlation matrix (PCR), principal components based on variance and covariance matrix (PCS), and structural analysis based on the theoretical eigenvalue of the correlation matrix (SAR).
The present study obtained the highest repeatability values for CN estimation using principal component methods (PCR and PCS) and structural analysis based on the correlation matrix (SAR).These methods seem suitable for rc estimation for CN in rye traits.High repeatability estimates indicate that with a relatively small number of measurements, it is possible to estimate the true value of a given trait (CARGNELUTTI FILHO et al., 2004).This is because the higher the rc estimate, the greater the predictability that values very close to the estimates of previous events will occur in subsequent measurements (CRUZ; REGAZZI; CARNEIRO, 2012).
The minimum number of measurements or trials (η m ) for CN estimation varied according to the method, the case (number of traits combined), the level of precision (R² -coeffi cient of determination), and the trait group (morphological and productive) (Table 3).By comparing the η m mean of cases between morphological and productive traits in each method and R², we observed the signifi cant superiority of η m in the group of productive traits using the PCR, PCS, and SAR methods in all R² (R² = 0.80, 0.85, 0.90, 0.95, and 0.99) according to the Student's t-test for independent samples at a 5% significance level.
Given the recommendation to use principal component methods and the high rc and R² values obtained in this study, the principal component methods were used in inferences for the number of trials (η m ) to estimate CN.It should be emphasized that, in this study, the estimates of η m were similar to each other by the PCR, PCS, and SAR methods (Table 3).
Using the PCR, PCS, and SAR implies that a single assay is enough to estimate CN in traits with at least 95% accuracy.In productive traits and the same level of accuracy (minimum R² of 0.95), up to three assays are needed depending on the number of traits.The number of trials obtained in this study is lower than those reported in other crops as necessary to evaluate agronomic traits.Three to 12 evaluation cycles were found to be necessary (CAVALCANTE et al., 2012) and 11 to 49 measurements in elephant grass genotypes (SOUZA et al., 2017), two evaluations in wheat (PAGLIOSA et al., 2014), and two to 18 measurements in soybean (MATSUO et al., 2012), with 95% accuracy.
Fewer trials can be used, although the researcher must give up accuracy.Therefore, a single trial to estimate CN with 80% accuracy can be used in almost all cases and trait groups.The exception occurs in cases with seven morphological and productive traits and when the rc of the CN is determined by ANOVA and SAS methods.
It is up to the researcher to choose the adequate number of measurements (trials), considering the availability of material, manpower, and the desired precision.When defi ning the number of trials, the results of previous experiments and studies of sample size, plot size, relationships between traits, multicollinearity diagnosis, and other information about the crop must be considered.
Using a greater number of traits may result in greater predictability in CN estimation.On the other hand, it may result in lower precision in the estimates of the parameters of multivariate analysis because the researcher must be aware that the use of a higher number of traits will also lead to a higher degree of multicollinearity, requiring some procedure to reduce CN to values below 100.In most multivariate techniques, parameter estimates become unreliable in the presence of multicollinearity (HAIR et al., 2009) or there is a misinterpretation of the results (ALVES; CARGNELUTTI FILHO; BURIN, 2017;DORMANN et al., 2013;TOEBE;CARGNELUTTI FILHO, 2013); this is because the information is shared among the traits, consequently increasing the variance of the estimated parameters (FIGUEIREDO FILHO et al., 2011).
A larger number of trials is needed to estimate CN in productive traits compared to the group of morphological traits, although using diff erent numbers of trials is not practical.Thus, in conducting experiments in rye or any other crop, a single number of trials facilitates the planning and experimental conduct.Adopting the highest η m value enables minimum precision to be obtained, regardless of the trait group.
For the morphological and productive traits of 'BRS Progresso' rye, a single trial is enough to estimate the CN with 80% accuracy except for the case with seven combined traits and when repeatability analysis is performed using ANOVA and SAS methods.When one seeks to obtain higher accuracy values, at least three trials are required to estimate CN with 95% accuracy, regardless of the number of traits and trait group.

CONCLUSIONS
1. Fewer trials are needed for the cases with a higher number of combined traits.However, the larger the number of traits, the larger the condition number estimate will also be; 2. One trial is enough to estimate the condition number with at least 80% accuracy in morphological and productive traits of rye.At least three trials are necessary for 95% accuracy.
Number of trials to estimate the condition number in rye traits

Figure 1 -
Figure 1 -Representation of a rye (Secale cereale L.) plant and details of the evaluated parts -Coeffi cient of determination (R²) - -PCS -Principal components based on the covariance matrix - -SAR -Structural analysis based on the correlation matrix - -SAS -Structural analysis based on the covariance matrix -

Table 1 -
Minimum, mean, median, m aximum, and range (maximum-minimum)of the condition number and percentage of combinations with weak multicollinearity (PCWM) in combinations of morphological and productive traits (cases) in five trials of 'BRS Progresso' rye (Secale cereale L.) ¹Case: number of traits combined.²NCT:number of combinations in each case and trial.³No. of Comb.= number of combinations in each case × number of trials.4Percentage of combinations with weak multicollinearity (PCWM): CN ≤ 100 (MONTGOMERY et al., 2012) Case¹ NCT² No. of Trials No. of Comb.³Minimum Mean Median Maximum Range PCWM (%) 4 - Number of trials to estimate the condition number in rye traits ¹Case: number of combined traits. 2 Means of the number of trials (measurements) with diff erent letters in the row (comparison between groups of morphological and productive traits within each method and level of accuracy -R²) diff er at a 5% signifi cance level according to the Student's t-test for independent samples with 10 degrees of freedom (except in the PCS method with 6.15 degrees of freedom)