Sample size to evaluate the degree of multicollinearity in rye morphological traits

ABSTRACT Investigation of multicollinearity allows parameters in multivariate analysis to be estimated with higher precision and with biological interpretation. In order to generate reliable estimates of the degree of multicollinearity, it is necessary to use appropriate sample size. Thus, the objectives of this study were to determine the sample size (number of plants) necessary to estimate the indicators of the degree of multicollinearity - condition number (CN), correlation matrix determinant (DET), and variance inflation factor (VIF) - in morphological traits of rye and to verify the variability of the sample size between the indicators. Five and three uniformity trials were conducted with the cultivars BRS Progresso and Temprano, respectively. Eight morphological traits were evaluated in 780 plants in eight trials. For each trial, 22 cases were selected among the 28 formed by the combination of eight traits, taken six by six, totaling 176 cases. In each case, 197 sample sizes were planned (20, 25, 30, ..., 1,000 plants) and in each size 2,000 resampling procedures with replacement were performed, CN, DET, and VIF were determined and the average among 2,000 estimates was calculated. For each case and indicator (CN, DET, and VIF), the sample size was determined through three models: modified maximum curvature method and linear and quadratic segmented models with plateau response. There is variability between sample sizes between indicators, with larger sample sizes required for DET, followed by CN and VIF, in that order, with at least 180, 116 and 85 plants, respectively.


INTRODUCTION
Rye (Secale cereale L.) belongs to the Poaceae family, with important use of its grains in human and animal diet, as a soil cover crop (SAPIRSTEIN; BUSHUK, 2016) and as forage crop (BAIER, 1994), with early supply of fodder at the end of autumn (PAULINO; CARVALHO, 2004), a time when other winter forage cereals are not yet at the ideal point for grazing.It is a crop with interesting characteristics to integrate crop rotation systems.It has high resistance to diseases and drought, tolerance to sandy and acidic soils (MORRISON, 2016), assists in the maintenance of soil water content (BASCHE et al., 2016) and exerts allelopathic or retarding effect on the germination of spontaneous plants (ABOU CHEHADE et al., 2021).
Breeding strategies can be obtained by knowing the correlation between crop characteristics (LAIDIG et al., 2017)  and multivariate statistical techniques can be used as auxiliary tools.For the estimates of the parameters of the analysis to be reliable, it is necessary to assess the degree of multicollinearity between the predictor traits.Multicollinearity can be interpreted as the strong relationship between predictors and affects the precision with which coefficients are estimated (GUJARATI; PORTER, 2011;MONTGOMERY;PECK;VINNING, 2012).Inadequate interpretation of the parameters in canonical correlation analysis (ALVES; CARGNELUTTI FILHO; BURIN, 2017) as well as results with no biological meaning and estimates with no interpretation in path analysis (TOEBE; CARGNELUTTI FILHO, 2013) have been observed in studies conducted in the presence of multicollinearity.
Given the importance of the diagnosis of multicollinearity, it needs to be accurately estimated, which can be achieved using adequate sample size.The determination of sample size for agronomic characteristics has been carried out in studies with rye (BANDEIRA et al., 2018a;2018b) and showy rattlepod (TOEBE et al., 2017a), as well as in the estimation of the correlation between traits of maize (OLIVOTO et al., 2017a) and parameters in path analysis in cherry tomato (SARI et al., 2018).In these studies, larger sample sizes promote greater precision, with reduced gain above the sample size determined.For rye, no studies determining the sample size necessary for the diagnosis of multicollinearity were found.In a study with rye crop, the diagnosis of multicollinearity was made with 128 observations (NOURAEIN, 2019), whereas in other crops, such as wheat (JANMOHAMMADI; SABAGHNIA; NOURAEIN, 2014), maize (OLIVOTO et al., 2017a;2017b), showy rattlepod (TOEBE et al., 2017a), cherry tomato (SARI et al., 2018) and sunflower (FOLLMANN et al., 2019), the diagnosis was made with 45 to 1,180 observations.Therefore, the diagnosis of multicollinearity has been performed with different sample sizes, which generates estimates of lower or higher precision.Some inferences have been made regarding sample size in the diagnosis of the degree of multicollinearity in maize traits (OLIVOTO et al., 2017a), as well as investigations regarding the interference of multicollinearity in path analysis in maize (TOEBE; CARGNELUTTI FILHO, 2013) and cherry tomato (SARI et al., 2018).Additionally, Olivoto et al. (2017a) and Sari et al. (2018) pointed out that insufficient sample sizes incorrectly estimate the degree of multicollinearity.However, these studies did not determine the appropriate sample size for estimating multicollinearity in rye traits.
Given the varied number of observations used in the diagnosis of multicollinearity and the existence of inferences made for the need to use larger sample sizes, this study was conducted.It is assumed that it is possible to determine the sufficient sample size (number of plants) for the diagnosis of the degree of multicollinearity and that this size differs between the indicators condition number, determinant and variance inflation factor.Thus, the objectives of this study were to determine the sample size (number of plants) necessary to determine the indicators of the degree of multicollinearity -condition number (CN), determinant (DET) and variance inflation factor (VIF) -in morphological traits of rye and to assess the variability of sample size between the indicators.

MATERIAL AND METHODS
Eight uniformity trials were conducted with rye crop (Secale cereale L.), consisting of five sowing times with the cultivar BRS Progresso (T1, T2, T3, T4 and T5) and three sowing times with the cultivar Temprano (T6, T7 and T8) in the winter crop season of 2016.These trials were conducted in an experimental area located in Santa Maria -RS (29º42' S, 53º49' W and 95 m altitude).According to Köppen's classification, the climate of the region is classified as Cfa -Humid subtropical climate, with hot summers and no defined dry season (ALVARES et al., 2013).The soil of the region is classified as Argissolo Vermelho distrófico arênico (Ultisol) (SANTOS et al., 2018).
The experimental area was homogeneously prepared and soil fertility was corrected with the application of 500 kg ha -1 of fertilizer (5-20-20 NPK formulation).Two rye cultivars were sown: BRS Progresso, intended for grain production; and Temprano, intended for soil cover and as forage plant.The seeds of each cultivar were sown broadcast in an area of 320 m² (20 m × 16 m) in the first sowing time, whereas in the other sowing times, each cultivar was sowned in an area of 375 m² (25 m × 15 m).
The sowing times were planned to meet the recommendation of planting from March to July (BAIER, 1994).For both cultivars and at all sowing times, a density of 455 seeds m -2 was used.Top-dressing fertilization was performed when the plants were between the stages of three and four developed leaves, using 25 kg ha -1 of nitrogen.The other cultural practices were carried out according to the need and to the management recommendations for rye crop (BAIER, 1994).
In each uniformity trial, 100 plants at physiological maturity were randomly collected, except for trials four and eight.In these trials, 90 plants were evaluated, corresponding to the cultivar BRS Progresso in the fourth sowing time and the cultivar Temprano in the third sowing time.In each plant, the following morphological traits were evaluated: number of stems plant -1 (NSP = main stem + tillers); number of nodes plant -1 (NNP = sum of the number of nodes of the stems); number of nodes stem -1 (NNS = NNP/NSP); plant stem length, in cm (PSL = average length of stems); plant peduncle length, in cm (PPL = average length of stem peduncles); plant ear length, in cm (PEL = average length of ears); main stem height, in cm (MSH); and plant stem height, in cm (PSH = average height of the stems).PPL was defined as the stem portion between the last node and the ear insertion in the stem; PEL as the portion between the ear insertion in the stem and the last spikelet; and MSH and PSH as the portion between the base of the plant and the last spikelet.In this study, the data of plants of each trial were considered as the master sample.
Rev. Caatinga, Mossoró, v. 36, n. 1, p. 215 -225, jan. -mar., 2023 217 For each trial, 28 cases were planned, obtained by combining eight traits taken six by six (Table 1).In each case, with the data from the master sample, the degree of multicollinearity was estimated by the indicators condition number (CN), correlation matrix determinant (DET), and variance inflation factor (VIF).CN was obtained by the relationship between the highest eigenvalue (λ max ) and the lowest eigenvalue (λ min ) of the correlation matrix ( ) (GUJARATI; PORTER, 2011) and classified as weak (CN ≤ 100), moderate to strong (100 < CN ≤ 1,000) and severe multicollinearity (CN > 1,000) CN = λ max λ min (MONTGOMERY; PECK; VINNING, 2012).Problems due to multicollinearity may exist for DET lower than 0.00001 (FIELD, 2009) and greater than or equal to ten, where , where R j 2 is the multiple coefficient of determination of a given variable with the other explanatory variables (GUJARATI; PORTER, 2011).CN and DET are indicators with interpretation for all variables, while VIF has the advantage of informing the variance inflation for each variable, and the highest VIF value was considered in this study.
Table 1.Traits combined in each case obtained by combining eight morphological traits of rye (Secale cereale L.) and the respective degree of multicollinearity (condition number -CN) of the master sample for each trial (two cultivars at different sowing times), evaluated in the 2016 season, Santa Maria, RS, Brazil.
The sample size was determined for estimating the indicators of the degree of multicollinearity -CN, DET and VIF -for each of the 176 cases.For this, in each case, 197 sample sizes were planned.The first planned sample size was composed of observations of 20 plants.The other planned sample sizes were obtained with the increment of five plants, up to the last size, containing 1,000 plants.Thus, in each case, the sample sizes of 20, 25, 30, ..., 1,000 plants were planned.Then, for each planned sample size, 2,000 resampling procedures with replacement were performed, and CN, DET and VIF were estimated in each one.After that, the mean degree of multicollinearity of each indicator in each planned sample size was calculated.
Finally, three models were fitted: modified maximum curvature method (MMCM), segmented linear model with plateau response (LMPR) and segmented quadratic model with plateau response (QMPR).In these three models, the mean of the indicator (CN, DET or VIF) (dependent variable, Y i ) was fitted as a function to the planned sample sizes (independent variable, X i ).For each case, indicator and model (176 × 3 × 3 = 1,584 situations), were determined the sample size (n), the multicollinearity degree obtained in the fitting corresponding to n (CN (n) , DET (n) and VIF (n) ) and the adjusted coefficient of determination (R² a ).
Coefficients a and b for MMCM were determined by the expression of Equation 1: where: X i is the independent variable, that is, the planned sample sizes (20, 25, 30, ..., 1,000 plants), and Y i is the dependent variable referring to the value (mean of 2,000 estimates) of each indicator of the degree of multicollinearity.
The sample size (n) was determined according to Equation 2(MEIER; LESSMAN, 1971) and the estimate of the multicollinearity corresponding to n according to Equation 3, where a and b are the model parameters.
Regarding the functions with plateau response, Equation 4 was considered for LMPR and Equation 5 was considered for QMPR: where: X i is the independent variable, that is, the planned sample sizes (20, 25, 30, ..., 1,000 plants); Y i is the dependent variable referring to the value (mean of 2,000 estimates) of the degree of multicollinearity of each indicator; a, b and c are the parameters of the models; ɛ i is the error associated with the i-th observation; P is the plateau; and n is the estimate of the sample size and the point of union between the two functions.
The n parameter was determined considering the union between the two lines for LMPR and QMPR according to Equation 6.For the estimation of the degree of n = (5) ), the estimate of P was considered for LMPR and Equation 7 was considered for QMPR, where are the estimates of the model parameters.
For each trial, indicator and model, were calculated the minimum, maximum and mean values of the sample size (n), the estimation of the degree of multicollinearity obtained in the fitting of the model for n (Y (n) = CN (n) or DET (n) or VIF (n) ) and the adjusted coefficient of determination (R² a ), among the 22 cases.The means of R² a for each indicator were taken into account for choosing the model to be used in the inference of n.After defining the model, the mean estimates of the sample size of each trial were compared through a Scott-Knott means comparison test, at 5% significance level and the means of the sample size among the indicators of the same model, respectively, were compared at 5% significance level by the Student's t-test for independent samples.The fits by QMPR of the degree of multicollinearity of the three indicators, as well as the cases of lowest and highest degree of multicollinearity, were graphically presented.Statistical analyses were carried out in R software (R TEAM CORE, 2019).

RESULTS AND DISCUSSION
The existence of a severe degree of multicollinearity (CN > 1,000) (MONTGOMERY; PECK; VINNING, 2012) was verified in the master sample in trials when considering the eight morphological traits of rye, with the values of condition number (CN) higher than 1.5×10 16 (Table 1).Similarly, severity was verified for all cases when combining seven traits.For 28 cases obtained by the combination of six traits, trials with weak (CN ≤ 100), moderate to strong (100 < CN ≤ 1,000) and severe multicollinearity (CN > 1,000) (MONTGOMERY; PECK; VINNING, 2012) were observed for the master sample.
The degree of multicollinearity obtained by the CN, correlation matrix determinant (DET) and variance inflation The estimates obtained by the other two indicators also showed high variability, but of lower magnitudes (63.99% ≤ CV CN ≤ 87.74% and 63.51% ≤ CV VIF ≤ 89.01%).This variability of multicollinearity estimates was due to the cases, which are formed by the combination of eight traits taken six by six.A study with rye characteristics to assess the relationship between grain yield and yield and morphological components reported variability in the estimates of multicollinearity (1.37 ≤ VIF ≤ 452) and traits were removed from the regression model with VIF > 10 (NOURAEIN, 2019).In a trial with wheat crop, it was not necessary to eliminate traits because VIF was lower than 1.46 (JANMOHAMMADI; SABAGHNIA; NOURAEIN, 2014).However, in maize, the VIF estimate was higher than 195.58, using all observations or mean values per plot in the diagnosis of multicollinearity (OLIVOTO et al., 2017b).
No studies with rye crop in which diagnoses were made by CN or DET were found.In other crops, low degree of multicollinearity was observed in sunflower traits (CN = 9.64) (FOLLMANN et al., 2019) and severe multicollinearity was observed in morphological traits of showy rattlepod (CN = 1,113.08)(TOEBE et al., 2017a) and maize hybrids (CN > 1,000) (TOEBE; CARGNELUTTI FILHO, 2013;OLIVOTO et al., 2017b;TOEBE et al., 2017b).The DET was used for the diagnosis in maize traits using all observations (DET = 3.02×10 -6 ) and mean values of plots (DET = 1.26×10 -7 ) (OLIVOTO et al., 2017b).In cherry tomatoes, DET values between 0.00002 and 0.02500 were obtained in a study on the impact of sample size on the degree of multicollinearity (SARI et al., 2018).These studies demonstrate that high estimates of multicollinearity can be obtained, regardless of the indicator.
Both in this study and in the other studies presented above, there was variation in the estimate or occurrence of absence or high multicollinearity.As it can be defined as the relationship between traits (MONTGOMERY; PECK; VINNING, 2012), the different levels of multicollinearity are due to the traits and their interrelationships.Thus, the researcher should conduct a survey in the literature and choose sets of traits that capture as much as possible the variability of the phenomenon under study and that have the lowest degree of multicollinearity.Therefore, it is important to know the traits that, when combined, can cause collinearity, thus preventing them from being evaluated and then eliminated later when conducting multivariate analysis.Thus, in order to avoid evaluating traits that may cause problems due to multicollinearity, the researcher should choose traits from any of the cases with CN ≤ 100 (Table 1).
The sample sizes (n) in each case and trial were obtained by fitting the degree of multicollinearity according to the sample size using three models, and the means for each trial are presented in Table 3.The worst fits for the three indicators were verified in the modified maximum curvature method (MMCM).For this model, mean values of adjusted coefficients of determination (R² a ) of each trial and in each indicator differed at 5% probability level by the Student's ttest for independent samples (Table 4), when compared with the other two models: 0.56 ≤ R² a ≤ 0.68, 0.58 ≤ R² a ≤ 0.74 and 0.50 ≤ R² a ≤ 0.64 for the indicators CN, DET and VIF, respectively.
The segmented linear (LMPR) and quadratic (QMPR) models with plateau response showed estimates of R² a ≥ 0.74 and very similar to each other, but with superiority of the means of R² a for QMPR in the fit of CN and VIF, at 5% probability level (Table 4).For this model, the mean estimates of R² a between the trials for the CN, DET and VIF indicators were 0.90 ≤ R² a ≤ 0.91, 0.80 ≤ R² a ≤ 0.92 and 0.90 ≤ R² a ≤ 0.92, respectively.Models are considered to be of good fit when R² a values are greater than 0.80.
Due to the superiority obtained by QMPR in fitting the degree of multicollinearity as a function of sample size, this model was selected to determine n for the CN, DET and VIF indicators.For each indicator, the fits by QMPR of the trials that had the lowest and highest degree of multicollinearity among the data of the master sample were presented in graphs (Figure 1).
Rev. Caatinga, Mossoró, v. 36, n. 1, p. 215 -225, jan. -mar., 2023 223 The n necessary for the diagnosis of the degree of multicollinearity between morphological traits of rye, obtained through QMPR, varied among the 176 cases (8 trials × 22 cases trial -1 ).The middle n among the trials and cases was 116, with the variation in the mean values of n within each trial of 97 ≤ n ≤ 141 for CN.The estimation of the degree of multicollinearity by the DET indicator requires a larger sample size (n) or number of plants, with a mean value of 180 and between the trials, with means 36 ≤ n ≤ 859.Among the three indicators, the detection of the multicollinearity degree by VIF requires the lowest n, with an overall mean of 85 plants and means of 68 ≤ n ≤ 99 for the trial of highest and lowest estimate of n.Due to the significant differences in n means between indicators, it can be affirmed that there is variability among the estimates by CN, DET and VIF indicators in morphological traits of rye.This demonstrates the need to contemplate in the experimental planning also the indicator to be used in the diagnosis of multicollinearity.Given the significant difference and aiming at greater precision, larger size of n should be used, with n = 180 plants (mean value of plants obtained by DET).
It can also be observed that there is variability in the sample size estimates to detect the degree of multicollinearity among the trials (T1 to T8).Thus, sowing time has an effect on the average estimates of n in the same cultivar (T1 to T5 for the cultivar BRS Progresso and T6 to T8 for the cultivar Temprano) and among the cultivars.When comparing the estimates of n between sowing times, there was also no standard behavior of the highest mean of n from one indicator to another.Considering the CN indicator, the highest means were observed in the trials corresponding to the first sowing time in both cultivars (T1 and T6) and the second sowing time for the cultivar Temprano; whereas for DET, the highest means were observed in the trials corresponding to the fifth sowing time for BRS Progresso (T5) and second sowing time for Temprano (T7); second sowing time in both cultivars (T2 and T7) for VIF.Effects of sowing time and rye cultivar were also observed in studies to determine the sample size to estimate the mean value of morphological traits and in flowering stage (BANDEIRA et al., 2018a;2018b).
No studies with rye crop in which the sample size study was performed for the diagnosis of multicollinearity were found.Some inferences have been made in studies with maize and cherry tomato, indicating that insufficient sample sizes could incorrectly estimate the degree of multicollinearity (OLIVOTO et al., 2017a;SARI et al., 2018).
Olivoto et al. (2017a) point out that problems caused by multicollinearity can be mitigation by using all observations to generate the correlation matrix, instead of using the mean values.The authors used data considering all observations or grouped data for the mean and found that the lower the number of observations (use of means), the greater the inaccuracy in the estimates.Sari et al. (2018) found the need for sample sizes greater than 45 plants to estimate multicollinearity by the DET indicator, with a 5% probability of error using the bootstrap methodology with a 95% confidence interval, and that when using sample size greater than 135 plants there would be no interference of the sample in the diagnosis of the degree of multicollinearity.
However, the present study found, for morphological traits of rye, the need for sample size of at least 180 plants, a value higher than that reported by Sari et al. (2018) in cherry tomato traits.This difference may be associated with the species, evaluated traits or methodology used in the determination of sample size.However, the results of this study point to the need for a sample size greater than 135 plants in rye.Therefore, further investigations on sample size for the diagnosis of multicollinearity in the most diverse agricultural crops should be carried out to check for possible variability of n.
Significant differences between the means of sample size, at 5% significance level, were verified by the Student's ttest for independent samples, in the comparisons of CN × DET, CN × VIF and DET × VIF (Table 4), considering the values of 176 sample sizes (8 trials × 22 trial -1 ) for each indicator.These results confirm that higher values of n or number of plants are necessary when using the DET indicator, followed by CN and VIF, for the diagnosis of the degree of multicollinearity in correlation matrices of rye morphological traits.
In this study, it was found that it is necessary to use different sample size when diagnosing multicollinearity by the indicators condition number, correlation matrix determinant and variance inflation factor in morphological traits of rye.Aiming at greater precision, larger sample sizes should be prioritized, adopting an average size determined for the correlation matrix determinant indicator (n = 180 plants).As a method to determine the sample size, it is not recommended to use the modified maximum curvature method, but rather the segmented quadratic model with plateau response.Other models should be investigated for the possibility of use in determining the sample size.

CONCLUSIONS
There is variability in sample size between the indicators condition number (CN), correlation matrix determinant (DET) and variance inflation factor (VIF) for the diagnosis of the degree of multicollinearity in morphological traits of rye, with increase in the following order: VIF, CN and DET, which require at least 85, 116 and 180 plants, respectively.If there is interest in greater precision, a larger sample size should be prioritized, with the adoption of sample size obtained for the DET indicator.
-Eight traits combined six by six -C(8, 6) - ) of the master sample, in the 22 cases and in each trial, were presented only in a summarized way in Table2, which showed: 12.36 ≤ CN ≤ 1,401.96;0.000019 ≤ DET ≤ 0.165307; and 3.25 ≤ VIF ≤ 196.27, with greater variability of multicollinearity estimates among the cases observed for the indicator DET (coefficient of variation -CV DET ≥ 163.46%).