Linear mixed model to identify the relationship between grain yield and other yield related traits and genotype selection for sorghum

Sorghum is the most popular crop in arid and semi-arid areas, especially in Sub-Saharan African countries. Genotype effects, environmental and the interaction of genotype by environmental factors have an influence on phenotypic traits. The aim of the study is to identify the relationship between grain yield and other yield-related traits and select the genotypes which perform better in grain yield as well as to examine the association between the uncorrelated phenotypic traits and grain yield via mixed model. The data was generated using a lattice square design. Principal component analysis was used to generate uncorrelated variables for the mixed model. The study revealed that there was a difference in grain yield due to the treatment and there was a pairwise relationship among the phenotypic variables. 77.12% of the total variance of the original phenotypic variables was explained by the first three principal components and decided to use PCAs as input variables for the mixed model. All PCs had significant effects on grain yield as well as grain yield variability due to random effects associated with genotypes, genotype interaction by treatment, and replication within the treatment. The variability of grain yield due to genotype effect was explained about 45.73%, the variation of grain yield due to the interaction of genotype by the treatment was also explained about 39.06% and 1.55% of the variation of grain was explained by replication within treatment. The best performer genotypes recommended for mass production were G40 (Genotype 40), G186 (Genotype 186) and G196 (Genotype 196) without any constraint of environment. The genotypes recommended for mass production under irrigation conditions were G40 (Genotype 40), G62 (Genotype 62) and G192 (Genotype 192). G26 (Genotype 26), G55 (Genotype 55) and G49 (Genotype 49) were the genotypes recommended for mass production under stress conditions. Overall, the study recommends using a mixed model to fit the grain yield, and future work will focus on to evaluate the performance of genotypes under different environments and years of production.


Introduction
Sorghum (Sorghum bicolor (L.) Moench) is an important staple food for people in tropical arid and semi-arid regions of Africa, Asia and South America [1][2][3]. As a result, it is found to be the fifth most important cereal crop after wheat, maize, rice and barley [4][5][6]. It is also one of the most popular cereals in production around arid and semi-arid areas, especially in sub-Saharan African countries. It is the most important cereal crop in moisture-deficit areas of eastern Ethiopia and is widely distributed throughout the country. Out of the grain cereal crops, sorghum is a major food crop in Ethiopia and is high in land coverage and volume of production [7,8].
Phenotypic traits are the response of the genotypic effects, environmental and the interaction of genotype by environmental factors and the traits are measured using phenotypic characteristics which are multivariate data as they are used for the phenotypic traits and also correlated to each other. Correlation analysis is very important to measure the association between yield and some other component traits and also indicates collinearity among the predictors of the yield traits. When there are correlations, which are interdependence among the independent variables of the grain yield, between two independent variables that the values of the pairwise analysis [1,9].
A study conducted on sorghum variety performance in the lowlands of Wag and Lasta in Ethiopia and found that the yield had a direct and indirect relationship with pheno-agronomic traits and the result also indicates that there is a correlation between the phenoagronomic traits [8].
Studies indicate that the phenotypic variables are correlated to each other and principal component analysis is a statistical tool to transform these correlated variables into uncorrelated (independent) new set of variables which maximizes the total variance of the phenotypic variables under the constraints being orthogonal to all preceding components [10][11][12]. The principal component analysis is important to screen out the correlation between phenotypic and genotype diversity, and it is very interesting to use as an input for further multivariate analysis techniques such as multivariate regression, clustering analysis and factor analysis [13,14].
Several studies applied multivariate data analysis such as principal component and cluster analysis to identify the correlation between the morphological traits and the genotype and the correlations among the phenotypic variables which found the presence of a strong positive correlation among the phenotypic characteristics such as head shape, midrib color, panicle exertion, glume color, presence of awns, grain color, glume cover and thousand seeds weight and strong negative correlation among days of flowering, leaf color, seed size, number of green leaves, grain color, thousand seed weight, awn presence, glume color, and inflorescence compactness [11,15,16].
The studies conducted by different investigators found that the yield was positive and significantly related to leaf length, leaf breadth and the number of leaves and indicated that these variables had a direct effect on yield. The study used principal component analysis to transform the morphological traits into a set of new traits having maximum variance with the constraints to be orthogonal to all preceding components, and cluster analysis was also used to classify the genotypes according to the Euclidian distance by ward's minimum variance linkage method [4,[17][18][19][20].
Studies noticed that the presence of association indicates that the traits are conditioned by the same set of genes in either positive or negative directions [8]. However, none of the studies mentioned the statistical models that quantify the relations between one phenotypic variable which is to be considered as a response variable, and the other phenotypic variables as well as covariates, considering these variables as predictor variables. Therefore, it is important to understand the linear or non-linear relationship between the grain yield and other agronomic and morphological traits of sorghum which provides insight to quantify important agronomic characteristics having difficulty measuring depending on easily measurable agronomic traits.
This study aims to identify the relationship between the sorghum yield and other yield-related traits using a model contains fixed and random effects and select the genotypes performing better using a mixed model examining the effect of yield related traits on grain yield.
This article is confined to 5 sections. The first section deals with an introduction to the study including a literature review and the aim of the study. Next, the second section describes the data and model formulations used for data analysis. The result of the study, which is presented in section 3, explains the main results of the study while the fourth section discusses and compares the results with other scientific works. Finally, the conclusion and the recommendation of the study are provided.

Site description and experimental design
Genotypes were collected from the Ethiopian Biodiversity Institute (EBI). The institute was established in 1976 as a plant genetic resource center with a bilateral technical cooperation agreement between the Government of Ethiopia and Germany and in 1994 institute of Biodiversity conservation was established as well as nowadays its name is Ethiopian Biodiversity Institute (EBI) [21]. The institute is located in Addis Ababa, which is the capital of Ethiopia, with 9.0335206 latitudes and 38.7822931 longitudes. One hundred ninety-six genotypes were obtained from the institute that were used to identify the agronomic characteristics of sorghum in North East Ethiopia [20]. The experiment was held at the Kobo site of the Sirinka agriculture research center in 2014/2015 in Ethiopia. The site is located in the North Wollo zone in Amhara Regional state and far 40 km from Woldia, the center of the North Wollo Zone and 561 km far from Addis Ababa which is the capital of Ethiopia. The design of the experiment was an incomplete block design called lattice square design that contained 14 blocks, and each block contains 14 experimental units (plot) where the genotypes were applied and two replications for each treatment level that was a non-irrigated condition having a lack of water and the condition having sufficient [20,22]. The general layout of the design of the experiment is shown in Table 1 below.

Data nature
The data collection procedure was planting 196 genotypes to identify sorghum genotypes capable of resisting the drought and genotypes having higher production of sorghum under the given treatment. From the experiment, 14 phenotypes characteristics of sorghum were measured under the experiment. In general, the counted phenotypic characteristics were the number of days of emerging (DE), days of flowering (DF), days of maturity (DM), the number of green leaves (NGL) and grain filling period (GFP) measured. However, panicle height (PH), panicle length (PL), panicle yield (PY), thousand seed weight (TSW), grain yield (GY), aboveground dry matter and harvest index (HI), which is the ratio of the grain yield and the above-ground dry matter, were the measured phenotypic characteristics of the experiment [23,24]. This study considered grain yield as a response variable and the input variables are treatment and principal components will be obtained from Principal component analysis considered as a fixed effect, and replication within treatment, genotype effect and the interaction of genotype by treatment are incorporated as random effects.

Principal component analysis (PCA)
Principal component analysis (PCA), which transforms several correlated variables into several uncorrelated variables, is a multivariate technique for examining the relationships among several quantitative variables [10,15,25]. PCA is not an end by itself but it is an input for further multivariate analysis such as cluster analysis, factor analysis and multivariate regression [10]. Consider p set of correlated variables X 1 , X 2 , X 3 …, X p with covariance matrix Σ and assume p set of variables m 1 , m 2 , m 3 …, m p which are uncorrelated variables to each other and are linear combinations of the correlated variables, the principal component analysis has a form The matrix notation of the principal component is Table 1 Lattice square design layout of the data.
The variance of m i is a ′ i Σa i and covariance between m i and m k is a ′ i Σa k . Assume Σ has the eigenvalue-eigenvector pairs (λ 1 , e 1 ), (λ 2 , e 2 ) , ⋯, (λ p , e p ) for all λ i ≥ 0; ∀i. The ith principal component is given by The sum of the variance of m is the sum of the eigenvalue of the covariance matrix and the proportion of the total variance due to the kth principal component is the kth eigenvalue divided by the total eigenvalue. The principal component, having the highest eigenvalue, has the highest variance proportion relatively [10,25].

Linear mixed models
Linear mixed models are very flexible to fit the models which incorporate both fixed and random effects. Suppose that y is the response variable and consider the fixed effects which include the principal components (m i ; i = 1, 2, 3) and the treatment and the random effects including the replication within treatment, the genotype effect and the interaction of genotype by treatment [26]. The linear mixed model is given by Refs. [27][28][29]: Where y ijkln is the outcome of the grain yield of the nth observation of ith treatment of jth replication and lth block of the kth genotype; μ is the overall mean; β 1 , β 2 , and β 3 are the coefficients for the 1st, 2nd and 3rd principal components which are considered to be fixed effects for the linear mixed model; α i is the treatment effect which is a fixed effect; ρ j(i) ω k , (αω) ik and ε ijkln are the random effects of replication within treatment, genotype, the interaction of treatment by genotype and the random error, respectively, and these are from a normal distribution with mean zero and variances σ 2 ρ(α) , σ 2 ω , σ 2 αω and σ 2 ε , respectively. The fixed effects and random effects are estimated using the concept of a best linear unbiased estimator (BLUE) and best linear unbiased predictor (BLUP) methods respectively, which both have the minimum variance among all linear unbiased estimators and minimum mean square error among all linear unbiased predictor respectively [12,27,30].
Linear mixed model parameters are Maximum Likelihood (ML), Restricted Maximum Likelihood (REML) and Minimum Norm Quadratic Unbiased Estimator (MINQUE) [31,32]. In this case, MINQUE is used to estimate the parameters of the random effects of the mixed model as the method does not require the normality assumption of the data [33]. The estimated values of the random effects help to estimate the parameters of the fixed effects of the mixed model by using generalized least squares (GLS) [34,35].
Model diagnosis of the linear mixed model is the most important visualization to assess the agreement between the model and the data which is evaluated by marginal and conditional residual plots to test normality, linearity, and homoscedasticity of the residuals and detect the outlier values. The other measure of diagnosis is the measure of influential diagnosis which helps to assess the extreme/ outlier observation [32]. The analysis of this study was then performed using a mixed procedure of the SAS system (version 9.4) and R software (version R4.1.2).

Correlation coefficient analysis
The correlation analysis clearly shows the presence of positive and negative correlation among the phenotypic variables and also depicts the association between the grain yield and the other yield-related traits of sorghum. Fig. 1 presents the results of the Pearson correlation coefficient which is used to test the null hypothesis that the population correlation coefficient is zero (there is no correlation coefficient among the variables). It shows the presence of a relationship among the phenotypic variables which displays the presence of a significant correlation between the grain yield and the possible phenotypic variables. The pairwise correlation indicates that there is a high correlation among the variables and explains the presence of multicollinearity among the predictors such as the morphological variables (number of green leaves, panicle length, panicle width, panicle yield, harvest index) had a direct strong association pair wisely and the relationship among the phenology variables (seedling vigorous, days of emerging, days of flowering, days of maturity and grain filling period). The correlation between the grain yield and the morphological variables indicates the presence of a strong positive relationship except for panicle height which had an indirect strong association with the grain yield whereas the association between the grain yield and the phenological variables shows a negative relationship except for the grain filling period. The phenotypic variables which have a direct relationship with the grain yield were panicle exertion (r = 0.89), number of green leaves (r = 0.81), panicle length (r = 0.74), panicle width (r = 0.82), thousand seed weight (r = 0.86) and panicle yield (r = 0.90), and the phenotypic variables having an inverse relationship with grain yield were seedling vigorous (r = − 0.88), days of flowering (r = − 0.55), days of maturity (r = − 0.36), panicle height (r = − 0.57) and days of emerging (r = − 0.08 (P value < 0.05). The pairwise correlation analysis indicates the presence of a relationship between the phenotypic variables and the grain yield. In addition, the presence of a strong correlation among the predictors indicates the precedence of multi-collinearity.

Principal component analysis
The result indicates the eigenvalues of the correlation matrix with the corresponding proportions of the components and the cumulative proportion of the components. 56.16% of the total variance was explained by the 1st principal component, and 13.9% of the total variance was explained by the 2nd component. 7.4% of the total variation was explained by the third principal component. The first three principal components had eigenvalues greater than 1 and explained more than three-fourths of the total variance. 77.46% of the total variance was explained by the first three principal components ( Table 2). The first principal component had a larger positive association with the number of green leaves, panicle exertion, panicle length, panicle width, thousand seed weight and harvest index; however, it had a larger negative association with seedling vigorous and panicle height. Besides, these phenotypic characteristics explained 54.33% of the total variation of the phenotypic characteristics. The second principal component had days of maturity, grain filling period, number of green leaves and above-ground dry matter, and those variables explained 14.86% of the total variation of the phenotypic characteristics, which were more related to the 2nd principal component of the phenotypic characteristics of the sorghum production. The phenotypic variables such as days of emerging and days of flowering were related to the 3rd principal component which explain 7.92% of the total variance of phenotypic characteristics. The new variables which were transformed from the original phenotypic variables, and which are multivariate and correlated data, are uncorrelated and the dimension of the phenotypic variables was reduced into 3 sets of new variables. Fig. 2 presents the principal component against the eigenvalue (left side of the plot) and the principal components against the proportion of the variances (cumulative) which shows an elbow after three principal components and the first three principal components contained approximately 78%, which is indicated by the dotted line of Fig. 2 (right hand side), of the total variance of the original variables. From this, we used three principal components which are important variables and input predictors (fixed effect for the grain yield) for the mixed model.

Results of mixed model
According to the result in Table 3, there was a linear association between the grain yield and the phenotypic components. The grain yield of sorghum has a direct relation with the first, second and third principal components of the phenotypic characteristics of sorghum having a positive estimate for the coefficient of the components and the P-values suggest to reject the null hypothesis that the effect of the components are significant on grain yield as P value < 0.05. A unit change of the first, second and third components of the phenotypic characteristics provided the increment of grain yield by 0.09841, 0.03223 and 0.01783 in log scale respectively. The average grain yields (in log scale) for irrigated and non-irrigated were 1.0197 and 0.9425 respectively. This shows that the presence of irrigation affects the mean difference in grain yield in the log scale compared to the absence of irrigation (non-irrigated). There was a significant difference in grain yield among the treatment (P value < 0.05). Random effect replication within treatment has an insignificant effect on the grain yield variability and 1.55% of the total variance of grain yield was associated with the variability of replication within treatment (P value > 0.05). The genotype effect has a significant effect on grain yield variability (P value < 0.05) and 45.73% of the variability of grain yield is associated with random effect genotype. The total variance of the grain yield, which is 39.73%, was related to the interaction of genotype by treatment and the effect of the random effect has a significant effect on grain yield variability. The variability of grain yield due to the block random effect was very small which causes to omit the block effect from the analysis.  Table 4 shows the top and least ten genotypes performance in grain yield. According to the result, the top three best-performer genotypes were G40 (Genotype 40), G186 (Genotype 186) and G196 (Genotype 196) with their BLUP estimates 0.1229, 0.1059 and 0.1016 respectively whereas G41 (Genotype 41), G66 (Genotype 66) and G136 (Genotype 136) were least three worst performer genotypes and their BLUP estimates of the genotypes were − 0.1927, − 0.1741 and − 0.1391 respectively. The result indicates the standard error of the BLUP estimates was small (0.036). Table 5 presents the performance of the genotypes that coveys the BLUP estimates of the genotypes under irrigation conditions. This explained the genotypes which were appropriate genotypes for the wet environment. The result shows that G40 (Genotype 40), G62 (Genotype 62) and G192 (Genotype 192) were the top best performer genotypes under irrigated wet conditions and the BLUP estimates these genotypes were 0.1055, 0.09396 and 0.08959 respectively with relatively small standard error (0.038). On the other hand, the least three worst performer genotypes were G41 (Genotype 41), G55 (Genotype 55) and G2 (Genotype 2) with the corresponding BLUP estimates of − 0.1505, − 0.1219 and − 0.1184 respectively. Table 6 indicates the best and worst performer genotypes under stress conditions. The result shows the top three best performer genotypes were G26 (Genotype 26), G55 (Genotype 55) and G49 (Genotype 49) and their estimates are 0.09087, 0.08751 and 0.085 with standard error 0.037. The least three worst performer genotypes are G66 (Genotype 66), G137 (Genotype 137) and G156 (Genotype 156) with their BLUP estimates of the three genotypes − 0.1439, − 0.107, − 0.09075. As a result, Genotype 26, Genotype 55 and Genotype 49 are recommended for the mass production of sorghum under stress conditions whereas Genotype 66, Genotype 137 and Genotype 156 are not recommended for the mass production of sorghum under stress conditions.

Discussion of the study
The paper discussed and described the dimensional reduction of variables and produced an uncorrelated new set of variables, which used the principal component variables as an input to estimate the response variable, in this case, grain yield, for the linear mixed model which is a very flexible way of identifying the relationship between the response variable and predictor variables. The Table 3 Tests on fixed and random effects significance for grain yield in log scale.  results of the descriptive statistics indicate the difference in grain yield of the treatment (irrigated and non-irrigated) and the observation of grain yield under irrigated conditions is more consistent than the observation of grain yield under non-irrigated conditions. This result is consistent with other studies which were investigated the impact of drought on sorghum production and the effect of the genotypes on sorghum production [20,36,37]. The correlations among the phenotypic variables are significant which indicates the presence of association among the phenotypic variables and an indication of developing a statistical linear relationship between one trait, in this case, grain yield, and other related traits. This study agreed with the studies which indicated the relationship between grain yield and other yield-related traits [14,17,38].
The principal component analysis is the most important multivariate data analysis tool which helps to generate uncorrelated variables from correlated variables by maximizing the variance of original variables. PCA1 explained 54.33% of the total variance of the phenotypic traits, and PCA2 also explained 14.86% of the total variance of the phenotypic traits such that PCA1 and PCA2 are uncorrelated (independent) to each other. PCA3, which is uncorrelated to the other preceding two PCAs, explained 7.92% of the total variance of the phenotypic variable. The first three principal components explained 77.12% of the total variance of the phenotypic variables [13,20,38]. The principal component analysis is mainly important to find independent variables from correlated variables and essential to data reduction for high dimensional data. These new independent variables were used as input variables for statistical models to model the response variable. A study was conducted by other investigators who used the principal components as independent variables for linear models [12].
The result of the linear mixed model presented the quantitative measure of grain yield against the fixed effects namely the treatment effect (irrigation), the principal components (PCA1, PCA2 and PCA3) and the random effects, that were used in a linear mixed model, are replication within irrigation, genotype and the interaction of genotype and treatment [36]. The fixed effects were statistically significant on grain yield. The grain yield increased due to principal components as there were a linear combination of the other yield related traits and these showed the presence of a direct relationship between grain yield and other yield related traits. The variability of grain yield was significantly associated with the effect of genotype, the interaction of genotype by treatment and the replication within treatment. The genotypes may generate different production under several environmental conditions, for instance, a genotype that can produce high yield under irrigation, may not produce better yield under stress conditions that is due to the presence of different capacities in drought resistance among genotypes [36,37]. The result of the random effects showed that G40 (Genotype 40), G186 (Genotype 186) and G196 (Genotype 196) were the best performer genotypes without any constraint of the environmental conditions and the result recommends these genotypes for the mass production in Ethiopia. This agreed with the study in the same region [20,36]. Moreover, the performance of genotypes is different under the treatment levels. The top three best-performer genotypes under irrigation (sufficient water supply) were G40 (Genotype 40), G62 (Genotype 62) and G192 (Genotype 192) which the study recommends for mass production in irrigated regions. The recommended genotypes for mass production under stress (insufficient water supply) are G26 (Genotype 26), G55 (Genotype 55) and G49 (Genotype 49). These genotypes were found to resist the  Table 6 The performance of the genotype under stress condition. impact of drought which was revealed by the shortage of water during the growth of the genotypes [20].

Conclusion
The study indicates that the effect of treatment was on grain yield and also shows the presence of a pairwise relationship among the phenotypic variables, particularly the correlation coefficient between grain yield and the other phenotypic variables that have statistically significant association. Based on the correlation analysis, the phenotypic variables were correlated and principal component analysis was used to generate new uncorrelated (independent) phenotypic variables. The principal component analysis shows that 77.12% of the total variance of the original variables was explained by the three principal components. The outcomes of the principal component analysis were used as independent variables (fixed effects) for linear mixed model. The production of grain yield under irrigation is relatively high compared with the production of the grain yield under stress condition that leads to concluded the impact of drought on grain yield. The effect of principal components was important to estimate the grain yield in which all principal components had a direct linear relationship to the grain yield as the coefficients of the components are positive. The variability of grain yield due to random effects was associated with replication within treatment, genotypes and the interaction of genotype by treatment. According to the proportion of the total variation of the grain yield, the genotype and the interaction of genotype by treatment have high contribution on the variability of grain yield with respect to their contribution as well as the contribution of the replication within the treatments in grain yield is very small relative to other random effects. The best performer genotypes recommended for mass production were G40 (Genotype 40), G186 (Genotype 186) and G196 (Genotype 196) without any constraint of environment. Under irrigated condition, G40 (Genotype 40), G62 (Genotype 62) and G192 (Genotype 192) are recommended genotypes for mass production. On the other hand, in stress condition, G26 (Genotype 26), G55 (Genotype 55) and G49 (Genotype 49) are the genotypes recommended for mass production. Overall, the study recommended the use of the linear mixed model to fit grain yield against various correlated phenotypic characteristics. The future work will focus on to evaluate the performance of genotypes under different environments (location) and years of production.

Author contribution statement
Mulugeta Tesfa: Temesgen Zewotir: Denekew Bitew Belay: Analyzed and interpreted the data; Wrote the paper. Solomon Assefa Derese: Conceived and designed the experiments; Performed the experiments; Contributed reagents, materials, analysis tools or data.
Hussien Shimelis: Conceived and designed the experiments; Contributed reagents, materials, analysis tools or data.

Data availability statement
Data included in article/supp. Material/referenced in article.

Additional information
No additional information is available for this paper.

Declaration of competing interest
We have no known competing financial interest or personal relationships.