Multitrait index based on factor analysis and ideotype‐design: proposal and application on elephant grass breeding for bioenergy

This study proposes a new multitrait index based on factor analysis and ideotype‐design (FAI‐BLUP index), and validates its potential on the selection of elephant grass genotypes for energy cogeneration. Factor analysis was carried out, and afterwards, factorial scores of each ideotype were designed according to the desirable and undesirable factors, and the spatial probability was estimated based on genotype‐ideotype distance, enabling genotype ranking. In order to quantify the potential of the FAI‐BLUP index, genetic gains were predicted and compared with the Smith‐Hazel classical index. The FAI‐BLUP index allows ranking the genotypes based on multitrait, free from multicollinearity, and it does not require assigning weights, as in the case of the Smith‐Hazel classical index and its derived indices. Furthermore, the genetic correlation ‐ positive or negative ‐ within each factor was taken into account, preserving their traits relationship, and giving biological meaning to the ideotypes. The FAI‐BLUP index indicated the 15 elephant grass with the highest performance for conversion to bioenergy via combustion, and predicted balanced and desirable genetic gains for all traits. In addition, the FAI‐BLUP index predicted gains of approximately 62% of direct selection, simultaneously for all traits that are desired to be increased, and approximately 33% for traits which are desired to be decreased. The genotypes selected by the FAI‐BLUP index have potential to improve all traits simultaneously, while the Smith‐Hazel classical index predicted gains of 66% for traits that are desired to be increased, and −32% for traits that are desired to be decreased, and it does not have potential to improve all traits simultaneously. The FAI‐BLUP index provides an undoubtable selection process and can be used in any breeding programme aiming at selection based on multitrait.


Introduction
Elephant grass [Pennisetum purpureum Schum.; synonym Cenchrus purpureus (Schumach.) Morrone] may be used as forage or as bioenergy crop Fontoura et al., 2015;Rocha et al., 2017b). Besides the morphoagronomic traits, elephant grass breeders have focused on high quality nutrition attributes for animal feed, emphasizing high nitrogen concentration, low fibre levels and high digestibility (Rengsirikul et al., 2013). In turn, for bioenergy purposes, the selection of quality attributes occurs in the opposite direction, that is low nitrogen concentration and high fibre levels.
Experienced plant breeders keep in mind a plant ideotype that leads them to select high performance plants. In this context, the idea behind the ideotype is that it will provide to the breeders an ultimate target for selection, and thereby exchange the stepwise trial-anderror method, consequently increasing plant performance (Van Oijen & H€ oglind, 2016).
Selecting high performance genotypes for multitrait simultaneously can be a hard task. The first index for simultaneous selection was proposed by Smith (1936) for plant breeding, and by Hazel (1943), for animal breeding. This index is based on the selection of unknown genetic values. Thus, the use of phenotypic values and genetic covariances is necessary to determine how a vector of weights has to be chosen in order to maximize the correlation of unknown genetic values and phenotypic values (Hazel et al., 1994).
Currently, the Smith-Hazel classical index has been successfully used. However, one of the difficulties of applying this index is the lack of a procedure to weight the traits of economic importance (Cer on- Rojas et al., 2006;Stephens et al., 2012). Some parameters have been assigned as relative economic weights, such as coefficient of genetic variation (Bhering et al., 2012) and heritability; in other cases, they can be randomly attributed (Zhang et al., 2009;Stephens et al., 2012).
In the selection process, plant breeders usually handle multitrait (Santchurn et al., 2012(Santchurn et al., , 2014. However, multicollinearity problems will certainly appear and be another obstacle faced by the Smith-Hazel classical index. According to Prunier et al. (2015), multicollinearity between traits is a systemic issue in multivariate analyses, and is likely to cause serious difficulties to the proper interpretation of the results, with the risk of erroneous conclusions, misdirected research and inefficient conservation measures. Furthermore, the Smith-Hazel classical index does not take advantage of the genetic correlations between traits.
Several methods are designed to incorporate collinear variables, such as: factor analysis, principal component analysis, principal component regression, partial least squares, etc. Factor analysis may produce uncorrelated or orthogonal axes between final factors scores, and therefore they are free from multicollinearity (Dormann et al., 2013). This method concentrates in the first few new latent variables, and usually the less important latent variables are discarded, leading to dimensional reduction (Dormann et al., 2013), which consequently simplifies the analysis.
The theoretical foundations of structural equation modelling (SEM) arise by joining the traditional technique of factor analysis (Exploratory Factor Analysis) with the ideotype-design (Confirmatory Factor Analysis). SEM will allow the use of the correlations (covariance) between the traits. In this way, the influence that a trait exerts under another will be computed, getting information regarding their magnitude and sense.
In light of the aforementioned, this study proposes a new multitrait index based on factor analysis and genotype-ideotype distance (FAI-BLUP index), and validates the potential of FAI-BLUP index on the selection of elephant grass genotypes for conversion to bioenergy via combustion.

Method description
Exploratory factor analysis. The principal components analysis was used to extract factorial loads of genetic correlation matrix, obtained by genetic values. The varimax criterion was used (Kaiser, 1958) for the analytic rotation and for the calculation of the factor scores of the weighted least squares method (Bartlett, 1938) was used.
Ideotypes-design (step by step). The number of ideotypes was defined based on the combination of desirable and undesirable factors for the objective of the selection. The number of ideotypes is given by the following algorithm: NI = 2 n , in which: NI = number of ideotypes; n = Number of factors.
The number of factors (n) used to design the ideotypes should be equal to the number of eigenvalues (variance of principal components), higher or equal to one (Kaiser, 1958). It also indicates the number of coordinates that should be calculated (i.e. n factorial scores) for each ideotype. Thus, each ideotype and their descriptions based on the combination of desirable and undesirable factors are described as follows: If, for example, three eigenvalues are higher than one (n = 3), eight ideotypes (NI = 2³ = 8) will be designed with three coordinates each one.
The factor score is a linear combination of standard genetic values (BLUP means) weighted by the canonical loadings obtained by the factor analysis. Therefore, a desirable factor should have the desirable genetic values (of the data set) for all traits under selection. The desirable genetic values may be the maximum, minimum, mean or a specific genetic value. An undesirable factor should have undesirable genetic values (maximum, minimum, mean or a specific genetic value) for all traits under selection. Thus, each ideotype and its coordinates (factorial scores) can be designed.
Multitrait index based on factor analysis and genotype-ideotype distance (FAI-BLUP index). After ideotypes were determined, the distances from each genotype according to ideotypes (genotype-ideotype distance) were estimated and converted into spatial probability, enabling the genotype ranking. The following algorithm was used: in which: P ij = Probability of the i th genotype (i = 1, 2,. . ., n) to be similar to the j th ideotype (j = 1, 2,. . ., m); d ij = Genotypeideotype distance from the i th genotype to the j th ideotypebased on standardized mean Euclidean distance.
The theoretical basis of FAI-BLUP index. In addition to dealing very well with multicollinearity problems and the lack of weight assignment (in both cases due to factorial analysis), the FAI-BLUP index takes into account the correlation structure obtained from the data and directs to select genotypes closer to the hypothesized by the breeder (ideotype). The jointing of the exploratory factor analysis (EFA, i.e. sample by the data) and the ideotype-design (resembles confirmatory factor analysis -CFA) fit the theoretical basis of the structural equation modelling (SEM).
In EFA, the researcher has a large set of traits and hypothesizes that the observed traits may be linked together by virtue of the traits correlations (unknown underlying structure); the aim of an EFA is to uncover this structure (Ullman, 2006) and lead to dimensional reduction (Dormann et al., 2013).
In a CFA the researcher has a strong idea about the number of factors, the relations among the factors, and the relationship between the factors and measured traits; hence, the factor extraction and rotation are not part of CFA (Ullman, 2006). The goal of the CFA is to hypothesize a factor structured a priori and verified empirically (or by tests), rather than derived from the data (Ullman, 2006;Lei & Wu, 2007). However, in order to use this theoretical foundation in the breeding programmes selection process, the ideotype-design must be used, as it considers all the desirable relationships between the traits and the values for the traits that wish to achieve.
One of the main advantages of SEM is that it can be used to study the relationships among latent constructs (factors) that are indicated by multiple measures. It is also applicable to both experimental and non-experimental data, as well as cross-sectional and longitudinal data (Ullman, 2006;Lei & Wu, 2007). Structural equation modelling have been applied on multitrait genetic breeding by Gianola & Sorensen, 2004;Valente et al., 2010;Rosa et al., 2011 andValente et al., 2013. Comparison of the FAI-BLUP index with the Smith-Hazel classical index. The Smith-Hazel classical index was used to validate the potential of FAI-BLUP index by comparing both indices. The Smith-Hazel index aims at determining how a vector of weights has to be chosen to maximize the correlation of unknown genetic values and phenotypic values. This can be achieved by solving the equation: b ¼P À1 Ga b = vector of the weightings of the index to be estimated; a = vector of known relative economic weights. In this study, the coefficient of genetic variation was attributed as relative economic weight, considering the sense of trait under selection positive or negative sign; P = matrix of phenotypic variancecovariance; G = matrix of genotypic variance-covariance.
Multicollinearity diagnose was carried out in the matrix of phenotypic correlation in accordance with the recommendations of Montgomery & Peck (1992), and variables were discarded to solve multicollinearity problems, and therefore apply Smith-Hazel classical index.
Comparisons between the Smith-Hazel classical index and the FAI-BLUP index were carried out by means of predicted genetic gains. To make a more valid comparison, the predicted genetic gains were calculated using the genotypes indicated by the Smith-Hazel classical index, based on genetic values (SH-BLUP); and using the genotypes indicated by the FAI-BLUP index. Selection intensity was 15% (15 genotypes selected).

Method application
Experimental information. The experiment was carried out at the experimental field of Embrapa Dairy Cattle Research Center, located in the municipality of Coronel Pacheco, MG, Brazil (21°33 0 18 0 'S, 43°15 0 51″W, at 417 m asl), in a red-yellow latosol soil with the following chemical properties: pH (H 2 O) = 5.4; H+Al = 2.31 cmol c dm À3 ; P (Mehlich 1) = 1.1 mg dm À3 ; K = 23 mg dm À3 ; and the following exchangeable cations (cmol c dm À3 ): Al 3+ = 0.2; Ca 2+ = 1.4; Mg 2+ = 0.7. The planting was carried out in December, 2011, in 0.20 m deep furrows, and 80 Kg ha À1 P 2 O 5 fertilizer was applied at planting. After the establishment stage, 30 days after planting, elephant grass was cut to 0.30 m stubble height (uniformity harvest). The first of the three growth periods started at this time. Maintenance fertilization was carried out with 300 Kg ha À1 of the N-P 2 O 5 -K 2 O formulation (20 : 05 : 20 blended granular fertilizer), after the uniformity harvest, and after the first evaluation cutting. Fertilization was carried out according to the soil analysis.
Three evaluation cuttings were carried out for this study. Aiming at using them as bioenergetic feedstock, 1 st (28th September 2012) and 2 nd (04th June 2013) cuttings were harvested at 250 days, and the 3 rd (15th April 2014), at 315 days after the previous harvest.
Genetic material and experimental design. One hundred genotype of the Active Elephantgrass Germplasm Bank (BAGCE) were evaluated. Plots (1.5 m 9 4 m) consisted of a single 4 m row. Rows were planted side by side, spaced 1.5 m apart. Plots were allocated in a simplex lattice design, with two replications.
Evaluated Traits. The following traits were measured: mean height (m)obtained from the arithmetic mean of the height of three random plants samples, in each plot, measured from the ground level to the curve of the last completely expanded leaf; phenotypic vigor (1 to 5)obtained using a grading scale, which ranged from 1 to 5 (5 = high vigor; 1 = low vigor); stalk diameter (mm)obtained from the arithmetic mean of five plants in the useful plot, randomly taken, measured at 10 cm from the ground level with a digital caliper rule; total green biomass (Mg ha À1 )obtained from a cut at 7.5 cm stubble height in a 3 m section from the middle of rows, using a gasoline powered trimmer, and then harvested by hand. The 3 m section was immediately weighed in the field to provide estimates of total green biomass. Total dry biomass (Mg ha À1 ) was quantified by multiplying the total green biomass by the dry matter concentration (%).
Before cutting the experimental plots, random samples of three complete plants from each plot were collected. Then, these samples were preprocessed in a stationary forage cutter and dried in a forced air circulation oven at 56°C for 72 h. After drying, samples were ground (1 mm) in a Wiley type grinder and sent to the biomass analysis laboratory for the chemical analysis described below: Acid detergent fibre (g Kg À1 ), neutral detergent fibre (g Kg À1 ), cellulose (g Kg À1 ), lignin (g Kg À1 ) and hemicellulose (g Kg À1 )determined following the methodology proposed by Goering & Van Soest (1967); Cellulose/lignin ratio (-)given by the cellulose/lignin ratio. In vitro digestibility of the dry biomass (g Kg À1 )determined following the methodology used by Tilley & Terry (1963); Nitrogen (g Kg À1 )stated following the methodology proposed by the Association of Official Analytical Chemical (AOAC, 1975); Ash (g Kg À1 )given according to the methodology proposed by Silva & Queiroz (2002); Calorific value (MJ kg À1 )determined using a IKA C-5000 calorimeter; Dry matter concentration (g Kg À1 )obtained by the sampling of three complete plants from each plot, which were dried in a kiln after weighing (fresh weight) until weight stabilization. Samples were weighed (dry weight) again, and then the dry matter concentration was determined by the ratio between dry weight and fresh weight. This trait was used as a common denominator for the estimation of cellulose, lignin, hemicellulose, in vitro digestibility of the dry biomass, nitrogen, ash and calorific value.

Statistical analyses
The mixed model methodology was adopted for statistical analyses via REML/BLUP (restricted residual maximum likelihood/best linear unbiased prediction), according to Patterson & Thompson (1971) and Henderson (1975).
The statistical model was denoted by: where; y = data vector; m = vector of the effects of the measurement-replication combination (assumed as fixed) added to the overall mean; g = vector of genetic effects (assumed as random); b = vector of block effects (assumed as random); i = vector of the genotype x measurements effects; p = vector of the permanent environment (random); ɛ = vector of residue (random); X, Z, W, T and Q represent the incidence matrices for these effects.
For the random effects of the model, the significance for the likelihood ratio test was tested using the chi-square statistic with one degree of freedom. Genetic values (BLUP means) were predicted for each one of the 100 genotype based on the 16 traits evaluated.

Software
The software Selegen-REML/BLUP (Resende, 2007) was used for the deviance analysis, prediction of genetic values, genetic variance-covariance, genetic correlation and coefficient of genetic variation. The Smith-Hazel classical index was used on the software GENES (Cruz, 2013). The R software (R Development Core Team 2015) was used for the principal component analysis, factor analysis and genotype-ideotype distances (spatial probability). The FAI-BLUP-index routine analysis (Routine S1) applied to the R software is available in supporting information.

Genetic variability
Significant genotypes effect (P < 0.05) was detected by the joint deviance analysis in the three cuts evaluated for 15 traits, with the exception of hemicellulose. As significant genetic variability among genotypes is essential for genetic progress, hemicellulose content will not be used in subsequent analyses. Table 1 shows the eigenvalues and cumulative frequency for the 15 principal components obtained by the genetic correlation matrix. Only the three-first principal components had eigenvalues higher than one, and thus, according to the Kaiser's criterion (Kaiser, 1958), data may be condensed (dimensional reduction) in three factors. The cumulative frequency for the first three principal components, or communality mean (common variance) was higher than 78%, indicating that it is sufficient to represent 78% of all the variability (Table 1).

Exploratory factor analysis
After varimax rotation (Table 2), high genetic correlation for the first factor was observed among the traits cellulose, lignin, cellulose/lignin ratio, acid detergent fibre, neutral detergent fibre, in vitro digestibility of the dry biomass, and nitrogen level, and this factor was named qualitymetric factor. For the second factor, high genetic correlation was observed among mean height, phenotypic vigor, total green biomass and total dry biomass, and it was named by phytometric factor. The third factor was named by energymetric factor, and stalk diameter, dry matter, ash and calorific value are strongly correlated. Genetic correlations among traits within a factor can be given in the same and/or opposite direction.

Ideotype-design
To design the ideotype 1 (specific ideotype for direct combustion - Table 3), the maximum standard genetic values were used for the traits cellulose and lignin content, dry matter, calorific value, stalk diameter, mean height, phenotypic vigor, total green biomass, total dry biomass, acid detergent fibre and neutral detergent fibre; and the minimum standard genetic values were used for the traits in vitro digestibility of the dry biomass, cellulose/lignin ratio, nitrogen level, and ash content. The other ideotypes (ideotypes 2, 3, to 8) were designed using the same thinking, considering desirable and undesirable combining factors, as shown in Table 3.

FAI-BLUP index
The FAI-BLUP index allows the genotypes ranking (associated with spatial probability) based on multitrait, free from multicollinearity. In addition, genetic correlation -positive or negative -within each factor was taken into account, preserving the relationships between traits, and giving biological meaning to the ideotypes.
In the elephant grass case, the ideotypes designed may indicate the selection of genotypes for different purposes, for example ideotype 1 is suitable for conversion to bioenergy via combustion, ideotype 2 is suitable for the production of second generation ethanol, and ideotypes 6 and 8 are suitable for forage uses (cut-andcarry and grazing, respectively), as presented and discussed by Rocha et al. (2017b).
A breeding programme based on ideotype focuses on multitrait simultaneously. This method differs from other multivariate approaches in plant breeding, such as Smith-Hazel classical index, which tends to focus on few traits and not on the underlying plant morphology and physiology. Focusing directly on few traits statistically simplifies the problem (Van Oijen & H€ oglind, 2016); however, important information may not be considered on the data analyses.
Simultaneous selection methods, such as the Smith-Hazel classical index and other indices derivative from this index may be used, as there is no multicollinearity problem in the phenotypic covariance matrix. When comparing the Smith-Hazel classical index with the FAI-BLUP index, the diagnosis of multicollinearity was carried out, and some traits of the Smith-Hazel classical ; LIG = Lignin (g Kg À1 ); C/L = Cellulose/Lignin ratio; ADF = Acid detergent fibre (g Kg À1 ); NDF = Neutral detergent fibre (g Kg À1 ); IDBD = In vitro digestibility of the dry biomass (g Kg À1 ); N = Nitrogen (g Kg À1 ); HGT = mean height (m); PHV = Phenotypic vigor (1-5); TGB = Total green biomass (Mg ha À1 ); TDB = Total dry biomass (Mg ha À ¹); STD = Stalk diameter (mm); DM = Dry matter (g Kg À1 ); ASH = Ash (g Kg À1 ) and CAV = Calorific value (MJ kg À1 ). index were discarded to solve this problem: total green biomass, cellulose/lignin ratio, neutral detergent fibre, and acid detergent fibre. The multicollinearity among traits will certainly appear when handling several traits, providing inflated errors. Inflated errors result in inaccurate tests of significance for the predictors, meaning that important predictors may not be significant, even if they are truly influential, because if a predictor, rather than other collinear predictor is dropped from the model, the selection process may proceed on a wrong trajectory with the risk of erroneous conclusions (Dormann et al., 2013;Prunier et al., 2015).
Methods for detecting and solving multicollinearity problems established for multiple regression can also be applied in SEM (Lei & Wu, 2007). Thus, using the factor analysis, which produces uncorrelated or orthogonal axes among final factors, scores free from multicollinearity are generated (Dormann et al., 2013). In addition, there is no need to assign weights to FAI-BLUP index, as it occurs in the Smith-Hazel classical index and its derived indices. Therefore, the index proposed in this study solves the major problems of the Smith-Hazel classical index. Table 4 presents the comparisonvia predicted genetic gain -between the indices. Predicted genetic gain for the qualitymetric traits factor considering the FAI-BLUP index allowed obtaining gains in desirable sense for all traits, and showed gains of greater magnitude when compared with the SH-BLUP index. The SH-BLUP index predicted genetic gains in undesirable sense for lignin, cellulose/lignin ratio, in vitro digestibility of the dry biomass, and nitrogen level.
The SH-BLUP index predicted greatest gains at once for total green biomass and total dry biomass (Table 4). This fact is explained due to greater relative economic weight attributed to the total dry biomass (Table 4) and to the high genetic correlation (positive) between total green biomass and total dry biomass (Table 2). However, the FAI-BLUP index estimated the greatest genetic gain for other traits.
The energymetric factor showed superior predicted genetic gain considering the FAI-BLUP index for stalk diameter, ash content, and calorific value. The FAI-BLUP index predicted lesser gain only for dry matter, when compared with the SH-BLUP index; however, all the gains were in the desired sense for both indices (Table 4). Falconer & Mackay (1996) assigned two main causes of genetic correlations: pleiotropy and genetic linkage. The second is most appropriate to explain the predicted genetic gains using the FAI-BLUP index due to the quantitative nature of the traits and due to the factor analysis. According to Wang et al. (2009) the factor analysis explains the covariances or correlations among the traits in the factors; moreover, it can be used to understand what constructs underlie the data. Genetic linkage can only be broken by means of repeated cycles of meiosis (Luby & Shaw, 2009), and therefore desirable gains for all traits can be achieved considering the genotypes selected by the SH-BLUP index.
Direct selection provides the maximum predicted gain considering one trait at a time. In addition, the FAI-BLUP index provided gains of approximately 62% of direct selection, simultaneously for both traits that are desired to be increased, and of approximately 33% for traits that are desired to be decreased. The genotypes selected by the FAI-BLUP index have potential to improve all traits simultaneously, while the Smith-Hazel classical index predicted 66% and À32% for the traits that are desired to be increased and decreased, respectively, and it does not have potential to improve all traits simultaneously. The FAI-BLUP index provides more balanced genetic gains prediction for all traits.
The FAI-BLUP and the SH-BLUP indices share only seven of the 15 elephant grass with the highest performance for combustion (Table 4). Besides, the FAI-BLUP index without multicollinearity (i.e. when the traits that cause multicollinearity on the matrix of phenotypic variance-covariance are excluded) still provides superior gains in desirable sense for all traits, when compared with the SH-BLUP index. Thus, traits that cause multicollinearity work as auxiliary traits, improving the FAI-BLUP index.
In this study, a multitrait index (FAI-BLUP index) is proposed based on the structural equation models theory. The FAI-BLUP index indicated the 15 elephant grass genotypes with the highest performance for combustion based on multitrait, without assigning weights, free from multicollinearity, and balanced genetic gains were predicted in desirable sense for all measured traits. The Smith-Hazel classical index predicted genetic gains in undesirable sense for lignin, cellulose/lignin ratio, in vitro digestibility of the dry biomass and nitrogen. Moreover, in general, the genetic gains predicted by the FAI-BLUP index were superior to those predicted by the Smith-Hazel classical index. Therefore, the FAI-BLUP index is a new technical advance tool, and it can be applied in genetic breeding programmes. In addition, a routine analysis applied to R free software has been elaborated, so the breeder can use the FAI-BLUP index on his own data.