Machine Learning Applications and Optimization of Clustering Methods Improve the Selection of Descriptors in Blackberry Germplasm Banks

Machine learning (ML) and its multiple applications have comparative advantages for improving the interpretation of knowledge on different agricultural processes. However, there are challenges that impede proper usage, as can be seen in phenotypic characterizations of germplasm banks. The objective of this research was to test and optimize different analysis methods based on ML for the prioritization and selection of morphological descriptors of Rubus spp. 55 descriptors were evaluated in 26 genotypes and the weight of each one and its ability to discriminating capacity was determined. ML methods as random forest (RF), support vector machines, in the linear and radial forms, and neural networks were optimized and compared. Subsequently, the results were validated with two discriminating methods and their variants: hierarchical agglomerative clustering and K-means. The results indicated that RF presented the highest accuracy (0.768) of the methods evaluated, selecting 11 descriptors based on the purity (Gini index), importance, number of connected trees, and significance (p value < 0.05). Additionally, K-means method with optimized descriptors based on RF had greater discriminating power on Rubus spp., accessions according to evaluated statistics. This study presents one application of ML for the optimization of specific morphological variables for plant germplasm bank characterization.


Introduction
Machine learning (ML) is a form of artificial intelligence (AI) that gives machines the ability to learn through the use of algorithms and a training process [1] and is used in tandem with big data technologies and high-performance computing [2,3], which, together with information and communication technologies (ICTs) and the Internet of Things (IoT), deep learning, among others tools, have created new opportunities for dataintensive science. These tools are being applied in multiple areas, including agriculture, as an emerging technology [2,4]. These tools helped create what is now known as digital agriculture, a new agricultural revolution [2,5].
The practical applications of ML in agriculture are broad and are applied in various fields, such as yield prediction [6], pest detection, classification, monitoring, and management [3,7], and species recognition [8], which is why ML is undoubtedly a powerful tool within digital agriculture, providing information for correct decision-making including the study and efficient use of tropical genetic resources. The ML approach in agriculture presents many challenges, because it is highest impacted by environmental factors that cannot be controlled, demanding rigorous process and extensive data for validation and evaluate different machine learning techniques for the selection of morphological variables in 26 Rubus spp. accessions cultivated under tropical conditions and for the optimization of discrimination methods to increase the differentiation capacity of accessions.

Rubus spp. Accession Genotypes and Agronomic Management
A total of 26 blackberry accessions (Rubus spp.) were selected from the germplasm bank at the La Selva Research Center in Rionegro, Antioquia-Colombia located at an altitude of 2100 masl with an average temperature of 16 • C and 74.83% relative humidity (Longitude: 075 • 24 51.9 and Latitude: 06 • 07 52.7), a lower montane humid forest life zone (bh-MB).
The selected accessions had interspecific morphological variation at the species level, obtained as a representative sample of all variation of the Rubus spp. germplasm bank collection. The origin of each accession was as follows: Cultivated natives (CN) with 18 accessions; introduced (I) with 6 accessions; and wild native (WN) with 2 accessions ( Table 1). The plant material was collected in the phenological phases of flowering; the primary floriferous stems were obtained after renewal pruning to guarantee their quality. For each accession, five plants were selected, which had three floriferous primary stems extracted from each one. The sanitary and renewal pruning were standard for all accessions, as well as the fertilization practices. When there was a water deficit, water was added with automatic drip irrigation.

Morphological Descriptors
The selected morphological descriptors were based on previous studies on morphological characterizations carried out by Evans and Weber [22], combined with the proposals of Ligarreto-Moreno, Espinosa, Barrero and Medina [21] for species from the Rubus L genus. Once selected, they were grouped using typology criteria, botanical terminology, and measurement scales. These descriptors were verified with taxonomic keys of the genus Rubus L. [20] and the characters for the distinction of varieties of blackberry [21]. This validation confirms that morphological traits are associated with stable phenotypic expressions into Rubus genus and low environmental effect.
The descriptors were recorded at the plant and organ levels, including qualitative variables such as growth habit, organ shape, and other vegetative structures. In addition, the organs were colored using the Royal Horticultural Society's color chart for plants [23]. The quantitative variables focused on morphometric measurements such as length (linear dimension) and length and width (horizontal dimensions). Discrete variables were also used for the vegetative structures in the plant organs (Table 2). The Rubus spp. germplasm bank currently does not have validated morpho-agronomic descriptors, meaning much methodological effort is needed for characterization, which is why three ML tools were used to discriminate and determine the weight of the descriptors in terms of diversity for optimal clustering of the evaluated accessions. This study used an internal optimization process from ML algorithms random forest (RF), linear (SVMl) and radial (SVMr) support vector machines, and neural networks (NN).
The data matrix was associated with 55 columns (morphological descriptors) per 78 rows (26 accessions with 3 replicates by each one). In addition, one-hot encode data was realized through a numerical representation of categorical descriptors. The arrays data were randomly divided into two data sets: (i) training (75%) and (ii) testing (25%) to be evaluated under different ML methods.
The RF algorithm is widely used as a classifier given its simplicity in terms of the parameters required for decision making [24]. It was implemented using the libraries randomForest [25], caret [26], ranger [27], and h2o [28] available to free software R.
The optimization of the RF algorithm and determination of the importance of the morphological descriptors (numerical and non-numerical) were carried out through a multi-step analysis. The first step was evaluating the number of trees (set of combinations from 100 to 1000), the size through the number of nodes (1 to 4000), and the hyperparameter alpha (0 to 20) with of the error rate, Bayes error and out-of-bag root-mean-square error (OOB-RMSE), using the caret, ranger and h2o libraries, selecting 500 trees, 4000 nodes and an alpha of 4 as a balance between the robustness, stabilization of the error rate and computational performance. Subsequently, the importance of the morphological descriptors was determined by calculating the confusion matrices [29] and then their global accuracy [30] using the "caret" package. These parameters identified the weight and interactions within the decision tree for each variable. As a complement, the importance of the descriptors was evaluated by calculating the Mean Decreases in the Gini Index metric. This process was corroborated using the metrics: mean minimal depth, number of nodes, mse increase, node purity increase, number of trees times a root, root variable, interaction occurrences, uncond mean minimal depth and significance (p < 0.05), evaluated with the randomForestExplainer library [31]. Finally, these metrics were graphed. The type of assembly used in the RF algorithm for the selection of the models was Bagging.
The second evaluated algorithm was SVM, which belongs to the general category of kernel methods, widely used in classification and regression because of its high precision and capacity to handle high-dimensional data [32]. The linear and radial forms of the SVM algorithm (SVMl and SVMr) were evaluated using the e1071 library [33], optimized with two steps: (i) selection of parameters and (ii) final training [34]. To select the fitness parameters, the classification error, and the mean square error of the regression were determined using the e1071 library. In the case of SVMr, gammas of 0.5, 1, 2, 3, 4, and 5 were tested, and the computational cost was determined with the sum of the hyperparameters and an indirect measurement of the computational simplicity of the code.
The third algorithm was NN, commonly used in identifying and predicting patterns between multiple variables [35]. It was implemented with the nnet [36], NeuronalNetTools, and RSNNS packages [37]. The importance and sensitivity of the descriptors were evaluated using the Garson and Olden algorithms and the Lek profile method; then, the network was optimized through a step-by-step process [35]: (i) Normalize entries, standardize responses, and evaluate the influence of outliers. (ii) Network architecture, which includes the size or number of units in the hidden layer, the number of nodes in each layer, inclusion of bias layers, and weights or inputs. (iii) Decay by decreasing the specific weight of the regularization in the neural network and (iv) interactions by evaluating different amounts of interactions. Additionally, the correlation matrix between variables was calculated [35].
With the optimization of each algorithm (RF, SVMr, SVMl, and NN), the results obtained in the selection of descriptors and its ability to classify adequately the accessions of Rubus spp. were evaluated based on training/validation accuracy and training/validation missclass Error using area under receiver operating characteristic (ROC) curve (AUC) [38] implemented in the free software R [39] with own code. The accuracy quantified by AUC it is considered a good metric that has been used in the comparison of ML algorithms [40].

Rubus spp. Germplasm Bank k Genotype Clustering Methods
The tools selected for the analysis of discrimination of the accessions in the Rubus spp. germplasm bank included hierarchical agglomerative clustering, using the Ward D2 method as the clustering union strategy, and k-means [41]. The analyses were developed in the R software [39] through the creation of an own code with the help of the libraries vegan [42], pvclust [43], ape [44], and rgl [45]. The consolidated morphological descriptors in Table 2 were used as discriminant variables.
In order to optimize the discrimination of genotypes based on morphological descriptors, a multistage analysis was carried out: (i) Standardization of variables carried out by means of the Z score using the clusterSim library [46], guaranteeing equal or similar measurement scales especially in measurements of dissimilarity sensitive to magnitude such as the Euclidean distance [47,48]. (ii) For both discrimination methods, the optimal number of clusters was estimated with the gap statistic [49] using the libraries factorextra [50] and stat [51]. (iii) Since there are no validated morphological descriptors for the germplasm bank, the effect of the number of variables in the discrimination methods was determined using two data sets: (a) all variables ( Table 2) and (b) those selected using the RF algorithm (V23, V44, V24, V49, V25, V27, V50, V29, V28, and V42).
All combinations of morphological descriptors used on discriminating methods included the results of ML algorithms were evaluated for their ability to discriminate each accession used as replicas in order to avoid the artifacts such as the environmental condition in differential expression of morphological markers. In addition, the results were taxonomically corroborated to detect anomalies in the algorithms evaluated.

Selection and Optimization of Machine Learning Algorithms for the Prioritization and
Selection of Morphological Descriptors in Rubus spp. Table 3 shows the results of the ability of each of the machine learning methods to prioritize the importance of morphological variables and their discriminating ability in the studied blackberry accessions after an internal optimization process. It was determined that the RF algorithm had the best performance based on the test statistics. In decreasing order of ability to discriminate adequately based on the appropriate selection of descriptors, the algorithms were RF (descriptive and numerical variables), RF (numerical variables), neuronal networks (NN), support vector machine (SVM) radial (SVMr) and linear (SVMl) with area under curve (AUC) accuracy classification values of 0.76, 0.64, 0.31, 0.21, and 0.09, respectively. The RF algorithm, after optimization based on the number of trees, and size and reduction of the hyperparameters, was able to determine that the quantitative and qualitative descriptors of greatest importance for use in the discrimination of blackberry genotypes were those with the highest value for the mean decrease in the Gini index (Figures 1a and  2a), minimum depth within the decision forest (Figures 1b and 2b), maximum number of connected nodes (Figures 1c and 2c), increase in purity (Figures 1d and 2d) and number of most frequent interactions in the decision trees (Figures 1e and 2e). According to these criteria, the variables in decreasing order of importance for quantitative and quantitativedescriptive descriptors were V44, V24, V42, V49, V25, V29, V50, V27, V28, V26 and V23, V44, V9, V24, V53, V22, V25, V49, V28, V50, and V26 (Figure 2a-e).
Plants 2020, 9, x FOR PEER REVIEW 7 of 18 value for the mean decrease in the Gini index (Figures 1 and 2a), minimum depth within the decision forest (Figures 1 and 2b), maximum number of connected nodes (Figures 1 and 2c), increase in purity (Figures 1 and 2d) and number of most frequent interactions in the decision trees (Figures 1 and 2e). According to these criteria, the variables in decreasing order of importance for quantitative and quantitative-descriptive descriptors were V44, V24, V42, V49, V25, V29, V50, V27, V28, V26 and V23, V44, V9, V24, V53, V22, V25, V49, V28, V50, and V26 (Figure 2a-e).  The optimized SVMl, varying parameter C, as a balance between the massification of the algorithm margin and the error with a selection factor of 0.04, determined that the descriptors in increasing order of importance were V51, V29, V19, V7, V27, V32, V46, V31, and V23 (Figure 3a,b). The SVMr found that the selection factor 50 in the Kernel function and a gamma of 0.5 minimized the error at the lowest possible computational cost, which prioritized the descriptors in decreasing order as V19, V37, V46, V48, V23, V30, V42, V44, V51, V7, V32 and V31 (Figure 3b). The optimized SVMl, varying parameter C, as a balance between the massification of the algorithm margin and the error with a selection factor of 0.04, determined that the descriptors in increasing order of importance were V51, V29, V19, V7, V27, V32, V46, V31, and V23 (Figure 3a,b). The SVMr found that the selection factor 50 in the Kernel function and a gamma of 0.5 minimized the error at the lowest possible computational cost, which prioritized the descriptors in decreasing order as V19, V37, V46, V48, V23, V30, V42, V44, V51, V7, V32 and V31 (Figure 3b). NN found that the best relationship as a function of the decrease in the specific weight of the regularization of the network with respect to the number of hidden layers quantified using the Bootstrap indicator determined that the highest value of occurrences was seen with a weight of decay of 0.1 and a number of layers of 5.0 (Figure 4b). The relationship between the variables and the network determined using Olsen's connection weights algorithm showed that the descriptors with importance greater than 2.5 in absolute value were V30, V34, V32, V31, V49, V12, V50, V26, V33, V36, V48 and V19 (Figure 4a). Additionally, as a result of the correlation analysis, two groups were found: (i) directly proportional relationships and (ii) inversely proportional relationships (Figure 4c). The descriptors associated with the flower, such as V49 and V50 (inversely proportional), helped differentiate the accessions at the inter-and intraspecific levels. The leaf descriptors, such as V31 and V32 (directly proportional), were able to discriminate using a single descriptor. Grouping also occurred in five clusters associated with the importance value of the variables: High importance: V34 to V44, medium importance: V26 to V33, and low importance V12 to V25. The importance of variables V27, V37, V43, and V44 was notable in all groups (Figure 4d). NN found that the best relationship as a function of the decrease in the specific weight of the regularization of the network with respect to the number of hidden layers quantified using the Bootstrap indicator determined that the highest value of occurrences was seen with a weight of decay of 0.1 and a number of layers of 5.0 (Figure 4b). The relationship between the variables and the network determined using Olsen's connection weights algorithm showed that the descriptors with importance greater than 2.5 in absolute value were V30, V34, V32, V31, V49, V12, V50, V26, V33, V36, V48 and V19 (Figure 4a). Additionally, as a result of the correlation analysis, two groups were found: (i) directly proportional relationships and (ii) inversely proportional relationships (Figure 4c). The descriptors associated with the flower, such as V49 and V50 (inversely proportional), helped differentiate the accessions at the inter-and intraspecific levels. The leaf descriptors, such as V31 and V32 (directly proportional), were able to discriminate using a single descriptor. Grouping also occurred in five clusters associated with the importance value of the variables: High importance: V34 to V44, medium importance: V26 to V33, and low importance V12 to V25. The importance of variables V27, V37, V43, and V44 was notable in all groups (Figure 4d).

Genotype Discriminating Methods from the Rubus spp. Germplasm Bank
It was found that the massification of the Gap statistic was obtained when the number of clusters for the standardized hierarchical agglomerative clustering, non-standardized hierarchical agglomerative clustering, standardized k-means, and non-standardized k-means methods were 10, 6, 10, and 7, respectively ( Figure 5), indicating a uniform, non-random distribution of the accessions within each group. Additionally, the effect of standardization on the two methods tested was notable (Table 4). Superior performance was found in the K-means and hierarchical agglomerative clustering grouping when the reduction of variables was carried out using the RF method, indicating that the selection process was highly informative ( Figure 6 and Table 4).

Genotype Discriminating Methods from the Rubus spp. Germplasm Bank
It was found that the massification of the Gap statistic was obtained when the number of clusters for the standardized hierarchical agglomerative clustering, non-standardized hierarchical agglomerative clustering, standardized k-means, and non-standardized kmeans methods were 10, 6, 10, and 7, respectively ( Figure 5), indicating a uniform, nonrandom distribution of the accessions within each group. Additionally, the effect of standardization on the two methods tested was notable (Table 4). Superior performance was found in the K-means and hierarchical agglomerative clustering grouping when the reduction of variables was carried out using the RF method, indicating that the selection process was highly informative ( Figure 6 and Table 4).    The evaluation of the behavior and discrimination capacity of the hierarchical agglomerative clustering and k-means methods with their different evaluated variations based on the Normalized Variation Index (NVI); Adjusted Rand Index (ARI); Separation index (IS); Calinski-Harabasz Index (CH); Entropy (EN); Pearson Gamma (PG) indices showed that the method that presented the best performance was K-means with standardized and reduced variables as a function of the RF optimization process, followed by the K-means method with all standardized variables, standardized agglomerative hierarchical grouping with optimization of variables using RF, agglomerative hierarchical grouping with all standardized variables, K-means and non-standardized agglomerative hierarchical grouping (Table 4).  The evaluation of the behavior and discrimination capacity of the hierarchical agglomerative clustering and k-means methods with their different evaluated variations based on the Normalized Variation Index (NVI); Adjusted Rand Index (ARI); Separation index (IS); Calinski-Harabasz Index (CH); Entropy (EN); Pearson Gamma (PG) indices showed that the method that presented the best performance was K-means with standardized and reduced variables as a function of the RF optimization process, followed by the K-means method with all standardized variables, standardized agglomerative hierarchical grouping with optimization of variables using RF, agglomerative hierarchical grouping with all standardized variables, K-means and non-standardized agglomerative hierarchical grouping (Table 4). Figure 7 shows the relationship between the descriptors prioritized by the RF algorithm and the results of the discrimination using the K-means method, without contradictory variations in the discriminating morphological characteristics for each group or accession. random Forest process) K-means (standardized and with reduced variables based on random Forest process) 0.32122 0.46730 1.05589 20.30752 2.19465 0.14359 Normalized Variation Index (NVI); Adjusted Rand Index (ARI); Separation index (IS); Calinski-Harabasz Index (CH); Entropy (EN); Pearson Gamma (PG). HC: Hierarchical agglomerative clustering. Figure 7 shows the relationship between the descriptors prioritized by the RF algorithm and the results of the discrimination using the K-means method, without contradictory variations in the discriminating morphological characteristics for each group or accession.

Selection and Discrimination of Descriptors Applied to the Rubus spp. Germplasm Bank Using Machine Learning Tools
Of the analyzed machine learning tools, RF presented the best performance when compared to SVMl, SVMr, and NN for the ability to prioritize highly discriminating descriptors of accessions from the Rubus genus. This algorithm has been widely used in classification processes given its good behavior, simplicity in terms of requirements and parameters, computational optimization capacity, high precision, and robustness to noise, among other reasons [55]. The classification values (AUC) found in this study agree with similar processes where the RF algorithm was used to determine classes associated with different phenotypes and landscape uses with information from remote sensors, among other uses [56].
Machine learning is one of the most used tools today in various fields, including agriculture [2,57], which has established itself as one of the most effective methods for detecting and predicting patterns [58]. Generalized use poses a challenge in the user community for adequate applications; an optimization and analysis process is needed for each algorithm, such as the one developed in this study.
The biological interpretation of the optimization and selection of morphological descriptors in the Rubus spp germplasm bank based on RF algorithm were variables and informative into Rubus genus, maximizing the contrast in the phenotypic discrimination [59]. These results suggest that some numerical morphometric characteristics largely define the interspecific morphological variability of the evaluated Rubus accessions, especially those related to the size of the plant organs.
The numerical descriptors with the highest level of discrimination, such as V26 (number of stingers in the stem internode) and V24 (length of the base of the stinger in the stem), are characteristics related to V23 (shape of the stinger in the stem), which makes variations in this vegetative structure a very valuable indicator that easily discriminates accessions or materials. Descriptors related to the size and arrangement of the leaf on the stem were also very important. Since species are represented by accessions and intraspecific variability, our results generally confirmed that the descriptors associated with the leaf and stem (Figure 7) tend to be the most informative [60]. Studies on Rubus subgenus Rubus highlight these descriptors in the determination of qualitative and quantitative variation among accessions [60]. These results suggest that many of the numerical descriptors prioritized by the RF method show the possibility of generating scale ratios, which is useful for comparing intervals, differences, and derivatives in absolute or dimensionless values [60]. Therefore, prioritizing these descriptors would allow the generation of composite indicators and more informative comparisons between collections and plant germplasm banks from different regions [61].
Likewise, when descriptive variables were incorporated both interspecific and intraspecific characteristics helped discriminate the phenotype of the Rubus genus materials. With this combination, the descriptors V9 (primary color of the stem surface), V23 (shape of the stinger in the stem), and V53 (primary color of the petals) contributed greatly to the definition of the morphological descriptors for this genus, highlighting the importance of the stinger shape characteristic in the stem, which has proven to be a highly discriminating characteristic at the species level and between species [21].
The 11 phenotypic variables allowed the discrimination of accessions, considered as the minimum morphological characters that would facilitate the study of Rubus germplasm. This optimization will allow characterizing the phenotypic traits of the accessions and covering a large number in a short time, reducing the time for the characterization of the entire germplasm bank [17].

Selection of the Genotype Clustering Method of the Rubus spp. Germplasm Bank
In the hierarchical agglomerative clustering and K-means, the standardization process combined with the determination of the optimal number of clusters presented a better discriminant power for Rubus spp. accessions. If the initial population and its distribution have large distances between individuals, produced by the use of non-standardized values, the number of clusters produced by the Gap statistic tends to be low and, therefore, the discrimination power is lower [49]. This affects the K-means and hierarchical agglomerative clustering methods since the Gap calculation includes the logarithm of the Sum of Square Errors in modal distribution in all terms as a quotient [62].
The NVI, ARI, CH, and PG statistics indicated that the greatest effectiveness was seen with the standardized K-means variants with reduced variables, standardized K-means with all variables, and standardized hierarchical agglomerative clustering with all the variables, because said statistics related the number of classes, their internal normalized variations and the order of the scores between the different classes formed. Therefore, they were strongly influenced by the number of clusters, the distances between individuals, and the nature of the grouping methods [63].
The K-means method performed better than hierarchical agglomerative clustering. This result was due to the reassignment nature of the K-means method, which allows each permutation to have an individual assigned to a group, independent of the group it was assigned to in the immediately previous permutation, contrary to the hierarchical methods, where individuals are assigned to a cluster depending on the initial parameters and remain in that group until the end of the analysis, creating subgroups in lower hierarchies [64].
This study indicated that the K-means method and the reduction of variables using the RF algorithm are an excellent alternative for the descriptors optimization and discrimination of accessions from the Rubus spp. germplasm bank, with high potential for use in fast, efficient characterizations in other germplasm banks. Even with the promising results presented here, these methods require internal validation, proper selection of the combination of variables, and specific clustering models for replication in different plant matrices [65].
The knowledge of the morphological variability of germplasm improves the understanding of the relationship between structural morphology and their corresponding functional botany [66]. It is considered that in the case of financial or human resource limitations, less relevant characters can be eliminated with objective elements, such as the process performed in this work. In addition, morphological descriptors must be easily determined and have a constant phenotypic expression in all environments-that is, high heritability and low environmental influence. Optimization of descriptors improves the availability of information quickly and accurately inducing efficient management of conservation and maximizing the use of financial resources [13]. Based on previous assumptions, our work constitutes an important contribution in the evaluation of the morphological variation of the Rubus germplasm with statistical, botanical, and taxonomic validity.

Conclusions
The correct optimization process of the RF algorithm allowed stable morphological descriptors with high taxonomic concordance to be selected, thus eliminating redundant and obsolete descriptors that present a high cost-benefit ratio. The adequate combination of discriminant morphological descriptors in combination with the optimal parameters of the K-means clustering method showed a promising approach to discriminate different materials from a population with high phenotypic variability, such as the Rubus spp. germplasm bank. This is particularly valuable since it is the first report on the use of machine learning tools and optimization of discriminant methods for the prioritization of quantitative and qualitative morphological descriptors and the ability to differentiate genotypes from plant germplasm banks for the Rubus spp. genus in tropical environments. Data Availability Statement: Data associated with morphological characterization of Rubus genus germplasms are part of the country's genetic resource; the nation's protection laws do not allow the publication of specific results without prior authorization. In addition, the R code generated during the current study is available from the corresponding author on reasonable request. For the future as a group, we are working on the development of an R package and a jupyter notebook for Python.