Probabilistically sampled and spectrally clustered plant species using phenotypic characteristics

Phenotypic characteristics of a plant species refers to its physical properties as cataloged by plant biologists at different research centers around the world. Clustering species based upon their phenotypic characteristics is used to obtain diverse sets of parents that are useful in their breeding programs. The Hierarchical Clustering (HC) algorithm is the current standard in clustering of phenotypic data. This algorithm suffers from low accuracy and high computational complexity issues. To address the accuracy challenge, we propose the use of Spectral Clustering (SC) algorithm. To make the algorithm computationally cheap, we propose using sampling, specifically, Pivotal Sampling that is probability based. Since application of samplings to phenotypic data has not been explored much, for effective comparison, another sampling technique called Vector Quantization (VQ) is adapted for this data as well. VQ has recently generated promising results for genotypic data. The novelty of our SC with Pivotal Sampling algorithm is in constructing the crucial similarity matrix for the clustering algorithm and defining probabilities for the sampling technique. Although our algorithm can be applied to any plant species, we tested it on the phenotypic data obtained from about 2,400 Soybean species. SC with Pivotal Sampling achieves substantially more accuracy (in terms of Silhouette Values) than all the other proposed competitive clustering with sampling algorithms (i.e. SC with VQ, HC with Pivotal Sampling, and HC with VQ). The complexities of our SC with Pivotal Sampling algorithm and these three variants are almost the same because of the involved sampling. In addition to this, SC with Pivotal Sampling outperforms the standard HC algorithm in both accuracy and computational complexity. We experimentally show that we are up to 45% more accurate than HC in terms of clustering accuracy. The computational complexity of our algorithm is more than a magnitude less than that of HC.


INTRODUCTION
Genetic diversity has been an important foundation of plant breeding from the inception of agriculture since it helps develop new plants to meet the growing food demand globally. The breeding process is a complex combination of multiple stages (Louwaars, 2018). The first stage involves discovery of the native characteristics where the selection of diverse parent donors is of paramount importance (Swarup et al., 2020). One way plant genetic diversity can be studied is by using their phenotypic characteristics (physical characteristics). This kind of analysis can be relatively easily done because a sufficiently large amount of data is available from different geographical areas. In the phenotypic context, which is our first focus, a few characteristics that play an important role are Days to 50% Flowering,Days to Maturity,Plant Height,100 Seed Weight,Seed Yield Per Plant,Number of Branches Per Plant,etc. Cluster analysis is an important tool to describe and summarize the variation present between different plant species (Painkra et al., 2018). Thus, clustering can be used to obtain diverse parents which, as mentioned above, is of utmost importance. It is obvious that after clustering, the species present in the same cluster would have similar characteristics, while those present in different clusters would be diverse. Phenotypic data for the species of different plants (e.g., Soybean, Wheat, Rice, Maize, etc.) usually have enough variation for accurate clustering. However, if this data is obtained for the species of the same plant, then clustering becomes challenging due to less variation in the data, which forms our second focus.
Hierarchical Clustering (HC) is a traditional and standard method that is currently being used by plant biologists for grouping of phenotypic data (Painkra et al., 2018;Sharma et al., 2014;Kahraman, Onder & Ceyhan, 2014). However, this method has a few disadvantages. First, it does not provide the level of accuracy required for clustering similar species (Rokach, 2009). Second, HC is based upon building a hierarchical cluster tree (also called dendrogram), which becomes cumbersome and impractical to visualize when the data is too large. The second most common clustering algorithm that is being currently used widely is Unweighted Pair Group Method using Arithmetic Mean (UPGMA). This algorithm is a variant of HC, and hence, has the same two disadvantages as discussed above.
To overcome these two disadvantages, in this paper, we propose the use of the Spectral Clustering (SC) algorithm. SC is mathematically sound and is known to give one of the most accurate clustering results among the existing clustering algorithms (Luxburg, 2007). For genotypic data, we have recently shown substantial accuracy improvements by using SC as well (Shastri et al., 2019). Furthermore, unlike HC, SC does not generate the intermediate hierarchical cluster tree. To the best of our knowledge, this algorithm has not been applied to phenotypic data in any of the previous works (see the Literature Review section below).
HC, as well as SC, both are computationally expensive. They require substantial computational time when clustering large amounts of data (Luxburg, 2007;Mullner, 2013). Hence, we use sampling to reduce this complexity. Probability-based sampling techniques have recently gained a lot of attention because of their high accuracy at reduced cost (Tille, 2006). Among these, Pivotal Sampling is most commonly used, and hence, we apply it to phenotypic data (Chauvet, 2012). Like for SC, using Pivotal Sampling for phenotypic data is also new. Recently, Vector Quantization (VQ) has given promising results for genotypic data (Shastri et al., 2019). Hence, here we adapt VQ for phenotypic data as well. This also serves as a good standard against which we compare Pivotal Sampling.
To summarize, in this article, we develop a modified SC with Pivotal Sampling algorithm that is especially adapted for phenotypic data. The novelty of our work is in constructing the crucial similarity matrix for the clustering algorithm and defining the probabilities for the sampling technique. Although our algorithm can be applied to any plant species, we test it on around 2,400 Soybean species obtained from Indian Institute of Soybean Research, Indore, India (Gireesh et al., 2015). In the experiments, we perform four sets of comparisons. First, we show that use of Pivotal Sampling does not deteriorate the cluster quality. Second, our algorithm outperforms all the proposed competitive clustering algorithms with sampling in terms of the accuracy (i.e. modified SC with VQ, HC with Pivotal Sampling, and HC with VQ). The computational complexities of all these algorithms are similar because of the involved sampling. Third, our modified SC with Pivotal Sampling doubly outperforms HC, which as earlier, is a standard in the plant studies domain. In terms of the accuracy, we are up to 45% more accurate. In terms of complexity, our algorithm is more than a magnitude cheaper than HC. Fourth and finally, we demonstrate the superiority of our algorithm by comparing it with two previous works that are closest to ours.
The rest of this article is organized as follows. Section "Literature Review" provides a brief summary of the previous studies on phenotypic data. The standard algorithms for Pivotal Sampling and SC are discussed in Section "Sampling and Clustering Algorithms". Section "Implementing Pivotal Sampling and Modified Spectral Clustering for Phenotypic Data" describes the crucial adaptations done in Pivotal Sampling and SC for phenotypic data. The data description, validation metric, and the experimental set-up are presented in Section "Methodology". Section "Results and Discussion" gives the experimental results. Finally, conclusions and future work are provided in Section "Conclusions and Future Work".

LITERATURE REVIEW
In this section, we present some relevant previous studies on phenotypic data and the novelty of our approach. Broadly, these studies can be classified into two categories. The first category consists of the works that identify relationships between the different phenotypic characteristics (for example, lower plant height may relate to lower plant yield or vice versa). These works are discussed in Section "First Category Previous Studies". The second category consists of the studies that identify the species with dissimilar phenotypic characteristics for the breeding program. These studies are discussed in Section "Second Category Previous Studies". Finally, we present a set of works that belong to both the categories in Section "Both Categories Previous Studies".
First category previous studies Immanuel et al. (2011) in 2011 measured nine characteristics of 21 Rice species. Grain Yield (GY) was kept as the primary characteristic, and its correlations with all others were obtained. It was observed that characteristics like Plant Height (PH), Days to 50% Flowering (DF), Number of Tillers per Plant (NTP), Filled Grains per Panicle (FGP) and Panicle Length (PL) were positively correlated with GY. The remaining characteristics were negatively correlated with GY. Divya et al. (2015) in 2015 recorded 21 characteristics of two Rice species. The authors investigated the association between Infected Leaf Area (ILA), Blast Disease Susceptibility (BDS), Number of Tillers per Plant (NTP), Grain Yield (GY) and others. The authors concluded that, for example, (a) ILA had a significant positive correlation with leaf's BDS, (b) NTP exhibited the highest association with GY. Gireesh et al. (2015) in 2015 analyzed eight characteristics of 3,443 Soybean species. The authors sampled the species using two methods, and correlations of all the characteristics with each other for both the samples were estimated. It was observed that, for example, Days to 50% Flowering (DF) was positively correlated with Days to Pod Initiation (DPI) in both the samples, while Number of Pods Per Plant (NPPP) showed a negative correlation with Nodes Per Plant (NPP). Huang et al. (2018) in 2018 studied six characteristics of 206 Soybean species. These characteristics were correlated with the three types of leaves; elliptical leaves, lanceolate leaves and round leaves. The authors deduced that Soybean plants with lanceolate leaves had maximum average Plant Height (PH), Number of Pods per Plant (NPP), Number of Branches per Plant (NBP), and 100-Seed Weight (SW), while Soybean plants with other two types of leaves had lower values of these characteristics.
Carpentieri-Pipolo, de Almeida Lopes & Degrassi (2019) in 2019 investigated 45 phenotypic characteristics of a Soybean specie. The authors then studied the effect of 20 bacteria isolated from roots, leaves, and stems on these characteristics (i.e. whether the bacteria had positive or negative activity on (correlation with) the 45 characteristics). For example, Enterobacter Ludwigii (EL) bacteria, which is isolated from leaves, showed a positive correlation with 25 characteristics (e.g., Plant Growth Promotion (PGP)) and a negative correlation with remaining 20 characteristics (e.g., Phenylacetic Acid (PAC) assimilation). For better exposition, the above five studies are summarized in Table 1. Sharma et al. (2014) in 2014 performed clustering of 24 synthetic Wheat species. Cluster analysis was performed using HC, and the species were grouped into three clusters using the polymorphic Inter Simple Sequence Repeat (ISSR) markers. The authors argued that species belonging to different clusters were diverse in terms of heat tolerance, and could be used to develop better heat tolerant specie. Kahraman, Onder & Ceyhan (2014) in 2014 analyzed the field performance of 35 Common Bean species by grouping them. The authors used HC, and the species were clustered into three groups based upon the matrix of relationship between the species.

Second category previous studies
The species belonging to different clusters were considered diverse, and were used to select promising species for breeding. Painkra et al. (2018) in 2018 performed clustering of 273 Soybean species. Here, the authors used HC, and the species were grouped into seven clusters using Pearson Correlation Coefficient. According to the authors, the species belonging to the distant clusters were more diverse such that choosing them maximized heterosis 1 in crossbreeding. Islam et al. (2020) in 2020 clustered ten Upland Rice species. Here, HC was used and the species were grouped into three clusters using a similarity coefficient between the species. The authors identified the two best species that could be used to obtain new species having higher plant yield. As earlier, here also, we summarize the above four studies in Table 2 below.
Both categories previous studies Fried, Narayanan & Fallen (2018) in 2018 analyzed 11 characteristics of 49 Soybean species. The authors determined correlations between the root characteristics and other phenotypic characteristics. For example, Shoot Dry Weight (SDW) and Chlorophyll Index (CI) were positively correlated with Total Root Length (TRL) and Total Root Surface Area (TRSA), while Plant Height (PH) was negatively correlated with TRSA and Average Root Diameter (ARD). In this work, Principal Component Analysis (PCA) biplot was used to separate the species into seven clusters. According to the authors, this research was critical for Soybean improvement programs since it helped select species with the improved root characteristics. Stansluos et al. (2019) in 2019 analyzed 22 phenotypic characteristics for 11 Sweet Corn species. For example, the authors showed a positive and significant correlation of Yield of Marketable Ear (YME) with Ear Diameter (ED) and Number of Marketable Ear (NME), while a negative correlation between YME and Thousand Kernel Weight (TKW). Cluster analysis was performed using HC, and the corn species were grouped into four clusters using the Ward Linkage. The authors inferred substantial variation in morphological and agronomic capabilities of different species. Again, we summarize the above two studies in Table 3 below.
With the focus on the study of genetic diversity using phenotypic data, we have multiple novel contributions as below.
1. We focus on the second category above, and perform grouping of several thousand species as compared to a few hundred in the articles cited above. Note that from the first category, Gireesh et al. (2015) did work with about three thousand species, and we do compare one aspect of our work with this previous work (more on this in the point 2a below).
2. Clustering becomes computationally expensive when the size of the data is very large. Hence, sampling is required to make the underlying algorithm scalable. Thus, we perform clustering on the sampled data rather than the full one, which is not done in any of the articles above. We have two more innovations in this aspect as below.
a) We use a probability-based sampling technique (Pivotal Sampling as mentioned earlier) that is highly accurate, and forms a completely new contribution. We demonstrate the superiority of our sampling by comparing it with the one done in Gireesh et al. (2015). This comparison is discussed towards the end of the Results section. Please note that Gireesh et al. (2015) only performed sampling and did not cluster their data. b) HC, which is the most common clustering algorithm (and some other sporadically used algorithms like k-means and UPGMA), do not provide the level of accuracy needed. Again, as earlier, we develop a variant of the SC algorithm, which is considered highly accurate, especially for phenotypic data. Use of SC in this context is also completely new. We show the dominance of our clustering algorithm over the one proposed in the most recent past work by Islam et al. (2020) towards the end of the Results section. Again, please note that Islam et al. only performed clustering and did not sample their data.

SAMPLING AND CLUSTERING ALGORITHMS
In this section, we briefly discuss the standard algorithms for Pivotal Sampling and SC in the two subsections below.

Pivotal sampling
This is a well-developed sampling theory that handles complex data with unequal probabilities. The method is attractive because it can be easily implemented by a sequential procedure, i.e. by a single scan of the data (Deville & Tille, 1998). Thus, the complexity of this method is OðnÞ, where n is the population size. It is important to emphasize that the method is independent of the density of the data. Consider a finite population U of size n with its each unit identified by a label i = 1, 2,…, n. A sample S is a subset of U with its size, either being random (N(S)) or fixed (N). Obtaining the inclusion probabilities of all the units in the population, denoted by π i with i = 1, 2,…, n, forms an important aspect of this unequal probability sampling technique.
The pivotal method is based on a principle of contests between units (Tille, 2006). At each step of the method, two units compete to get selected (or rejected). Consider unit i with probability π i and unit j with probability π j , then we have the two cases as below.
1. Selection step (π i + π j ≥ 1): Here, one of the units is selected, while the other one gets the residual probability π i + π j − 1 and competes with another unit at the next step. More precisely, if (π i , π j ) denotes the selection probabilities of the two units, then (1) 2. Rejection step (π i + π j < 1): Here, one of the units is definitely rejected (i.e. not selected in the sample), while the other one gets the sum of the inclusion probabilities of both the units and competes with another unit at the next step. More precisely, This step is repeated for all the units present in the population until we get the sample of size N(S) or N. The worst-case occurs when we obtain the last sample (i.e. N th sample) in the last iteration.

Spectral clustering
Clustering is one of the most widely used techniques for exploratory data analysis with applications ranging from statistics, computer science, and biology to social sciences and psychology etc. It is used to get a first impression of data by trying to identify groups having "similar behavior" among them. Compared to the traditional algorithms such as k-means, SC has many fundamental advantages. Results obtained by SC are often more accurate than the traditional approaches. It is simple to execute and can be efficiently implemented by using the standard linear algebra methods. The algorithm consists of four steps as below (Luxburg, 2007).
1. The first step in the SC algorithm is the construction of a matrix called the similarity matrix. Building this matrix is the most important aspect of this algorithm; better its quality, better the clustering accuracy (Luxburg, 2007). This matrix captures the local neighborhood relationships between the data points via similarity graphs and is usually built in three ways. The first such graph is a ε-neighborhood graph, where all the vertices whose pairwise distances are smaller than ε are connected. The second is a k-nearest neighborhood graph, where the goal is to connect vertex v i with vertex v j if v j is among the k-nearest neighbors of v i . The third and the final is the fully connected graph, where each vertex is connected with all the other vertices. Similarities are obtained only between the connected vertices. Thus, similarity matrices obtained by the first two graphs are usually sparse, while the fully connected graph yields a dense matrix. Let the n vertices of a similarity graph be represented numerically by vectors a 1 , a 2 ,…, a n , respectively. Here, each a i 2 R m is a column vector for i = 1,…, n. Also, let a l i and a l j denote the l th elements of vectors a i and a j , respectively, with l = 1,…, m. There exist many distance measures to build the similarity matrix (Cha, 2007). We describe some common ones below using the above introduced terminologies. a) City block distance: (Cha, 2007) It is the special case of the Minkowski distance with p = 1. b) Euclidean distance: (Cha, 2007) It is the ordinary straight line distance between two points in the Euclidean space. It is again the special case of the Minkowski distance, where the value of p is taken as 2. Thus, it is given by c) Squared Euclidean distance: (Cha, 2007) It is the square of the Euclidean distance, and is given by d) Cosine distance: (Cha, 2007) It measures the cosine of the angle between two non-zero vectors, and is given by where, ||·|| denotes the Euclidean norm of a vector. e) Correlation distance: (Szekely, Rizzo & Bakirov, 2007) It captures the correlation between two non-zero vectors, and is given by where, a i and a j are the means of a i and a j multiplied with a vector of ones, respectively, and t signifies the transpose operation.
f) Hamming distance: (Norouzi, Fleet & Salakhutdinov, 2012) It measures the number of positions at which the corresponding values of two vectors are different, and is given by 2. Next, a matrix called the Laplacian matrix is constructed. This matrix is either non-normalized or normalized. The non-normalized Laplacian matrix is defined as where W is the similarity matrix and D is a diagonal matrix whose elements are obtained by adding together the elements of all the columns for every row of W. Normalized Laplacian matrix is again of two types: the symmetric Laplacian (L sym ) and the random walk Laplacian (L rw ). Both these matrices are closely related to each other and are defined as Henceforth, the non-normalized Laplacian matrix is referred to as the Type-1 Laplacian, L sym as the Type-2 Laplacian, and L rw as the Type-3 Laplacian. In the literature, it is suggested to use the normalized Laplacian matrix instead of the non-normalized one, and specifically the Type-3 Laplacian (Luxburg, 2007).
3. Once we have the Laplacian matrix, we obtain the first k eigenvectors u 1 ,…, u k of this matrix, where k is the number of clusters.
4. Finally, these eigenvectors are clustered using the k-means clustering algorithm.

IMPLEMENTING PIVOTAL SAMPLING AND MODIFIED SPECTRAL CLUSTERING FOR PHENOTYPIC DATA
Here, we first present the application of Pivotal Sampling to obtain the samples from phenotypic data. Subsequently, we implement our modified SC algorithm on the same data. Consider that the phenotypic data of a plant consists of n species with each specie evaluated for m different characteristics/traits. These characteristics may have categorical (non-numerical) or numerical values. Hence, we need to convert the categorical values into numerical ones. For this, we use the label encoder method (Hancock & Khoshgoftaar, 2020). This method transforms non-numerical labels into numerical values between 0 and (number of categories) -1. For example, if a characteristic has three possible labels; poor, good, and very good, we use 0, 1, and 2 to represent them, respectively. As discussed in Section "Pivotal Sampling", Pivotal Sampling requires that the inclusion probabilities (i.e. π i for i = 1,…, n), of all the species in the population U, be computed before a unit is considered for a contest. The set of characteristics associated with a specie can be exploited in computing these probabilities. To select a sample of size N, where N ( n, we obtain these probabilities as Deville & Tille (1998) where ϰ i can be a property associated with any one characteristic (or a combination of them) of the i th specie. Obtaining π i in such a way also ensures that P n i¼1 p i ¼ N, i.e. we get exactly N selection steps, and in-turn, exactly N samples.
In our implementation, we use the deviation property of the species, which is discussed next. Since different characteristics have values in different ranges, we start by normalizing them as below (Jain, Nandakumar & Ross, 2005;Shastri, Tamrakar & Ahuja, 2018).
Here, ðX j Þ i and (x j ) i are the normalized value and the actual value of the j th characteristic for the i th specie, respectively with j = 1,…, m and i = 1,…, n. Furthermore, max(x j ) and min(x j ) are the maximum and the minimum values of the j th characteristic among all the species. Now, the deviation for the i th specie is calculated using the above normalized values as Here, maxðX j Þ denotes the maximum normalized value of the j th characteristic among all the species. Practically, a relatively large value of dev i indicates that the i th specie is less important, and hence, its probability should be small. Thus, the inclusion probability of a specie is calculated by taking ϰ i ¼ 1 dev i in Eq. (13) or Thus, if the sum of probabilities of two species under consideration is greater than or equal to 1, we follow the selection step as discussed in Section "Pivotal Sampling". On the other hand, we follow rejection step when this sum is less than 1. This process is repeated until we obtain N species.
Next, we discuss the clustering of these N species into k clusters. Similar to the standard SC algorithm discussed in Section "Spectral Clustering", the first step in our modified SC is to obtain the similarity matrix. As mentioned earlier, this is the most important aspect of this algorithm since the better the matrix quality, the better the clustering accuracy. For this, we consider these N species as the vertices of a graph. Let vector p i contain the normalized values of all the characteristics (m) for the i th specie. Thus, we have N such vectors corresponding to the N species selected using Pivotal Sampling. That is, In our implementation, we use a fully connected graph to build the similarity matrix, i.e. we obtain similarities among all the N species.
We define the similarity between the vectors p1 and p2 (without loss of generality, representing the species 1 and 2, respectively) as the inverse of the distance between these vectors obtained by using the distance measures mentioned in "Spectral Clustering". This is intuitive because smaller the distance between any two species, larger the similarity between them and vice versa. We denote this distance by d p1p2 . We build this matrix of size N × N by obtaining the similarities among all the N species. The next step is to compute the Laplacian matrix, which when obtained from the above-discussed similarity matrix, generates poor eigenvalues, 2 and in-turn poor corresponding eigenvectors that are required for clustering 3 . Thus, instead of taking only the inverse of d p1p2 , we also take its exponent, i.e. we define the similarity between the species 1 and 2 as e −d p1p2 (Ng, Jordan & Weiss, 2002;Nemade et al., 2018). This, besides fixing the poor eigenvalues/eigenvectors problem, also helps perform better clustering of the given data. Further, we follow the remaining steps as discussed in "Spectral Clustering".
Above, we discussed the clustering of N sampled species into k clusters. However, our goal is to cluster all n species and not just N. Hence, there is a need to reverse-map the remaining n − N species to these k clusters. For this, we define the notion of average similarity, which between the non-clustered speciep and the cluster C l is given as Here, #(C l ) denotes the number of species present in C l and q is a specie originally clustered in C l by our modified SC algorithm with Pivotal Sampling. We obtain the average similarity ofp with all the k clusters (i.e. with C l for l = 1,…, k), and associate it with the cluster with whichp has the maximum similarity.
Next, we perform the complexity analysis of our algorithm. Since Pivotal Sampling and SC form the bases of our algorithm, we discuss the complexities of these algorithms before ours. Thus, the overall complexity of our algorithm is OðnN þ N 3 þ N 2 mÞ. Here, we have kept three terms because any of these can dominate (here, n ) N, m).
When we compare complexity of our algorithm with that of HC, which is Oðn 3 Þ, it is evident that we are more than a magnitude faster than HC. We revisit this complexity analysis after discussing data in the next section, which supports our claim further.

METHODOLOGY
In this section, we first briefly discuss the data used for our experiments. Next, we check the goodness of our sampling technique by estimating a measure called the population total. The hypothesis related to this is as follows: for a particular sampling technique, if the estimate (or approximation) of the population total using the samples is close to the actual population total, then that sampling technique is considered good in an absolute sense. Finally, we describe the clustering set-up, where the below are analyzed.
a) The Validation Metric. It is hypothesized that a good clustering is one where clusters are compact and well-separated.
b) The Ideal Number of Clusters. The hypothesis related to this is as follows: given a set of eigenvalues of the Laplacian matrix, we can exploit the differences between these eigenvalues to obtain the ideal number of clusters.
c) The Suitable Distance Measures. For building the similarity matrix and the Laplacian matrix, it is hypothesized to chose those matrices that give the best value for the validation metric.

Data description
As mentioned in Introduction, our techniques can be applied to any plant data, however, here we experiment on phenotypic data of Soybean species. This data is taken from Indian Institute of Soybean Research, Indore, India, and consists of 29 different characteristics/ traits for 2,376 Soybean species (Gireesh et al., 2015). Among these, we consider the following eight characteristics that are most important for higher yield: Early Plant Vigor ( Hence, we convert these two categorical values into numerical ones using the label encoder method discussed in the previous section. A snapshot of this phenotypic data for a few Soybean species is given in Appendix A. Here, we also perform validation of this data by comparing it with a similar dataset. Next, we compare the complexities of our algorithm and HC using the selected data; see Table 4. It is evident from this table that our algorithm achieves substantial savings.

Sampling discussion
To inspect the quality of our sampling techniques, we estimate a measure called the population total, which is the addition of values of a particular characteristic for all the n units (species here) present in the population U. For example, if "Plant Height (PH)" is the characteristic of interest, then the population total is the addition of PH values for all the n species. Mathematically, the exact (or actual) population total for a characteristic of interest x j is given as where, as earlier, (x j ) i is the value of the j th characteristic for the i th specie and U is the set of all species. By the definition of this measure (and also for two more measures listed below in this section), we work with original (non-normalized) values of the characteristics rather than normalized ones. Also, based upon the same argument, we work with only those characteristics that are originally numerical.
In this work, we use two different estimators to compute an approximation of the population total from the sampled data. Closer the value of an estimator to the actual value, better the sampling. First is the Horvitz-Thompson (HT)-estimator (also called πestimator), which is defined as Horvitz & Thompson (1952) where, π i is the inclusion probability of the i th specie as evaluated in Section "Implementing Pivotal Sampling and Modified Spectral Clustering for Phenotypic Data" and S is the set of sampled species. Another estimator that we use is the Hájek-estimator. It is usually considered better than the HT-estimator and is given as Hájek (1971) Y here, as earlier, n is the total number of species. The actual population total and the values of the above two estimators for six characteristics (that have numerical values) when using Pivotal Sampling and 500 samples are given in Table 5 (see columns 3, 4, and 6, respectively). From this table, it is evident that the approximate values of the population total are very close to the corresponding actual values. Thus, Pivotal Sampling works well in an absolute sense. Here, we also compute the values of the two estimators when using VQ (see columns 5 and 7). We can notice from these results that VQ also works reasonably well, but Pivotal Sampling is better.

Clustering setup
Here, first, we describe the criteria used to check the goodness of the generated clusters. There are two categories of metrics available for the validation of clustering algorithms. One category includes the metrics that require prior knowledge of the cluster labels (Fahad et al., 2014). On the other hand, metrics from the second category do not have this requirement (Fahad et al., 2014;Rousseeuw, 1987). In this work, the ideal cluster labels are not available, and hence, we use a metric called Silhouette Value (from the second category) for validation of our clustering algorithms (Rousseeuw, 1987).
Clustering is considered good if the obtained clusters are compact and well-separated. Silhouette Value captures both these aspects well by computing the intra-cluster similarity and the inter-cluster similarity. Consider that we have k clusters represented as C 1 ,…, C k , and we want to obtain the Silhouette Value of the i th data point present in the cluster C 1 . For this, we compute the average distance between this data point and all the other points in the cluster C 1 . This distance is denoted as a(i). Next, we compute the average distance between the i th data point and all the other points in clusters C 2 ,…, C k . This distance is denoted as b(i). Then, for this point, Silhouette Value is computed as below (Rousseeuw, 1987).
sðiÞ ¼ bðiÞ À aðiÞ maxfaðiÞ; bðiÞg : As evident from the above discussion, the intra-cluster similarity is captured by a(i), and the inter-cluster similarity is captured by b(i). This value usually lies between minus one to plus one because the denominator of Eq. (21) is always greater than its numerator. Silhouette Value for the overall clustering is obtained by averaging the Silhouette Values of all the data points. If this value tends towards a positive one, then the clustering is considered to be good. On the other hand, if this value tends towards a negative one, then the clustering is considered poor.
Second, we determine the ideal number of clusters by using the eigenvalue gap heuristic (Luxburg, 2007;Kong et al., 2010). If λ 1 , λ 2 , …, λ n are the eigenvalues of the matrix used for clustering (e.g., the Laplacian matrix), then often the initial set of eigenvalues, say k, have a considerable difference between the consecutive ones in this set. That is, j i À iþ1 j 6 % 0 for i = 1,…, k − 1. After the k th eigenvalue, this difference is usually approximately zero. According to this heuristic, this k gives a good estimate of the ideal number of clusters.
For this experiment, without loss of generality, we build the similarity matrix using the Euclidean distance measure on the above discussed phenotypic data. As mentioned earlier, it is recommended to use the Type-3 Laplacian matrix (Luxburg, 2007). Hence, we use its eigenvalues for estimating k. Figure 1 represents the graph of the first fifty smallest eigenvalues (in absolute terms) of this Laplacian matrix. On the x-axis, we have the eigenvalue number, and on the y-axis its corresponding value. From this figure, we can see that there is a considerable difference between the first ten consecutive eigenvalues. After the tenth eigenvalue, this difference is very small (tending to zero). Hence, based upon the earlier argument and this plot, we take k as ten.
To corroborate this choice more, we experiment with k as twenty and thirty as well. As expected, and discussed in-detail later in this section, Silhouette Values for these numbers of clusters are substantially lower than those for ten clusters.
Third, and final, we perform experiments to identify the suitable similarity measures to build the similarity matrix, and also verify that, as recommended, the Type-3 Laplacian matrix is the best. Table 6 below gives Silhouette Values of our modified SC for all seven similarity measures and three Laplacians when clustering the earlier presented phenotypic data into 10, 20, and 30 clusters.
From this table, it is evident that Silhouette Values for the Euclidean, Squared Euclidean, City-block and Correlation similarity measures and the Type-3 Laplacian matrix are the best. Hence, we use these four similarity measures and this Laplacian matrix. Also, as mentioned earlier, Silhouette Values decrease for twenty and thirty cluster sizes.

RESULTS AND DISCUSSION
Using the earlier presented dataset, and sampling-clustering setups, we compare our proposed algorithm (i.e. modified Spectral Clustering (SC) with Pivotal Sampling) with the existing variants in four ways. Again, as earlier, we use Silhouette Values for comparison. Quantifying statistical difference between different Silhouette Values is a hard task.
In general, the more closer these values are to one, the better is the clustering (see Section "Clustering Setup"). First, we demonstrate that use of sampling with modified SC does not deteriorate the quality of clustering. Second, we compare our algorithm with modified SC with Vector Quantization (VQ), Hierarchical Clustering (HC) 4 with Pivotal Sampling and HC with VQ for a sample size of 500. Since the results for modified SC with VQ come out to be closest to our algorithm, next, for broader appeal we compare these two algorithms for a sample size of 300. Third, we compare our algorithm with the current best in literature for this kind of data (i.e. HC without sampling) for both the sample sizes of 500 and 300. Fourth and finally, as discussed in the Literature Review section, we compare our sampling with that in Gireesh et al. (2015) and our clustering with the one in Islam et al. (2020).
Initially, we calculate the loss of accuracy incurred because of Pivotal Sampling in our algorithm. This loss for both the sample sizes and cluster size ten is listed in Table 7. Columns 1 and 2 give the sample sizes and the similarity measures chosen, respectively. Columns 3 and 4 give the Silhouette Values for modified SC without sampling (from Table 6) and our algorithm, respectively. The last column gives the percentage loss of accuracy. We can observe from this data that the loss of accuracy for one type of similarity measure (Correlation) is almost as low as −2% for both the sample sizes. This is considered acceptable because we are still better than the existing best algorithm (HC without sampling; please see Table 10 and its accompanying discussion below).
Here, we also perform a statistical test to support the above conjecture that using Pivotal Sampling does not substantially deteriorate the accuracy of our modified SC. For this, we use the ANOVA (analysis of variance) test (Rutherford, 2011). This test uses the variance between the different groups and the variance within each group to compute a value called the F-value, which is then compared with a standard estimate called F-critical. If F-value is less than F-critical, then it is inferred that the means of all the groups are equal. The two groups for us refer to the modified SC results (column 3) and the modified SC with Pivotal Sampling results (column 4). The F-values here (using the Silhouette Values of the two groups) come out to be 0.3432 and 0.4202 for N = 500 and N = 300, respectively. Both these values are less than the F-critical value given in the F-distribution table of Beyer (2019), which is 5.9873. Thus, using the above mentioned ANOVA test theory, we infer that that the mean Silhouette Value of modified SC is similar to the mean Silhouette Value of modified SC with Pivotal Sampling for both the sample sizes.
The results for the second set of comparisons are given in Table 8. Columns 2 and 3 give the similarity measures and the number of clusters chosen, respectively. Columns 4 and 5 give Silhouette Values of modified SC with Pivotal Sampling and VQ, respectively, while columns 6 and 7 give Silhouette Values of HC with Pivotal Sampling and VQ, respectively.
When we compare our algorithm (values in the fourth column, and highlighted in bold) with other variants, it is evident that we are clearly better than modified SC with VQ and HC with VQ (values in the fifth and the seventh columns); all our values are higher than those from these two algorithms.
When we compare our algorithm with HC with Pivotal Sampling (values in the sixth column), we again perform better for many cases. However, for some cases, our algorithm performs worse than HC with Pivotal Sampling (highlighted with a Ã ). Upon further analysis (discussed below), we realize that segregation of species by HC with Pivotal Sampling into fewer clusters than practically observed, results in these set of Silhouette Values getting wrongly inflated. To further assess the quality of the proposed technique, we present the distribution of species into different clusters (after reverse-mapping) for HC with Pivotal Sampling and our algorithm. Without loss of generality, this comparison is done using the Squared Euclidean similarity measure and cluster size thirty. The results for HC with Pivotal Sampling are given in Fig. 2 and for our algorithm are given in Fig. 3. In both the figures, on the x-axis, we have the cluster number and on the y-axis, the number of species present in them.
As evident, Fig. 2 depicts a very skewed distribution, i.e. most species are segregated into only a few clusters, while the remaining clusters contain only one or two species. At a broader level, this biased distribution of species obtained by HC with Pivotal Sampling is correct since all species belong to the same plant. On the contrary, the distribution in Fig. 3  is fairly equal. That is, our algorithm equally distributes all species between the different clusters. At a finer level, this distribution is better since our algorithm is able to perform a more detailed clustering, i.e. it splits the bigger clusters into multiple smaller ones, which better captures the similarity between species. This is also the reason for the inflation of Silhouette Values of HC with Pivotal Sampling in Table 8 since the intra-cluster similarity for solitary specie is zero leading to its respective Silhouette Value to become one (the maximum possible; see Eq. (21)). Thus, our algorithm also outperforms HC with Pivotal Sampling, which from Table 8 was not very evident.
Next, as mentioned earlier, to further demonstrate the applicability of our work, we also present the results with a sample size 300. Since modified SC with VQ turns out to be our closest competitor, we compare our algorithm with this one only. This comparison is given in Table 9, with its columns mapping the respective columns of Table 8. As evident from Table 9, our modified SC with Pivotal Sampling substantially outperforms modified SC with VQ (see values in columns 4 and 5). As earlier, third, we compare the results of our algorithm (modified SC with Pivotal Sampling) with the currently popular clustering algorithm in the plant studies domain (i.e. HC without sampling). For this set of experiments, without loss of generality, we use the cluster size of ten. The results of this comparison are given in Table 10, where the first four columns are self-explanatory (based upon the data given in Tables 8 and 9 earlier). In the last column of this table, we also evaluate the percentage improvement in our algorithm over HC. As evident from this table, our algorithm is up to 45% more accurate than HC for both the sample sizes. As earlier, our algorithm also has the crucial added benefit of reduced computational complexity as compared to HC.
Fourth and finally, as mentioned in the Literature Review section, we also compare our work with two previous works that are closest to ours. With the dataset almost the same as used by us, that is, a slightly larger phenotypic data for Soybean species, Gireesh et al. (2015) performed Principal Component and Power Core based samplings to identify relationships between the different phenotypic characteristics (first category as in Section "Literature Review"). We compare our sampling results with the best from Gireesh et al. (2015) in Appendix B, which demonstrates the superiority of our sampling method. Islam et al. (2020) performed HC on phenotypic data for Rice species (second category as in Section "Literature Review"). In Appendix C, we apply modified SC on this dataset to again demonstrate that our clustering technique is better.

CONCLUSIONS AND FUTURE WORK
We present the modified Spectral Clustering (SC) with Pivotal Sampling algorithm for clustering plant species using their phenotypic data. We use SC for its accurate clustering and Pivotal Sampling for its effective sample selection that in-turn makes our algorithm scalable for large data. Since building the similarity matrix is crucial for the SC algorithm, we exhaustively adapt seven similarity measures to build such a matrix. We also present a novel way of assigning probabilities to different species for Pivotal Sampling.
We perform four sets of experiments on about 2,400 Soybean species that demonstrate the superiority of our algorithm. First, we compare the Silhouette Values of modified SC without and with Pivotal Sampling, and show that the difference between these values is not significant. Second, when compared with the competitive clustering algorithms with samplings (SC with Vector Quantization (VQ), Hierarchical Clustering (HC) with Pivotal Sampling, and HC with VQ), Silhouette Values obtained when using our algorithm are higher. Third, our algorithm doubly outperforms the standard HC algorithm in terms of clustering accuracy and computational complexity. We are up to 45% more accurate and an order of magnitude faster than HC. Fourth and finally, we illustrate the excellence of our algorithm by comparing it with two previous works that are closest to ours.
Since the choice of the similarity matrix has a significant impact on the quality of clusters, in the future, we intend to adapt other ways of constructing this matrix such as Pearson χ 2 , Squared χ 2 , Bhattacharyya, Kullback-Liebler etc. (Cha, 2007). Furthermore, we also plan to observe the performance of Cube Sampling, which is another probabilistic sampling technique with data analysis properties complementary to Pivotal Sampling (Tille, 2006). Both Pivotal and Cube belong to the balanced sampling category, i.e. they satisfy Y ≈ Y′ HT and Y ≈ Y′ Hájek (recall Eqs. (18)-(20)). Cube Sampling automatically obtains the samples (without specifying the sample size), which does not happen in Pivotal. As mentioned earlier, our algorithm is developed to work well for phenotypic data of all plant species. This is because different species vary only in the number of characteristics and the type of characteristics, both of which do not affect our algorithm. We have preliminarily discussed this aspect for Maize and Rice in Appendix C, with extensive experiments for these two plants planned for future.

APPENDIX A
Here, we first present phenotypic data of the Soybean species used for our experiments. Please see Table A1 below. Next, we validate this data. For this, we compare our species data with a similar Soybean species data from (Gireesh et al., 2015) for the common set of phenotypic characteristics; Plant Height (PH), Number of Pods Per Plant (NPPP), and Days to Pod Initiation (DPI). This comparison is done using standard statistical metrics and is given in Table A2 below.
From this table, it is evident that the Standard Deviation (SD), Coefficient of Variance (CV), and Mean of our data and the data from the previous work are very close (for all three characteristics of PH, NPPP, and DPI). The slight variation in the metrics between the two data for all the characteristics is due to the difference in the ranges of the respective characteristics (due to the slightly differing selection of the species by the two works).

APPENDIX B
Here, we compare our sampling technique with those proposed by Gireesh et al. (2015) for a similar dataset. As earlier, we do Pivotal Sampling on 2,376 Soybean species while Gireesh et al. (2015) performed the Principal Component Score (PCS) and the Power Core (PC) samplings on 3,443 Soybean species. Since the samples obtained by the PC method are better, we compare our results with this sampling only.
This comparison is done using the statistical metrics of Standard Deviation (SD), Coefficient of Variance (CV) and Mean, and is given in Table A3 below. Since the metrics of our sampled data are more closer to our respective full data as compared to the metrics of the previous works' sampled data to its respective full data, our sampling is better.

APPENDIX C
In the article, we have demonstrated the usefulness of our algorithm on the species of the Soybean plant, here we demonstrate our algorithms' applicability to the species of the other two plants (Maize and Rice). The phenotypic data for the Maize species is given in Table A4, and for the Rice species is given in Table A5.
We can observe from Tables A1, A4, and A5 that there is a set of common phenotypic characteristics for the three plant species. Also, the values of all the characteristics are either categorical or numerical. As mentioned earlier, the categorical values can be easily converted to numerical ones. Since the input to our algorithm is a matrix built using the phenotypic data for given species, it can be applied to any of these plants.
To demonstrate the usefulness of our algorithm to the two new plant species, without loss of generality, we perform clustering of Rice species using our modified SC. For this, we use the data from Islam et al. (2020), where the authors have used HC to cluster ten Rice species into three clusters. Hence, we also cluster these ten species into three clusters using our modified SC. In Islam et al. (2020), the output is in the form of a hierarchical tree, which is non-numerical, and hence, difficult to compare. Thus, we compute Silhouette Values for our modified SC and HC. This data for the four similarity measures are given in Table A6. As evident from this table, our algorithm substantially outperforms HC.