A Review of Unsupervised K-Value Selection Techniques in Clustering Algorithms

: Purpose: Automatic grouping of data according to certain characteristics is made possible by clustering algorithms, which makes them an essential tool when working with large datasets. However, although they are unsupervised tools, they generally require the specification of the number of clusters to be formed, k , a task that may be simple for a human, but quite complex to automate. Despite the most commonly used k-value selection techniques offer acceptable results, they are not without shortcomings, suggesting that there is ample room for improvement. This paper briefly introduces clustering techniques, discusses the main shortcomings of conventional k-value selection techniques and examines the advantages and limitations of nine promising alternatives presented in recent years. Design/methodology/approach: An evaluation of the main shortcomings of classic k-value estimation techniques has been carried out, and the newest proposals have been explained and compared. Findings: New k-value estimation indices and methodologies proposed by authors guarantee better results, extending the use of these techniques to large volumes of data, and complex shapes and structures. However, no generical methodology able to overcome all the described shortcomings has still been developed. Research limitations/implications: This research is limited to the newest developed techniques for k-value estimation, including proposals published since 2019. Older proposals have not been considered, as the newest ones overcome the former’s shortcomings. A k-value estimation techniques review published in 2019 is cited in the test as a base reference. Practical implications: Although the examples listed in the text apply to industry, the techniques described and discussed in this review are applicable to any area of science that can benefit from the use of clustering techniques. Originality/value: To date, there has been no paper comparing the new k-value estimation techniques. Although there are literature reviews comparing the classical methods, these methods are nowadays nearly obsolete due to the complexity of the data usually faced.


Introduction
Clustering is a key machine learning process that aims to group sets of unlabeled objects according to their characteristics, in order to build subsets of data known as clusters.Each cluster is formed by a collection of data that, in terms of the considered features, are similar to each other and differ from the rest of the data belonging to the dataset.Specifically, the features of the observations are represented numerically.Therefore, the similarity between two points can be measured as the distance between them (e.g.Euclidean distance).Thus, clustering algorithms will attempt to group observations in such a way as to maximize the similarity between group members as well as the difference with members of other groups.
Data availability and quality keep increasing due to technological advancements, automation, and the pervasive use of interconnected devices.However, large volumes of data are not useful if they cannot be easily managed and if conclusions cannot be drawn from them.Hence, clustering techniques are essential in data mining, enabling the handling of substantial amounts of data based on common characteristics.As an example of this, numerous developments can be found in recent literature in which data clustering is a necessary tool within the research, covering industry-related areas such as risk and quality assessment (Er-Kara, Oktay-First & Ghadge, 2020; Orak, Akkoyunlu & Can, 2020), logistics (Pegado-Bardayo, Lorenzo-Espejo, Muñuzuri & Aparicio-Ruiz, 2023), or production optimization (Hong, Lee, Cho, Jang & Kim, 2023), among others.
As this tool finds application across various fields, the characteristics of different datasets and the needs of data scientists can vary significantly, giving rise to the development of numerous clustering algorithms.These algorithms have been conventionally classified according to whether they employ a partitional approach, in which observations are segregated into previously specified number of groups, or a hierarchical strategy, in which clusters are created iteratively, either by building them from individual observations and merging them into larger clusters, or by dividing a cluster containing the entire set into smaller clusters until individual clusters are obtained (Saxena, Prasad, Gupta, Bharill, Patel, Tiwari et al., 2017).This means that, while in the latter approach the final output consists of dendrograms expressing relationships between all observations in the dataset, in the former one observations are assigned exclusively to one cluster.
One of the most widely used clustering algorithms in machine learning and data analysis is the K-means clustering algorithm, due to its simplicity, ease of implementation, and computational efficiency.K-means is a partitioning technique aimed at dividing a dataset into  distinct, non-overlapping subsets.Grouping is done by minimizing the sum of distances between each object and the centroid of its cluster.The naïve version of this algorithm, proposed by Lloyd (1982), follows the described steps: 1. Initialization: k points are initially placed in the data domain (centers), either randomly or following an initialization method.2. The Voronoi diagram of the k sites is computed and all points inside each cell are assigned to its corresponding center.
3. The center of each Voronoi cell is substituted by the mean value of points corresponding to that cell.
Steps 2 and 3 are repeated until a stopping criterion is met, usually when a number of iterations is reached.However, the algorithm has converged when the assignments no longer change.
This algorithm has given rise to extensions that seek to adapt the technique to more complex or extensive datasets, such as BFR (Bradley, Fayyad & Reina, 1998) or Fuzzy C-means algorithms.Partitional algorithms also encompass another widely used group of medoid-based algorithms, such as PAM (Kaufman & Rousseeuw, 1990), CLARA (Kaufman & Rousseeuw, 1990) or CLARANS (Ngand & Han, 2002).Unlike K-means, where cluster centers are represented by the mean value of data points in each cluster, in K-medoids cluster centers are actual data points, specifically the most centrally located or "medoid" point within a cluster.Among them, PAM algorithm (Partitioning Around Medoids) represents the simplest approach, which proceeds with the following steps: 3. For each medoid m, for each non-medoid o: i) Swap m and o, and recalculate the cost function (sum of the distance of the points to their medoids).
ii) If the total cost of the configuration increased in the previous step, undo the swap.
Step 3 is repeated until no improvement is achieved in the objective function.
Clustering tasks with popular algorithms such as k-means and k-medoids are essential in data analytics and are considered unsupervised tasks.However, they require the specification of the number of clusters to be formed a priori, and said value directly affects the result.This can become a problem when dealing with large volumes of data, which is often the case due to the very purpose of these techniques.
There are several approaches in the literature for identifying the optimal number of clusters, but there is still much room for improvement.The classical approach to estimating this k value involves performing a brute-force search.The first step is to establish a range of variation of k, i.e., all the values considered for k.Then, the data is clustered for each value within the range.Finally, the accuracy of the result is evaluated by using a clustering validation index, inferring the final value of k according to the score evaluation.
The main disadvantage of this technique lies in its computational cost as the algorithm has to be run as many times as numbers are contemplated in the range of k.This range should not be too small, as many options would be left unexplored, nor too large, as this may imply high execution times.
Also, the commonly used indices to evaluate each iteration show weaknesses and are sometimes not sophisticated enough to achieve good results on complex datasets.Yuan and Yang (2019) lists some of the most widespread classical validation indices used for estimating this parameter.The main ones currently in use are the Silhouette Score (S), Elbow Method, Gap Statistic, and Calinski-Harabasz (CH), according to the authors.
All of them offer acceptable results, but they fail to solve the problem satisfactorily with certain point distributions.Observations in a dataset will show both similarities and differences among them, generating multiple clusters.In ideal cases, the clusters will have clearly differentiated, however there may be instances in which two or more clusters have fuzzy boundaries (overlapping), or in which a cluster encompasses different "subgroups" (hierarchy), as illustrated in Figure 1.Cluster overlapping and hierarchy.When these situations occur, the classical indices tend to underestimate the number of clusters, resulting in a loss of information that may be relevant in future analyses.These, together with the computational complexity of this technique, have encouraged researchers to design alternatives for this process, which is essential for an appropriate handling of data.
This paper gathers more sophisticated methodologies proposed in recent years that attempt to address the weaknesses of the classical techniques.The following section compiles novel indices and strategies to estimate this parameter, and discussion and conclusions are given in Section 3.

Trends in K-value Estimation Methodologies
New developments in cluster validation indices as well as new methodologies designed to estimate the number of clusters are presented hereafter.

New Cluster Validation Indices
This section includes the most relevant studies focused exclusively on the improvement of classic validation indices or the design of new ones.
Some commonly used scores do not perform as desired when clusters "shapes" are not perfectly defined.The Silhouette score (Rousseeuw, 1987), for example, can be affected by irregular distances between clusters, while the Calinski-Harabasz index (Calinski & Harabasz, 1974) is highly sensitive to outliers.The proposal by Wang and Xu (2019) tries to solve these fluctuations by identifying peak points in Silhouette and Calinski-Harabasz indices and combining them into a single metric called Peak Weight Index (PWI), which seeks to balance both indexes by applying weights.The research shows promising results; however, it cannot be considered fully unsupervised as peak boundaries need to be specified according to the distribution of the dataset.
Yang, Lee, Choi and Joo ( 2020) focuses his research on the Gap statistic (Tibshirani, Walther & Hastie, 2001) as one of the most reliable indices in the estimation of k and describes the main weaknesses in its performance in order to improve it.The identified shortcomings are cluster overlapping, hierarchy within a cluster (i.e., clusters that may be formed by two or more smaller clusters, which classical methodologies are unable to identify), and high computational time.To overcome the first two aspects, the research proposes a new metric that evaluates the evolution of the Gap value with k, based on the premise that the Gap(k) function increases with constant or accelerated speed as k value is incremented, up to the point where k reaches its optimal value.At that point, the Gap(k) value suddenly decreases its increasing speed or starts slowing down.This deceleration of the Gap statistic (Dacc statistic) is calculated as follows: (1) However, statistical measurements sometimes fail to obtain proper results when data is not clearly separated or presents asymmetries.Aiming to overcome this, Rojas-Thomas, Santos and Mora (2017) propose an index adapted to real data patterns, based on the inner cohesion of clusters and the distance to others.The methodology divides clusters into sub-clusters based on the Principal Component Analysis (PCA), and the minimum spanning tree is obtained from the resulting centroids.Here, the concept of cohesion is introduced.To measure the cohesion between two adjacent sub-clusters in the spanning tree, the arc that connects them is evaluated: the greater the dispersion of data in the center of this arc, the lower the cohesion between these sub-clusters is (and vice versa).Finally, the cluster validation index is constructed by combining the distance between all the clusters being evaluated.To assess the distance between two clusters, the closest pair of sub-centroids (one from each cluster) is searched according to Euclidean distance, and the distance between them is calculated by adding the costs of the spanning tree branches joining them.In terms of scalability, the experimental results show that, as the number of clusters increases, the index's performance level decreases.
Also based on distance concepts, Abdalameer, Alswaitti, Alsudani and Isa (2022) present a novel index, according in this case to Euclidean distances.Two features are considered when evaluating clustering accuracy in this index: the distance between each point within a cluster to its centroid, namely Distance Within Cluster (DWC), and the Distance Between Centroids (DBC).Thus, good clustering will minimize DWC while maximizing DCB.Applying this concept, the authors design a new metric, namely Validity Clustering Index based on Mean of clustered Data (VCIM), that aims to achieve more accurate and computationally cheaper estimations of k, obtained as: (2) Where and DWC total represent the overall DBC and DWC for the dataset respectively at each iteration.The use of this metric is therefore limited to clustering algorithms that use Euclidean distances, but in these cases the results obtained are satisfactory, and better than those obtained with classical metrics in terms of accuracy.
Xie, Lawniczak and Gan (2022) try to solve the k-value underestimation of classic algorithms with an effective modification in the standard iterative method using Gap statistic.As previously discussed, the Gap statistic stands out among the classic indices for being the most sophisticated, however, there are two reasons that lead this technique to estimate low k values.These are the standard deviations of the data to be clustered, and the local fluctuations of this statistic, which have a direct impact on the evolution graphic and, therefore, on the estimation of k.To address this, the study proposes to smooth the curves of this graph.In order to do so, the authors propose benefiting from the power-law relationship between the Gap value and k, so that the derivative of the smooth function can be used to approximate the differential of gap statistics.This smooth curve allows overcoming the aforementioned fluctuations, thus avoiding the underestimation of clusters of this statistic.
Finally, a validity index based on the point pairs with fewer shared nearest neighbors (ANCV) is proposed by Duan, Ma, Zhou, Huang and Wang (2023), following the mentioned approaches that evaluate compactness within clusters and separation between clusters.
To calculate this index, an initial search is carried out to identify small, loose clusters within actual clusters, and their compactness is obtained as an indicator for the entire cluster.Consecutively, the average distance between pairs of data points at the intersection of two clusters is used as the between-cluster separation, making the index performance less influenced by the cluster shape.These measurements are obtained using equations ( 3) and ( 4): (3) (4) Where K is the number of clusters formed, com(c i ) is the within-cluster compactness for cluster i, and sep(c m ,c n ) is the average distance between all pairs of between cluster augmented non-shared nearest neighbors.Both compactness and separation are combined in the final index ANCV. (5) Experiments show the best performance against the classic indices; however, this index quality may fail when the clustering results are incorrect for the actual number of clusters, and thus, authors consider improving this index to achieve optimal results in all different clustering situations.

New K-value Selection Methodologies
Classical methods are computationally expensive, mainly because the clustering algorithms are run iteratively for the whole range of possible k values.Therefore, authors have shown interest in creating new alternatives to these methodologies but still offering competitive results.
Computational complexity problem is compounded when talking about big data.Alibuhtto and Mahat (2020) present a local search algorithm to find a local optimum based on distances between centroids in big data.The main point of this technique is to establish a stop criterion.The research proposes an estimation of a threshold value based on Euclidean Distance so that the clustering algorithm is run until an acceptable value is found.This technique avoids evaluating the clustering over the whole range of possible k values, thus streamlining of the handling of large volumes of data.
Also trying to solve this problem, Ri, Kang, Kim, Choe and Han (2022) propose the Ratio of Variance to Range (RVR) and Dispersion-Width Ratio (DWR) separation measurement metrics as key to identifying different populations in a dataset.The authors performed Montecarlo simulations to study the behavior of DWR, initially on samples from a single population (cluster) and subsequently adding different populations.The analysis of the evolution of the DWR value revealed that for each new cluster, it is possible to observe a boundary in the graph, which allows the estimation of k.The results are promising, as this technique is able to reduce the runtime considerably, but still requires improvements to achieve more reliable results on some types of data such as heterogeneous distribution, sparse, and abnormal data.
Lastly, trying to address the inefficiency in execution times in conjunction with the accuracy of the results, Patil and Baidari (2019) propose to estimate k based on "depth difference" (DeD), following a similar approach to the exposed by (Abdalameer et al., 2022) in the previous section.Data depth measures a median in a multi-variate dataset, which is considered the deepest point in the given dataset.This metric assigns values from 0 to 1 to each point in the dataset according to their centrality.Then, the aforementioned Distance Between Clusters and Distance within Cluster are obtained and averaged, and finally, DeD is calculated as the difference among them.
Unlike classical methods, DeD does not employ any clustering algorithm for partitioning data, but rather iterates on the function itself, achieving significantly lower run times while achieving more accurate results than those obtained with indices such as Calinski and Harabasz (1974), Krzanowski and Lai (1988) Silhouette, and Gap.

Discussion and Conclusions
The increasing availability of data due to the growing presence of technologies in daily tasks enables its use and exploitation for several multidisciplinary purposes.However, the handling of large data volumes is challenging and clustering techniques are usually required in order to group data according to designed characteristics.This article reviews trends and new developments in unsupervised methodologies for estimating the optimal number of clusters in a dataset.There are widespread simple methods that present acceptable results in some cases, but they show numerous shortcomings.
Table 1 summarizes all approaches reviewed in this article, including the identified problems and each technique's limitations.Note that (I) and (M) correspond to new Index and Methodology, respectively.
Six novel indices and three methodologies are discussed in this paper, which evidence those mentioned shortcomings.These novel improvements get closer to an optimal solution to this problem, offering new approaches that mainly speed up the execution time and/or offer more accurate results, overcoming obstacles such as the underestimation of clusters, hierarchy within clusters, detection of small clusters or fuzzy shapes or even classical issues in real datasets such as the presence of outliers.
The findings of this review highlight the absence of a universally applicable approach to the aforementioned challenges: after a global comparison, it is observed that in order to solve the identified shortcomings it is necessary to trade-off either their accuracy, running time, or applicability.The latest advances in this field significantly facilitate and enhance these estimations considerably, and thus, although there is no global solution, data scientists can refer to the table to identify the approach that best aligns with their needs based on their dataset's characteristics.
Moreover, the table shows that there are still some aspects that can be improved and limitations that suggest that further advancements and refinements are still possible, underscoring the value and potential of this field of study.

Table 1 .
Comparative table of new methodologies and indices to estimate k value