ALTERNATIVE TERMINATION CRITERION FOR K-SPECIFIED CRISP DATA CLUSTERING ALGORITHMS

In this paper the analysis of k-specified (namely k-means) crisp data partitioning pre-clustering algorithm’s termination criterion performance is described. The results have been analyzed using the clustering validity indices. Termination criterion allows analyzing data with any number of clusters. Moreover, introduced criterion in contrast to the known validity indices enables to analyze data that make up one cluster.


Introduction and related work
Clustering refers to the process of the partition of a data set of objects into groups (clusters) so that the objects within a particular cluster have high similarity to each others, but are very dissimilar to objects in other clusters.Clustering methods have been classified into four types [15]: partitioning clustering, hierarchical clustering, density-based clustering and grid based clustering.Thus, basing on the relationship of each object to the cluster, we can distinguish crisp vs. fuzzy clustering.
The most fundamental version of cluster analysis is partitioning, which organizes objects of a data set into several mutually exclusive (no point in the data set belongs to more than one cluster), or jointly exhaustive (every point belongs to some cluster associated with other objects based on the membership levels) groups.This approach usually requires some background knowledge, namely an input parameter (number of clusters) as a starting point of a partitioning process.In the case of some partitioning algorithms (k-means, k-medoids, etc.) the user-defined initial parameter k (number of clusters) is simultaneously the stopping criterion of clustering performance.
Stopping criteria for optimal clustering have been a topic of discussion during the last decades and which caused an increase in research to confirm their usefulness [5,8].For partitioning clustering methods the stopping criteria are based on the predefined threshold or termination criterion including number of iteration, number of clusters, etc.
In order to quantify clustering optimality the procedure of estimating the results of clustering algorithm (cluster validity) has been used.In the case of partitioning clustering the only way to omit the strong user's influence on the clustering result is to use a pre-processing step (pre-clustering) or a post-processing (result validation).As a consequence, the resulting clustering configuration should be performed without a-priori understanding of the internal structure of data, but on the other hand it requires some sort of estimation related to its validity.
The distinctive feature of clustering is finding a structure in the investigated data, but its disadvantage is the introduction of an additional redundant structure into these data.Clustering allows finding structures even in the data which do not have it a priori (overclustering), which leads to the appearance of artifacts, that is, erratic results of cluster finding.In this case for finding the "best" number of clusters the pre-clustering is used.The most known pre-clustering algorithm is a canopy clustering algorithm [10].The aim of this algorithm is finding the approximate number of clusters which make up the input information for further clustering algorithms.The disadvantage of this algorithm is a heuristic definition of two thresholds (distances T 1 and T 2 ).The only logical solution to the problem of receiving valid results and at the same time of elimination the user's influence on clustering results is the use of clustering validity indices.
In [1,16], three approaches to investigation of cluster validity are described.The first one is based on external criteria, which consist in comparing the results of cluster analysis to externally known results, such as externally provided class labels.The second approach is based on internal criteria, and serves to estimate the goodness of clustering results without reference to external information.The third approach (relative criteria) is based on the estimation of the clustering structure by comparing different input parameter values for the same algorithm, e.g., the number of clusters.Most of the validity indices require statistical sequential substitution of the input parameter and are based on finding the "best" index value.Different indices in different situations cause different results.However, this paper is focused on the mixed sample of k-specified data partitioning clustering indices proposed for the comparison purpose criterion for kspecified data partitioning clustering algorithms of the termination criterion of clustering performance.The termination criterion helps to perform partitioning up to the certain step for the optimal determination of the number of clusters and gives the chance to keep an important balance between underclustering and overclustering.

Termination criterion for the pre-clustering algorithm
The pre-clustering algorithm as opposed to other existing algorithms does not require input parameters or threshold values for the correct determination of the number of clusters.Preclustering is the procedure of checking the possibility of input data clustering.The published pre-clustering algorithm [12] and its main partthe decision ruledetermines the existence of one or two clusters in the input data set.The decision rule has been implemented in the termination criterion [11] for the determination of any number of clusters.
In the following pre-clustering algorithm and its termination criterion we denote that: n is a number of objects, p is a number of attributes, k is a number of clusters, X ={x i , i=1,2,…,n} stands for data set containing n objects, in a pdimensional space, K q is a sequential number of cluster, where q = 1,2,...,k, () q i x is the i-th object of K q cluster.The advantage of the pre-clustering algorithm is that it does not require setting the initial parameter (number of clusters).The pre-clustering algorithm based on the application of crisp partitioning algorithms, in this case k-means.However, the k-means algorithm can be replaced by any other crisp partitioning algorithm.It should also be noted that the input parameter for partitioning (knumber of clusters) is not set by the user but at every step of partitioning it is set automatically on default being equal to the value k = 2.

Numerical results of pre-clustering algorithm validation
In this section the characteristics of data set are described.Thereafter, the validation of termination criterion of pre-clustering algorithm is presented.
Artificial #1: two-attribute data set containing 100 objects with Gaussian distribution, where all data objects make up one globular group.
Artificial #2: data set is similar to the previous one, but distinguished by the presence of three well separated groups at the equal distance from each other.
Artificial #3: data set is based on longitudinal distribution of objects in an elongated group.
Iris: all known four-attribute data set, where each group/cluster refers to the length and the width of the sepals and petals of iris flower.
Artificial #4: artificial two-attribute data set containing 100 objects generated with normal distribution and with three well separated globular form clusters.
Artificial #5: artificial two-attribute data set containing 500 objects in the form of three concentric ring clusters.Three classes labeled as "core", "first ring" and "second ring", accordingly.
Tested data sets are shown in Figure 1.
In this paper for the purpose of termination criterion validation, the internal validity measures [3] (Davies-Bouldin index, the Dunn index, index called "silhouette statistic", average within cluster distance and cluster density) are used.
The Dunn [4] index defines the ratio between the minimal intracluster distance to maximal intercluster distance.The Dunn index is limited to the interval [0,∞] and should be maximized.Rousseeuw [14] introduced the Silhouette index.The maximum value of the index is used to determine the optimal number of clusters in the data.Silhouette index is not defined for k = 1 (only one cluster).
The average within cluster distance [13] is calculated by averaging the distance between the centroid and all examples of a cluster.As clusters get more compact, this measure reduces.Of course, as the number of clusters increases, the average distance will decrease naturally anyway and so this measure can be difficult to interpret.
Cluster density measure [7] considers each cluster in turn and finds the average of the distances between all the pairs of points in the cluster and multiplies by the number of points in the cluster.This results in a measure that is equivalent to a distance per point within the cluster and which is, therefore, similar to a density.This measure tends to zero as the number of clusters increases, but smaller values indicate more compact clusters.
The Gini index [6] for measuring class inequality is also used as a validation index.A Gini coefficient of 1 (or 100%) expresses maximal inequality among values.
Simulations were carried out on the basis of RapidMiner software.The scheme of the validation process of pre-clustering algorithm is shown in Figure 2. Operators Data Set generate the artificial data whilst Iris data set is read from the RapidMiner samples repository.It is straightforward to connect the input of Loop operator.Operator Loop Parameters generate clusters that makes multiple partitioned clusters with k = 1 up to a maximum number of clusters defined by the user (k = 6).The measure type is set to numerical Euclidean distance.The Log operator is a very important part of RapidMiner as it allows data to be recorded during the execution.The values returned in the log are converted to real values, where necessary, to make analysis easier later on.

. Scheme of the validation of pre-clustering algorithm
The results of the validation process of Iris data set using the pre-clustering algorithm based on the crisp k-means algorithm are shown in Figure 3.
The graph presented in Figure 3 shows how internal validity measures vary as different clusterings are compared.All of the validity measures together indicate that k = 2 is a strong candidate for the best clustering.This is encouraging since in this case, the Dunn index was not told the correct result.

Fig. 3. Internal validity measures as a function of k for Iris data set. The x axis is the value of k and the "best" number of clusters is estimated using the elbow method. This graph was produced using the Series Multiple plotter and consequently, the y axes are normalized to make the ranges of each series match
The idea of the elbow method [9] is to choose the k at which the validity of indices decreases or increases abruptly.This produces an "elbow effect" in the graph.The number of clusters is chosen at this point, hence defined as "elbow criterion".The Elbow method is a heuristic and, as such, it may or may not work well in particular case.Sometimes, there is more than one elbow, or no elbow at all.In those situations user usually end up calculating the best k by evaluating how well partitioning algorithm performs clustering.Validation results can also be displayed in numerical form (see Table 1), where best indices performance and accordingly the number of possible clusters is labeled with a red color.The determination of k p causes finding a number of clusters using the pre-clustering algorithm with the termination criterion.Due to the limitation on the article size, additional metrics (accuracy, classification error, f-measure) as well as the results of preclustering algorithm based on other crisp partitioning algorithms (k-medoid, Kernel k-means, etc.) cannot be represented.
Also the external validity measures that compare clusters that are previously known with the clusters produced by the clustering algorithm are not presented.In this paper the data is presented in visual form in 2 or 3 dimensional space, however external validity measures (Rand, Jaccard, Fowlkes-Mallow and adjusted Rand indexes) could be used as ground truth to refer to the known clusters.

Conclusion
Briefly summarizing, the pre-clustering algorithm with the termination criterion is a good alternative for well-known clustering validity indices.Its considerable advantage is the ability to analyze data that make up one cluster.This pre-clustering algorithm has its disadvantages.One of them is the dependence of the parameters on calculated distances.When objects are significantly scattered, there are possibilities for existing anomalies or isolated clusters and, accordingly, the difficulties in obtaining adequate results, which can be seen in Table 1, from Artificial #5 data set.

Fig. 1 .
Fig. 1.Artificial (a, b, c, e, f) two-attribute data sets, (d) real-life iris data set that contains 150 objects and three classes of iris

Fig. 2
Fig. 2. Scheme of the validation of pre-clustering algorithm The pre-clustering algorithm, with the termination criterion, where partitioning is based on the crisp k-means clustering.Number of clusters k in the form of an acyclic connected graph.
Input: X: a data set containing n objects with p attributes.Output:

Table 1 .
The "best" number of cluster is determined from the labeled with a red color validity index