Abstract

The k-means algorithm is sensitive to the outliers. In this paper, we propose a robust two-stage k-means clustering algorithm based on the observation point mechanism, which can accurately discover the cluster centers without the disturbance of outliers. In the first stage, a small subset of the original data set is selected based on a set of nondegenerate observation points. The subset is a good representation of the original data set because it only contains all those points that have a higher density of the original data set and does not include the outliers. In the second stage, we use the k-means clustering algorithm to cluster the selected subset and find the proper cluster centers as the true cluster centers of the original data set. Based on these cluster centers, the rest data points of the original data set are assigned to the clusters whose centers are the closest to the data points. The theoretical analysis and experimental results show that the proposed clustering algorithm has the lower computational complexity and better robustness in comparison with k-means clustering algorithm, thus demonstrating the feasibility and effectiveness of our proposed clustering algorithm.

1. Introduction

Clustering is an important research branch of data mining. The k-means algorithm is one of the most popular clustering methods [1]. When performing k-means clustering, we usually use a local search to find the solution [2, 3], i.e., selecting k points as the initial cluster centers and then optimizing them by an iterative process to minimize the following objective function (see, for example, [4, 5]):where is the j-th data point belonging to the i-th cluster . It is well known that the solution of equation (1) is affected by the initial values of .

In order to choose properly, the k-means++ algorithm [6] picks out a set of points as the initial center points whose distances between each other are as large as possible. However, this method for choosing the initial center points is sensitive to outliers [79]. Some methods use the subsets of the original data set to determine . For instance, the CLARA [10] and CLARANS [11] algorithms use PAM [12] to calculate the initial cluster centers from the random subsets of the original data set. The sampling-based methods weaken the sensitivity because the sampling process can discard some outliers in the original data set, but it cannot guarantee all outliers to be ignored in the sampling process. Therefore, the remaining outliers in subsets still affect the clustering results.

The automatic clustering algorithms are attracting more and more attention from the academic community, e.g., the density-based spatial clustering of applications with noise (DBSCAN) algorithm [1315], depth difference-based clustering algorithm [16], and Tanir’s method [17]. Recently, a new automatic clustering algorithm named I-nice was proposed in [18]. Inspired by the observation point mechanism of I-nice algorithm, we propose a two-stage k-means clustering algorithm in this paper to find the cluster centers from a subset of the original data set with all outliers removed. In the first stage, we select a small subset of original data set based on a set of nondegenerate observation points. The subset contains only all the higher density points of the original data set and does not have the outliers. Therefore, it is a good representation of the original data set for finding the proper cluster centers. In the second stage, we perform the k-means algorithm on the subset to obtain a set of cluster centers and then the other points in the original data set can be clustered accordingly.

Selecting the subset in the first stage is based on a set of nondegenerate observation points that are assigned to the data space , where d is the dimension of data points. For each observation point, we compute a set of distances between it and all data points in the original data set. The set of distances generates a distance distribution with respect to the observation point. From the distance distribution, we identify the dense areas and extract the subset of data points in the dense areas. Then, we take the intersection of all subsets of data points in all dense areas from those distance distributions. After refining this intersection subset of data points, we obtain a subset without outliers of the original data set. Therefore, it can be used to find the proper cluster centers. Finally, we conduct some convictive experiments to validate the effectiveness of our proposed algorithm and the experimental results demonstrate that our proposed algorithm is robust to outliers.

The remainder of this paper is organized as follows. We describe the related mathematical principles of our algorithm in Section 2. The details of two-stage k-means clustering algorithm and its pseudocode are presented in Section 3. In Section 4, we present a series of experiments to validate the feasibility of our proposed algorithm. Finally, we summarize the conclusions and future work in Section 5.

2. Mathematical Principles

Definition 1. Suppose is a data set including N data points with dimensions. Given an observation point , we say is a generated distance set of with respect to the observation point O, where denotes the Euclidean distance between X and Y.
Given a data set and an observation point O, we haveby the triangle inequality for every . Hence, the distance between two data points in is larger than the difference of their corresponding two distances in . Therefore, for any positive number r and a point , the number of points in with distances to X less than r is not greater than the number of points in whose distances to are less than r. In particular, if X is a proper cluster center in , will be a data point in which has more points close to it. That is to say, if X is a dense point in , it is also corresponding to a dense point in .
Unfortunately, the converse is not true. Because two points in which have a small distance may correspond to two points in that have a large distance, a proper cluster center of may not be corresponding to a proper cluster center of . Hence, we can deduce that retains the partial clustering information of . In order to obtain more clustering information of , one possible way is to choose more observation points to generate more distance sets and then combines all those different clustering information together. This is the main idea behind our new algorithm. We provide the following two theorems to guarantee the correctness of the abovementioned statements.

Definition 2. Given a set of points where for , define the generating matrix A of asIf the determinant of A is not equal to zero, we say A is nondegenerate and is a set of nondegenerate points.

Theorem 1. Suppose is a set of nondegenerate points and let . Iffor all , then .

Proof. Assume and . For each , we havei.e.,By simplifying equation (6), we can obtainThus, we haveSince the coefficient matrix of (8) is nondegenerate, we can get thatThus, .

Remark 1. For the convenience of calculation, we can choose as the set of nondegenerate observation points. In fact, if , then for each i, it hasThus, if we have obtained the distance between X and , then computing the square of the distance between X and will convert to addition operation three times, which will decrease the time complexity in generating those distance sets.

Remark 2. If the number of observation points is less than , Theorem 1 does not hold true. For example, if we choosethen, it hasbut . By Theorem 1, we can confirm that all different clustering points can be distinguished by choosing a set of nondegenerate points as the observation points. Thus, is the minimum number of the observation points to distinguish all the cluster centers of the original data set.

Theorem 2. Suppose , . Let and setfor and . If, for each ,then,

Proof. For , we set . We haveSolving the system of equation (16) results inThen, we can obtainwhich yields .

Remark 3. If we normalize the original data set , for example, we perform the min-max normalization on , and we can deduce that .

Remark 4. Suppose is a generated distance set of with respect to the observation point O. We cannot confirm whether two elements in which have a small difference are corresponding to two data points in which also have a small distance. But by Theorem 2, if all the pairs of generated distances of X and Y have a small difference, then X must have a small distance to Y. This can be used to adjust the dense of the selected subset.

Remark 5. The observation point mechanism aims to transform the original multidimensional data points into one-dimensional distance points, which is different from the landmark point or representative point mechanisms. The landmark point [19] is the core of landmark-based spectral clustering (LSC) algorithm which generates some representative data points as the landmarks and represents the remaining data points as linear combinations of these landmarks. The representative points [20] are the subset of original data set and used in the ultrascalable spectral clustering (U-SPEC) algorithm to alleviate the huge computational burden of spectral clustering. The observation points are designed to enhance the robustness of k-means clustering, while the landmark points or representative points are used to speed up the spectral clustering.

3. The Proposed Two-Stage k-Means Clustering Algorithm

Given a data set with N objects, we want to partition into k clusters. The main idea of our two-stage k-means clustering algorithm is that we only need to deal with a small subset of which has a similar clustering structure to . In order to select a proper subset with the abovementioned property, we need to discard all outliers in and retain a portion of those points that are close to the cluster centers.

3.1. Description of Algorithm
3.1.1. Generating Distance Sets in the First Stage

First of all, we conduct the normalization operation on the original data set . Set , where for . Suppose

Then, we transform into with corresponding to . Obviously, the transformation on is a composition of a translation transformation and a dilation transformation. The dilation factor is the same for each dimension; hence, the dilation transformation does not change the cluster structure. Because the translation transformation also does not change the cluster structure of a dataset, the cluster structure of is totally the same as that of . We also note that the value of every component of is in the interval .

Letbe the set of observation points. Denote the generated distance set of with respect to the observation point , and we get sets . For each data point , we actually have mapped it to a -dimensional vector . Theorem 1 shows that we can identify by the -dimensional vector, and hence, it is reasonable to expect that the clustering structure about can be deduced by those distance sets.

3.1.2. Selecting a Representative Subset of in the First Stage

For each , we can get a set consisting of all candidate higher density points of by using the grid-based clustering methods (e.g., [21]). For example, first, we arrange in the ascending order. Second, a fixed value is selected to be a quantile of diff. Third, for each s in , we counter the number of elements of in the interval . Thus, we obtain a positive integer sequence, where each member indicates the relative size of the density of the corresponding element of . Finally, we select out those s in such that the corresponding integer number is either a local maximum or beyond a threshold.

In the following experiments, we will set as two times the p-th percentile of set diff for some p, where is the rearrangement of in the ascending order and diff is the sequence of the first-order difference on . Denote N as the cardinality of . If N is small, we usually choose a smaller p; for example, . If N is very large, we choose a bigger p; e.g., . Otherwise, we can choose a proper p between them; e.g., .

Now, we have obtained sets with each one containing all the higher-density points of the corresponding distance set. By the triangle inequality, we have the following property:

If there is an such that is not in , then X cannot be a higher-density point of .

According to this property, we can select a subset of whose distances to the i-th observation point are in for all .

For each point , we have mapped it to a -dimensional vector. By Remark 4 of Theorem 2, all the pairs of the corresponding components of two points that belong to the same cluster will have a little difference between them. But it is possible that there are some data points which have some components that have the little difference with that of one cluster center and have the other components that have the little difference with that of another cluster center. In such case, few outliers may be missed by the above selection criterion. To discard those few outliers in and decrease the number of elements of , we need to refine . We have the following criterion according to Remark 4 of Theorem 2.

Suppose has been selected. Given a data point , if, for every , and have a small difference, then we discard .

We denote and set

We also need a counter to indicate the density of each data point of . Firstly, we let and make the indicative number of to be 1. We then sequentially choose the data points in and dynamically construct and according to the following process. Suppose we choose from , then we compute the distance between and each data point in . If there are some distances less than a threshold value δ, we add 1 to each of the counter of data point that corresponds to these distances and then discard . Meanwhile, if the counter number of a data point in is bigger than another threshold value n, then we remove this data point from and add it into . But if each one of the point in has distance to bigger than δ, we continue to check whether there is a point in that has distance to less than δ; we will discard if there is any and we will add to if not. Finally, we obtain a set that closely represents the original data set and the size of it is smaller than that of the original data set. Furthermore, all outliers of original data set are not included in this selected subset.

3.1.3. Clustering and in the Second Stage

Since the selected set has discarded all outliers and has a smaller size than the original data set, the running time will decrease significantly when performing the k-means algorithm on the selected subset. Furthermore, because the subset closely represents the original data set, the cluster centers will also be suitable to be chosen as the cluster centers of the original data set. When we have identified the cluster centers, it is easy to cluster the whole data set. The pseudo-code of our proposed algorithm is presented in Algorithm 1.

Input:
 The number of clusters: k;
 The number of percentile: p;
 The original data set: ;
Method:
 Normalize and generate ;
for to d do
 Set where ;
 Generate by rearranging in ascending order and set be p-th percentile of ;
 Set and ;
 Equally divide the interval into intervals with length ;
 Let be the union of those intervals that contain local maximum number of elements in ;
end for
 Select out a subset if and only if for all ;
 Refine and obtain the subsets :
 Perform the k-means algorithm on ;
 Assign each to the nearest center of the obtained cluster of ;
Output:
 The result of clustering.
3.2. Analysis of Computational Complexity

In this section, we analyze the computational complexity of the proposed algorithm. When running the classical k-means algorithm, each iteration needs to compute the distances between each data point in the whole data and those new modified cluster centers, which has a time complexity of . In our algorithm, the time cost in the first stage mainly consists of four parts. The first part is to generate one-dimensional data sets, which has a time complexity of . The second one is to find those intervals which contain the local maximum of distances, which has a time complexity of . The third part is to select , which has the time complexity . In the fourth part, we refine and obtain the subset , which has the time complexity , where denotes the cardinality of . Thus, the time complexity of the first stage is . At the second stage, we will perform the k-means algorithm on and each iteration will have a time complexity less than , where denotes the cardinality of . Because , the total time complexity of the new algorithm is .

We note that the time complexity of the fourth part in the first stage is usually much less than . Since many data points have been discarded when constructing the sets and , we do not have to compute the distances with all data points in .

4. Experimental Results and Analysis

We have conducted a series of experiments on 6 synthetic data sets and 3 benchmark data sets (UCI [22] and KEEL [23]) to validate the effectiveness of the proposed two-stage k-means clustering algorithm in this section. The synthetic data sets can be downloaded from BaiduPan (https://pan.baidu.com/s/1MfS8JfQdJLHYSlpZdndLUQ) with the extraction code “p3mc.” We first present the clustering results of our proposed algorithm and the k-means algorithm on two synthetic data sets, i.e., the data set #1 and data set #2. The experimental results are shown in Figures 1 and 2. For simplicity, we only use the experimental results on data set #1 to explain the advantage of our proposed algorithm. There are two clusters in data set #1, where each cluster includes 41 data points. The data points obey the 2-dimensional normal distributions with mean vectors (3, 11) and (12, 5) and covariance matrices and , respectively. There are also two outliers in data set #1.

Figure 1(b) gives the selected data points of normalized data points corresponding to the data set #1 as shown in Figure 1(a). In Figure 1(b), we can find that outliers have been removed in the first stage of our proposed method. Figure 1(c) shows the clustering result of the k-means algorithm. We can see that outliers seriously impact the clustering result of the k-means algorithm, although there are only two outlier data points in the data set #1. The clustering result of our proposed method is presented in Figure 1(d), where the cluster center can be found correctly without the disturbance of outliers. The similar results can be found in Figure 2 for the data set #2 which includes 7 clusters and 10 outliers. The experimental results reflect that our proposed two-stage k-means clustering algorithm is not sensitive to outliers and can obtain the better clustering results than that of k-means clustering algorithm.

Furthermore, we choose another four synthetic data sets as shown in Figure 3 (only 2-dimensional illustration) and three real-world data sets to compare the clustering performances of our proposed algorithm with the k-means algorithm. The details of these data sets and experimental results are summarized in Table 1, where N is the number of the elements of the data set, t is the proportion of the outlier in the data set, k is the number of clusters, d is the dimension of data point, p is the percentile number, is the cardinality of selected subset, and are the adjusted Rand index (ARI) and time consumption of k-means algorithm, and and are ARI and time consumption of our proposed algorithm. In Table 1, we can see that our proposed algorithm obtains the larger ARIs with the lower time consumption in comparison with k-means clustering algorithm on these synthetic data sets. For the real data sets without outliers, our algorithm can obtain the ARIs comparable to that of k-means algorithm. Nevertheless, the ARIs of k-means algorithm are severely degraded when the outliers are deliberately arranged in the real data sets, while the experimental results in Table 1 demonstrate that our proposed clustering algorithm is robust to the outliers. Table 2 shows the details of comparison on four large-scale synthetic data sets. The variables in Table 2 have the same meaning as that in Table 1. The comparison of time complexity between our proposed algorithm and k-means algorithm in Table 2 reflects that our algorithm has less time consumption than k-means algorithm. Especially, we can find that the superiority of our proposed method on time consumption is more obvious for data set with the larger size and dimension. Furthermore, the most time-consuming procedure in our algorithm, i.e., the selection of high-density distances for each generated distance set can be ran in the parallel way, which make our algorithm to be easily extended to perform the clustering task for large-scale data set.

In addition, we provide a real application, i.e., the tyre inclusion identification, to validate the clustering performance of our proposed clustering algorithm. Figure 4 shows two tyres with different kinds of inclusions, where each picture includes 1027  768 pixels. Figures 5 and 6 present the clustering results of our proposed algorithm and k-means clustering algorithm on Tyre #1 and Tyre #2, respectively. In these figures, we can see that our proposed method can accurately identify the cluster centers without the disturbance of outliers. The inclusions can be clearly recognized by our proposed algorithm in the tyres, while the k-means clustering algorithm does not find the inclusions distinctly, e.g., Figures 6(b), 6(d), and 6(f) include not only the inclusions but also the tyre traces. Above all, the experimental results demonstrate the better clustering performance in comparison with the classical k-means clustering algorithm when handling the clustering tasks with the disturbance of outliers.

5. Conclusions and Future Work

In this paper, we proposed a robust two-stage k-means clustering algorithm which can accurately identify the cluster centers without the disturbance of outliers. As the direct application of the observation point mechanism of I-nice [18], we select a small subset from the original data set based on a set of nondegenerate observation points in the first stage. In the second stage, we use the k-means clustering algorithm to cluster the selected subset and make these cluster centers as the true cluster centers of the original data set. The theoretical analysis and experimental verification demonstrate the feasibility and effectiveness of proposed clustering algorithm. The future studies will be focused on three directions. First, we will try to use the k-nearest neighbors (kNN) method to improve the selection of observation points. Second, we will seek the real applications for the two-stage k-means clustering algorithm. Third, we will extend our proposed algorithm to cluster big data based on the random sample partition model [24].

Data Availability

The data used in our manuscript can be accessed by readers via our BaiduPan (https://pan.baidu.com/s/1MfS8JfQdJLHYSlpZdndLUQ) with the extraction code “p3mc.”

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This paper was supported by the National Key R&D Program of China (2017YFC0822604-2), Basic Research Foundation of Strengthening Police with Science and Technology of the Ministry of Public Security (2017GABJC09), and Scientific Research Foundation of Shenzhen University for Newly-introduced Teachers (2018060).