SCALABLE CLUSTERING BY TRUNCATED FUZZY c -MEANS

. Most existing clustering algorithms are slow for dividing a large dataset into a large number of clusters. In this paper, we propose a truncated FCM algorithm to address this problem. The main idea behind our proposed algorithm is to keep only a small number of cluster centers during the iterative process of the FCM algorithm. Our numerical experiments on both synthetic and real datasets show that the proposed algorithm is much faster than the original FCM algorithm and the accuracy is comparable to that of the original FCM algorithm.

1. Introduction.Data clustering refers to a process of dividing a set of items into homogeneous groups or clusters such that items in the same cluster are similar to each other and items from different clusters are distinct [10,1].As one of the most popular tools for data exploration, data clustering has found applications in many scientific areas such as bioinformatics [21,26], actuarial science and insurance [11,13], image segmentation [20,25], to name just a few.
During the past six decades, many clustering algorithms have been developed by researchers from different areas.These clustering algorithms can be divided into two groups: hard clustering algorithms and fuzzy clustering algorithms.In hard clustering algorithms, each item is assigned to one and only one cluster; In fuzzy clustering algorithms, each item can be assigned to one or more clusters with some degrees of membership.Examples of hard clustering algorithms include the k-means algorithm [27], which is one of the most widely used clustering algorithm.The FCM (Fuzzy c-means) algorithm [9,4,3] is a popular fuzzy clustering algorithm.

GUOJUN GAN, QIUJUN LAN AND SHIYANG SIMA
The FCM algorithm is formulated to minimize an objective function.Let X = {x 1 , x 2 , . . ., x n } be a dataset containing n points.Let k be the desired number of clusters.Then the objective function of the FCM algorithm is defined as where U = (u il ) n×k is an n × k fuzzy k partition matrix, α > 1 is the fuzzifier, Z = {z 1 , z 2 , . . ., z k } is a set of k centers, and • is the L 2 -norm or Euclidean distance.Here a fuzzy k partition of a dataset of n points is an n × k matrix that satisfies the following conditions: u il > 0, l = 1, 2, . . ., k.
Similar to the k-means algorithm, the FCM algorithm employs an iterative process to minimize the objective function.
The FCM algorithm has some advantages over the k-means algorithm.For example, the FCM algorithm can reduce the number of local minima of the objective function [22].However, the FCM algorithm is not efficient for dividing a large dataset into many clusters.Examples of such situations include clustering millions of web pages into a thousand categories [5] and clustering hundreds of thousands insurance policies into a thousand clusters in order to select a thousand representative policies [11,13].This inefficiency is caused by the following two factors.First, the FCM algorithm needs to store the full fuzzy partition matrix, which contains nk elements.Second, the FCM algorithm needs to calculate nk distances at each iteration.
In this paper, we propose a modified version of the FCM algorithm, called the TFCM (Truncated FCM) algorithm, to address the aforementioned drawback of the FCM algorithm.In the TFCM algorithm, a subset of the full fuzzy partition matrix is stored and the number of distance calculations at each iteration is reduced.The idea of the TFCM algorithm stems from the insight that when k is large, a data point belongs to only a few clusters with high degrees of membership.As a result, we can ignore the clusters with low degrees of membership while preserving the overall quality of the clustering.
The remaining part of this paper is organized as follows.In Section 2, we give a brief review of relevant work.In Section 3, we introduce the TFCM algorithm in detail.In Section 4, we demonstrate the performance of the TFCM algorithm using numerical experiments.Finally, we conclude the paper with some remarks in Section 5.

Related work.
As one of the most popular fuzzy clustering algorithms, the FCM algorithm was originally proposed by [9] and later modified by [4].Many improvements of the FCM algorithm have been proposed since its introduction.In this section, we give a brief review of research work related to the efficiency of the FCM algorithm.[6] proposed the AFCM (Approximate FCM) algorithm by replacing some variates in the FCM equations with integer-valued or real-valued estimates.The AFCM algorithm was developed to process digital images interactively.In the implementation of the AFCM algorithm, the fuzzy memberships u il are approximated by real numbers with three decimal places and stored as integers in [0, 1000] in memory.In addition, the AFCM algorithm stores six internal tables in memory and uses a table lookup approach to eliminate the use of exponentiation operators in the updating of the cluster centers and fuzzy memberships.Experimental results show that the runtime of each iteration of the AFCM algorithm is reduced approximately to one sixth of that of a literal implementation of the FCM algorithm.
[7] proposed a multistage random sampling FCM algorithm, called the mrFCM algorithm, to reduce the runtime of the FCM algorithm.The mrFCM algorithm consists of two phases.In the first phase, the FCM algorithm is applied to a series of subsamples selected randomly from the whole dataset in order to find good initial cluster centers.In the second phase, the standard FCM algorithm with the initial cluster centers obtained from the first phase is applied to partition the whole dataset.
[19] proposed the psFCM (partition simplification FCM) algorithm to speed up the FCM algorithm by simplifying the computation at each iteration and reducing the number of iterations.Similar to the mrFCM algorithm [7], the psFCM algorithm also consists of two phases.In the first phase, the kd-tree method is first used to partition the whole dataset into small blocks.All points in a block are represented by the centroid of the block.In this way, a large dataset is reduced to a simplified dataset that is much smaller than the original dataset.Then the FCM algorithm is applied to the simplified dataset to obtain the actual cluster centers.In the second phase, the FCM algorithm with the cluster centers obtained from the first phase is applied to partition the original dataset.[23] proposed a modified version of the FCM algorithm by eliminating the need to store the fuzzy partition matrix U .In the modified version, updating the cluster centers and updating the fuzzy memberships are combined into a single step.The original FCM algorithm has a time complexity of O(nk 2 d), but this modified version reduces the time complexity to O(nkd), where n, k, and d denote the number of data points, the desired number of clusters, and the number of attributes, respectively.
[24] proposed the PFCM (Parallel FCM) algorithm for clustering large datasets by using the Message Passing Interface (MPI).In the PFCM algorithm with P processors, a dataset of n points is divided into P blocks of equal size so that each processor processes n/P data points.The fuzzy partition matrix is also divided into P sections so that the memberships of local data points of a processor are stored in the processor's memory.An MPI call is used to pass messages if a computation requires data points stored in other processes.The PFCM algorithm is an example of speeding up the FCM algorithm through hardware.Another example of speeding up the FCM algorithm through hardware is to use the graphics-processing unit (GPU) [28].[17] proposed the geFFCM (generalized extensible fast fuzzy c-means) algorithm to cluster very large datasets.The geFFCM algorithm is similar to the mrFCM algorithm [7] and the psFCM algorithm [19] in the sense that divide-and-conquer strategy is used by all three algorithms.In the geFFCM algorithm, a subsample X SS is drawn from the original dataset X without replacement such that the number of features for which X SS and X agree is not less than a specified number.Then the standard FCM algorithm is applied to X SS to obtain the cluster centers.Finally, the cluster centers are used to obtain the fuzzy partition matrix of the original dataset.
[18] compared three different implementation of the FCM algorithm for clustering very large datasets.In particular, [18] compared the random sample and extension FCM, single-pass FCM, and on-line FCM.In addition, kernelized versions of the three algorithms were also compared.[29] proposed the FCM++ algorithm to improve the speed of the FCM algorithm by using the seeding mechanism of the K-means++ [2].
Almost all of the aforementioned algorithms aim at speeding up the FCM algorithm for large datasets.These algorithms do not scale well when the desired number of clusters is large.The algorithm proposed by [23] reduces the time complexity of the FCM algorithm from O(nk 2 d) to O(nkd).However, the algorithm still needs to calculate nk distances at each iteration.In the next section, we propose the truncated fuzzy c-means algorithm to approximate the FCM algorithm when the desired number of clusters is large.
3. The TFCM algorithm.In this section, we introduce the TFCM (Truncated Fuzzy c-means) algorithm.
Let X = {x 1 , x 2 , . . ., x n } be a dataset containing n points.Let k be the desired number of clusters.A fuzzy partition matrix U = (u il ) n×k of dividing X into k clusters is an n × k matrix that satisfies the following conditions Let T be an integer such that 1 ≤ T ≤ k.Let U T be the set of fuzzy partition matrices U such that each row of U has at most T nonzero entries.In other words, U ∈ U T if U is a fuzzy partition matrix such that for each i = 1, 2, . . ., n, where | • | denote the number of elements in a set.Then the objective function of the TFCM algorithm is defined as where • is the L 2 -norm or Euclidean distance, and is a small positive number used to prevent division by zero.Let I i = {l : u il > 0} for i = 1, 2, . . ., n.Then we can rewrite the objective function (4) as From Equation (5) we see that the main difference between the TFCM algorithm and the fuzzy c-means algorithm is the constraint given in Equation ( 3).If T = 1, then the TFCM algorithm becomes the k-means algorithm.If T = k, then the TFCM algorithm becomes the fuzzy c-means algorithm.Theorem 3.1.Given a set of centers Z.The fuzzy partition matrix U ∈ U T that minimizes the objective function (4) is given by where I i is the set of indices of the T centers that are closest to x i , i.e., with (l 1 , l 2 , . . ., l k ) being a permutation of (1, 2, . . ., k) such that Proof.For each i = 1, 2, . . ., n, let I i be the set of indices defined in Equation (7).We first show that for these I i , the optimal weights are given in Equation ( 6).Since the rows of a fuzzy partition matrix are independent of each other, the objective function ( 4) is minimized if for each i = 1, 2, . . ., n, the following function is minimized subject to l∈Ii u il = 1, where u i = (u i1 , u i2 , . . ., u ik ).Using the method of Lagrange multipliers, we can obtain the optimal weights by minimizing the following function We can obtain the optimal weights given in Equation ( 6) by solving the equations obtained by taking derivatives of P i (u i , λ, I i ) with respect to λ and u il for l ∈ I i and equating the derivatives to zero.Now we show that where J i is an arbitrary subset of {1, 2, . . ., k} such that |J i | ≤ T , and u * i and v * i are the optimal weights obtained from Equation ( 6) when the underlying index sets are I i and J i , respectively.Since u * i is the vector of optimal weights, we have Similarly, we have Since α > 1, I i contains the indcies of the T centers that are closest to x i , and which shows that the inequality given in Equation ( 9) is true.This completes the proof.
Theorem 3.2.Given a fuzzy partition matrix U ∈ U T .The set of centers Z that minimizes the objective function (4) is given by for l = 1, 2, . . ., k and j = 1, 2, . . ., d, where d is the number of features, z lj is the jth component of z l , and The proof of Theorem 3.2 is omitted as it is similar to the result of the FCM algorithm.
Algorithm 1: Pseudo-code of the TFCM Algorithm.
Input: The pseudo-code of the TFCM algorithm is given in Algorithm 1.The TFCM algorithm consists of two phases: the initialization phase and the iteration phase.In the initialization phase, we initialize the cluster centers to be k data points randomly selected from the dataset.We also calculate the distances between all data points and all initial centers in order to calculate the weights U .In the iteration phase, for each data point x i , we only calculate the distances between the data point and 2 * T cluster centers, which include the T existing centers saved in I i and T centers randomly selected from the remaining k − T centers.Among the 2 * T centers, we only keep the T centers that are closest to the point x i and save them for the next iteration.The advantage of selecting some centers from the remaining centers is that it helps us to find the true center for a data point if the true center was not selected in the initialization phase or previous iterations.
Regarding how to choose a value for the parameter T , a good starting point is d + 1, where d is the dimension of the underlying dataset.The reason to use T = d + 1 is that d + 1 is the number of vertices of a d-dimensional simplex.In a d-dimensional dataset, a point can be surrounded by d + 1 sphere-shaped clusters.For example, in an one-dimensional dataset, a data point has two closest clusters.In a two-dimensional dataset, a data point has three closest clusters.

Parameter Default Value
Table 1.Default values for some parameters required by the TFCM algorithm.
A list of default values for the parameters required by the TFCM algorithm is given in Table 1.The parameters N max and δ are used to terminate the algorithm.The FCM algorithm usually converges in a few iterations.Since the TFCM algorithm only calculates the distances between data points and a small subset of the centers, it may need more iterations to converge.Hence we suggest setting the default value of the maximum number of iterations to 1000.The parameter α is the fuzzifier, which should be larger than 1.

Experimental evaluation.
In this section, we present some numerical results to demonstrate the performance of the TFCM algorithm in terms of speed and accuracy.We also compare the performance of the TFCM algorithm to that of the FCM algorithm.We implemented both the TFCM algorithm and the FCM algorithm in Java.In order to make relatively fair comparison between the TFCM algorithm and the FCM algorithm, we used the same sets of initial cluster centers and the same criteria to terminate the algorithms.

4.1.
Results on synthetic data.To show that the TFCM algorithm works, we created two synthetic datasets, which are summarized in Table 2.Both synthetic datasets are two-dimensional datasets.One dataset contains four clusters and the other dataset contains 100 clusters.Figure 1 shows the two datasets.Since we know the labels of the data points of the two synthetic datasets, we use the corrected Rand index [8,14,15,16] to measure the accuracy of the clustering algorithms.The corrected Rand index, denoted by R c , ranges from 0 to 1.A higher value of the corrected Rand index indicates a more accurate clustering result.3. Runtime and accuracy of the TFCM algorithm and the FCM algorithm when applied to the first synthetic dataset 10 times with different initial cluster centers.In the TFCM algorithm, T = 3 and other parameters were set to default values given in Table 1.
The runtime is measured in seconds.The numbers in the parentheses are the corresponding standard deviations.
Table 3 shows the speed and accuracy of the TFCM algorithm and the FCM algorithm when applied to the first synthetic dataset.Since both algorithms use random initial cluster centers, we run the two algorithms 10 times to alleviate the impact of initial cluster centers on the performance measures.
The first synthetic dataset contains 400 data points and four clusters of equal size.When the number of clusters k was set to small numbers (e.g., 2 and 4), the TFCM algorithm was slower than the FCM algorithm on average.However, when k was set to 8, the TFCM algorithm outperformed the FCM algorithm on average in terms of speed.In terms of accuracy, both the TFCM algorithm and the FCM algorithm produced a corrected Rand index of 1 when k was set to the true number of clusters.In addition, the standard deviations of the corrected Rand index are zero when k = 2 and 4, indicating that the clustering results were not affected by the initial cluster centers.However, when k = 8, the standard deviations become positive, indicating that the clustering results were affected by the initial cluster centers.
The test results on the first synthetic dataset show that when k is small, the TFCM algorithm is slower than the FCM algorithm.This is expected as the implementation of TFCM involves sorting distances.The additional runtime caused by sorting is more than the reduced runtime resulted from less number of distance calculations.4. Runtime and accuracy of the TFCM algorithm and the FCM algorithm when applied to the second synthetic dataset 10 times.The runtime is measured in seconds.The numbers in the parentheses are the corresponding standard deviations.
Table 4 shows the speed and accuracy of the two algorithms when applied to the second synthetic dataset 10 times.The second synthetic dataset contains 5000 data points, which are contained in 100 clusters.Each cluster contains 50 points.Table 4(a) shows the speed and accuracy of the TFCM algorithm when T = 3. Comparing Tables 4(a) and 4(c), we see that the TFCM algorithm was significantly faster than the FCM algorithm.For example, it only took the TFCM algorithm about 6.869 seconds to finish clustering the dataset when k = 50 and T = 3, while it took the FCM algorithm 71.877 seconds to finish clustering the dataset when k = 50.For k = 50, 100, and 200, the average accuracy of the TFCM algorithm when T = 3 is close to that of the FCM algorithm.
If we increased T from 3 to 6, the average accuracy of the TFCM algorithm increased a little bit when k = 100 and 200.This is reasonable because when k is large, increasing T helps find the true cluster centers.Comparing Table 4(b) and Table 4(c), we see that the average corrected Rand index of the TFCM algorithm is higher than that of the FCM algorithm when k was set to the true number of clusters.This might be caused by the fact that we only used 10 runs to calculate the average accuracy.If we use 100 runs to calculate the average corrected Rand index, the FCM algorithm may be more accurate than the TFCM algorithm.
If we look at the average runtime for different k at Tables 4(a), 4(b), and 4(c), we see that the average runtime when k was set to the true number of clusters is lower than that when k was set to other numbers.For example, the average runtime of the TFCM algorithm when k = 100 and T = 3 was 5.084 seconds, which is lower than the average runtime when k = 50 and k = 200.We see a similar pattern of runtime for the FCM algorithm.This is caused by the fact that when k is not set to the true number of clusters, it takes both the algorithms more iterations to converge on average.

4.2.
Results on real data.As we mentioned in the introduction section of this article, data clustering was used to divide a large portfolio of variable annuity contracts into hundreds of clusters in order to find representative contracts for metamodeling [11,13].Existing clustering algorithms are slow for dividing a large dataset into hundreds of clusters.In this subsection, we apply the TFCM algorithm to divide a large portfolio of variable annuity contracts into hundreds of clusters.
The variable annuity dataset was simulated by a Java program [12].The dataset contains 10,000 variable annuity contracts.The original dataset contains categorical variables.We converted the categorical variables into binary dummy variables and normalized all numerical variables to the interval [0,1].The resulting dataset has 22 numerical features.Since the dataset has no labels, we cannot use the corrected Rand index to measure the accuracy of the clustering results.To compare the clustering results of this dataset, we use the within-cluster sum of squares defined as where C 1 , C 2 , . . ., C k are k hard clusters obtained from the fuzzy membership matrix U and z l is the average of the data points in the cluster C l .For fixed k, a lower value of W SS indicates a better clustering result.
We applied the TFCM algorithm to this dataset with different values of T .The default value of T for this dataset is 23 because the dimension of this dataset is 22.We also tested the TFCM algorithm with T = 3, 6, 12, and 46.The results are shown in Table 5.
Table 5(f) shows the result of the FCM algorithm when applied to the variable annuity dataset.From this table we see that it took the FCM algorithm about 756.378 seconds to divide the dataset into k = 200 clusters.The standard deviation of the runtime is also large, indicating that the convergence of the FCM algorithm is sensitive to the initial cluster centers.
Tables 5(a) -5(e) give the results of the TFCM algorithm when applied to the variable annuity dataset with different values of T .From these tables we see that the runtime of the TFCM algorithm increases when T increases.The tables also show that the TFCM algorithm achieved the best result when T = 6.For example, when T = 6 and k = 200, the within-cluster sum of squares (W SS) produced by the TFCM algorithm is 721.  5. Speed and accuracy of the TFCM algorithm and the FCM algorithm when applied to the variable annuity dataset 10 times.The runtime is in seconds.The numbers in the parentheses are the corresponding standard deviations.the FCM algorithm.When T increases from 6 to 46, the W SS measure increases.This is counterintuitive because we expect W SS to decrease when T increases.The reason we see that W SS decreased when T increased from 6 to 46 is that we used two criteria to terminate the algorithm: δ and N max .When T is large, we expect that it takes the TFCM algorithm more iterations to get the change of the objective function to be less than δ.Since we terminated the TFCM algorithm when the number of iterations reaches N max = 1000, the clustering result was still suboptimal.
If we compare Table 5(b) and Table 5(f), we see that the TFCM algorithm is more than 20 times faster than the FCM algorithm.For example, it took the FCM algorithm about 756 seconds on average to divide the dataset into 200 clusters, but it only took the TFCM algorithm about 32 seconds on average to divide the dataset into 200 clusters.In summary, the numerical experiments show that the TFCM algorithm outperformed the FCM algorithm in terms of speed when the desired number of clusters is large.The accuracy of the TFCM algorithm is close to that of the FCM algorithm.
5. Concluding remarks.In some situations, we need to divide a large dataset into a large number of clusters.For example, we need to divide millions of web pages into thousands of categories [5] and divide a large portfolio of insurance policies into hundreds of clusters in order to select representative policies [11,13].Most existing algorithms are not efficient when used to divide a large dataset into a large number of clusters.
In this paper, we proposed a truncated fuzzy c-means (TFCM) algorithm to address the problem when both the number of data points and the desired number of clusters are large.The TFCM algorithm is similar to the FCM algorithm in the initialization phase.However, the TFCM algorithm is different from the FCM algorithm in the iteration phase where the TFCM algorithm only keeps a subset of cluster centers for each data point and only calculates the distances between each data point and a subset of cluster centers.Our numerical experiments on both synthetic and real datasets show that the TFCM algorithm outperforms the FCM algorithm significantly in terms of speed when the desired number of clusters is large.In addition, the accuracy of the TFCM algorithm is comparable to that of the FCM algorithm.
We implement the TFCM algorithm in a straightforward way according to the pseudo-code given in Algorithm 1.The speed of the TFCM algorithm can be further improved by using the technique introduced by [23].This technique allows us to combine the step of updating the cluster centers and the step of updating the fuzzy memberships into a single step.In future, we would like to incorporate this technique into the TFCM algorithm and compare the TFCM algorithm with other algorithms mentioned in Section 2.

4 Let 5 Update the weights u ( 0 ) 11 for i = 1 to n do 12 Select 17 P 19 Break; 20 end 21 s
Initialize the set of cluster centers Z (0) by selecting k data points from X randomly; 2 for i = 1 to n do 3 Calculate the distance between x i and all k centers; I i be the subset of {1, 2, . . ., k} such that the corresponding T centers are closest to x i ; il for l ∈ I i according to Equation (6); 6 end 7 s ← 0; 8 P (0) ← 0; 9 while True do 10 Update the set of cluster centers Z (s+1) according to Equation (10); T centers with indices in {1, 2, . . ., k}/I i randomly; 13 Calculate the distance between x i and centers with indices in I i ∪ J i ; 14 Update I i with the indices of the T centers that are closest to x i ; 15 Update the weights u (s+1) il for l ∈ I i according to Equation (6); 16 end (s+1) ← P U (s+1) , Z (s+1) ; 18 if P (s+1) − P (s) < δ or s ≥ N max then ← s + 1;

Figure 1 .
Figure 1.Two synthetic datasets.The first dataset contains 4 clusters and the second dataset contains 100 clusters.

Table 2 .
Summary of the two synthetic datasets.
29, which is close to W SS = 697.841produced by (f) FCM Table