Abstract

The k-modes clustering algorithm has been widely used to cluster categorical data. In this paper, we firstly analyzed the k-modes algorithm and its dissimilarity measure. Based on this, we then proposed a novel dissimilarity measure, which is named as GRD. GRD considers not only the relationships between the object and all cluster modes but also the differences of different attributes. Finally the experiments were made on four real data sets from UCI. And the corresponding results show that GRD achieves better performance than two existing dissimilarity measures used in k-modes and Cao’s algorithms.

1. Introduction

Clustering is an important technique in data mining, and its main task is to group the given data based on some similarity/dissimilarity measures [1]. Most clustering techniques use distances largely to measure the dissimilarity between different objects [24]. However, these methods work only on the data sets with numeric attributes, which limits their uses in solving categorical data clustering problems [5].

Some researchers have made great efforts to quantize relationships among different categorical attributes. Guha et al. [6] proposed a hierarchical clustering method termed ROCK, which can measure the similarity between a pair of objects [7]. In ROCK, the number of Link is computed as the number of common neighbors between two objects [8]. However, the following two deficiencies still exist: (1) two involved parameters must be assigned in advance and (2) the mass calculation is involved [9]. For these reasons, some researchers have generated some new algorithms like QROCK [10], DNNS [11], and GE-ROCK [12] to modify or improve the ROCK algorithm. To remove the numeric-only limitation of k-means algorithm, Huang et al. [13, 14] proposed the k-modes algorithm, which extends the k-means algorithm by using (1) a simple matching dissimilarity measure for categorical attributes; (2) modes in place of means for clustering; and (3) a frequency-related strategy to update modes to minimize the clustering costs [15]. In fact, the idea of simple matching has been used in many clustering algorithms, such as fuzzy k-modes algorithm [16], fuzzy k-modes algorithm with fuzzy centroid [17], and k-prototype algorithm [14]. However, simple matching often results in some low intradissimilarity clusters [18] and disregards of the dissimilarity hidden between the categorical values [19].

In this paper, a Global-Relationship Dissimilarity (GRD) measure for the k-modes clustering algorithm is proposed. This dissimilarity measure considers not only the relationships between the object and all cluster modes but also the differences of various attributes instead of simple matching. The clustering effectiveness of k-modes based on GRD (KBGRD) is demonstrated on four standard data sets from the UCI Machine Learning Repository [20].

The remainder of this paper is organized as follows: a detailed review of the dissimilarity measure used in k-modes is presented and analyzed in Section 2. In Section 3, the new dissimilarity measure GRD is proposed. Section 4 describes the details of KBGRD algorithm. Section 5 illustrates the performance and stability of KBGRD. Finally, a concluding remark is given in Section 6.

2.1. Categorical Data

As is known to all, the structural data can be stored in a table, where each row represents a fact about an object. And the practical data usually contains categorical attributes [21]. We firstly define the term “data set” [22].

Definition 1 (data set). A data set information system can be expressed as a quadruple , which is satisfied with(1) is a nonempty set of data objects, which is named as a universe;(2) is a nonempty set of categorical attributes;(3) is the union of all attribute domains, that is, , where is the value domain of attribute , and it is finite and unordered; is the number of categories of attribute for ;(4) is a mapping function which can be formally expressed as .

2.2. k-Modes Dissimilarity Measure

The k-modes clustering algorithm is an improvement of the k-means algorithm [4] by using a simple dissimilarity measure for categorical data. And it adopts a frequency-related strategy to update modes in the clustering to minimize the clustering costs. These extensions have excluded the numeric-only limitation existed in k-means algorithm and enable the clustering process to be used on large-size categorical data sets from real world database [22].

Definition 2. Let be a categorical data set information system which is defined in Definition 1 and . For any object and cluster mode for , is the simple matching dissimilarity measure between object and the mode of the th cluster which is defined as follows:In (1), can be expressed as .

There are nine objects with four attributes and three initial cluster modes as shown in Table 1. For determining the appropriate cluster of , it is required to compute the dissimilarity of and the three cluster modes. According to (1), . Therefore, it is impossible to determine exactly to which cluster the object should be assigned.

The dissimilarity between an object and a cluster mode should consider the relationships between the object and all cluster modes as well as the differences of various attributes. When the k-modes dissimilarity measure is computing dissimilarity of a certain attribute, it only simply matches this object with this mode and ignores the differences of various attributes. Such as attribute “A4” in Table 1, almost all of objects and cluster modes is “E”; “A4” should contribute more to dissimilarity than other attributes. However, the k-modes dissimilarity treats all attributes equally.

3. Global-Relationship Dissimilarity Measure

Definition 3. Let be a categorical data set information system which is defined in Definition 1 and . For any object and cluster mode for , is the new dissimilarity measure between object and the mode of the th cluster which is defined as In (2), is the dimension number of data set and the similarity function is defined as follows:subject towhere is the number of cluster modes, andhere is satisfied with

As shown in Table 1, it is required to compute the dissimilarity of with three cluster modes for determining which cluster should be assigned to. According to (2)–(6), the following three ones can be got:(1).(2).(3).Hence, can be assigned to cluster “2” definitely.

4. KBGRD Algorithm

In this section, we give the concrete procedure of the k-modes based on GRD (KBGRD) algorithm. In addition, the computational complexity of KBGRD is analyzed.

4.1. KBGRD Algorithm Description

Definition 4. Let be a categorical data set information system which is defined in Definition 1 and . The k-modes algorithm uses the k-means paradigm to cluster categorical data. The objective function of the k-modes algorithm is defined as follows:In (7), . Here is a known cluster number; is a k-by- matrix; is a binary variable and indicates whether object belongs to the th cluster; if belongs to the th cluster and 0 otherwise; ; and is the th cluster mode with categorical attributes .

4.2. Update and Convergence Analysis

The steps of the KBGRD algorithm are presented below. Here and denote cluster modes and membership matrix at th iteration, respectively.(1)Randomly select distinct objects from as initial mode . Determine such that is minimized according to (8). Set .(2)Determine such that is minimized according to (9). If , then stop; otherwise, go to step (3).(3)Determine such that is minimized according to (8). If , then stop; otherwise, set and go to step (2).

In each iteration, and are updated by the following formulae.

When is given, is updated by (8) for and .

And when is given, is updated as follows:where , . Here, ; is the number of categorical of attribute for .

Now we consider the convergence of the KBGRD algorithm.

Theorem 5. is minimized when and is updated by (8).

Proof. For a given , we have . The updating method of is computing the minimized dissimilarity between objects and modes according to (8), and the dissimilarities of objects and modes are independent. So is updated by (8) such that is minimized.

Theorem 6. is minimized when and is updated by (9).

Proof. For a given , we havewhere . Note that all inner sums are nonnegative and independent. Then minimizing is equivalent to maximizing each inner sum. When , according to (9), is maximized. So is updated by (9) such that is minimized.

Theorem 7. The KBGRD algorithm converges in a finite number of iterations.

Proof. Firstly, we note that there are only a finite number () of potential cluster mode. There are possible kinds for cluster modes; it is a finite number too.
Secondly, each possible mode appears at most once in the iteration process of KBGRD algorithm. If not, there exist () such that . According to Theorem 6, a given can obtain a certain , that is, , . When , we have , that is, in the iteration of algorithm, occurring at . However, if or , algorithm is stopped according to steps (2) and (3) of the KBGRD algorithm, that is, never occurs.
So the KBGRD algorithm converges in a finite number of iterations.

4.3. Pseudocodes and Complexity Analysis

The pseudocodes of KBGRD algorithm are presented in Pseudocode 1.

Input: data set U and initial cluster number ;
Output: clusters.
Sub function Cluster(U, modes)
Begin:
(1) for   = to is the number of clusters.
(2)   for   = to is the number of objects.
(3)  Calculating according to Eqs. (2)–(6);
(4)    end for
(5) end for
(6) if ()
(7)   Classify th object into into th cluster;
End
Sub function Fun()
Begin:
(1) for   = to is the number of clusters.
(2)     for   = to is the number of objects.
(3)    Calculating SumDissimilarity according to Eq. (7);
(4) return SumDissimilarity;
end
Main function
Begin:
(1) Randomly choose distinct objects as initial modes from ;
(2) Cluster(, modes);
(3) newDissimilarity = Fun(); calculating the value of .
(4) Do
(5)  oldDissimilarity = newDissimilarity;
(6)  Update modes according to Eq. (9);
(7)  Cluster(, modes);
(8)  newDissimilarity = Fun();
(9) while newDissimilarity != oldDissimilarity;
End

The major function of subfunction Cluster() is computing the dissimilarity between object and cluster mode and classifying the objects into the clusters whose dissimilarity is the minimum. The function of subfunction Fun() is computing the value of objective function.

In fact, main function is a controller, which controls the iterations of algorithm. We first choose distinct objects as initial modes. Line 2 is the initialization of cluster; Line 3 computes original cluster result and “new Dissimilarity.” Lines 4–9 are to iteratively update modes and clusters. And when “new Dissimilarity” is invariant, the iteration stops.

Referring to the pseudocodes as shown in Pseudocode 1, the computational complexity of KBGRD algorithm is analyzed as follows. We only consider the major computational steps.

We firstly consider the computational complexity of two subfunctions. The computational complexity for computing the dissimilarity is , where is the number of modes, n is the number of objects in data set, and is the dimension of data set. The computational complexity for assigning the th object into the lth cluster is . So the computational complexity for updating all clusters is , that is, . The computational complexity of computing objective function is .

Suppose that the iteration time is and the whole computational cost of KBGRD algorithm is , that is, . This shows that the computational cost is linearly scalable with the number of objects, the number of attributes, and the number of clusters.

5. Experimental Analysis

5.1. Experimental Environment and Evaluation Indexes

The experiments are conducted on a PC with an Intel i3 processor and 4 G byte memory running the Windows 7 operating system. All algorithms are coded by JAVA on Eclipse.

To evaluate the efficiency of clustering algorithm, the evaluation indexes Accuracy (AC) and RandIndex are employed in the experiments.

Let be the set of three classes in the data set and be the set of three clusters generated by the clustering algorithm. Given a pair of objects in the data set, we refer to it as(1) if both objects belong to the same cluster in and the same cluster in ;(2) if the two objects belong to the same cluster in and two different clusters in ;(3)c if the two objects belong to two different clusters in and to the same cluster in ;(4)d if both objects belong to two different clusters in and two different clusters in .Let , , , and be the number of a, b, c, and d, RandIndex [23] is defined as follows:

Accuracy (AC) is defined as follows:where is the number of clusters, n is the number of objects, and is the number of objects that are correctly assigned to the cluster ().

Four categorical data sets from the UCI Machine Learning Repository are used to evaluate the clustering performance, including QSAR Biodegradation (QSAR), Chess, Mushroom, and Nursery. The relative information about the data sets is tabulated in Table 2.

5.2. Experimental Results and Analysis

In the experiments, we compare KBGRD algorithm with the original k-modes and Cao’s algorithm [24]. Three algorithms are sequentially run on all data sets. Each algorithm requires the number of modes (ClusterNum) as an input parameter. We randomly select distinct ClusterNum objects as initial cluster modes. The number of iteration of all algorithms is no more than 500.

Note that there are very few missing values in the Mushroom data set; we use optimal completion strategy to deal with missing values. In the optimal completion strategy, the missing values in data set are viewed as additional variables [25, 26].

Firstly, we set ClusterNum as the classes’ number of the data set. The average RandIndex of ten times’ experiments on four data sets for three algorithms is summarized in Table 3. The average AC of ten times’ experiments on four data sets for three algorithms is summarized in Table 4. As shown in Tables 3 and 4, KBGRD achieves the highest RandIndex and AC. That is, it performs better than other algorithms under the same conditions.

In real world applications, the number of initial cluster modes is unknown. We evaluated clustering stability by setting different ClusterNum (10, 15, 20, 25, 30, and 35) for each data set and used RandIndex to evaluate clustering results. The average RandIndex of ten times’ experiments on four data sets for three algorithms is summarized in Tables 58. And the last column shows the average clustering RandIndex of each algorithm on six ClusterNum. As shown in Tables 58, KBGRD achieves the highest RandIndex. That is to say, it performs better than other algorithms on four data sets. Additionally, KBGRD has the highest stability compared with other algorithms.

6. Conclusion

This paper analyzes the advantages and disadvantages of k-modes algorithms for categorical data. Based on this, we propose a novel dissimilarity measure (GRD) for clustering categorical data. This measure is used to improve the performance of the existing k-modes algorithm. The computational complexity of KBGRD algorithm has been analyzed which is linear with the number of data objects, attributes, and clusters. We have tested KBGRD algorithm on four real data sets from UCI. Experimental results have shown that KBGRD algorithm is effective and stable in clustering categorical data sets.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research was supported by the National Science Foundation of China under the Grants of 61402363 and 61472319, Education Department of Shaanxi Province Key Laboratory Project under the Grant of 15JS079, Xi’an Science Program Project under the Grant of CXY1509(7), Beilin district of Xi’an Science and Technology Project under the Grant of GX1625, and CERNET Innovation Project under the Grant of NGLL20150707.