Abstract

Because traditional fuzzy clustering validity indices need to specify the number of clusters and are sensitive to noise data, we propose a validity index for fuzzy clustering, named CSBM (compactness separateness bipartite modularity), based on bipartite modularity. CSBM enhances the robustness by combining intraclass compactness and interclass separateness and can automatically determine the optimal number of clusters. In order to estimate the performance of CSBM, we carried out experiments on six real datasets and compared CSBM with other six prominent indices. Experimental results show that the CSBM index performs the best in terms of robustness while accurately detecting the number of clusters.

1. Introduction

Recently, in order to reveal valuable knowledge and patterns behind data, data mining has become increasingly crucial in many fields such as automatic categorization of text documents [1, 2], grouping search engine results [3, 4], analyzing time series data [5, 6], and others [710]. As one of the vital techniques of data mining, clustering can divide a group of samples into multiple clusters, so that elements in the same cluster are as similar as possible and elements in different clusters are as dissimilar as possible.

In fuzzy clustering, represented by the FCM (fuzzy C means) [11] algorithm, the value of membership degree is fuzzy, which means that a sample is allowed to belong to multiple clusters with different probabilities. It is more consistent with the rule of sample distribution than the hard clustering logic; therefore, the fuzzy clustering research has constantly been ongoing and innovative. As yet, a large number of fuzzy clustering algorithms have been increasingly improved in accuracy, efficiency, robustness, and other aspects, which significantly boosts the development of data mining. At the same time, the validity index used to measure the clustering quality of fuzzy clustering, as an indispensable part of algorithm research, plays a growing important role in fuzzy clustering.

Recently, the achievement on clustering validity index is pretty fruitful. Hu et al. [12] proposed a clustering validity index by combining intraclass compactness with intercluster separateness, which reduces the impact of noise data well. Chen and Pi [13] proposed a nondistance validity index based on fuzzy membership degree by mixing with compactness and separateness, which improves the identification of overlapping clusters while weakening the sensitivity to noise data. Although these indices enhance the robustness by combining intraclass compactness with interclass separateness, it is still necessary to manually specify the number of clusters owing to the restrictions of the FCM algorithm. Zhang et al. [14], regarding the fuzzy membership degree and the bipartite modularity as the global and local attributes, respectively, proposed a weighted global-local validity based index (WGLI). The WGLI can automatically determine the number of clusters but is vulnerable to noise data because of its higher dependency on the fuzzy membership degree of the FCM algorithm.

Motivated by the above analysis, in this paper, we apply the bipartite modularity to the constructed bipartite network based on the clustering result of the FCM algorithm and evaluate the final partition result combining intraclass compactness with interclass separateness at the same time. The proposed validity index fuses the bipartite modularity with intraclass compactness and interclass separateness, which is not only able to enhance the robustness of clustering results but also able to determine the optimal number of clusters automatically.

2.1. FCM Algorithm

The FCM algorithm divides N data samples xi into C fuzzy clusters by means of the fuzzy partition method and then calculates the centre of each cluster. Its objective function shown in formula (1) is minimized through the process of iteration:where represents the center of c-th cluster, m is the fuzzy parameter and , and uci indicates the membership degree that the sample i belonging to the cluster c and

Then, the expressions of uci and can be obtained using the Lagrange multiplier method as follows:where dci and dki, respectively, represent the Euclidean distances from the clustering centers c and k to the sample i.

2.2. Fuzzy Clustering Validity Index

The fuzzy clustering validity index can measure the clustering performance and is fairly significant in the fuzzy clustering research. The prevalent fuzzy clustering validity indices are shown in Table 1.

2.3. Bipartite Modularity

Newman and Girvan [23] introduced modularity to measure the strength of community structure in a single network. However, networks in the real world not only exist in this simple form, which means, for example, in the metabolic networks [24], pathological networks [25], and the World Wide Web; there may be one-to-many and even many-to-many relationships between the vertices and the divided communities instead of the simple one-to-one relationship. For complicated networks, by extending and concretizing the matrix-based method that Newman used for bipartite network, Barber [26] defined a zero model and then proposed the bipartite modularity applied in bipartite networks with special constraints which means that the vertices of bipartite network are divided into two disjoint sets and each edge connects two vertices from these two sets, respectively. However, the disadvantages of the bipartite modularity include that the number of communities must be determined in advance, and the number of communities of the two types of vertices in the bipartite network must be equal. Guimerà et al. [27] proposed the bipartite modularity by adopting the idea of modularity maximization and transformed the module identification problem into the combinatorial optimization problem. Unfortunately, the module structure of a type of vertex should be manually specified, which will affect the accuracy of partition.

Then, Murata [22] defined a modified bipartite modularity, which allows random connections between two types of vertices, and a community containing a type of vertex can correspond to one or more communities of the other type of vertex. When there is a complete one-to-one relationship between communities of different types of vertices, the bipartite modularity reaches its maximum.

Let G be a bipartite network, in which M denotes the total number of edges and V is a set including all vertices. The bipartite network is divided into communities of X-vertex and Y-vertex, and the numbers of these communities are LX and LY. VX and VY are defined as sets of communities of X-vertex and Y-vertex and represented as and , respectively, in which the elements and represent a community of X-vertex and Y-vertex, respectively. Let A be an adjacency matrix, and one of its elements can be expressed as A (i, j). Moreover, if the vertexes i and j are connected, A (i, j) = 1, otherwise A (i, j) = 0.

Assuming that the two communities Vl and Vp are different from each other, which means ; the number of all the edges, denoted by elp, that connect the vertices from Vl and Vp and its row sum, denoted by al, can be respectively expressed as

The bipartite modularity QB is defined aswhere

QB indicates the sum of bipartite modularity in two different directions, and , and , respectively, denote the expected values in two directions, which means the number of edges connecting the corresponding vertices from communities of X-vertex and Y-vertex minus the number of edges that are connected randomly between X-vertex and Y-vertex in the same divided communities. The larger the value of QB, the stronger the community structure of bipartite network and the better the result of community detection.

2.4. Constructing Bipartite Network

According to the membership degree matrix and C clusters obtained from the FCM algorithm, a weighted bipartite network can be constructed. The X-vertex is represented by all the cluster centers, the Y-vertex is denoted by all the sample points, and the weighted edges are indicated by membership degrees. Applying the bipartite modularity to the constructed bipartite network, we have LX = LY = C, , and the adjacency matrix A (i, j) can be defined as follows:where α = 0.7, uci denotes the fuzzy membership degree of the FCM algorithm.

Suppose that the FCM algorithm runs on a dataset including 10 sample points and 4 clusters. According to formula (10), a bipartite network can be constructed by cluster centers and sample points and shown as Figure 1.

In Figure 1, the top nodes, which are also the X-vertex, represent different communities composed of cluster centers, and the bottom nodes, which are correspondingly the Y-vertex, represent different communities composed of divided datasets. Besides, the weighted values of the edges are the corresponding membership degrees, which are used to express the values of A (i, j). According to the formulas (5)–(10), the value of QB can be calculated.

3. Fuzzy Clustering Validity Index CSBM

In this paper, we proposed a fuzzy clustering validity index named CSBM. This index combines three components: (1) bipartite modularity, (2) intraclass compactness, and (3) interclass separateness. First, CSBM builds a bipartite network based on the clustering results of FCM and applies the bipartite modularity to this bipartite network. Second, CSBM evaluates the clustering results by combining intraclass compactness and interclass separation.

Compared with the conventional validity indices, the index CSBM can enhance the robustness on the one hand and automatically determine the optimal number of clusters on the other hand.

3.1. Intraclass Compactness

The intraclass compactness of the index CSBM is defined aswhere

NC (novel compactness) improves the performance of the partition coefficient PC, where the compactness of the cluster c is expressed as . The larger the value of NC, the higher the intraclass compactness, and the better the result of fuzzy partition.

3.2. Interclass Separateness

In order to reduce the impact of noise data on the clustering result, the interclass separateness is measured by the distance between different fuzzy clusters, which is defined as follows:where

The threshold To is used to eliminate the noise points on the cluster boundary and represents the separateness between samples of two clusters a and b. uai and ubi indicate the membership degrees that sample point i belonging to cluster a and b, respectively. The smaller the value of Oabi (C, U), the lower the coverage degree between clusters a and b, and the higher the separation degree between the two clusters, whereas SEP (separateness) indicates the sum of the cluster separation degrees of all sample points in the fuzzy membership matrix, and the smaller the value of SEP, the better the result of fuzzy partition.

3.3. Index CSBM

The objective function used to calculate the index CSBM is defined as

The introduction of is to adjust the value of CSBM. The larger the value of NC, the smaller the value of SEP and the better the clustering result. At the same time, the larger the value of NC-SEP and QB, the larger the value of CSBM. The better the clustering quality of FCM algorithm, the more accurate the optimal number of clusters.

The calculation process of the fuzzy clustering validity index (CSBM) is as follows:(1)Input: the dataset S and threshold ε used in the FCM algorithm(2)Output: the clustering result.(3)Running the FCM algorithm on the given dataset(4)Constructing the weighted bipartite network according to the membership degree uci and the cluster centre obtained from formulas (3) and (4) and the adjacency matrix A (i, j) from formula (10).(5)Calculating the bipartite modularity QB according to formulas (5)–(10)(6)Calculating CSBM according to formulas (11)–(15)

4. Experiments

4.1. Datasets

In order to verify the effectiveness and validity of the index CSBM, we select six datasets from UCI database. The detailed information of these six datasets is shown in Table 2.

4.2. Evaluation Criteria for Clustering Result

F-measure (FM) and entropy (EN) are used to evaluate the clustering result of the FCM algorithm. F-measure is often used to evaluate the partition result of clustering algorithms and can be defined aswhere P represents the accuracy rate, which means the proportion of the related files retrieved by the system to the total number of all the related files in the system and R represents the recall rate, which indicates the proportion of the related files retrieved by the system to the total number of all the files retrieved by the system. In general, the accuracy rate and recall rate interact on each other. Considering these two factors, the index F-measure can be calculated and reveals the overall performance. The value range of F-measure is [0, 1], and the larger its value, the better the clustering result.

Entropy, also known as Shannon entropy [28], is represented by a set of discrete probabilities pi, which, in the case of sending a message, are the ones that a particular message is actually sent. The entropy of message system is used to measure the average amount of information in a message and can be defined as follows:

The value range of entropy is [0, 1], and the smaller its value, the better the clustering result.

4.3. Experimental Results

In order to evaluate the accuracy of CSBM in terms of predicting the optimal number of clusters, we compare it with six prevalent fuzzy clustering validity indices, including PC, PE, MPC, MPE, CO, and WGLI. Tables 38 show the values of the index CSBM and other comparative indices obtained from running the FCM algorithm on six datasets, in which the values of F-measure and entropy are also included, so the practicability and effectiveness of the FCM algorithm can be confirmed. The bold values in the tables denote the values of indices corresponding to the optimal number of clusters.

The experimental results can be clearly interpretive from the following three aspects:(1)On the iris, wine, and zoo datasets (Tables 3, 4 and 7, respectively), the index CSBM outperforms other indices, which means its matching classification number indicates the best partition result. On the iris and wine datasets shown in Tables 3 and 4, respectively, other indices could produce their own best index values when the cluster number equals 2. And actually, these two datasets have both 3 classes. Table 7 lists the results on the zoo dataset which has 7 classes. However, WGLI, PC, MPC, and CO all generate their own best index values when there are 4 clusters, and PE and MPE think the best number of clusters is 3.(2)On the wpbc and Hayesroth datasets shown in Tables 5 and 6, respectively, some indices including CSBM all gain their own optimal values, while others are not as satisfactory. On the wpbc dataset, there are 5 indices that could find the realistic number of classes accurately, including CSBM, WGLI, PC, PE, and MPE. Other two indices, MPC and CO, find more clusters. On the Hayesroth dataset, only the CSBM and CO generate accurate number of clusters and the other 5 indices generate less clusters.(3)On the glass dataset as in Table 8, all the indices cannot achieve the desired results. But compared with the other indices, the predicted value, 5, of the index CSBM is closest to the standard value, 6. The second best is the index MPC, whose predicted number of clusters is 3. The rest indices, including WGLI, PC, PE, MPE, and CO, all generate their own best values when the number of clusters is 2.

According to the above experimental results, the index CSBM performs better than these comparative indices in terms of predicting the optimal number of clusters.

In order to verify the robustness of the index CSBM, we add some noisy data to each dataset at a rate of 10% and then run the FCM algorithm again. The change trend of values of all the indices with the number of clusters is shown as Figure 2. At the same time, the OC (original clusters) is the original classification number of each dataset, and the optimal numbers of clusters of each dataset determined by all the indices before and after adding noise data are shown in Table 9.

The numbers outside and within parentheses, respectively, represent the optimal number of clusters determined by the corresponding indices before and after the addition of noise data, and the coarser values indicate that the number of clusters matches the standard values

It can be seen from Figure 2 and Table 9 that the index CSBM is less vulnerable to the added noise data than other comparative indices and can still determine the optimal number of clusters accurately. Meanwhile, due to the noise data, the results of indices WGLI, PC, PE, and MPE on zoo and glass datasets, index MPC on wine, wpbc and Hayesroth datasets and index CO on wine, and Hayesroth and glass datasets are all changed in different degrees. According to Figure 2, after adding noise data, the values of some indices remain unchanged, such as the results of WGLI on iris dataset, and the values of some indices are slightly closer to standard values, such as the results of CO on wine dataset. However, some indices differ from standard values to a greater extent, such as the results of MPC on wine dataset. The overall change shows that there is a big difference between the value of each index and the standard values in theory after adding noise data, but the actual results are relatively random and lack of the corresponding rules, which furthermore verifies the excellent performance of the index CSBM in terms of robustness.

5. Conclusions

In this paper, a new fuzzy clustering validity index named CSBM is proposed. This index modifies the intraclass compactness and interclass separateness on the basis of conventional indices, which means that it enhances the robustness and weakens the impact of noise data. At the same time, the optimal cluster number of fuzzy clustering can be predicted more accurately by integrating with bipartite modularity. Six datasets from the UCI database are selected for the sake of validating the feasibility and validity of the index CSBM. The results show that the index CSBM performs better than other comparative indices in terms of clustering accuracy and robustness and predicts the optimal number of clusters more precisely.

Data Availability

All the datasets used in this paper are derived from the UCI (University of California Irvine) Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets.html).

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

The authors would like to thank members of the IR&DM Research Group from Henan Polytechnic University for their invaluable advice that makes this paper successfully completed. The authors would also like to thank the support of the Foundation for Scientific and Technological Project of Henan Province under Grant 172102210279.