Modified Fuzzy Gap statistic for estimating preferable number of clusters in Fuzzy k-means clustering

doi:10.1263/jbb.105.273

Journal of Bioscience and Bioengineering

Volume 105, Issue 3, March 2008, Pages 273-281

https://doi.org/10.1263/jbb.105.273 Get rights and content

In clustering methods, the estimation of the optimal number of clusters is significant for subsequent analysis. Without detailed biological information on the genes involved, the evaluation of the number of clusters becomes difficult, and we have to rely on an internal measure that is based on the distribution of the data of the clustering result. The Gap statistic has been proposed as a superior method for estimating the number of clusters in crisp clustering. In this study, we proposed a modified Fuzzy Gap statistic (MFGS) and applied it to fuzzy k-means clustering. For estimating the number of clusters, fuzzy k-means clustering with the MFGS was applied to two artificial data sets with noise and to two experimentally observed gene expression data sets. For the artificial data sets, compared with other internal measures, the MFGS showed a higher performance in terms of robustness against noise for estimating the optimal number of clusters. Moreover, it could be used to estimate the optimal number of clusters in experimental data sets. It was confirmed that the proposed MFGS is a useful method for estimating the number of clusters for microarray data sets.

Section snippets

Data sets

To evaluate the various internal measures, we analyzed the two artificial data sets, those data sets with noise, a leukemia data set and a yeast data set that were experimentally obtained.

Artificial data sets

We prepared two artificial data sets: C4D3 (4 clusters, 3-dimensional, as shown in Fig. 1), C4D5 (4 clusters, 5-dimensional; data not shown).

Leukemia data set

We used the same data set for gene expression in acute leukemia as that analyzed by Golub et al. (28) by hierarchical clustering. These data, obtained from 38 patients,

Estimation results for number of clusters in C4D3 and C5D3 data sets

As shown in Fig. 1, we prepared two artificial data sets (C4D3 and C4D5), with each having 4 clusters. When the MFGS and Gap statistic were used for the C4D3 data set, the values of the modified Gap(k) (MFGap(k)) and Gap(k), as calculated using Eq. 6, rapidly increased until clusters = 4 (see Fig. 2A, B). From the definition of MFGS and Gap statistic (Eq. 7), each plot for these results had an error bar (s_k). Meanwhile, that for PC, FHV, and XB had no error bar (Fig. 2C–E). Since the minimum k

References (36)

R.J. Cho et al.
A genome-wide transcriptional analysis of the mitotic cell cycle
Mol. Cell
(1998)
K. Hakamada et al.
A preprocessing method for inferring genetic interaction from gene expression data using Boolean algorithm
J. Biosci. Bioeng.
(2004)
T. Hanai et al.
Application of bioinformatics for DNA microarray data to bioscience, bioengineering and medical field
J. Biosci. Bioeng.
(2006)
P.J. Rousseeuw
Silhouettes: a graphical aid to the interpretation and validation of cluster analysis
J. Comput. Appl. Math.
(1987)
Y.-I. Kim et al.
A cluster validation index for GK cluster analysis based on relative degree of sharing
Inf. Sci.
(2004)
Y. Okada et al.
Knowledge-assisted recognition of cluster boundaries in gene expression data
Artif. Intell. Med.
(2005)
M.K. Pakhira et al.
Validity index for crisp and fuzzy clusters
Pattern Recognit.
(2004)
R.J.G.B. Campello
A fuzzy extension of the Rand index and other related indexes for clustering and classification assessment
Pattern Recogn. Lett.
(2007)
S. Chu et al.
The transcriptional program of sporulation in budding yeast
Science
(1998)
P.T. Spellman et al.
Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization
Mol. Biol. Cell
(1998)

Y. Maki et al.

An integrated comprehensive workbench for inferring genetic networks: voyagene

J. Bioinform. Comput. Biol.

(2004)

M.B. Eisen et al.

Cluster analysis and display of genome-wide expression patterns

Proc. Natl. Acad. Sci. USA

(1998)

S. Tavazoie et al.

Systematic determination of genetic network architecture

Nat. Genet.

(1999)

P. Tamayo et al.

Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation

Proc. Natl. Acad. Sci. USA

(1999)

S. Tomida et al.

Analysis of expression profile using fuzzy adaptive resonance theory

Bioinformatics

(2002)

D. Dembele et al.

Fuzzy C-means method for clustering microarray data

Bioinformatics

(2003)

A.P. Gasch et al.

Exploring the conditional coregulation of yeast gene expression through fuzzy k-means clustering

Genome Biol.

(2002)

R.J. Cho et al.

Transcriptional regulation and function during the human cell cycle

Nat. Genet.

(2001)

Cited by (26)

Automatic organofacies identification by means of Machine Learning on Raman spectra
2023, International Journal of Coal Geology
In this study we compare and evaluate different unsupervised clustering algorithms for organofacies discrimination in low maturity dispersed organic matter based on Raman spectroscopic analyses. A total of 1363 Raman spectra were collected from a set of 27 organic-rich samples from the Lower Toarcian shale interval of the Paris Basin sub-surface. Rock-Eval pyrolysis data indicate a type II to type III kerogen with a vitrinite reflectance (R_o%) between 0.45% and 0.65%, and T_max between 415 °C and 438 °C. Organic petrographic observations under transmitted light reveal the presence of organofacies composed by amorphous organic matter, opaque, and translucent phytoclasts. An optical classification of organic particles was performed on about 40–60 fragments per sample and used as the ground truth. Raman spectra were obtained for all the classified fragments and principal component analysis was performed to underline the variability among spectra. Unsupervised clustering was then applied on Raman spectra principal components. Three clustering methods were applied to evaluate their effectiveness in predicting number, shape and density of clusters and a contingency matrix was used to quantify their ability to predict different organofacies. Gaussian Mixture Model (GMM) was found to be the best algorithm for organofacies identification showing an accuracy mostly between 80% and 90%. This work outlines how unsupervised clustering of Raman spectra of dispersed organic matter can reduce the uncertainty in thermal maturity assessment and help the classification of highly heterogeneous organofacies when using large datasets for Earth and planetary sciences studies.
WeDIV – An improved k-means clustering algorithm with a weighted distance and a novel internal validation index
2022, Egyptian Informatics Journal
Citation Excerpt :
Furthermore, several new internal validation indices have been proposed in recent years. Arima et al. [61] proposed a modified Gap index in fuzzy k-means clustering and applied it to gene expression datasets. Mur et al. [62] designed the GS index by combining the Sil index and the concept of local scaling to estimate the number of clusters in spectral clustering.
Designing appropriate similarity metrics (distance) and estimating the optimal number of clusters have been two important issues in cluster analysis. This study proposed an improved k-means clustering algorithm involving a Weighted Distance and a novel Internal Validation index (WeDIV). The weighted distance, $EP_dis$ , was designed by considering the relative contribution between Euclidean and Pearson distances with a weighted strategy. This strategy can effectively capture information reflecting the globally spatial correlation and locally variable trend simultaneously in high-dimensional space. The new internal validation index,RCH, inspired by the Calinski-Harabasz (CH) index and the analysis of variance, was developed to automatically estimate the optimal number of clusters. The $EP_dis$ was proved reliable in mathematics and was validated on two simulated datasets. Four simulated datasets representing different properties were used to validate the effectiveness of RCH. Furthermore, We compared the clustering performance of WeDIV with 12 prevailing clustering algorithms on 16 UCI datasets. The results demonstrated that WeDIV outperforms the others regardless of specifying the number of clusters or not.
Unsupervised learning approach in defining the similarity of catchments: Hydrological response unit based k-means clustering, a demonstration on Western Black Sea Region of Turkey
2020, International Soil and Water Conservation Research
This study investigated the similarity of the catchments with the k-means clustering method by using the hydrological response unit (HRU) images of 33 catchments located in the Western Black Sea Region of Turkey. HRUs are the unit cells in hydrological models and these units are important because the same HRUs have the same hydrological behavior regarding weather inputs and water runoff. Catchments that reside inside a cluster will have high hydrological similarity, the catchments of two separate clusters would be dissimilar to each other. With the help of the clustered catchments, an elimination process can be conducted that can save time and effort in basin selection for future hydrological studies. In the study, the basic process sequence was carried out in 5 steps. These steps were creating HRUs, assigning a color to HRUs, creating HRU images, image embedding, and k-means clustering respectively. Silhouette and multidimensional scaling plots were sketched to visually examine the quality of intra-cluster distributions. Considering the silhouette score values, the optimum number of clusters was determined as 8, and the clustered catchments were illustrated on the study area.
Automatic pattern recognition of ECG signals using entropy-based adaptive dimensionality reduction and clustering
2017, Applied Soft Computing Journal
Citation Excerpt :
Krista Rizman Žalik [17] developed a COr index for clusters widely differing in density or size. In recent years, some new concepts, such as granulation error [34], gap statistic [35], non-local spatial information [36], fuzzy partition stability[37] and geometrical compactness [38], have been involved in the development of fuzzy indexes. More fuzzy indices can be found in Refs. [18] and [21].
In order to automatically recognize the patterns of ECG signals for different subjects, a new entropy-based principal component analysis (EPCA) is developed in this paper for dimensionality reduction of ECG signals. With the EPCA, P_b, the best number of principal components for a specific subject, is automatically determined by the energy to entropy ratio of reconstructed ECG signals after dimensionality reduction. Then, a novel fuzzy-entropy based c-means clustering (FECM) is proposed for cluster partition of ECG feature data. The optimal number of clusters, i.e. k_b for a data set with specific ECG feature is found through a fuzzy-entropy based clustering measure (ECM), in which the average symmetric fuzzy cross entropy of membership subset pairs is combined with the average fuzzy entropy of clusters. Afterwards, the ECG signals in the MIT-BIH Arrhythmia Database are used for performance evaluation of EPCA and FECM. Five indices of signal reconstruction are employed to evaluate the results of dimensionality reduction, which shows that the EPCA performs better than the schemes based on cumulative percentage and scree graph when searching for P_b. Moreover, by comparing ECM with the other eight fuzzy clustering indices, i.e. PC, PE, MPC, XB, FS, Kwon, FHV and PBMF, the ECM has demonstrates the superiority in searching for k_b. By using ECM, the adaptability of FECM to various ECG data sets has been strengthened. Its clustering accuracy is superior to those of three commonly-used algorithms, i.e. spectral clustering (NJW), hierarchical agglomerative (HCA) and K-means clustering.
A novel automatic picture fuzzy clustering method based on particle swarm optimization and picture composite cardinality
2016, Knowledge-Based Systems
Citation Excerpt :
Scanning: This is the simplest way which tries each number of clusters in a given range for clustering and takes one having the best clustering quality in terms of validity indices as the final number of clusters. This approach was used in the works of Alp Erilli et al. [1], Arima et al. [2], Fang & Wang [11], Fujita et al. [12], Lee & Olafsson [15], and Liang et al. [16]. However, computational complexity is the main drawback of this approach since it has to assess all candidates to find the best one.
Fuzzy clustering plays an important role in pattern recognition and knowledge discovery. Recently, there has been a great interest of developing fuzzy clustering algorithms on advanced fuzzy sets such as Picture Fuzzy Clustering (FC-PFS) which is an extension of Fuzzy C-Means on Picture Fuzzy Set. A major disadvantage of FC-PFS is how to define a prior number of clusters before clustering. Because each dataset has distinctive features and distributions of patterns, determining such the number for a clustering algorithm would result in good quality. In this paper, we propose a method called Automatic Picture Fuzzy Clustering (AFC-PFS) for determining the most suitable number of clusters for FC-PFS. It is a hybrid method between Particle Swarm Optimization (PSO) and FC-PFS where combined solutions consisting of the number of clusters and equivalent clustering centers and membership matrices are packed and optimized in PSO. A new term namely Picture Composite Cardinality is also given to determine a suitable number of clusters. AFC-PFS is empirically validated on benchmark datasets of UCI Machine Learning Repository by different clustering quality indices. The results show that AFC-PFS has better performance than the relevant methods.
Multi-model control of blast furnace burden surface based on fuzzy SVM
2015, Neurocomputing
Burden distribution is one of the key procedures of blast furnace operation. The improvement in control quality of the entire charging process for blast furnace is very necessary for more competitive and profitable production. In this paper, burden surface data from radars are classified by using k-means clustering algorithm to set up the multiple models set. Given objective burden surface, multiple burden distribution control strategies are obtained. In every charging period, real time burden surface data will be processed to match the model in model set by fuzzy support vector machine, and the optimal control strategy based on the matched model will be switched in action. Finally, blast furnace closed-loop control can be realized by this way. The proposed control method is applied to a blast furnace in an Iron and Steel Plant, energy saving and consumption reduction have been achieved greatly in this operation process.

View all citing articles on Scopus

View full text

Modified Fuzzy Gap statistic for estimating preferable number of clusters in Fuzzy k-means clustering

Section snippets

Data sets

Artificial data sets

Leukemia data set

Estimation results for number of clusters in C4D3 and C5D3 data sets

Mol. Cell

J. Biosci. Bioeng.

J. Biosci. Bioeng.

J. Comput. Appl. Math.

Inf. Sci.

Artif. Intell. Med.

Pattern Recognit.

Pattern Recogn. Lett.

The transcriptional program of sporulation in budding yeast

Science

Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization

Mol. Biol. Cell

An integrated comprehensive workbench for inferring genetic networks: voyagene

J. Bioinform. Comput. Biol.

Cluster analysis and display of genome-wide expression patterns

Proc. Natl. Acad. Sci. USA

Systematic determination of genetic network architecture

Nat. Genet.

Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation

Proc. Natl. Acad. Sci. USA

Analysis of expression profile using fuzzy adaptive resonance theory

Bioinformatics

Fuzzy C-means method for clustering microarray data

Bioinformatics

Exploring the conditional coregulation of yeast gene expression through fuzzy k-means clustering

Genome Biol.

Transcriptional regulation and function during the human cell cycle

Nat. Genet.