Modified Fuzzy Gap statistic for estimating preferable number of clusters in Fuzzy k-means clustering

https://doi.org/10.1263/jbb.105.273Get rights and content

In clustering methods, the estimation of the optimal number of clusters is significant for subsequent analysis. Without detailed biological information on the genes involved, the evaluation of the number of clusters becomes difficult, and we have to rely on an internal measure that is based on the distribution of the data of the clustering result. The Gap statistic has been proposed as a superior method for estimating the number of clusters in crisp clustering. In this study, we proposed a modified Fuzzy Gap statistic (MFGS) and applied it to fuzzy k-means clustering. For estimating the number of clusters, fuzzy k-means clustering with the MFGS was applied to two artificial data sets with noise and to two experimentally observed gene expression data sets. For the artificial data sets, compared with other internal measures, the MFGS showed a higher performance in terms of robustness against noise for estimating the optimal number of clusters. Moreover, it could be used to estimate the optimal number of clusters in experimental data sets. It was confirmed that the proposed MFGS is a useful method for estimating the number of clusters for microarray data sets.

Section snippets

Data sets

To evaluate the various internal measures, we analyzed the two artificial data sets, those data sets with noise, a leukemia data set and a yeast data set that were experimentally obtained.

Artificial data sets

We prepared two artificial data sets: C4D3 (4 clusters, 3-dimensional, as shown in Fig. 1), C4D5 (4 clusters, 5-dimensional; data not shown).

Leukemia data set

We used the same data set for gene expression in acute leukemia as that analyzed by Golub et al. (28) by hierarchical clustering. These data, obtained from 38 patients,

Estimation results for number of clusters in C4D3 and C5D3 data sets

As shown in Fig. 1, we prepared two artificial data sets (C4D3 and C4D5), with each having 4 clusters. When the MFGS and Gap statistic were used for the C4D3 data set, the values of the modified Gap(k) (MFGap(k)) and Gap(k), as calculated using Eq. 6, rapidly increased until clusters = 4 (see Fig. 2A, B). From the definition of MFGS and Gap statistic (Eq. 7), each plot for these results had an error bar (sk). Meanwhile, that for PC, FHV, and XB had no error bar (Fig. 2C–E). Since the minimum k

References (36)

  • Y. Maki et al.

    An integrated comprehensive workbench for inferring genetic networks: voyagene

    J. Bioinform. Comput. Biol.

    (2004)
  • M.B. Eisen et al.

    Cluster analysis and display of genome-wide expression patterns

    Proc. Natl. Acad. Sci. USA

    (1998)
  • S. Tavazoie et al.

    Systematic determination of genetic network architecture

    Nat. Genet.

    (1999)
  • P. Tamayo et al.

    Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation

    Proc. Natl. Acad. Sci. USA

    (1999)
  • S. Tomida et al.

    Analysis of expression profile using fuzzy adaptive resonance theory

    Bioinformatics

    (2002)
  • D. Dembele et al.

    Fuzzy C-means method for clustering microarray data

    Bioinformatics

    (2003)
  • A.P. Gasch et al.

    Exploring the conditional coregulation of yeast gene expression through fuzzy k-means clustering

    Genome Biol.

    (2002)
  • R.J. Cho et al.

    Transcriptional regulation and function during the human cell cycle

    Nat. Genet.

    (2001)
  • Cited by (26)

    • WeDIV – An improved k-means clustering algorithm with a weighted distance and a novel internal validation index

      2022, Egyptian Informatics Journal
      Citation Excerpt :

      Furthermore, several new internal validation indices have been proposed in recent years. Arima et al. [61] proposed a modified Gap index in fuzzy k-means clustering and applied it to gene expression datasets. Mur et al. [62] designed the GS index by combining the Sil index and the concept of local scaling to estimate the number of clusters in spectral clustering.

    • Automatic pattern recognition of ECG signals using entropy-based adaptive dimensionality reduction and clustering

      2017, Applied Soft Computing Journal
      Citation Excerpt :

      Krista Rizman Žalik [17] developed a COr index for clusters widely differing in density or size. In recent years, some new concepts, such as granulation error [34], gap statistic [35], non-local spatial information [36], fuzzy partition stability[37] and geometrical compactness [38], have been involved in the development of fuzzy indexes. More fuzzy indices can be found in Refs. [18] and [21].

    • A novel automatic picture fuzzy clustering method based on particle swarm optimization and picture composite cardinality

      2016, Knowledge-Based Systems
      Citation Excerpt :

      Scanning: This is the simplest way which tries each number of clusters in a given range for clustering and takes one having the best clustering quality in terms of validity indices as the final number of clusters. This approach was used in the works of Alp Erilli et al. [1], Arima et al. [2], Fang & Wang [11], Fujita et al. [12], Lee & Olafsson [15], and Liang et al. [16]. However, computational complexity is the main drawback of this approach since it has to assess all candidates to find the best one.

    View all citing articles on Scopus
    View full text