Elsevier

Information Sciences

Volume 267, 20 May 2014, Pages 16-34
Information Sciences

Probabilistic cluster structure ensemble

https://doi.org/10.1016/j.ins.2014.01.030Get rights and content

Abstract

Cluster structure ensemble focuses on integrating multiple cluster structures extracted from different datasets into a unified cluster structure, instead of aligning the individual labels from the clustering solutions derived from multiple homogenous datasets in the cluster ensemble framework. In this article, we design a novel probabilistic cluster structure ensemble framework, referred to as Gaussian mixture model based cluster structure ensemble framework (GMMSE), to identify the most representative cluster structure from the dataset. Specifically, GMMSE first applies the bagging approach to produce a set of variant datasets. Then, a set of Gaussian mixture models are used to capture the underlying cluster structures of the datasets. GMMSE applies K-means to initialize the values of the parameters of the Gaussian mixture model, and adopts the Expectation Maximization approach (EM) to estimate the parameter values of the model. Next, the components of the Gaussian mixture models are viewed as new data samples which are used to construct the representative matrix capturing the relationships among components. The similarity between two components corresponding to their respective Gaussian distributions is measured by the Bhattycharya distance function. Afterwards, GMMSE constructs a graph based on the new data samples and the representative matrix, and searches for the most representative cluster structure. Finally, we also design four criteria to assign the data samples to their corresponding clusters based on the unified cluster structure. The experimental results show that (i) GMMSE works well on synthetic datasets and real datasets in the UCI machine learning repository. (ii) GMMSE outperforms most of the previous cluster ensemble approaches.

Introduction

With the development of cluster ensemble techniques, a growing number of related approaches have been successfully applied to different fields [32], [33], [11], [64], [39], [55], such as medicine, bioinformatics, and multimedia data mining. For example, Iam-On et al. [32] proposed a link-based cluster ensemble approach based on the similarity between clusters, and successfully applied it to both artificial and real datasets. They also applied this approach to solve categorical data clustering problem [33]. Christou [11] explored the optimization-based cluster ensemble approach which is formulated in terms of intra-cluster criteria, and applied it to the TSPLIB benchmark data sets. Yu et al. [64] studied the knowledge based cluster ensemble approach which is applied to perform cancer discovery from gene expression profiles. Mimaroglu and Aksehirli [39] designed a divisive clustering ensemble approach called DICLENS, which is able to identify the cluster number automatically and achieved good performance on gene expression data sets. Weber et al. [55] gave a general definition of optimal clustering related to overlapping clustering solutions, which is useful for cluster ensemble approaches.

Compared with traditional clustering algorithms, cluster ensemble approaches represent a more effective technique since they have the ability to generate a unified clustering solution from multiple clustering solutions in the ensemble, and improve the effectiveness, stability and robustness of the clustering process. Most of the previous cluster ensemble approaches focus on the alignment of the labels of data samples derived from diverse clustering solutions, and do not take into account the fusion of multiple cluster structures obtained from various data sources into a unified structure. The cluster structure which summarizes information about the distribution of the data samples is more useful in a lot of scenarios. For example, as time passes, some data sources will gradually change, which will lead to the variation of the labels of data samples in different clustering solutions. In this scenario, the cluster structure of the data is more important than the labels of data samples. This raises an interesting question of how to construct a cluster structure ensemble, and identify the most representative cluster structure among the datasets.

There are a lot of useful applications for a cluster structure ensemble approach. For example, multiple sensors will generate a lot of datasets which have their own cluster structures in the area of mobile Internet. At the same time, the cluster structures of these datasets share a large number of similar characteristics. How to construct a unified cluster structure which captures the similarity of the cluster structures in different datasets generated from multiple sensors is an interesting problem deserving intensive exploration. For another example, the objective of clustering analysis on lung cancer datasets is to assign samples to their corresponding classes. The Lung adenocarcinomas dataset in [7] contains 203 samples assigned to 5 classes: adenocarcinoma, small-cell lung cancer, pulmonary carcinoids, squamous cell lung carcinomas, and normal lung tissues. Since there are a large number of datasets obtained by different research groups in the area of lung cancer research [16], it raises the question of how to find the most representative cluster structure from the cluster structures obtained from the different datasets.

In this paper, we design a new probabilistic cluster structure ensemble framework, referred to as the Gaussian mixture model based cluster structure ensemble framework (GMMSE), to identify the most representative cluster structure from the dataset. Specifically, GMMSE first integrates the bagging technique, the K-means algorithm and the Expectation–Maximization approach to generate diversity, and estimate the various cluster structures from different data sources. Then, it adopts the normalized cut algorithm [47] and the representative matrix constructed based on the set of cluster structures from different data sources to find the most representative cluster structure. Finally, GMMSE applies four assignment criteria, which are the nearest Gaussian model criterion (NGM), the average Gaussian model criterion (AGM), the nearest group center criterion (NGC) and the Gaussian model based majority voting criterion (GMV), to assign the data samples to their corresponding clusters based on this most representative cluster structure. The results in the experiment show that GMMSE achieves good performance on both synthetic datasets and real datasets in the UCI machine learning repository.

The contribution of the paper is fourfold. First, we proposed a Gaussian mixture model based cluster structure ensemble framework (GMMSE) to identify the most representative cluster structure. Second, four criteria are designed to assign data samples to their corresponding clusters based on this representative cluster structure. Third, the time and space complexity of GMMSE are analyzed. Fourth, the representative matrix is designed to capture the relationships among the components of the Gaussian mixture models. The Bhattacharya distance function is adopted to measure the similarity between two components with respect to their respective Gaussian distributions.

The remainder of the paper is organized as follows. Section 2 introduces the related works to cluster ensemble approaches. Section 3 describes the Gaussian mixture model based cluster structure ensemble framework, and analyzes the time and space complexity of GMMSE. Section 4 evaluates the performance of GMMSE through experiments on synthetic datasets, as well as several real datasets in the UCI machine learning repository. Section 5 draws conclusions and describes possible future works.

Section snippets

Related works

Recently, ensemble learning is gaining more and more attention since these approaches have the ability to provide more accurate, stable and robust final clustering results when compared with traditional approaches. Most of the ensemble learning approaches [45], [24], [43] can be categorized into three types, which are supervised learning ensemble, semi-supervised learning ensemble and unsupervised learning ensemble. Supervised learning ensemble, also called classifier ensemble, includes a

Gaussian mixture model based cluster structure ensemble

Fig. 1 provides an overview of the Gaussian mixture model based cluster structure ensemble framework (GMMSE). Specifically, GMMSE applies a re-sampling technique, for example the bagging technique, to generate a set of new datasets F1,F2,,FB from the original dataset F in the first step. In the second step, the underlying cluster structure of each new dataset Fb (b{1,,B}) is captured by the Gaussian mixture model Θb. GMMSE adopts the K-means algorithm to initialize the values of the

Experiment

We conduct a number of experiments on synthetic datasets in Table 2 and real datasets from the UCI machine learning repository in Table 3 to evaluate the performance of GMMSE.

Table 2 shows the details of the synthetic datasets (where k is the number of clusters, n is the number of data samples, m is the number of attributes, and ς is the number of noisy attributes), which are generated by different Gaussian distributions with randomly selected centers, and the covariance matrices are of the

Conclusion and future work

In this paper, we propose a new cluster structure ensemble framework, referred to as the Gaussian mixture model based cluster structure ensemble framework (GMMSE), for identifying the most representative cluster structure from data. Compared with previous cluster ensemble approaches, GMMSE integrates multiple structures from different data sources into a unified structure, and applies four new assignment criteria to distribute data samples to their corresponding clusters. We perform a thorough

Acknowledgments

The authors are grateful for the constructive advice on the revision of the manuscript from the anonymous reviewers. The work described in this paper was partially funded by the grant from the Hong Kong Scholars Program (Project No. XJ2012015) and the outstanding talent training plan of South China University of Technology, and supported by grants from the National Natural Science Foundation of China (NSFC) (Project Nos. 61003174, 61070090, 61273363, 61379033), the NSFC-Guangdong Joint Fund

References (71)

  • J. Azimi, X. Fern, Adaptive cluster ensemble selection, in: Proceedings of the 21st International Joint Conference on...
  • N. Bassiou et al.

    Speaker diarization exploiting the eigengap criterion and cluster ensembles

    IEEE Trans. Audio Speech Lang. Process.

    (2010)
  • A. Bhattacharjee et al.

    Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinomas sub-classes

    Proc. Natl. Acad. Sci.

    (2001)
  • A. Bhattacharyya

    On a measure of divergence between two statistical populations defined by their probability distributions

    Bull. Calcutta Math. Soc.

    (1943)
  • L. Breiman

    Bagging predictors

    Mach. Learn.

    (1996)
  • L. Breiman

    Random forests

    Mach. Learn.

    (2001)
  • I.T. Christou

    Coordination of cluster ensembles via exact methods

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2011)
  • T.H. Cormen et al.

    Introduction to Algorithms

    (2009)
  • T.M. Cover et al.

    Elements of Information Theory

    (2006)
  • A.P. Dempster et al.

    Maximum likelihood from incomplete data via the EM algorithm

    J. Roy. Stat. Soc.

    (1977)
  • C. Domeniconi et al.

    Weighted cluster ensembles: methods and analysis

    ACM Trans. Knowl. Discovery Data (TKDD)

    (2009)
  • L. Dyrskjot et al.

    Loss-of-heterozygosity analysis of small-cell lung carcinomas using single-nucleotide polymorphism arrays

    Nat. Biotechnol.

    (2000)
  • X.Z. Fern, C.E. Brodley, Random projection for high dimensional data clustering: a cluster ensemble approach, in: Proc....
  • X.Z. Fern et al.

    Cluster ensemble selection

    Stat. Anal. Data Min.

    (2008)
  • A.L.N. Fred et al.

    Combining multiple clusterings using evidence accumulation

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2005)
  • K. Ganchev, J. Graca, J. Blitzer, B. Taskar, Multi-view learning over clustered and non-identical outputs, in: Proc....
  • J. Gao et al.

    Graph-based consensus maximization among multiple supervised and unsupervised models

    Adv. Neural Inform. Process. Syst.

    (2009)
  • N. Garc?aa-Pedrajas et al.

    Supervised subspace projections for constructing ensembles of classifiers

    Inform. Sci.

    (2012)
  • R. Ghaemi et al.

    A survey: clustering ensembles techniques

    World Acad. Sci. Eng. Technol.

    (2009)
  • D. Greene et al.

    Ensemble non-negative matrix factorization methods for clustering protein? Cprotein interactions

    Bioinformatics

    (2008)
  • T.K. Ho

    The random subspace method for constructing decision forests

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1998)
  • X. Hu et al.

    Microarray gene cluster identification and annotation through cluster ensemble and EM-based informative textual summarization

    IEEE Trans. Inform. Technol. Biomed.

    (2009)
  • N. Iam-On et al.

    LinkCluE: a MATLAB package for link-based cluster ensembles

    J. Stat. Softw.

    (2010)
  • N. Iam-on et al.

    LCE: a link-based cluster ensemble method for improved gene expression data analysis

    Bioinformatics

    (2010)
  • N. Iam-On et al.

    A link-based approach to the cluster ensemble problem

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2011)
  • Cited by (32)

    • An evidence accumulation based block diagonal cluster model for intent recognition from EEG

      2022, Biomedical Signal Processing and Control
      Citation Excerpt :

      Nowadays, clustering is continuously studied and applied in different fields such as pattern recognition [2–5], image segmentation [6–8], data mining [9,10], etc. The probabilistic mixture model is an effective density tool and one of the most commonly used clustering tools [11–13], which can be understood as a convex combination of multiple independent probabilistic models. Regardless of the structural complexity of the data distribution, the local characteristics of the data distribution can be described by adding components due to the application of multiple independent probability distributions.

    • A sequential ensemble clusterings generation algorithm for mixed data

      2018, Applied Mathematics and Computation
      Citation Excerpt :

      Semi-supervised ensemble clustering methods use some prior knowledge of the data sets provided by experts in the consensus functions [23,24]. Structure ensemble, firstly proposed by Yu et al. [25,26], can integrate multiple cluster structures extracted from different base clusterings into a unified structure for large-scale data. This paper mainly focuses on the research of base clusterings generation strategy.

    • Cluster ensemble selection with constraints

      2017, Neurocomputing
      Citation Excerpt :

      Clustering ensemble typically first produces a large set of clustering results of a given data set and then combines them using a consensus function to create a final clustering that is considered to encompass all of the information contained in the ensemble [9,10,21,11,15,16,18,27–29,35,33]. Recently, some studies propose to learn a cluster structure ensemble which aims to integrate multiple cluster structures extracted from different datasets into a unified cluster structure [36,34,37]. Traditionally, all the generated clustering solutions are integrated to produce the final ensemble.

    View all citing articles on Scopus
    View full text