Probabilistic cluster structure ensemble
Introduction
With the development of cluster ensemble techniques, a growing number of related approaches have been successfully applied to different fields [32], [33], [11], [64], [39], [55], such as medicine, bioinformatics, and multimedia data mining. For example, Iam-On et al. [32] proposed a link-based cluster ensemble approach based on the similarity between clusters, and successfully applied it to both artificial and real datasets. They also applied this approach to solve categorical data clustering problem [33]. Christou [11] explored the optimization-based cluster ensemble approach which is formulated in terms of intra-cluster criteria, and applied it to the TSPLIB benchmark data sets. Yu et al. [64] studied the knowledge based cluster ensemble approach which is applied to perform cancer discovery from gene expression profiles. Mimaroglu and Aksehirli [39] designed a divisive clustering ensemble approach called DICLENS, which is able to identify the cluster number automatically and achieved good performance on gene expression data sets. Weber et al. [55] gave a general definition of optimal clustering related to overlapping clustering solutions, which is useful for cluster ensemble approaches.
Compared with traditional clustering algorithms, cluster ensemble approaches represent a more effective technique since they have the ability to generate a unified clustering solution from multiple clustering solutions in the ensemble, and improve the effectiveness, stability and robustness of the clustering process. Most of the previous cluster ensemble approaches focus on the alignment of the labels of data samples derived from diverse clustering solutions, and do not take into account the fusion of multiple cluster structures obtained from various data sources into a unified structure. The cluster structure which summarizes information about the distribution of the data samples is more useful in a lot of scenarios. For example, as time passes, some data sources will gradually change, which will lead to the variation of the labels of data samples in different clustering solutions. In this scenario, the cluster structure of the data is more important than the labels of data samples. This raises an interesting question of how to construct a cluster structure ensemble, and identify the most representative cluster structure among the datasets.
There are a lot of useful applications for a cluster structure ensemble approach. For example, multiple sensors will generate a lot of datasets which have their own cluster structures in the area of mobile Internet. At the same time, the cluster structures of these datasets share a large number of similar characteristics. How to construct a unified cluster structure which captures the similarity of the cluster structures in different datasets generated from multiple sensors is an interesting problem deserving intensive exploration. For another example, the objective of clustering analysis on lung cancer datasets is to assign samples to their corresponding classes. The Lung adenocarcinomas dataset in [7] contains 203 samples assigned to 5 classes: adenocarcinoma, small-cell lung cancer, pulmonary carcinoids, squamous cell lung carcinomas, and normal lung tissues. Since there are a large number of datasets obtained by different research groups in the area of lung cancer research [16], it raises the question of how to find the most representative cluster structure from the cluster structures obtained from the different datasets.
In this paper, we design a new probabilistic cluster structure ensemble framework, referred to as the Gaussian mixture model based cluster structure ensemble framework (GMMSE), to identify the most representative cluster structure from the dataset. Specifically, GMMSE first integrates the bagging technique, the K-means algorithm and the Expectation–Maximization approach to generate diversity, and estimate the various cluster structures from different data sources. Then, it adopts the normalized cut algorithm [47] and the representative matrix constructed based on the set of cluster structures from different data sources to find the most representative cluster structure. Finally, GMMSE applies four assignment criteria, which are the nearest Gaussian model criterion (NGM), the average Gaussian model criterion (AGM), the nearest group center criterion (NGC) and the Gaussian model based majority voting criterion (GMV), to assign the data samples to their corresponding clusters based on this most representative cluster structure. The results in the experiment show that GMMSE achieves good performance on both synthetic datasets and real datasets in the UCI machine learning repository.
The contribution of the paper is fourfold. First, we proposed a Gaussian mixture model based cluster structure ensemble framework (GMMSE) to identify the most representative cluster structure. Second, four criteria are designed to assign data samples to their corresponding clusters based on this representative cluster structure. Third, the time and space complexity of GMMSE are analyzed. Fourth, the representative matrix is designed to capture the relationships among the components of the Gaussian mixture models. The Bhattacharya distance function is adopted to measure the similarity between two components with respect to their respective Gaussian distributions.
The remainder of the paper is organized as follows. Section 2 introduces the related works to cluster ensemble approaches. Section 3 describes the Gaussian mixture model based cluster structure ensemble framework, and analyzes the time and space complexity of GMMSE. Section 4 evaluates the performance of GMMSE through experiments on synthetic datasets, as well as several real datasets in the UCI machine learning repository. Section 5 draws conclusions and describes possible future works.
Section snippets
Related works
Recently, ensemble learning is gaining more and more attention since these approaches have the ability to provide more accurate, stable and robust final clustering results when compared with traditional approaches. Most of the ensemble learning approaches [45], [24], [43] can be categorized into three types, which are supervised learning ensemble, semi-supervised learning ensemble and unsupervised learning ensemble. Supervised learning ensemble, also called classifier ensemble, includes a
Gaussian mixture model based cluster structure ensemble
Fig. 1 provides an overview of the Gaussian mixture model based cluster structure ensemble framework (GMMSE). Specifically, GMMSE applies a re-sampling technique, for example the bagging technique, to generate a set of new datasets from the original dataset F in the first step. In the second step, the underlying cluster structure of each new dataset is captured by the Gaussian mixture model . GMMSE adopts the K-means algorithm to initialize the values of the
Experiment
We conduct a number of experiments on synthetic datasets in Table 2 and real datasets from the UCI machine learning repository in Table 3 to evaluate the performance of GMMSE.
Table 2 shows the details of the synthetic datasets (where k is the number of clusters, n is the number of data samples, m is the number of attributes, and is the number of noisy attributes), which are generated by different Gaussian distributions with randomly selected centers, and the covariance matrices are of the
Conclusion and future work
In this paper, we propose a new cluster structure ensemble framework, referred to as the Gaussian mixture model based cluster structure ensemble framework (GMMSE), for identifying the most representative cluster structure from data. Compared with previous cluster ensemble approaches, GMMSE integrates multiple structures from different data sources into a unified structure, and applies four new assignment criteria to distribute data samples to their corresponding clusters. We perform a thorough
Acknowledgments
The authors are grateful for the constructive advice on the revision of the manuscript from the anonymous reviewers. The work described in this paper was partially funded by the grant from the Hong Kong Scholars Program (Project No. XJ2012015) and the outstanding talent training plan of South China University of Technology, and supported by grants from the National Natural Science Foundation of China (NSFC) (Project Nos. 61003174, 61070090, 61273363, 61379033), the NSFC-Guangdong Joint Fund
References (71)
- et al.
On voting-based consensus of cluster ensembles
Pattern Recognit.
(2010) - et al.
A decision-theoretic generalization of on-line learning and an application to boosting
J. Comput. Syst. Sci.
(1997) - et al.
Moderate diversity for better cluster ensembles
Inform. Fusion
(2006) - et al.
A scalable framework for cluster ensembles
Pattern Recognit.
(2009) - et al.
A dynamic classifier ensemble selection approach for noise data
Inform. Sci.
(2010) - et al.
From cluster ensemble to structure ensemble
Inform. Sci.
(2012) - et al.
Ensembling neural networks: many could be better than all
Artif. Intell.
(2002) - M.F. Amasyali, O. Ersoy, The performance factors of clustering ensembles, in: IEEE 16th Signal Processing,...
- A. Asuncion, D.J. Newman, UCI Machine Learning Repository, Irvine, CA: University....
- et al.
Cumulative voting consensus method for partitions with variable number of clusters
IEEE Trans. Pattern Anal. Mach. Intell.
(2008)
Speaker diarization exploiting the eigengap criterion and cluster ensembles
IEEE Trans. Audio Speech Lang. Process.
Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinomas sub-classes
Proc. Natl. Acad. Sci.
On a measure of divergence between two statistical populations defined by their probability distributions
Bull. Calcutta Math. Soc.
Bagging predictors
Mach. Learn.
Random forests
Mach. Learn.
Coordination of cluster ensembles via exact methods
IEEE Trans. Pattern Anal. Mach. Intell.
Introduction to Algorithms
Elements of Information Theory
Maximum likelihood from incomplete data via the EM algorithm
J. Roy. Stat. Soc.
Weighted cluster ensembles: methods and analysis
ACM Trans. Knowl. Discovery Data (TKDD)
Loss-of-heterozygosity analysis of small-cell lung carcinomas using single-nucleotide polymorphism arrays
Nat. Biotechnol.
Cluster ensemble selection
Stat. Anal. Data Min.
Combining multiple clusterings using evidence accumulation
IEEE Trans. Pattern Anal. Mach. Intell.
Graph-based consensus maximization among multiple supervised and unsupervised models
Adv. Neural Inform. Process. Syst.
Supervised subspace projections for constructing ensembles of classifiers
Inform. Sci.
A survey: clustering ensembles techniques
World Acad. Sci. Eng. Technol.
Ensemble non-negative matrix factorization methods for clustering protein? Cprotein interactions
Bioinformatics
The random subspace method for constructing decision forests
IEEE Trans. Pattern Anal. Mach. Intell.
Microarray gene cluster identification and annotation through cluster ensemble and EM-based informative textual summarization
IEEE Trans. Inform. Technol. Biomed.
LinkCluE: a MATLAB package for link-based cluster ensembles
J. Stat. Softw.
LCE: a link-based cluster ensemble method for improved gene expression data analysis
Bioinformatics
A link-based approach to the cluster ensemble problem
IEEE Trans. Pattern Anal. Mach. Intell.
Cited by (32)
Cluster ensemble selection and consensus clustering: A multi-objective optimization approach
2024, European Journal of Operational ResearchAdaptive Correlation Integration for Deep Image Clustering
2022, NeurocomputingAn evidence accumulation based block diagonal cluster model for intent recognition from EEG
2022, Biomedical Signal Processing and ControlCitation Excerpt :Nowadays, clustering is continuously studied and applied in different fields such as pattern recognition [2–5], image segmentation [6–8], data mining [9,10], etc. The probabilistic mixture model is an effective density tool and one of the most commonly used clustering tools [11–13], which can be understood as a convex combination of multiple independent probabilistic models. Regardless of the structural complexity of the data distribution, the local characteristics of the data distribution can be described by adding components due to the application of multiple independent probability distributions.
A sequential ensemble clusterings generation algorithm for mixed data
2018, Applied Mathematics and ComputationCitation Excerpt :Semi-supervised ensemble clustering methods use some prior knowledge of the data sets provided by experts in the consensus functions [23,24]. Structure ensemble, firstly proposed by Yu et al. [25,26], can integrate multiple cluster structures extracted from different base clusterings into a unified structure for large-scale data. This paper mainly focuses on the research of base clusterings generation strategy.
Cluster ensembles: A survey of approaches with recent extensions and applications
2018, Computer Science ReviewCluster ensemble selection with constraints
2017, NeurocomputingCitation Excerpt :Clustering ensemble typically first produces a large set of clustering results of a given data set and then combines them using a consensus function to create a final clustering that is considered to encompass all of the information contained in the ensemble [9,10,21,11,15,16,18,27–29,35,33]. Recently, some studies propose to learn a cluster structure ensemble which aims to integrate multiple cluster structures extracted from different datasets into a unified cluster structure [36,34,37]. Traditionally, all the generated clustering solutions are integrated to produce the final ensemble.