Elsevier

Pattern Recognition

Volume 45, Issue 8, August 2012, Pages 3034-3044
Pattern Recognition

Vector quantization based approximate spectral clustering of large datasets

https://doi.org/10.1016/j.patcog.2012.02.012Get rights and content

Abstract

Spectral partitioning, recently popular for unsupervised clustering, is infeasible for large datasets due to its computational complexity and memory requirement. Therefore, approximate spectral clustering of data representatives (selected by various sampling methods) was used. Alternatively, we propose to use neural networks (self-organizing maps and neural gas), which are shown successful in quantization with small distortion, as preliminary sampling for approximate spectral clustering (ASC). We show that they usually outperform k-means sampling (which was shown superior to various sampling methods), in terms of clustering accuracy obtained by ASC. More importantly, for quantization based ASC, we introduce a local density-based similarity measure – constructed without any user-set parameter – which achieves accuracies superior to the accuracies of commonly used distance based similarity.

Highlights

► We use neural networks based quantization for approximate spectral clustering (ASC). ► Neural networks usually outperform k-means, in terms of clustering accuracy achieved by ASC. ► We propose a local density-based similarity, CONN (constructed without any parameter), for quantization prototypes. ► Compared to distance based similarity, CONN achieves superior accuracies for ASC.

Introduction

Unsupervised clustering aims to find distinct groups in a dataset, often without a priori information on their structures. A common approach is to construct parametric models based on known number of clusters. Among them the most popular method is the k-means clustering (and its variants) which minimizes the total distances of data samples to their corresponding cluster centroid. Another parametric approach is the use of the expectation-maximization algorithm or Gaussian mixture models, which aims to optimize both the cluster centroids and the cluster variances. However, real datasets often do not fit into parametric models, which in turn requires nonparametric clustering methods [1].

Recently, spectral clustering [2], [3], [4], which exploits pairwise similarities of data samples using eigendecomposition of their similarity matrix, has been shown to be successful in several areas such as information retrieval and computer vision [5], [6]. It has advantageous properties, such as extraction of irregularly shaped clusters without parametric models and easy implementation, and it has been supported by theoretical and empirical studies [7], [8]. Detailed reviews on spectral clustering can be found in [9], [10]. For large datasets, however, its use is limited since it is often infeasible due to the computational complexity of O(N3) and memory requirement of O(N2) with N being the number of samples to be clustered [11].

In order to apply spectral methods for clustering of large datasets, one approach is to use distributed systems for parallelizing spectral clustering on many computers to overcome the issue of memory use and computational complexity [12]. This in turn requires additional resources that should be scaled according to the size of the dataset. A novel approach, applicable only for segmentation of large images, is to apply spectral clustering to non-overlapping small blocks of the image and combine the resulting partitionings by stochastic ensemble [13]. However, the common naive approach is to reduce the number of data samples using data representatives (either sampled among the data samples or obtained by their quantization), and then apply spectral clustering to those representatives rather than to the data samples directly [11], [14], [15], [16], [17], [18], producing an approximate spectral clustering (ASC). Fowlkes et al. [14] use random selection using Nystrom method, and hence may produce different partitionings at each try. Bezdek et al. [15] use a progressive sampling which has a tendency to over-sample [18], whereas Wang et al. [16], [18] use selective sampling. Wang et al. [11] also compare different sampling algorithms for spectral clustering and conclude that selective or k-means sampling outperform random sampling approach. Additionally, Yan et al. [17] use k-means and random projection trees as sampling methods and show experimentally that vector quantization can be successfully used to select data representatives for fast ASC with slight decrease in clustering accuracy. Moreover, Belabbas and Wolfe [19] provide theoretical justification for using vector quantization to determine the data representatives for approximate spectral clustering.

Self-organizing maps (SOMs) [20] and neural gas [21] are two neural networks that can be used for effective vector quantization of large datasets. Contrary to the k-means quantization, which is based on iterative adaptation of the centroids (the best-matching units, BMUs), SOMs and neural gas cooperatively adapt the best-matching units together with their neighbors (determined by a function), to reflect the data topology as faithful as possible with the given number of quantization prototypes. On the one hand, SOMs use a rigid (usually 2D or 3D) grid structure to define the neighborhood relations. This also enables the visualization of high-dimensional data spaces, without dimensionality reduction, since prototypes neighbor in the grid are (ideally) expected to be neighbors in the data space as well. On the other hand, the neural gas defines the neighborhood function in the data space by using the ranking of distances between the prototypes, without any forced layout. Thanks to their quantization based on cooperative adaptation, SOMs and neural gas are successfully used in prototype-based data analysis [22], [23]. Our first contribution in this study is to utilize the quantization property of SOMs and neural gas as preprocessors for approximate spectral clustering of large datasets, and show that they are usually superior to k-means quantization, in terms of accuracies achieved by ASC.

In general, another challenge in spectral clustering is to construct the similarity matrix for eigendecomposition. Even though different ways can be used for this matrix [9], a common approach is to define pairwise similarities, s(i,j) s, using a Gaussian function based on the (often Euclidean) distances, d(xi,xj), of data samples xi and xj, i.e.s(i,j)=e(d(xi,xj)2/2σ2)where σ is a decaying parameter determining the neighborhood. Alternatively, a recent method [24] defines s(i,j) by including common-near-neighbor, CNN(i,j) (the number of data samples in the intersection of ϵ-neighborhoods of xi and xj), ass(i,j)=e(d(xi,xj)2/(2σ2(CNN(i,j)+1)))and show superior clustering accuracies. However, both approaches requires to set σ which has to be determined properly for the best possible partitioning with spectral clustering. Ref. [3] recommends to use various σ values to find the optimum value whereas [25] uses a cluster ensemble approach to merge partitionings obtained by different σ. Instead, automated setting of σ (different σi for each sample xi) has also been used [26], [6], [18] by defining σi as the distance to the kth nearest neighbor of data sample xi. However, this approach introduces another parameter to be set by the user, often specific to the dataset [24]. To overcome this challenge for vector quantization based approximate spectral clustering, we define a similarity matrix based on local data distribution without any user-defined parameters, as our second contribution in this study.

The paper is outlined as follows. First, we briefly discuss spectral clustering methods in Section 2; then we describe self-organizing maps and neural gas, which are vector quantization methods for approximate spectral clustering used in this study, in Section 3. In Section 4, we describe our similarity matrix derived from local data distribution. In Section 5, we show the effectiveness of the proposed approaches using three synthetic datasets in [27], six real datasets from UCI Machine Learning Repository [28], and five large datasets. We conclude in Section 6.

Section snippets

Spectral clustering

Spectral clustering methods are associated with relaxed optimization of graph-cut problems, using a graph Laplacian matrix, L [2], [3], [4]. We refer to the tutorials [9], [10] (and references therein) for detailed overview of different methods. Below, we describe the method in [3] utilized for this study, since several studies indicate that there is no clear advantage among different spectral methods as long as a normalized graph Laplacian is considered [9], [8].

Let G=(V,S) be a weighted,

Vector quantization by neural networks

This section briefly describes the two neural networks, self-organizing maps [20] and neural gas [21], which are used as vector quantization methods for selecting the data representatives to be clustered using spectral methods.

A similarity measure for vector quantization prototypes

Vector quantization methods encode a data manifold MRD with a finite set of prototypes (codebook vectors) W={w1,w2,,wn};wiRD,i=1,2,,n. A data sample vM is characterized by the prototype wi which is the best matching unit for v (Eq. (6)). The prototypes partitions the data space into subregions, which are called Voronoi polygons ViVi={vM:vwivwjj}The prototypes are the centers of their corresponding polygons, representing the data samples in Vis. A common way of representing the

Experimental results

First, we evaluate the proposed neural network based ASC and the density-based CONN similarity, using three synthetic datasets [27] described above, and six real datasets from UCI machine learning repository [28]. These datasets are small enough to be clustered by spectral clustering directly and the ASC is unnecessary for them. However, owing to their easily interpretable or well-known cluster structures, they serve for qualitative evaluation of the ASC and the CONN similarity. We compare the

Conclusion

We show that vector quantization based approximate spectral clustering (VQASC) achieves high accuracies of spectral clustering (SC), utilizing the SC advantages for large datasets where the SC is infeasible due to computational complexity and memory requirement. We also show that neural networks (SOM or neural gas) often provide better quantization than k-means (a sampling method shown superior for ASC in an earlier study [11]), in order to obtain high clustering quality with VQASC. For the

Kadim Tasdemir Kadim Tasdemir received his B.S. degree from Bogazici University, Istanbul, Turkey in 2001; his M.S. degree from Istanbul Technical University in 2003; and his Ph.D. degree from Rice University, Houston, TX in 2008. He has been a researcher at the European Commission Joint Research Centre (JRC), Institute for Environment and Sustainability (IES) since October 2009. He received IES Best Young Scientist Award in 2011. Currently, he works on automated control methods for monitoring

References (41)

  • A. Qin et al.

    Enhanced neural gas network for prototype-based clustering

    Pattern Recognition

    (2005)
  • E. Merényi et al.

    Learning highly structured manifolds: harnessing the power of SOMs

  • J. Shi et al.

    Normalized cuts and image segmentation

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2000)
  • A. Ng et al.

    On spectral clustering: analysis and an algorithm

  • M. Meila, J. Shi, A random walks view of spectral segmentation, in: Eighth International Workshop on Artificial...
  • K. Ersahin et al.

    Segmentation and classification of polarimetric SAR data using spectral graph partitioning

    IEEE Transactions on Geoscience and Remote Sensing

    (2010)
  • R. Kannan et al.

    On clusterings: good, bad and spectral

    Journal of the ACM

    (2004)
  • D. Verma, M. Meila, A Comparison of Spectral Clustering Algorithms, Technical Report UW TR CSE-03-05-01, University of...
  • U. von Luxburg, A Tutorial on Spectral Clustering, Technical Report TR-149, Max Planck Institute for Biological...
  • L. Wang et al.

    Approximate spectral clustering

  • Cited by (72)

    • Approximate spectral clustering density-based similarity for noisy datasets

      2019, Pattern Recognition Letters
      Citation Excerpt :

      Three datasets were retrieved from UCI machine learning repository. These datasets were used in an experiment by [14]. The number of representatives was set similar to the values selected in that paper.

    • An automated spectral clustering for multi-scale data

      2019, Neurocomputing
      Citation Excerpt :

      While these existing approaches [14–16,22,23] were developed to tackle spectral clustering in an automated manner, they are either designed to solve the problem for multi-scale 2D and image segmentation or network data, which is different in nature from data with multi-scale higher dimensional attributes as sought here. Spectral clustering for higher dimensional feature spaces has also been the subject of some studies (e.g. [24–29]) to address different challenges. One of the examples of higher dimensional spaces in the real-world application is the time-series data.

    View all citing articles on Scopus

    Kadim Tasdemir Kadim Tasdemir received his B.S. degree from Bogazici University, Istanbul, Turkey in 2001; his M.S. degree from Istanbul Technical University in 2003; and his Ph.D. degree from Rice University, Houston, TX in 2008. He has been a researcher at the European Commission Joint Research Centre (JRC), Institute for Environment and Sustainability (IES) since October 2009. He received IES Best Young Scientist Award in 2011. Currently, he works on automated control methods for monitoring agricultural resources using remote sensing imagery. Before JRC, he worked as Assistant Professor at Department of Computer Engineering, Yasar University, Izmir, Turkey in 2008–2009. During 2003–2008, he was a research assistant at Rice University, where he developed visualization and clustering methods using neural computation for detailed knowledge discovery, sponsored by NASA Applied Information Systems Research Program. He was also awarded “Rice University Robert Patten Award“ for his contributions to graduate life. During 2001–2003, he was a research assistant at Istanbul Technical University, where he worked on License Plate Recognition project. His research interests include detailed knowledge discovery from high-dimensional and large data sets (especially remote sensing imagery) using machine learning and data mining, self-organized learning, computer vision and pattern recognition.

    View full text