Elsevier

Pattern Recognition

Volume 42, Issue 9, September 2009, Pages 2020-2028
Pattern Recognition

Clustering with r-regular graphs

https://doi.org/10.1016/j.patcog.2008.11.022Get rights and content

Abstract

In this paper, we present a novel graph-based clustering method, where we decompose a (neighborhood) graph into (disjoint) r-regular graphs followed by further refinement through optimizing the normalized cluster utility. We solve the r-regular graph decomposition using a linear programming. However, this simple decomposition suffers from inconsistent edges if clusters are not well separated. We optimize the normalized cluster utility in order to eliminate inconsistent edges or to merge similar clusters into a group within the principle of minimal K-cut. The method is especially useful in the presence of noise and outliers. Moreover, the method detects the number of clusters within a pre-specified range. Numerical experiments with synthetic and UCI data sets, confirm the useful behavior of the method.

Introduction

Clustering is a fundamental technique in exploratory data analysis, which is often encountered into many applications such as data mining, image segmentation, information retrieval, and bioinformatics [9]. There exist many successful clustering algorithms, however, clustering is still a difficult problem because the notion of clusters is strongly dependent on the context as well as the purpose of clustering. The main issues that are unsolved and controversial in clustering, can be summarized as follows [10]. First, it is data-dependent to select appropriate measure of similarity or dissimilarity between data points. There is no guidelines which allow us to choose the best one among diverse measures. Second, there exist widely acceptable conceptual definitions of a cluster, but it is difficult to derive an operational definition leading to a concrete algorithm from the conceptual one. The operational definition is also related to selecting the optimal number of clusters which is usually unknown and is difficult to estimate in real world applications. Third, the quality of clustering is very sensitive to background noise or outliers. Therefore, it is important to remove or to discriminate them from actual clusters, for robust clustering. Fourth, many clustering algorithms, which are based on minimizing the cost of error iteratively, suffer from getting trapped in local minima. Last, the hyperparameters such as the kernel width in spectral clustering are not easy to tune manually [20].

In this paper, we present a novel r-regular graph-based clustering algorithm which can alleviate the aforementioned problems. In Section 2, we give a review of graph-based clustering algorithms. In Section 3, we first define an operational definition of a cluster in terms of an underlying graph structure and the dissimilarity between vertices. The optimal clusters are determined by maximizing the normalized cluster utility. We emphasize that the graph structure of r-regular graphs plays an important role in eliminating inconsistent edges and in calculating the normalized cluster utility. In Section 4, we explain how to decompose a graph into disjoint r-regular graphs. Our proposed clustering algorithm is presented in Section 5, incorporating the decomposed r-regular graph into the definition of a cluster. Numerical experimental results with synthetic and UCI data sets are provided in Section 6, showing the useful behavior of our method. Finally conclusions are drawn and discussions are given in Section 7.

Section snippets

Related work

In this section, we give an overview of graph-based partitional clustering algorithms that are related to our work. Extensive literature survey for clustering can be found in [11], [27]. In general, graph-based partitional clustering algorithms consist of three steps that are summarized below:

  • (1)

    Construct an underlying graph to capture a geometric structure among data points.

  • (2)

    Remove some inconsistent edges according to a rule.

  • (3)

    Identify clusters from resulting connected subgraphs.

Various graph

Clusters, dissimilarity, and cluster validation

We present an operational definition of a cluster based on an underlying graph structure, leading a novel dissimilarity measure between data points, which enables us to define an optimization criterion that quantitatively validates the quality of clustering.

We consider an undirected weighted graph G=(V,E) where V={v1,v2,,vn} is a set of vertices (nodes) and E={eij} is a set of edges with each edge eij weighted by Euclidean distance between vi and vj. The weighted adjacency matrix of a graph G,

r-regular graph decomposition

It follows from the arguments described in Section 3 that the essential requirements for the underlying graph of a data set S include: (1) all vertices should have the same degree; (2) vertices in a neighborhood should be connected; and (3) connected subgraphs should be mutually disconnected when they are well separated. In this section, we present a method for constructing an underlying graph satisfying these three requirements. The main task involves a decomposition of a complete graph into

r-regular graph clustering

We present the r-regular graph clustering algorithm which consists of two parts: (1) the r-regular graph decomposition and (2) the refinement where inconsistent edges are eliminated and noise clusters are merged through maximizing the normalized cluster utility. Before we illustrate the detailed clustering algorithm, we introduce two user-specified parameters which involve resolving the ambiguities of noise clusters and controlling the level of resolution.

Definition 7

A noise cluster is a connected

Numerical experiments

We evaluated the clustering performance in terms of classification accuracy using several labeled data sets (labels are hidden to clustering algorithms). Experiments were done with two synthetic data sets with background noise, and one synthetic data set without noise. We also used six UCI data sets [2], including iris, Wisconsin original breast cancer (WBC), Wisconsin diagnostic breast cancer (WDBC), ionosphere, and handwritten digit data (<?MCtwidthcolumnwidth?>Table 1). In Table 1, n is the

Conclusions

We have presented a novel graph-based clustering algorithm that was composed of the r-regular graph decomposition followed by the further optimization in the framework of the maximum normalized cluster utility. Inspired by the perceptual grouping, the r-regular graph decomposition determined a disjoint union of r-regular graphs in such a way that the sum of weights of edges eliminated during that decomposition, was maximized. The r-regular graph decomposition captured the proximity between data

Acknowledgments

J.K. Kim was supported by Microsoft Research Asia fellowship. This work was supported by National Core Research Center for Systems Bio-Dynamics and KOSEF Basic Research Program (Grant R01-2006-000-11142-0).

About the Author—JONG KYOUNG KIM received the B.S. degree in chemistry, the B.S. degree in computer science in 2004, and the M.S. degree in computer science in 2006, from Pohang University of Science and Technology, Pohang, Korea. He is studying for the Ph.D. degree in computer science in the same university. His research interests include statistical machine learning and bioinformatics.

References (28)

  • E. Hartuv et al.

    A clustering algorithm based on graph connectivity

    Information Processing Letters

    (2000)
  • R. Urquhart

    Graph theoretical clustering based on limited neighborhood sets

    Pattern Recognition

    (1982)
  • N. Ahuja

    Dot pattern processing using voronoi neighborhoods

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (1982)
  • C.L. Blake, C.J. Merz, UCI repository of machine learning databases,...
  • U. Brandes et al.

    Experiments on graph clustering algorithms

  • J.-S. Cherng et al.

    A hypergraph based clustering algorithm for spatial data sets

  • B.S. Everitt

    Cluster Analysis

    (1974)
  • P. Foggia et al.

    Assessing the performance of a graph-based clustering algorithm

  • K.C. Gowda et al.

    Agglomerative clustering using the concept of mutual nearest neighborhood

    Pattern Recognition

    (1978)
  • A.K. Jain et al.

    Algorithms for Clustering Data

    (1988)
  • A.K. Jain et al.

    Statistical pattern recognition: a review

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2000)
  • A.K. Jain et al.

    Data clustering: a review

    ACM Computing Surveys

    (1999)
  • R.A. Jarvis et al.

    Clustering using a similarity measure based on shared near neighbors

    IEEE Transactions on Computers

    (1973)
  • R. Kannan et al.

    On clustering: good, bad and spectral

    Journal of the ACM

    (2004)
  • Cited by (9)

    View all citing articles on Scopus

    About the Author—JONG KYOUNG KIM received the B.S. degree in chemistry, the B.S. degree in computer science in 2004, and the M.S. degree in computer science in 2006, from Pohang University of Science and Technology, Pohang, Korea. He is studying for the Ph.D. degree in computer science in the same university. His research interests include statistical machine learning and bioinformatics.

    About the Author—SEUNGJIN CHOI received the B.S. and M.S. degrees in electrical engineering from Seoul National University, Korea, in 1987 and 1989, respectively, and the Ph.D. degree in electrical engineering from the University of Notre Dame, Indiana, in 1996. He was a Visiting Assistant Professor in the Department of Electrical Engineering at University of Notre Dame, Indiana, during the Fall semester of 1996. He was with the Laboratory for Artificial Brain Systems, RIKEN, Japan, in 1997 and was an Assistant Professor in the School of Electrical and Electronics Engineering, Chungbuk National University from 1997 to 2000. He is currently an Associate Professor of Computer Science at Pohang University of Science and Technology, Korea. His primary research interests include statistical machine learning, probabilistic graphical models, Bayesian learning, kernel machines, manifold learning, independent component analysis, and pattern recognition.

    View full text