Mining massive datasets by an unsupervised parallel clustering on a GRID: Novel algorithms and case study

https://doi.org/10.1016/j.future.2011.01.002Get rights and content

Abstract

This paper proposes three novel parallel clustering algorithms based on the Kohonen’s SOM aiming at preserving the topology of the original dataset for a meaningful visualization of the results and for discovering associations between features of the dataset by topological operations over the clusters. In all these algorithms the data to be clustered are subdivided among the nodes of a GRID. In the first two algorithms each node executes an on-line SOM, whereas in the third algorithm the nodes execute a quasi-batch SOM called MANTRA. The algorithms differ on how the weights computed by the slave nodes are recombined by a master to launch the next epoch of the SOM in the nodes. A proof outline demonstrates the convergence of the proposed parallel SOMs and provides indications on how to select the learning rate to outperform both the sequential SOM and the parallel SOMs available in the literature. A case study dealing with bioinformatics is presented to illustrate that by our parallel SOM we may obtain meaningful clusters in massive data mining applications at a fraction of the time needed by the sequential SOM, and that the obtained classification supports a fruitful knowledge extraction from massive datasets.

Highlights

► A novel parallel clustering method is proposed for accelerating the Self Organizing Map (SOM) and preserving the topology of the original dataset. ► A proof outline demonstrates the convergence of the proposed parallel SOM. ► Heuristics are provided to maximize the speed-up and to minimize the topological error. ► Simulations show that the parallel SOM outperforms the currently available methods for parallelizing the SOM. ► A case study in bioinformatics illustrates how the method assists in discovering associations between data features from the topological properties of the clusters.

Introduction

The main concern of clustering algorithms is to group in a reasonable time large amounts of data in meaningful classes that can be disjoint, overlapping or organized in some hierarchical fashion [1]. A “meaningful class” is a class characterized by a maximum similarity among its items and minimum similarity between its items and the ones belonging to other classes; the similarity metrics may vary with the application domain. Clustering analysis has been used in numerous fields ranging from machine learning, artificial intelligence, pattern recognition, to web mining, spatial database analysis, text mining and image segmentation [2]. Recently, clustering algorithms have been used extensively in life sciences such as genomics and proteomics (e.g., [3], [4]).

All the mentioned application fields have in common the exponentially increasing size of the number of records to be analyzed, a high number of features for each record of the dataset, and a need for interactivity to allow the users to visualize and analyze the dataset from different perspectives.

Currently, if one needs only to cluster a dataset efficiently, many scalable clustering techniques are available, such as BIRCH [5] or bEMADS [6], which are able to process massive datasets efficiently even on a normal PC. Alternatively, to save time, one could resort to the parallelization of the clustering technique more suitable for the problem at hand. For example, if one is interested in clustering large high density data, a parallelization of the original DBSCAN [7] or its recent improved version [8] could be used.

However, often the need is to cluster a massive dataset by preserving its original topology. This would allow an efficient visualization of the clusters and the extraction, by using topological operations, of cluster properties which can be referred to the original dataset too. In this case a parallelization of either the Self Organizing feature Maps (SOM) by Kohonen et al. [9], Kaski et al. [10] and Oja et al. [11], or the Generative Topographic Mapping (GTM) by Bishop and Svensen [12] would be appropriate, as pointed out in the study carried out by [13] on topographic maps and data visualization, where a thorough comparison between the SOM, the GTM based methods and some variants of the K-Means is presented. Another method that is able to cluster a dataset by preserving its topology is the Growing Neural Gas (GNG), that is able also to match the temporal distribution of the original dataset [14], but it is too computationally demanding.

Although the GTM and its extensions, as well as the extended K-Means, show a very low topographic error,1 the SOM is significantly faster [15], thus it is a suitable target of parallelization efforts for dealing with massive datasets.

The Parallel SOMs (PSOMs) that have been proposed in the literature fall in two classes: (1) parallelization of the original SOM, known as on-line SOM, mainly implemented on ad-hoc hardware (e.g., [16]), and (2) parallelization of an approximation of the original SOM, i.e., the so called batch SOM, implemented on loosely distributed computing systems that less preserves the original dataset topology [17].

The paper presents a novel approach to the efficient parallelization of the on-line SOM on a loosely distributed computing system. This approach can be used not only to improve the on-line SOM processing time, but also to improve the performance of other SOM based techniques, namely: (1) the ESOM, that is an Extended version of the SOM [18], [19], to better preserve the topology of the original dataset and to solve complex optimization problems; (2) the Growing SOM (GSOM) to achieve the same topological accuracy of GTM methods, widely studied in this decade (e.g., [20], [21], [22]); and (3) the Relative Density SOM (ReDSOM) proposed recently by [23] to model high density datasets with variable topology, which may achieve the same performance of the GNG.

The proposed PSOMs will be compared with the original SOM, but not with the techniques that do not preserve the dataset’s topology (e.g., DBSCAN), neither with techniques that preserve topology but are much slower than the SOM (e.g., GTM). Moreover, we will show that the moderate differences between the classifications obtained by the original SOM and our PSOM influence little the associations between features one can derive by topological operations from the obtained clusters. This makes the method particularly suitable for data mining purposes.

In particular, the SOM algorithm to be parallelized is briefly described in Section 2. Section 3 reviews the literature on the parallel SOMs to point out the limits of the existing algorithms. A novel parallel clustering method that overcomes these limits is proposed with three variants in Section 4, where its suitability to parallelize other SOM based techniques featured by a very low topographic error (i.e., GSOM, and ReDSOM) is also pointed out. In all these variants the dataset is partitioned among the nodes of a GRID, or any loosely coupled computing system. Section 5 outlines the proof of the convergence of the proposed clustering method and gives some measures of its topological properties. The time performance is analyzed in Section 6, to demonstrate that all the three variants are scalable and converge with a precision of about 85%–90% to the same clusters obtainable by the sequential SOM. Also, Section 6 shows a knowledge discovery methodology based on the proposed PSOMs that is able to discover about 85% of the associations one can extract from the same methodology based on the sequential SOM. Finally, in Section 7, we first describe how the GRID environment is used to execute the proposed SOMs, then a bioinformatics case study is discussed to demonstrate the effectiveness of the PSOMs to discover features associations in a realistic case.

Section snippets

Kohonen SOM and cluster analysis

A SOM consists of a neural net with two layers: an input layer which receives the values of input vectors (one vector at a time) and an output layer, consisting of nout neurons, used to identify to what class the current input vector belongs to. Such two layers are connected by a synaptic weights matrix, usually denoted by W. The class of an input vector Xi (for i=1 to nin) belonging to a set X is given by the output neuron more activated by it, i.e., Maxj[Yj=ΣWjiXi]for j=1 to nout.

The SOM

Parallel SOMs

The algorithms available in literature to parallelize the SOM can be grouped into two main categories: (a) network partitioning and (b) data partitioning methods.

The network partitioning approach, proposed in [26], subdivides the computational effort among different Computing Elements (CEs), by assigning to each CE a portion of the output nodes. Fig. 2(a) shows the case in which each CE deals with only one output node. In this case, for any input, each CEj receives all the input components, but

A novel parallel clustering method: the Weight Recombining (WR) PSOM

In order to parallelize the Kohonen’s algorithm we propose an approach based on data partition and Weight Recombining (WR) to be executed in a master–slave computing environment that greatly decreases the communication overhead. Fig. 3 shows the architecture of the master and the computation flow of the WR PSOM. In particular, the input dataset is subdivided by the dividing unit into NP blocks to be delivered to the NP slaves. Each slave has the task of clustering the received block of data.

Convergence and topological properties of the WR PSOM

Let Kk, for k=1 to NP, be a set of NP mono-dimensional SOM networks having the same number of nf or nd inputs and nc output neurons, each dedicated to classify a dataset Dk into nc classes. The main aim of this section is to demonstrate that the weight matrix computed by the master using the parallel algorithms proposed in the previous section converges towards a weight matrix W=[V1VNP] very close to the weight matrix WS=[VS1VSNP] computed by an on-line sequential SOM to classify the dataset D

Performance evaluation

The parallel algorithms presented in the previous section were conceived for use in a GRID infrastructure. Their performance has been evaluated first by simulating the GRID over a sequential machine before testing them in practice. In particular, this section first points out the precision achievable by the WR PSOMs according to the above definitions of precision and distortion, then the processing time to carry out one epoch and the number of epochs needed to reach the final clustering by both

Case study

In this case study we automatically search for the associations between genes and diseases by using the scientific abstracts freely available in Pubmed. These associations are extracted starting from the clusters obtained by the WR WM PSOM and are then compared with the ones derived from clusters obtained by a sequential SOM.

In detail, we started by searching in Pubmed all the abstracts containing the phrase “genetic disease” and selecting a collection of 110,000 abstracts. Then we performed a

Concluding remarks

In this paper we have proposed a novel parallel SOM based on Weight Recombination (WR) and discussed three variants of the method. The WR PSOM aims at preserving the dataset topology to support a satisfactory visualization of the results and to discover relevant associations between the data features by using topological operations over the clusters obtained. The WR PSOMs presented use a “horizontal” data partitioning scheme where each computational element of a GRID performs an on-line SOM or

Alberto Faro received the Laurea in Nuclear Engineering from Politecnico of Milan in 1971. He is full Professor of Artificial Intelligence at the University of Catania. His research interests include knowledge discovery in distributed and parallel systems, signal and image processing, and bioinformatics.

References (44)

  • C. Bishop et al.

    Developments of the generative topographic mapping

    Neurocomputing

    (1998)
  • V. Fiolet et al.

    A clustering method to distribute a database on a GRID

    Future Generation Computer Systems

    (2007)
  • P. Rousseeuw

    Silhouettes: a graphical aid to the interpretation and validation of cluster analysis

    Journal of Computational and Applied Mathematics

    (1987)
  • Y. Zhao et al.

    Data clustering in life science

    Molecular Biotechnology

    (2005)
  • R. Xu et al.

    Survey of clustering algorithms

    IEEE Transactions on Neural Networks

    (2005)
  • J.T.L. Wang

    Data Mining in Bioinformatics

    (2005)
  • J. Augen

    Bioinformatics in the Post-Genomic Era

    (2005)
  • T. Zhang et al.

    BIRCH: a new data clustering algorithm and its applications

    Data Mining and Knowledge Discovery

    (1997)
  • H. Jin et al.

    Scalable model-based clustering for large databases based on data summarization

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2005)
  • D. Arlia, M. Coppola, Experiments in parallel clustering with DBSCAN, in: Proc. EuroPar2001,...
  • J. Tan et al.

    An improved clustering algorithm based on density distribution function

    Computer and Information Science

    (2010)
  • T. Kohonen

    Self Organizing Maps

    (1995)
  • S. Kaski et al.

    Bibliography of self organizing map (SOM) papers: 1981–1997

    Neural Computing Survey

    (1998)
  • M. Oja et al.

    Bibliography of self organizing map (SOM) papers: 1998–2001 Addendum

    Neural Computing Survey

    (2003)
  • M. Pena et al.

    Topology-preserving mappings for data visualisation

  • I.J. Sledge et al.

    Growing neural gas for temporal clustering

  • E. Pampalk, Limitations of the SOM and the GTM, 2001....
  • C. Garcia, M. Prieto, A. Pascual-Montano, A speculative parallel algorithm for self organizing maps, in: Proceedings of...
  • M. Cottrel et al.

    Advantages and drawbacks of the batch Kohonen algorithm

  • H. Jin et al.

    Expanding self-organizing map for data visualization and cluster analysis

    Information Sciences

    (2004)
  • M. Morchren, A. Ultsch, ESOM Maps, Technical Report 45, Dept. CS, University of Marburg, Germany,...
  • D. Alahakoon et al.

    Dynamic self-organizing maps with controlled growth for knowledge discovery

    IEEE Transactions on Neural Networks

    (2000)
  • Cited by (31)

    View all citing articles on Scopus

    Alberto Faro received the Laurea in Nuclear Engineering from Politecnico of Milan in 1971. He is full Professor of Artificial Intelligence at the University of Catania. His research interests include knowledge discovery in distributed and parallel systems, signal and image processing, and bioinformatics.

    Daniela Giordano received the Laurea in Electronic Engineering (1990) from the University of Catania, where she is associate Professor of Information Systems, and the Ph.D. in Educational Technology (1998) from Concordia University, Montreal. Her research deals with knowledge-based systems, computer vision and learning environments.

    Francesco Maiorana received the Laurea in Electronic Engineering from the University of Catania in 1990, and the M.S. in Computer Science from New York University in 1993. He is a researcher in bioinformatics at the ICT-E1 Project of the Catania Municipality. His research interests include data mining, image processing and grid computing.

    View full text