GCHL: A grid-clustering algorithm for high-dimensional very large spatial data bases

https://doi.org/10.1016/j.patrec.2004.09.052Get rights and content

Abstract

Spatial clustering, which groups similar spatial objects into classes, is an important component of spatial data mining [Han and Kamber, Data Mining: Concepts and Techniques, 2000]. Due to its immense applications in various areas, spatial clustering has been highly active topic in data mining researches, with fruitful, scalable clustering methods developed recently. These spatial clustering methods can be classified into four categories: partitioning method, hierarchical method, density-based method and grid-based method.

Clustering large data sets of high dimensionality has always been a serious challenge for clustering algorithms. Many recently developed clustering algorithms have attempted to address either handling data with very large number of records or data sets with very high number of dimensions.

This new clustering method GCHL (a Grid-Clustering algorithm for High-dimensional very Large spatial databases) combines a novel density-grid based clustering with axis-parallel partitioning strategy to identify areas of high density in the input data space. The algorithm work as well in the feature space of any data set. The method operates on a limited memory buffer and requires at most a single scan through the data. We demonstrate the high quality of the obtained clustering solutions, capability of discovering concave/deeper and convex/higher regions, their robustness to outlier and noise, and GCHL excellent scalability.

Introduction

Data mining, or knowledge discovery in databases (KDD), is the technique of analyzing data to discover previously unknown information. The goal is to reveal regularities and relationships that are non-trivial. This is accomplished through an analysis of the patterns that form in the data. Various algorithms have been developed to perform this analysis, but many of these techniques are not scalable to very large databases.

Spatial data mining differs from regular data mining in parallel with the differences between nonspatial data and spatial data. The attributes of a spatial object stored in a database may be affected by the attributes of the spatial neighbors of that object. In addition, spatial location, and implicit information about the location of an object, may be exactly the information that can be extracted through spatial data mining (Fayad and Usama, 1996).

Clustering is one of the most useful tasks in data mining process for discovering groups and identifying interesting distributions and patterns in the underlying data. Thus the main concern in the clustering process is to reveal the organization of patterns into “sensible” groups, which allow us to discover similarities and differences, as well as to derive useful inferences about them (Gutha et al., 1999).

Recently, a number of different clustering algorithms have been suggested in data mining (Agrawal et al., 1998, Ester et al., 1996, Guha et al., 1998, Karypis et al., 1999, Ng and Han, 1994, Wang et al., 1999, Zhang et al., 1996, Estivill-Castro et al., 2000) and They differ in their capabilities, applicability and computational requirements. Clearly no particular clustering method has been shown to be superior to all its competitors in all aspects. Typically, the problem is that clusters identified with one method can not be detected by other methods. Even a set of points declared noise by a certain method might be identified as a cluster by some other methods. We believe that this is due to two reasons. First, methods aim to be applicable in very general metric spaces. Second, methods adhere strongly to a clustering philosophy, based either on a hierarchical approach, a density approach or a nearest neighbor approach. The proposed clustering algorithm, named GCHL is a hybrid of different methods, its salient features are as follows:

  • 1.

    GCHL is efficient and scalable while handling large amount of data objects with limited memory.

  • 2.

    Able to identify irregular shapes, including those with concave sections and nested shapes.

  • 3.

    Its clustering mechanism is insensitive to noise.

  • 4.

    GCHL algorithm is not sensitive to the order of input. That is, clustering results is independent of data order.

  • 5.

    Algorithm handles data with large number of dimensions, that is, higher dimensionality.

Overtime, a number of clustering algorithms have been developed. Some of these are evolutionary, enhancements of previously developed work, others are revolutionary, introducing new concepts and methods. In this paper we improve the previous works and introduce a new algorithm.

Section snippets

Related works

Clustering of very large high dimensional data sets is an important problem. There are numbers of different clustering algorithms that are applicable to very large data sets, and a few that address high dimensional data.

Clustering algorithms can be divided into partitioning, hierarchical, density-based and grid-based algorithms. In here some of those algorithms stated in brief.

PAM (Partitioning around Medoids) (Kaufman and Rousseeuw, 1990) uses k-clustering on medoids to identify clusters. It

Comparing the effectiveness

The clustering algorithms which presented in Section 2, are useful to the geographical and medical communities and can be categorized into four branches: partitioning-based, hierarchical-based, density-based and grid-based.

Partitioning methods like k-means, k-medoids are methods which make uses of technique called iterative reallocation to improve the clustering quality from an initial solution. As such methods tend to find clusters that are of spherical shape and similar in size, they are more

GCHL: the algorithm

In this section, some necessary definitions and the GCHL’s algorithm presented.

The salient features of the GCHL

  • 1.

    GCHL scans the data only once and if the amount of data is large, it will divide the input data into parts and analyze them part by part and after finishing with one part, inputs next part. Therefore it can handle large amount of data. With this partitioning technique it can handle and manage the clustering process with limited amount of memory.

  • 2.

    Most partitioning methods cluster objects based on the distance between objects. Such methods can find only spherical-shaped clusters and encounter

Complexity

The total time complexity of GCHL algorithm is between O(N · ρ · d) and O(ρ · d · N · logN), where input data set contain N, d-dimensional data points fitted into ρ blocks.

Empirical results

GCHL like DBSCAN and other density-based methods can discover clusters of arbitrary shapes, but they mostly use index-based methods which face breakdown in efficiency when the number of dimensions is high. The GCHL clustering uses a grid data structure. It quantizes the space into a finite number of cubic cells and perform all of the clustering operations in that cells.

This section illustrates the general behavior of GCHL the correctness of its solutions. The first series of tests were carried

Conclusion

In this paper, we propose a new clustering algorithm GCHL (a Grid-Clustering algorithm for High-dimensional very Large spatial databases), capable of efficiently and effectively clustering large high dimensional data sets. It relies on a novel active sampling approach and uses a grid axis-parallel partitioning scheme to identify the dense region in the input data space. GCHL has good accuracy and scalability, is robust to noise, automatically detects the number of clusters in the data, and can

References (24)

  • G. Garai et al.

    A novel genetic algorithm for automatic clustering

    Pattern Recognition Lett.

    (2004)
  • Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P., 1998. Automatic subspace clustering of high dimensional data for...
  • A.V. Aho et al.

    The design and analysis of computer algorithms

    (1974)
  • Alevizos, P., Boutsinas, B., Tasoulis, D., Vrahatis, M.N., 2002. Improving the K-windows clustering algorithm. In:...
  • Blanken, H., Ijbema, A., Meek, P., Van Den Akker, B., 1990. The generalized grid file: Description and performance...
  • Ester, M., Kriegel, M.-P., Sander, J., Xu, X., 1996. A density-based algorithm for discovering clusters in large...
  • Estivill-Castro, V., I. Lee, 2000. AMOEBA: Hierarchical clustering based on spatial proximity using delaunay diagram....
  • Fayad et al.

    Advances in Knowledge Discovery in Databases

    (1996)
  • Goil, S., H. Nagesh, A. Choudhary, 1999. MAFIA: Efficient and scalable subspace clustering for very large data sets....
  • Guha, S., Rastogi, R., Shim, K., 1998. CURE: An efficient clustering algorithm for large databases. In: Proc. ACM...
  • Gutha, S. Rastogi, R., Shim, K. 1999. ROCK: A robust clustering algorithm for categorical attributes. In: Proc. IEEE...
  • J. Han et al.

    Data Mining: Concepts and Techniques

    (2000)
  • Cited by (0)

    View full text