Using Self-Similarity to Cluster Large Data Sets

Barbará, Daniel; Chen, Ping

doi:10.1023/A:1022493416690

Using Self-Similarity to Cluster Large Data Sets

Published: April 2003

Volume 7, pages 123–152, (2003)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Daniel Barbará¹ &
Ping Chen²

408 Accesses
34 Citations
Explore all metrics

Abstract

Clustering is a widely used knowledge discovery technique. It helps uncovering structures in data that were not previously known. The clustering of large data sets has received a lot of attention in recent years, however, clustering is a still a challenging task since many published algorithms fail to do well in scaling with the size of the data set and the number of dimensions that describe the points, or in finding arbitrary shapes of clusters, or dealing effectively with the presence of noise. In this paper, we present a new clustering algorithm, based in self-similarity properties of the data sets. Self-similarity is the property of being invariant with respect to the scale used to look at the data set. While fractals are self-similar at every scale used to look at them, many data sets exhibit self-similarity over a range of scales. Self-similarity can be measured using the fractal dimension. The new algorithm which we call Fractal Clustering (FC) places points incrementally in the cluster for which the change in the fractal dimension after adding the point is the least. This is a very natural way of clustering points, since points in the same cluster have a great degree of self-similarity among them (and much less self-similarity with respect to points in other clusters). FC requires one scan of the data, is suspendable at will, providing the best answer possible at that point, and is incremental. We show via experiments that FC effectively deals with large data sets, high-dimensionality and noise and is capable of recognizing clusters of arbitrary shape.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Backer, E. 1995. Computer-Assisted Reasoning in Cluster Analysis. Prentice Hall.
Belussi, A. and Faloutsos, C. 1995. Estimating the selectivity of spatial queries using the ‘Correlation’ fractal dimension. In Proceedings of the International Conference on Very Large Data Bases, pp. 299–310.
Bradley, P.S., Fayyad, U., and Reina, C. 1998. Scaling clustering algorithms to large databases. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, New York City.
Bradley, P.S., Fayyad, U., and Reina, C. 1998. Scaling clustering algorithms to large databases (extended abstract). In Proceedings of the ACMSIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.
Carbon Dioxide Information Analysis Center. Contributor: Yi-Fan, Li. 1990. Global population distribution. URLhttp://cdiac.esd.ornl.gov/ftp/db1016/.
Chernoff, H. 1952. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Annals of Mathematical Statistics, 23:493–509.
Google Scholar
Domingo, C., Gavaldá, R., and Watanabe, O. 1998.Practical algorithms for online selection. In Proceedings of the first International Conference on Discovery Science.
Domingo, C., Gavaldá, R., and Watanabe, O. 2000. Adaptive sampling algorithms for scaling up knowledge discovery algorithms.Discovery Science, 1999:172–183.
Domingos, P. and Hulten, G. 2000. Mining high-speed data streams. In Proceedings of the first 2000 Conference on Knowledge Discovery and Data Mining, pp. 71–80.
Ester, M., Kriegel, J.P., Sander, J., and Su, X. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, pp. 226–231.
Faloutsos, C. and Gaede, V. 1996. Analysis of the Z-ordering method using the Hausdorff fractal dimension. In Proceedings of the International Conference on Very Large Data Bases, pp. 40–50.
Faloutsos, C. and Kamel, I. 1997. Relaxing the uniformity and independence assumptions, using the concept of fractal dimensions. Journal of Computer and System Sciences, 55(2):229–240.
Google Scholar
Faloutsos, C., Matias, Y., and Silberschatz, A. 1996. Modeling skewed distributions using multifractals and the ‘80-20 law’. In Proceedings of the International Conference on Very Large Data Bases, pp. 307–317.
Fisher, D.H. 1996. Iterative optimization and simplification of hierarchical clusterings. Journal of AI Research,4:147–180.
Google Scholar
Fukunaga, K. 1990. Introduction to Statistical Pattern Recognition. San Diego, California: Academic Press.
Google Scholar
Gluck, M.A. and Corter, J.E. 1985. Information, uncertainty, and the utility of categories. In Proceedings of the Seventh Annual Conference of the Cognitive Science Society, Irvine, CA.
Grassberger, P. 1983. Generalized dimensions of strange attractors. Physics Letters, 97A:227–230.
Google Scholar
Grassberger, P. and Procaccia, I. 1983. Characterization of strange attractors. Physical Review Letters, 50(5):346–349.
Google Scholar
Guha, S., Rastogi, R., and Shim, K. 1998. CURE: An efficient clustering algorithm for large databases. In Proceedings of the ACM SIGMOD Conference on Management of Data, Seattle, Washington, pp. 73–84.
Hinneburg, A. and Keim, D. 1999. Clustering techniques for large data sets: From the past to the future. Tutorial Notes for ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
Jain, A. and Dubes, R.C. 1988. Algorithms for Clustering Data. Englewood Cliffs, New Jersey: Prentice Hall.
Google Scholar
Lauritzen, S.L. 1995. The EM algorithm for graphical association models with missing data. Computational Statistics and Data Analysis, 19:101–201.
Google Scholar
Liebovitch, L.S. and Toth, T. 1989. A fast algorithm to determine fractal dimensions by box counting. Physics Letters, A141:386–390.
Google Scholar
Lipton, R.J. and Naughton, J.F. 1995.Query size estimation by adaptive sampling. Journal of Computer Systems Science, 51:18–25.
Google Scholar
Lipton, R.J., Naughton, J.F., Schneider, D.A., and Seshadri, S. 1993. Efficient sampling strategies for relational database operations. Theoretical Computer Science, 116:195–226.
Google Scholar
Mandelbrot, B.B. 1983. The Fractal Geometry of Nature. New York: Freeman.
Google Scholar
Ng, R.T. and Han, J. 1994. Efficient and effective clustering methods for spatial data mining. In Proceedings of the 20^th Very Large Data Bases Conference, pp. 144–155.
Samet, H. 1990. Applications of Spatial Data Structures. Addison-Wesley.
Sarraille, J. and DiFalco, P. FD3. http://tori.postech.ac.kr/softwares/.
Schikuta, E. 1996. Grid clustering: An efficient hierarchical method for very large data sets. In Proceedings of the 13^th Conference on Pattern Recognition, IEEE Computer Society Press, pp. 101–105.
Schroeder, M. 1991.Fractals, Chaos, Power Laws: Minutes from an Infinite Paradise. New York: W.H. Freeman.
Google Scholar
Selim, S.Z. and Ismail, M.A. 1984. K-means-type Algorithms: A generalized convergence theorem and characterization of local optimality. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-6(1).
Sheikholeslami, G., Chatterjee, S., and Zhang, A. 1998. WaveCluster: A multi-resolution clustering approach for very large spatial databases. In Proceedings of the 24^thVery Large Data Bases Conference, pp. 428–439.
Wang, W., Yand, J., and Muntz, R. 1997. STING: A statistical information grid approach to spatial data mining. In Proceedings of the 23^rd Very Large Data Bases Conference, pp. 186–195.
Watanabe, O. 2000. Simple sampling techniques for discovery science. IEICE Transactions on Information and Systems.
Zhang, T., Ramakrishnan, R., and Livny, M. 1996. BIRCH: A efficient data clustering method for very large databases. In Proceedings of the ACM SIGMOD Conference on Management of Data, Montreal, Canada, pp. 103–114.

Download references

Author information

Authors and Affiliations

ISE Department, George Mason University, Fairfax, MSN 4A4, Virginia, 22030, USA
Daniel Barbará
Computer and Mathematical Science Department, University of Houston-Downtown, One Main Street, Houston, TX, 77002, USA
Ping Chen

Authors

Daniel Barbará
View author publications
You can also search for this author in PubMed Google Scholar
Ping Chen
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Barbará, D., Chen, P. Using Self-Similarity to Cluster Large Data Sets. Data Mining and Knowledge Discovery 7, 123–152 (2003). https://doi.org/10.1023/A:1022493416690

Download citation

Issue Date: April 2003
DOI: https://doi.org/10.1023/A:1022493416690

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Using Self-Similarity to Cluster Large Data Sets

Abstract

Access this article

Similar content being viewed by others

Implementing Correlation Dimension: K-Means Clustering via Correlation Dimension

What Can Fuzzy Cluster Analysis Contribute to Clustering of High-Dimensional Data?

Data Mining Paradigms

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Using Self-Similarity to Cluster Large Data Sets

Abstract

Access this article

Similar content being viewed by others

Implementing Correlation Dimension: K-Means Clustering via Correlation Dimension

What Can Fuzzy Cluster Analysis Contribute to Clustering of High-Dimensional Data?

Data Mining Paradigms

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation