Abstract
Clustering is a widely used knowledge discovery technique. It helps uncovering structures in data that were not previously known. The clustering of large data sets has received a lot of attention in recent years, however, clustering is a still a challenging task since many published algorithms fail to do well in scaling with the size of the data set and the number of dimensions that describe the points, or in finding arbitrary shapes of clusters, or dealing effectively with the presence of noise. In this paper, we present a new clustering algorithm, based in self-similarity properties of the data sets. Self-similarity is the property of being invariant with respect to the scale used to look at the data set. While fractals are self-similar at every scale used to look at them, many data sets exhibit self-similarity over a range of scales. Self-similarity can be measured using the fractal dimension. The new algorithm which we call Fractal Clustering (FC) places points incrementally in the cluster for which the change in the fractal dimension after adding the point is the least. This is a very natural way of clustering points, since points in the same cluster have a great degree of self-similarity among them (and much less self-similarity with respect to points in other clusters). FC requires one scan of the data, is suspendable at will, providing the best answer possible at that point, and is incremental. We show via experiments that FC effectively deals with large data sets, high-dimensionality and noise and is capable of recognizing clusters of arbitrary shape.
Similar content being viewed by others
References
Backer, E. 1995. Computer-Assisted Reasoning in Cluster Analysis. Prentice Hall.
Belussi, A. and Faloutsos, C. 1995. Estimating the selectivity of spatial queries using the ‘Correlation’ fractal dimension. In Proceedings of the International Conference on Very Large Data Bases, pp. 299–310.
Bradley, P.S., Fayyad, U., and Reina, C. 1998. Scaling clustering algorithms to large databases. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining, New York City.
Bradley, P.S., Fayyad, U., and Reina, C. 1998. Scaling clustering algorithms to large databases (extended abstract). In Proceedings of the ACMSIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery.
Carbon Dioxide Information Analysis Center. Contributor: Yi-Fan, Li. 1990. Global population distribution. URLhttp://cdiac.esd.ornl.gov/ftp/db1016/.
Chernoff, H. 1952. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Annals of Mathematical Statistics, 23:493–509.
Domingo, C., Gavaldá, R., and Watanabe, O. 1998.Practical algorithms for online selection. In Proceedings of the first International Conference on Discovery Science.
Domingo, C., Gavaldá, R., and Watanabe, O. 2000. Adaptive sampling algorithms for scaling up knowledge discovery algorithms.Discovery Science, 1999:172–183.
Domingos, P. and Hulten, G. 2000. Mining high-speed data streams. In Proceedings of the first 2000 Conference on Knowledge Discovery and Data Mining, pp. 71–80.
Ester, M., Kriegel, J.P., Sander, J., and Su, X. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, pp. 226–231.
Faloutsos, C. and Gaede, V. 1996. Analysis of the Z-ordering method using the Hausdorff fractal dimension. In Proceedings of the International Conference on Very Large Data Bases, pp. 40–50.
Faloutsos, C. and Kamel, I. 1997. Relaxing the uniformity and independence assumptions, using the concept of fractal dimensions. Journal of Computer and System Sciences, 55(2):229–240.
Faloutsos, C., Matias, Y., and Silberschatz, A. 1996. Modeling skewed distributions using multifractals and the ‘80-20 law’. In Proceedings of the International Conference on Very Large Data Bases, pp. 307–317.
Fisher, D.H. 1996. Iterative optimization and simplification of hierarchical clusterings. Journal of AI Research,4:147–180.
Fukunaga, K. 1990. Introduction to Statistical Pattern Recognition. San Diego, California: Academic Press.
Gluck, M.A. and Corter, J.E. 1985. Information, uncertainty, and the utility of categories. In Proceedings of the Seventh Annual Conference of the Cognitive Science Society, Irvine, CA.
Grassberger, P. 1983. Generalized dimensions of strange attractors. Physics Letters, 97A:227–230.
Grassberger, P. and Procaccia, I. 1983. Characterization of strange attractors. Physical Review Letters, 50(5):346–349.
Guha, S., Rastogi, R., and Shim, K. 1998. CURE: An efficient clustering algorithm for large databases. In Proceedings of the ACM SIGMOD Conference on Management of Data, Seattle, Washington, pp. 73–84.
Hinneburg, A. and Keim, D. 1999. Clustering techniques for large data sets: From the past to the future. Tutorial Notes for ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.
Jain, A. and Dubes, R.C. 1988. Algorithms for Clustering Data. Englewood Cliffs, New Jersey: Prentice Hall.
Lauritzen, S.L. 1995. The EM algorithm for graphical association models with missing data. Computational Statistics and Data Analysis, 19:101–201.
Liebovitch, L.S. and Toth, T. 1989. A fast algorithm to determine fractal dimensions by box counting. Physics Letters, A141:386–390.
Lipton, R.J. and Naughton, J.F. 1995.Query size estimation by adaptive sampling. Journal of Computer Systems Science, 51:18–25.
Lipton, R.J., Naughton, J.F., Schneider, D.A., and Seshadri, S. 1993. Efficient sampling strategies for relational database operations. Theoretical Computer Science, 116:195–226.
Mandelbrot, B.B. 1983. The Fractal Geometry of Nature. New York: Freeman.
Ng, R.T. and Han, J. 1994. Efficient and effective clustering methods for spatial data mining. In Proceedings of the 20th Very Large Data Bases Conference, pp. 144–155.
Samet, H. 1990. Applications of Spatial Data Structures. Addison-Wesley.
Sarraille, J. and DiFalco, P. FD3. http://tori.postech.ac.kr/softwares/.
Schikuta, E. 1996. Grid clustering: An efficient hierarchical method for very large data sets. In Proceedings of the 13th Conference on Pattern Recognition, IEEE Computer Society Press, pp. 101–105.
Schroeder, M. 1991.Fractals, Chaos, Power Laws: Minutes from an Infinite Paradise. New York: W.H. Freeman.
Selim, S.Z. and Ismail, M.A. 1984. K-means-type Algorithms: A generalized convergence theorem and characterization of local optimality. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-6(1).
Sheikholeslami, G., Chatterjee, S., and Zhang, A. 1998. WaveCluster: A multi-resolution clustering approach for very large spatial databases. In Proceedings of the 24thVery Large Data Bases Conference, pp. 428–439.
Wang, W., Yand, J., and Muntz, R. 1997. STING: A statistical information grid approach to spatial data mining. In Proceedings of the 23rd Very Large Data Bases Conference, pp. 186–195.
Watanabe, O. 2000. Simple sampling techniques for discovery science. IEICE Transactions on Information and Systems.
Zhang, T., Ramakrishnan, R., and Livny, M. 1996. BIRCH: A efficient data clustering method for very large databases. In Proceedings of the ACM SIGMOD Conference on Management of Data, Montreal, Canada, pp. 103–114.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Barbará, D., Chen, P. Using Self-Similarity to Cluster Large Data Sets. Data Mining and Knowledge Discovery 7, 123–152 (2003). https://doi.org/10.1023/A:1022493416690
Issue Date:
DOI: https://doi.org/10.1023/A:1022493416690