Abstract
Finding clusters in large datasets is an interesting challenge in many fields of Science and Technology. Many clustering methods have been successfully developed over the years. However, most of the existing clustering methods need multiple data scans to get converged. Therefore, these methods cannot be applied for cluster analysis in large datasets. Data summarization can be used as a pre-processing step to speed up classical clustering methods for large datasets. In this paper, we propose a data summarization scheme based on tolerance rough set theory termed rough bubble. Rough bubble utilizes leaders clustering method to collect sufficient statistics of the dataset, which can be used to cluster the dataset. We show that proposed summarization scheme outperforms recently introduced data bubble as a summarization scheme when agglomerative hierarchical clustering (single-link) method is applied to it. We also introduce a technique to reduce the number of distance computations required in leaders clustering method. Experiments are conducted with synthetic and real world datasets which show effectiveness of our methods for large datasets.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: A review. ACM Comput. Surv. 31(3), 264–323 (1999)
MacQueen, J.B.: Some Methods for Classification and Analysis of MultiVariate Observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data:an Introduction to Cluster Analysis. John Wiley & Sons, USA (1990)
Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In: Proceedings of 2nd ACM SIGKDD, SIGKDD 1996, pp. 226–231 (1996)
Hartigan, J.A.: Clustering Algorithms. John Wiley & Sons, Inc., New York (1975)
Spath, H.: Cluster Analysis Algorithms for Data Reduction and Classification of Objects. Ellis Horwood, UK (1980)
Ho, T.B., Nguyen, N.B.: Nonhierarchical document clustering based on a tolerance rough set model. Int. J. Intell. Syst. 17, 199–212 (2002)
Tremblay, J.P., Manohar, R.: Discreate Mathematical Structures with Applications to Computer Science. Tata McGraw-Hill Publishing Company Limited, New Delhi (1997)
Sneath, A., Sokal, P.H.: Numerical Taxonomy. Freeman, London (1973)
King, B.: Step-Wise Clustering Procedures. Journal of the American Statistical Association 62(317), 86–101 (1967)
Murtagh, F.: Complexities of hierarchic clustering algorithms: state of the art. Computational Statistics Quarterly 1, 101–113 (1984)
Ankerst, M., Breunig, M.M., Kriegel, H.P., Sander, J.: Optics: Ordering points to identify the clustering structure. SIGMOD Rec. 28, 49–60 (1999)
De, S.K., Krishna, P.R.: Clustering web transactions using rough approximation. Fuzzy Sets and Systems 148(1), 131–138 (2004)
Kumar, P., Krishna, P.R., Bapi, R.S., De, S.K.: Rough clustering of sequential data. Data Knowl. Eng. 63, 183–199 (2007)
Kawasaki, S., Nguyen, N.B., Ho, T.B.: Hierarchical document clustering based on tolerance rough set model. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 458–463. Springer, Heidelberg (2000)
Zhang, T., Ramakrishnan, R., Livny, M.: Birch: An efficient data clustering method for very large databases. In: Proceedings of the ACM SIGMOD Conference, SIGMOD 1996, pp. 103–114 (1996)
Breunig, M.M., Kriegel, H.P., Sander, J.: Fast hierarchical clustering based on compressed data and optics. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 232–242. Springer, Heidelberg (2000)
Breunig, M.M., peter Kriegel, H., Kröger, P., Sander, J.: Data bubbles: Quality preserving performance boosting for hierarchical clustering. In: Proceedings of the ACM SIGMOD Conference, SIGMOD 2001, pp. 79–90 (2001)
Bradley, P.S., Fayyad, U.M., Reina, C.: Scaling clustering algorithms to large databases. In: Proceedings of the KDD, KDD 1998, pp. 9–15 (1998)
Zhou, J., Sander, J.: Data bubbles for non-vector data: speeding-up hierarchical clustering in arbitrary metric spaces. In: Proceedings of VLDB 2003, pp. 452–463 (2003)
Patra, B.K., Nandi, S.: A fast single link clustering method based on tolerance rough set model. In: Sakai, H., Chakraborty, M.K., Hassanien, A.E., Ślęzak, D., Zhu, W. (eds.) RSFDGrC 2009. LNCS, vol. 5908, pp. 414–422. Springer, Heidelberg (2009)
Pawlak, Z.: Rough sets. Int. J. of Computer and Information Sc. 11, 341–356 (1982)
Lin, T.Y., Cercone, N. (eds.): Rough Sets and Data Mining: Analysis of Imprecise Data. Kluwer Academic Publishers, Norwell (1996)
Skowron, A., Stepaniuk, J.: Tolerance approximation spaces. Fundamenta Informaticae 27, 245–253 (1996)
Slowinski, R., Vanderpooten, D.: A generalized definition of rough approximations based on similarity. IEEE Trans. on Knowl. and Data Eng. 12, 331–336 (2000)
Kryszkiewicz, M.: Rough set approach to incomplete information systems. Information Sciences 112(1-4), 39–49 (1998)
Ślezak, D., Wasilewski, P.: Granular sets — foundations and case study of tolerance spaces. In: An, A., Stefanowski, J., Ramanna, S., Butz, C.J., Pedrycz, W., Wang, G. (eds.) RSFDGrC 2007. LNCS (LNAI), vol. 4482, pp. 435–442. Springer, Heidelberg (2007)
Bedi, P., Chawla, S.: Use of fuzzy rough set attribute reduction in high scent web page recommendations. In: Sakai, H., Chakraborty, M.K., Hassanien, A.E., Ślęzak, D., Zhu, W. (eds.) RSFDGrC 2009. LNCS, vol. 5908, pp. 192–200. Springer, Heidelberg (2009)
Elkan, C.: Using the triangle inequality to accelerate k-means. In: Proceedings of the Twentieth International Conference on Machine Learning, ICML 2003, pp. 147–153 (2003)
Nassar, S., Sander, J., Cheng, C.: Incremental and effective data summarization for dynamic hierarchical clustering. In: Proceedings of SIGMOD Conference, SIGMOD 2004, pp. 467–478 (2004)
Kryszkiewicz, M., Lasek, P.: TI-DBSCAN: Clustering with DBSCAN by means of the triangle inequality. In: Szczuka, M., Kryszkiewicz, M., Ramanna, S., Jensen, R., Hu, Q. (eds.) RSCTC 2010. LNCS, vol. 6086, pp. 60–69. Springer, Heidelberg (2010)
Patra, B.K., Hubballi, N., Biswas, S., Nandi, S.: Distance based fast hierarchical clustering method for large datasets. In: Szczuka, M., Kryszkiewicz, M., Ramanna, S., Jensen, R., Hu, Q. (eds.) RSCTC 2010. LNCS, vol. 6086, pp. 50–59. Springer, Heidelberg (2010)
Rand, W.M.: Objective Criteria for Evaluation of Clustering Methods. J. of American Statistical Association 66, 846–850 (1971)
Zhao, Y., Karypis, G.: Criterion functions for document clustering: Experiments and analysis. Technical report, University of Minnesota (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2011 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Patra, B.K., Nandi, S. (2011). Tolerance Rough Set Theory Based Data Summarization for Clustering Large Datasets. In: Peters, J.F., et al. Transactions on Rough Sets XIV. Lecture Notes in Computer Science, vol 6600. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-21563-6_8
Download citation
DOI: https://doi.org/10.1007/978-3-642-21563-6_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-21562-9
Online ISBN: 978-3-642-21563-6
eBook Packages: Computer ScienceComputer Science (R0)