Tolerance Rough Set Theory Based Data Summarization for Clustering Large Datasets

Patra, Bidyut Kr.; Nandi, Sukumar

doi:10.1007/978-3-642-21563-6_8

Bidyut Kr. Patra²² &
Sukumar Nandi²²

Part of the book series: Lecture Notes in Computer Science ((TRS,volume 6600))

580 Accesses
3 Citations

Abstract

Finding clusters in large datasets is an interesting challenge in many fields of Science and Technology. Many clustering methods have been successfully developed over the years. However, most of the existing clustering methods need multiple data scans to get converged. Therefore, these methods cannot be applied for cluster analysis in large datasets. Data summarization can be used as a pre-processing step to speed up classical clustering methods for large datasets. In this paper, we propose a data summarization scheme based on tolerance rough set theory termed rough bubble. Rough bubble utilizes leaders clustering method to collect sufficient statistics of the dataset, which can be used to cluster the dataset. We show that proposed summarization scheme outperforms recently introduced data bubble as a summarization scheme when agglomerative hierarchical clustering (single-link) method is applied to it. We also introduce a technique to reduce the number of distance computations required in leaders clustering method. Experiments are conducted with synthetic and real world datasets which show effectiveness of our methods for large datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: A review. ACM Comput. Surv. 31(3), 264–323 (1999)
Article Google Scholar
MacQueen, J.B.: Some Methods for Classification and Analysis of MultiVariate Observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)
Google Scholar
Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data:an Introduction to Cluster Analysis. John Wiley & Sons, USA (1990)
Book MATH Google Scholar
Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In: Proceedings of 2nd ACM SIGKDD, SIGKDD 1996, pp. 226–231 (1996)
Google Scholar
Hartigan, J.A.: Clustering Algorithms. John Wiley & Sons, Inc., New York (1975)
MATH Google Scholar
Spath, H.: Cluster Analysis Algorithms for Data Reduction and Classification of Objects. Ellis Horwood, UK (1980)
MATH Google Scholar
Ho, T.B., Nguyen, N.B.: Nonhierarchical document clustering based on a tolerance rough set model. Int. J. Intell. Syst. 17, 199–212 (2002)
Article MATH Google Scholar
Tremblay, J.P., Manohar, R.: Discreate Mathematical Structures with Applications to Computer Science. Tata McGraw-Hill Publishing Company Limited, New Delhi (1997)
MATH Google Scholar
Sneath, A., Sokal, P.H.: Numerical Taxonomy. Freeman, London (1973)
MATH Google Scholar
King, B.: Step-Wise Clustering Procedures. Journal of the American Statistical Association 62(317), 86–101 (1967)
Article Google Scholar
Murtagh, F.: Complexities of hierarchic clustering algorithms: state of the art. Computational Statistics Quarterly 1, 101–113 (1984)
MATH Google Scholar
Ankerst, M., Breunig, M.M., Kriegel, H.P., Sander, J.: Optics: Ordering points to identify the clustering structure. SIGMOD Rec. 28, 49–60 (1999)
Article Google Scholar
De, S.K., Krishna, P.R.: Clustering web transactions using rough approximation. Fuzzy Sets and Systems 148(1), 131–138 (2004)
Article MathSciNet MATH Google Scholar
Kumar, P., Krishna, P.R., Bapi, R.S., De, S.K.: Rough clustering of sequential data. Data Knowl. Eng. 63, 183–199 (2007)
Article Google Scholar
Kawasaki, S., Nguyen, N.B., Ho, T.B.: Hierarchical document clustering based on tolerance rough set model. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 458–463. Springer, Heidelberg (2000)
Chapter Google Scholar
Zhang, T., Ramakrishnan, R., Livny, M.: Birch: An efficient data clustering method for very large databases. In: Proceedings of the ACM SIGMOD Conference, SIGMOD 1996, pp. 103–114 (1996)
Google Scholar
Breunig, M.M., Kriegel, H.P., Sander, J.: Fast hierarchical clustering based on compressed data and optics. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 232–242. Springer, Heidelberg (2000)
Chapter Google Scholar
Breunig, M.M., peter Kriegel, H., Kröger, P., Sander, J.: Data bubbles: Quality preserving performance boosting for hierarchical clustering. In: Proceedings of the ACM SIGMOD Conference, SIGMOD 2001, pp. 79–90 (2001)
Google Scholar
Bradley, P.S., Fayyad, U.M., Reina, C.: Scaling clustering algorithms to large databases. In: Proceedings of the KDD, KDD 1998, pp. 9–15 (1998)
Google Scholar
Zhou, J., Sander, J.: Data bubbles for non-vector data: speeding-up hierarchical clustering in arbitrary metric spaces. In: Proceedings of VLDB 2003, pp. 452–463 (2003)
Google Scholar
Patra, B.K., Nandi, S.: A fast single link clustering method based on tolerance rough set model. In: Sakai, H., Chakraborty, M.K., Hassanien, A.E., Ślęzak, D., Zhu, W. (eds.) RSFDGrC 2009. LNCS, vol. 5908, pp. 414–422. Springer, Heidelberg (2009)
Chapter Google Scholar
Pawlak, Z.: Rough sets. Int. J. of Computer and Information Sc. 11, 341–356 (1982)
Article MathSciNet MATH Google Scholar
Lin, T.Y., Cercone, N. (eds.): Rough Sets and Data Mining: Analysis of Imprecise Data. Kluwer Academic Publishers, Norwell (1996)
Google Scholar
Skowron, A., Stepaniuk, J.: Tolerance approximation spaces. Fundamenta Informaticae 27, 245–253 (1996)
MathSciNet MATH Google Scholar
Slowinski, R., Vanderpooten, D.: A generalized definition of rough approximations based on similarity. IEEE Trans. on Knowl. and Data Eng. 12, 331–336 (2000)
Article Google Scholar
Kryszkiewicz, M.: Rough set approach to incomplete information systems. Information Sciences 112(1-4), 39–49 (1998)
Article MathSciNet MATH Google Scholar
Ślezak, D., Wasilewski, P.: Granular sets — foundations and case study of tolerance spaces. In: An, A., Stefanowski, J., Ramanna, S., Butz, C.J., Pedrycz, W., Wang, G. (eds.) RSFDGrC 2007. LNCS (LNAI), vol. 4482, pp. 435–442. Springer, Heidelberg (2007)
Chapter Google Scholar
Bedi, P., Chawla, S.: Use of fuzzy rough set attribute reduction in high scent web page recommendations. In: Sakai, H., Chakraborty, M.K., Hassanien, A.E., Ślęzak, D., Zhu, W. (eds.) RSFDGrC 2009. LNCS, vol. 5908, pp. 192–200. Springer, Heidelberg (2009)
Chapter Google Scholar
Elkan, C.: Using the triangle inequality to accelerate k-means. In: Proceedings of the Twentieth International Conference on Machine Learning, ICML 2003, pp. 147–153 (2003)
Google Scholar
Nassar, S., Sander, J., Cheng, C.: Incremental and effective data summarization for dynamic hierarchical clustering. In: Proceedings of SIGMOD Conference, SIGMOD 2004, pp. 467–478 (2004)
Google Scholar
Kryszkiewicz, M., Lasek, P.: TI-DBSCAN: Clustering with DBSCAN by means of the triangle inequality. In: Szczuka, M., Kryszkiewicz, M., Ramanna, S., Jensen, R., Hu, Q. (eds.) RSCTC 2010. LNCS, vol. 6086, pp. 60–69. Springer, Heidelberg (2010)
Chapter Google Scholar
Patra, B.K., Hubballi, N., Biswas, S., Nandi, S.: Distance based fast hierarchical clustering method for large datasets. In: Szczuka, M., Kryszkiewicz, M., Ramanna, S., Jensen, R., Hu, Q. (eds.) RSCTC 2010. LNCS, vol. 6086, pp. 50–59. Springer, Heidelberg (2010)
Chapter Google Scholar
Rand, W.M.: Objective Criteria for Evaluation of Clustering Methods. J. of American Statistical Association 66, 846–850 (1971)
Article Google Scholar
Zhao, Y., Karypis, G.: Criterion functions for document clustering: Experiments and analysis. Technical report, University of Minnesota (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science & Engineering, Indian Institute of Technology Guwahati, Guwahati, Assam, 781039, India
Bidyut Kr. Patra & Sukumar Nandi

Authors

Bidyut Kr. Patra
View author publications
You can also search for this author in PubMed Google Scholar
Sukumar Nandi
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of Manitoba, R3T 5V6, Winnipeg, MB, Canada
James F. Peters
University of Warsaw, 02-097, Warsaw, Poland
Andrzej Skowron & Dominik Slezak &
Kyushu Institute of Technology, 804, Tobata, Kitakyushu, Japan
Hiroshi Sakai
Jadavpur University, Kolkata, Indian Statistical Institute, Kolkata, India
Mihir Kumar Chakraborty
Cairo University, Orman, Giza, Egypt
Aboul Ella Hassanien
Zhangzhou Normal University, Fujian, China
William Zhu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Patra, B.K., Nandi, S. (2011). Tolerance Rough Set Theory Based Data Summarization for Clustering Large Datasets. In: Peters, J.F., et al. Transactions on Rough Sets XIV. Lecture Notes in Computer Science, vol 6600. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-21563-6_8

Download citation

DOI: https://doi.org/10.1007/978-3-642-21563-6_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-21562-9
Online ISBN: 978-3-642-21563-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics