Skip to main content

Tolerance Rough Set Theory Based Data Summarization for Clustering Large Datasets

  • Conference paper
Transactions on Rough Sets XIV

Part of the book series: Lecture Notes in Computer Science ((TRS,volume 6600))

Abstract

Finding clusters in large datasets is an interesting challenge in many fields of Science and Technology. Many clustering methods have been successfully developed over the years. However, most of the existing clustering methods need multiple data scans to get converged. Therefore, these methods cannot be applied for cluster analysis in large datasets. Data summarization can be used as a pre-processing step to speed up classical clustering methods for large datasets. In this paper, we propose a data summarization scheme based on tolerance rough set theory termed rough bubble. Rough bubble utilizes leaders clustering method to collect sufficient statistics of the dataset, which can be used to cluster the dataset. We show that proposed summarization scheme outperforms recently introduced data bubble as a summarization scheme when agglomerative hierarchical clustering (single-link) method is applied to it. We also introduce a technique to reduce the number of distance computations required in leaders clustering method. Experiments are conducted with synthetic and real world datasets which show effectiveness of our methods for large datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Jain, A.K., Murty, M.N., Flynn, P.J.: Data clustering: A review. ACM Comput. Surv. 31(3), 264–323 (1999)

    Article  Google Scholar 

  2. MacQueen, J.B.: Some Methods for Classification and Analysis of MultiVariate Observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)

    Google Scholar 

  3. Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data:an Introduction to Cluster Analysis. John Wiley & Sons, USA (1990)

    Book  MATH  Google Scholar 

  4. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In: Proceedings of 2nd ACM SIGKDD, SIGKDD 1996, pp. 226–231 (1996)

    Google Scholar 

  5. Hartigan, J.A.: Clustering Algorithms. John Wiley & Sons, Inc., New York (1975)

    MATH  Google Scholar 

  6. Spath, H.: Cluster Analysis Algorithms for Data Reduction and Classification of Objects. Ellis Horwood, UK (1980)

    MATH  Google Scholar 

  7. Ho, T.B., Nguyen, N.B.: Nonhierarchical document clustering based on a tolerance rough set model. Int. J. Intell. Syst. 17, 199–212 (2002)

    Article  MATH  Google Scholar 

  8. Tremblay, J.P., Manohar, R.: Discreate Mathematical Structures with Applications to Computer Science. Tata McGraw-Hill Publishing Company Limited, New Delhi (1997)

    MATH  Google Scholar 

  9. Sneath, A., Sokal, P.H.: Numerical Taxonomy. Freeman, London (1973)

    MATH  Google Scholar 

  10. King, B.: Step-Wise Clustering Procedures. Journal of the American Statistical Association 62(317), 86–101 (1967)

    Article  Google Scholar 

  11. Murtagh, F.: Complexities of hierarchic clustering algorithms: state of the art. Computational Statistics Quarterly 1, 101–113 (1984)

    MATH  Google Scholar 

  12. Ankerst, M., Breunig, M.M., Kriegel, H.P., Sander, J.: Optics: Ordering points to identify the clustering structure. SIGMOD Rec. 28, 49–60 (1999)

    Article  Google Scholar 

  13. De, S.K., Krishna, P.R.: Clustering web transactions using rough approximation. Fuzzy Sets and Systems 148(1), 131–138 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  14. Kumar, P., Krishna, P.R., Bapi, R.S., De, S.K.: Rough clustering of sequential data. Data Knowl. Eng. 63, 183–199 (2007)

    Article  Google Scholar 

  15. Kawasaki, S., Nguyen, N.B., Ho, T.B.: Hierarchical document clustering based on tolerance rough set model. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 458–463. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  16. Zhang, T., Ramakrishnan, R., Livny, M.: Birch: An efficient data clustering method for very large databases. In: Proceedings of the ACM SIGMOD Conference, SIGMOD 1996, pp. 103–114 (1996)

    Google Scholar 

  17. Breunig, M.M., Kriegel, H.P., Sander, J.: Fast hierarchical clustering based on compressed data and optics. In: Zighed, D.A., Komorowski, J., Żytkow, J.M. (eds.) PKDD 2000. LNCS (LNAI), vol. 1910, pp. 232–242. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  18. Breunig, M.M., peter Kriegel, H., Kröger, P., Sander, J.: Data bubbles: Quality preserving performance boosting for hierarchical clustering. In: Proceedings of the ACM SIGMOD Conference, SIGMOD 2001, pp. 79–90 (2001)

    Google Scholar 

  19. Bradley, P.S., Fayyad, U.M., Reina, C.: Scaling clustering algorithms to large databases. In: Proceedings of the KDD, KDD 1998, pp. 9–15 (1998)

    Google Scholar 

  20. Zhou, J., Sander, J.: Data bubbles for non-vector data: speeding-up hierarchical clustering in arbitrary metric spaces. In: Proceedings of VLDB 2003, pp. 452–463 (2003)

    Google Scholar 

  21. Patra, B.K., Nandi, S.: A fast single link clustering method based on tolerance rough set model. In: Sakai, H., Chakraborty, M.K., Hassanien, A.E., Ślęzak, D., Zhu, W. (eds.) RSFDGrC 2009. LNCS, vol. 5908, pp. 414–422. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  22. Pawlak, Z.: Rough sets. Int. J. of Computer and Information Sc. 11, 341–356 (1982)

    Article  MathSciNet  MATH  Google Scholar 

  23. Lin, T.Y., Cercone, N. (eds.): Rough Sets and Data Mining: Analysis of Imprecise Data. Kluwer Academic Publishers, Norwell (1996)

    Google Scholar 

  24. Skowron, A., Stepaniuk, J.: Tolerance approximation spaces. Fundamenta Informaticae 27, 245–253 (1996)

    MathSciNet  MATH  Google Scholar 

  25. Slowinski, R., Vanderpooten, D.: A generalized definition of rough approximations based on similarity. IEEE Trans. on Knowl. and Data Eng. 12, 331–336 (2000)

    Article  Google Scholar 

  26. Kryszkiewicz, M.: Rough set approach to incomplete information systems. Information Sciences 112(1-4), 39–49 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  27. Ślezak, D., Wasilewski, P.: Granular sets — foundations and case study of tolerance spaces. In: An, A., Stefanowski, J., Ramanna, S., Butz, C.J., Pedrycz, W., Wang, G. (eds.) RSFDGrC 2007. LNCS (LNAI), vol. 4482, pp. 435–442. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  28. Bedi, P., Chawla, S.: Use of fuzzy rough set attribute reduction in high scent web page recommendations. In: Sakai, H., Chakraborty, M.K., Hassanien, A.E., Ślęzak, D., Zhu, W. (eds.) RSFDGrC 2009. LNCS, vol. 5908, pp. 192–200. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  29. Elkan, C.: Using the triangle inequality to accelerate k-means. In: Proceedings of the Twentieth International Conference on Machine Learning, ICML 2003, pp. 147–153 (2003)

    Google Scholar 

  30. Nassar, S., Sander, J., Cheng, C.: Incremental and effective data summarization for dynamic hierarchical clustering. In: Proceedings of SIGMOD Conference, SIGMOD 2004, pp. 467–478 (2004)

    Google Scholar 

  31. Kryszkiewicz, M., Lasek, P.: TI-DBSCAN: Clustering with DBSCAN by means of the triangle inequality. In: Szczuka, M., Kryszkiewicz, M., Ramanna, S., Jensen, R., Hu, Q. (eds.) RSCTC 2010. LNCS, vol. 6086, pp. 60–69. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  32. Patra, B.K., Hubballi, N., Biswas, S., Nandi, S.: Distance based fast hierarchical clustering method for large datasets. In: Szczuka, M., Kryszkiewicz, M., Ramanna, S., Jensen, R., Hu, Q. (eds.) RSCTC 2010. LNCS, vol. 6086, pp. 50–59. Springer, Heidelberg (2010)

    Chapter  Google Scholar 

  33. Rand, W.M.: Objective Criteria for Evaluation of Clustering Methods. J. of American Statistical Association 66, 846–850 (1971)

    Article  Google Scholar 

  34. Zhao, Y., Karypis, G.: Criterion functions for document clustering: Experiments and analysis. Technical report, University of Minnesota (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Patra, B.K., Nandi, S. (2011). Tolerance Rough Set Theory Based Data Summarization for Clustering Large Datasets. In: Peters, J.F., et al. Transactions on Rough Sets XIV. Lecture Notes in Computer Science, vol 6600. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-21563-6_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-21563-6_8

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-21562-9

  • Online ISBN: 978-3-642-21563-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics