skip to main content
10.1145/3460569.3460590acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicmaiConference Proceedingsconference-collections
research-article

Locality Sensitive Hashing Based Clustering for Large Scale Documents

Published:31 August 2021Publication History

ABSTRACT

Nowadays, the size of data continues to increase more rapidly day by day. Considering this situation, large-scale processing has become a very important issue in document clustering, due to its capability to organize large numbers of documents as few meaningful and consistent clusters. In this study, a dataset consisting of 390 English textbooks with a total size of 7.61 GB, has been used for the clustering task. Locality sensitive hashing and k-shingles methods have been used to obtain clusters with high quality. Clusters have been evaluated using cluster validity indices. According to the experimental results, high-quality clusters have been obtained, with 0.88 and 0.79 for Silhouette and Davies–Bouldin scores, respectively.

References

  1. N. Rooney, D. Patterson, M. Galushka and V. Dobrynin, “A scaleable document clustering approach for large document corpora,” Information processing & management, vol. 42, no. 5, 2006, pp. 1163-1175.Google ScholarGoogle Scholar
  2. L. Wang and M. Dong, “On the clustering of large-scale data: A matrix-based approach,” The 2011 International Joint Conference on Neural Networks, San Jose, CA, 2011, pp. 139-144.Google ScholarGoogle ScholarCross RefCross Ref
  3. Shao Xiufeng and Cheng Wei, “Improved CURE algorithm and application of clustering for large-scale data,” 2011 IEEE International Symposium on IT in Medicine and Education, Cuangzhou, 2011, pp. 305-308.Google ScholarGoogle ScholarCross RefCross Ref
  4. N. Spasojevic and G. Poncin, “Large scale page-based book similarity clustering,” 2011 International Conference on Document Analysis and Recognition, Beijing, 2011, pp. 119-125.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. N. Y. Saiyad, H. B. Prajapati and V. K. Dabhi, “A survey of document clustering using semantic approach,” 2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT), Chennai, 2016, pp. 2555-2562.Google ScholarGoogle ScholarCross RefCross Ref
  6. J. Mei, Y. Wang, L. Chen and C. Miao, “Large scale document categorization with fuzzy clustering,” in IEEE Transactions on Fuzzy Systems, vol. 25, no. 5, pp. 1239-1251, Oct. 2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. T. H. Sardar and Z. Ansari, “Partition based clustering of large datasets using MapReduce framework: An analysis of recent themes and directions,” Future Computing and Informatics Journal, vol. 3, no. 2, 2018, pp. 247-261.Google ScholarGoogle ScholarCross RefCross Ref
  8. D. Xu and Y. Tian, “A comprehensive survey of clustering algorithms,” Ann. Data. Sci. vol. 2, 2015, pp. 165–193.Google ScholarGoogle ScholarCross RefCross Ref
  9. N. Shah and S. Mahajan, “Document clustering: a detailed review,” International Journal of Applied Information Systems, vol. 4, no. 5, 2012, pp. 30-38.Google ScholarGoogle ScholarCross RefCross Ref
  10. Nisha and P. J. Kaur, “A survey of clustering techniques and algorithms,” 2015 2nd International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, 2015, pp. 304-307.Google ScholarGoogle Scholar
  11. S. Bisht and A. Paul, “Document clustering: a review,” International Journal of Computer Applications, vol. 73, no. 11, 2013, pp. 26-33.Google ScholarGoogle ScholarCross RefCross Ref
  12. S. K. Sahu and S. Srivastava, “Review of web document clustering algorithms,” 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, 2016, pp. 1153-1155.Google ScholarGoogle Scholar
  13. A. Huang, “Similarity measures for text document clustering,” Proceedings of the sixth New Zealand computer science research student conference (NZCSRSC2008), Christchurch, New Zealand, vol. 4, pp. 9-56, April 2008.Google ScholarGoogle Scholar
  14. W. M. S. Yafooz, Z. A. Bakar and A. M. Mithun, “Textual document clustering in traditional and modern approaches (review),” 2018 IEEE Conference on Systems, Process and Control (ICSPC), Melaka, Malaysia, 2018, pp. 159-164.Google ScholarGoogle ScholarCross RefCross Ref
  15. M. S. G. Karypis, V. Kumar, and M. Steinbach, “A comparison of document clustering techniques,” TextMining Workshop at KDD2000, May 2000.Google ScholarGoogle Scholar
  16. https://www.kaggle.com/praveengovi/covid19engineeringbooksnlpdatasetGoogle ScholarGoogle Scholar
  17. M. E. Manaa and G. Abdulameer, “Web documents similarity using k-shingle tokens and minhash technique,” Journal of Engineering and Applied Sciences, vol. 13, no. 6, 2018, pp. 1499-1505.Google ScholarGoogle Scholar
  18. И. Blekanov and V. Korelin, “Hierarchical clustering of large text datasets using Locality-Sensitive Hashing,” Proceedings of the International Workshop on Applications in Information Technology (IWAIT), 2015.Google ScholarGoogle Scholar
  19. O. Arbelaitz, I. Gurrutxaga, J. Muguerza, J. M. Pérez, and I. Perona, “An extensive comparative study of cluster validity indices,” Pattern Recognition, vol. 46, no. 1, 2013, pp. 243-256.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Locality Sensitive Hashing Based Clustering for Large Scale Documents
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Other conferences
          ICMAI '21: Proceedings of the 2021 6th International Conference on Mathematics and Artificial Intelligence
          March 2021
          142 pages
          ISBN:9781450389464
          DOI:10.1145/3460569

          Copyright © 2021 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 31 August 2021

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed limited
        • Article Metrics

          • Downloads (Last 12 months)26
          • Downloads (Last 6 weeks)3

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format