research-article

Locality Sensitive Hashing Based Clustering for Large Scale Documents

Authors:
Kevser Özdem

Gazi University, Turkey

Gazi University, Turkey
View Profile

,
M. Ali Akcayol

Gazi University, Turkey

Gazi University, Turkey
View Profile

ICMAI '21: Proceedings of the 2021 6th International Conference on Mathematics and Artificial IntelligenceMarch 2021Pages 137–142https://doi.org/10.1145/3460569.3460590

Published:31 August 2021Publication History

ICMAI '21: Proceedings of the 2021 6th International Conference on Mathematics and Artificial Intelligence

Pages 137–142

ABSTRACT

Nowadays, the size of data continues to increase more rapidly day by day. Considering this situation, large-scale processing has become a very important issue in document clustering, due to its capability to organize large numbers of documents as few meaningful and consistent clusters. In this study, a dataset consisting of 390 English textbooks with a total size of 7.61 GB, has been used for the clustering task. Locality sensitive hashing and k-shingles methods have been used to obtain clusters with high quality. Clusters have been evaluated using cluster validity indices. According to the experimental results, high-quality clusters have been obtained, with 0.88 and 0.79 for Silhouette and Davies–Bouldin scores, respectively.

References

N. Rooney, D. Patterson, M. Galushka and V. Dobrynin, “A scaleable document clustering approach for large document corpora,” Information processing & management, vol. 42, no. 5, 2006, pp. 1163-1175.Google Scholar
L. Wang and M. Dong, “On the clustering of large-scale data: A matrix-based approach,” The 2011 International Joint Conference on Neural Networks, San Jose, CA, 2011, pp. 139-144.Google ScholarCross Ref
Shao Xiufeng and Cheng Wei, “Improved CURE algorithm and application of clustering for large-scale data,” 2011 IEEE International Symposium on IT in Medicine and Education, Cuangzhou, 2011, pp. 305-308.Google ScholarCross Ref
N. Spasojevic and G. Poncin, “Large scale page-based book similarity clustering,” 2011 International Conference on Document Analysis and Recognition, Beijing, 2011, pp. 119-125.Google ScholarDigital Library
N. Y. Saiyad, H. B. Prajapati and V. K. Dabhi, “A survey of document clustering using semantic approach,” 2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT), Chennai, 2016, pp. 2555-2562.Google ScholarCross Ref
J. Mei, Y. Wang, L. Chen and C. Miao, “Large scale document categorization with fuzzy clustering,” in IEEE Transactions on Fuzzy Systems, vol. 25, no. 5, pp. 1239-1251, Oct. 2017.Google ScholarDigital Library
T. H. Sardar and Z. Ansari, “Partition based clustering of large datasets using MapReduce framework: An analysis of recent themes and directions,” Future Computing and Informatics Journal, vol. 3, no. 2, 2018, pp. 247-261.Google ScholarCross Ref
D. Xu and Y. Tian, “A comprehensive survey of clustering algorithms,” Ann. Data. Sci. vol. 2, 2015, pp. 165–193.Google ScholarCross Ref
N. Shah and S. Mahajan, “Document clustering: a detailed review,” International Journal of Applied Information Systems, vol. 4, no. 5, 2012, pp. 30-38.Google ScholarCross Ref
Nisha and P. J. Kaur, “A survey of clustering techniques and algorithms,” 2015 2nd International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, 2015, pp. 304-307.Google Scholar
S. Bisht and A. Paul, “Document clustering: a review,” International Journal of Computer Applications, vol. 73, no. 11, 2013, pp. 26-33.Google ScholarCross Ref
S. K. Sahu and S. Srivastava, “Review of web document clustering algorithms,” 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, 2016, pp. 1153-1155.Google Scholar
A. Huang, “Similarity measures for text document clustering,” Proceedings of the sixth New Zealand computer science research student conference (NZCSRSC2008), Christchurch, New Zealand, vol. 4, pp. 9-56, April 2008.Google Scholar
W. M. S. Yafooz, Z. A. Bakar and A. M. Mithun, “Textual document clustering in traditional and modern approaches (review),” 2018 IEEE Conference on Systems, Process and Control (ICSPC), Melaka, Malaysia, 2018, pp. 159-164.Google ScholarCross Ref
M. S. G. Karypis, V. Kumar, and M. Steinbach, “A comparison of document clustering techniques,” TextMining Workshop at KDD2000, May 2000.Google Scholar
https://www.kaggle.com/praveengovi/covid19engineeringbooksnlpdatasetGoogle Scholar
M. E. Manaa and G. Abdulameer, “Web documents similarity using k-shingle tokens and minhash technique,” Journal of Engineering and Applied Sciences, vol. 13, no. 6, 2018, pp. 1499-1505.Google Scholar
И. Blekanov and V. Korelin, “Hierarchical clustering of large text datasets using Locality-Sensitive Hashing,” Proceedings of the International Workshop on Applications in Information Technology (IWAIT), 2015.Google Scholar
O. Arbelaitz, I. Gurrutxaga, J. Muguerza, J. M. Pérez, and I. Perona, “An extensive comparative study of cluster validity indices,” Pattern Recognition, vol. 46, no. 1, 2013, pp. 243-256.Google ScholarDigital Library

Index Terms

Locality Sensitive Hashing Based Clustering for Large Scale Documents
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Index terms have been assigned to the content through auto-classification.

Recommendations

Fast locality-sensitive hashing
KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining

Locality-sensitive hashing (LSH) is a basic primitive in several large-scale data processing applications, including nearest-neighbor search, de-duplication, clustering, etc. In this paper we propose a new and simple method to speed up the widely-used ...
Read More
Data-Dependent Locality Sensitive Hashing
Proceedings of the 15th Pacific-Rim Conference on Advances in Multimedia Information Processing --- PCM 2014 - Volume 8879

Locality sensitive hashing LSH is the most popular algorithm for approximate nearest neighbor ANN search. As LSH partitions vector space uniformly and the distribution of vectors is usually non-uniform, it poorly fits real dataset and has limited ...
Read More
Fast agglomerative hierarchical clustering algorithm using Locality-Sensitive Hashing
Abstract
The single linkage method is a fundamental agglomerative hierarchical clustering algorithm. This algorithm regards each point as a single cluster initially. In the agglomeration step, it connects a pair of clusters such that the distance between ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

ICMAI '21: Proceedings of the 2021 6th International Conference on Mathematics and Artificial Intelligence
March 2021
142 pages
ISBN:9781450389464
DOI:10.1145/3460569

Copyright © 2021 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 31 August 2021
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
k-shingles
large-scale document clustering
locality sensitive hashing
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 87
  Total Downloads
- Downloads (Last 12 months)26
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Locality Sensitive Hashing Based Clustering for Large Scale Documents

ICMAI '21: Proceedings of the 2021 6th International Conference on Mathematics and Artificial Intelligence

ABSTRACT

References

Cited By

Index Terms

Recommendations

Fast locality-sensitive hashing

Data-Dependent Locality Sensitive Hashing

Fast agglomerative hierarchical clustering algorithm using Locality-Sensitive Hashing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Locality Sensitive Hashing Based Clustering for Large Scale Documents

ICMAI '21: Proceedings of the 2021 6th International Conference on Mathematics and Artificial Intelligence

ABSTRACT

References

Cited By

Index Terms

Recommendations

Fast locality-sensitive hashing

Data-Dependent Locality Sensitive Hashing

Fast agglomerative hierarchical clustering algorithm using Locality-Sensitive Hashing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media