ABSTRACT
Nowadays, the size of data continues to increase more rapidly day by day. Considering this situation, large-scale processing has become a very important issue in document clustering, due to its capability to organize large numbers of documents as few meaningful and consistent clusters. In this study, a dataset consisting of 390 English textbooks with a total size of 7.61 GB, has been used for the clustering task. Locality sensitive hashing and k-shingles methods have been used to obtain clusters with high quality. Clusters have been evaluated using cluster validity indices. According to the experimental results, high-quality clusters have been obtained, with 0.88 and 0.79 for Silhouette and Davies–Bouldin scores, respectively.
- N. Rooney, D. Patterson, M. Galushka and V. Dobrynin, “A scaleable document clustering approach for large document corpora,” Information processing & management, vol. 42, no. 5, 2006, pp. 1163-1175.Google Scholar
- L. Wang and M. Dong, “On the clustering of large-scale data: A matrix-based approach,” The 2011 International Joint Conference on Neural Networks, San Jose, CA, 2011, pp. 139-144.Google ScholarCross Ref
- Shao Xiufeng and Cheng Wei, “Improved CURE algorithm and application of clustering for large-scale data,” 2011 IEEE International Symposium on IT in Medicine and Education, Cuangzhou, 2011, pp. 305-308.Google ScholarCross Ref
- N. Spasojevic and G. Poncin, “Large scale page-based book similarity clustering,” 2011 International Conference on Document Analysis and Recognition, Beijing, 2011, pp. 119-125.Google ScholarDigital Library
- N. Y. Saiyad, H. B. Prajapati and V. K. Dabhi, “A survey of document clustering using semantic approach,” 2016 International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT), Chennai, 2016, pp. 2555-2562.Google ScholarCross Ref
- J. Mei, Y. Wang, L. Chen and C. Miao, “Large scale document categorization with fuzzy clustering,” in IEEE Transactions on Fuzzy Systems, vol. 25, no. 5, pp. 1239-1251, Oct. 2017.Google ScholarDigital Library
- T. H. Sardar and Z. Ansari, “Partition based clustering of large datasets using MapReduce framework: An analysis of recent themes and directions,” Future Computing and Informatics Journal, vol. 3, no. 2, 2018, pp. 247-261.Google ScholarCross Ref
- D. Xu and Y. Tian, “A comprehensive survey of clustering algorithms,” Ann. Data. Sci. vol. 2, 2015, pp. 165–193.Google ScholarCross Ref
- N. Shah and S. Mahajan, “Document clustering: a detailed review,” International Journal of Applied Information Systems, vol. 4, no. 5, 2012, pp. 30-38.Google ScholarCross Ref
- Nisha and P. J. Kaur, “A survey of clustering techniques and algorithms,” 2015 2nd International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, 2015, pp. 304-307.Google Scholar
- S. Bisht and A. Paul, “Document clustering: a review,” International Journal of Computer Applications, vol. 73, no. 11, 2013, pp. 26-33.Google ScholarCross Ref
- S. K. Sahu and S. Srivastava, “Review of web document clustering algorithms,” 2016 3rd International Conference on Computing for Sustainable Global Development (INDIACom), New Delhi, 2016, pp. 1153-1155.Google Scholar
- A. Huang, “Similarity measures for text document clustering,” Proceedings of the sixth New Zealand computer science research student conference (NZCSRSC2008), Christchurch, New Zealand, vol. 4, pp. 9-56, April 2008.Google Scholar
- W. M. S. Yafooz, Z. A. Bakar and A. M. Mithun, “Textual document clustering in traditional and modern approaches (review),” 2018 IEEE Conference on Systems, Process and Control (ICSPC), Melaka, Malaysia, 2018, pp. 159-164.Google ScholarCross Ref
- M. S. G. Karypis, V. Kumar, and M. Steinbach, “A comparison of document clustering techniques,” TextMining Workshop at KDD2000, May 2000.Google Scholar
- https://www.kaggle.com/praveengovi/covid19engineeringbooksnlpdatasetGoogle Scholar
- M. E. Manaa and G. Abdulameer, “Web documents similarity using k-shingle tokens and minhash technique,” Journal of Engineering and Applied Sciences, vol. 13, no. 6, 2018, pp. 1499-1505.Google Scholar
- И. Blekanov and V. Korelin, “Hierarchical clustering of large text datasets using Locality-Sensitive Hashing,” Proceedings of the International Workshop on Applications in Information Technology (IWAIT), 2015.Google Scholar
- O. Arbelaitz, I. Gurrutxaga, J. Muguerza, J. M. Pérez, and I. Perona, “An extensive comparative study of cluster validity indices,” Pattern Recognition, vol. 46, no. 1, 2013, pp. 243-256.Google ScholarDigital Library
Index Terms
- Locality Sensitive Hashing Based Clustering for Large Scale Documents
Recommendations
Fast locality-sensitive hashing
KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data miningLocality-sensitive hashing (LSH) is a basic primitive in several large-scale data processing applications, including nearest-neighbor search, de-duplication, clustering, etc. In this paper we propose a new and simple method to speed up the widely-used ...
Data-Dependent Locality Sensitive Hashing
Proceedings of the 15th Pacific-Rim Conference on Advances in Multimedia Information Processing --- PCM 2014 - Volume 8879Locality sensitive hashing LSH is the most popular algorithm for approximate nearest neighbor ANN search. As LSH partitions vector space uniformly and the distribution of vectors is usually non-uniform, it poorly fits real dataset and has limited ...
Fast agglomerative hierarchical clustering algorithm using Locality-Sensitive Hashing
AbstractThe single linkage method is a fundamental agglomerative hierarchical clustering algorithm. This algorithm regards each point as a single cluster initially. In the agglomeration step, it connects a pair of clusters such that the distance between ...
Comments