Text mining: identification of similarity of text documents using hybrid similarity model

Prasad, K. M. Shiva

doi:10.1007/s42044-022-00127-4

Text mining: identification of similarity of text documents using hybrid similarity model

Research
Published: 03 December 2022

Volume 6, pages 123–135, (2023)
Cite this article

Iran Journal of Computer Science Aims and scope Submit manuscript

K. M. Shiva Prasad¹

139 Accesses
Explore all metrics

Abstract

The volume of data that are accessible on the internet has increased dramatically. This growth of data will only increase exponentially in the future as more data exhaust devices are connected to the network. A part of these data consists of documents from various sources. As the data from various digital sources increase, it becomes tough to perform the process of identification of relevant information which is most essentially needed for their further usage. The goal of this research is to present a hybrid similarity algorithm that uses text summarization techniques to identify papers that are similar in terms of both semantic and contextual similarity. Some of these methods aim to quantify the corpus’s polysemy quotient using deep learning with numerous layers and prebuilt Natural Language Processing (NPL) models to determine document similarity. In comparison with other conventional algorithms, the experimental results of our model showed an accuracy of 76.25%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Comparing Two Models of Document Similarity Search over a Text Stream of Articles from Online News Sites

Exploiting Siamese Neural Networks on Short Text Similarity Tasks for Multiple Domains and Languages

Using Deep Learning Word Embeddings for Citations Similarity in Academic Papers

References

Meena, Y.K., Jain, A., Gopalan, D.: Survey on the graph and cluster-based approaches in multi-document text summarization. International Conference on Recent Advances and Innovations in Engineering (ICRAIE-2014). IEEE, (2014)
Kallimani, J.S.: Survey on extractive text summarization methods with multi-document datasets. 2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI). IEEE, (2018)
“Towards Automatic Text Summarization”, Sciforce blog spot text summarization, https://medium.com/sciforce/towards-automatic-summarization-part-2-abstractive-methods-c424386a65ea
Moratanch, N., Chitrakala, S.: A survey on abstractive text summarization. 2016 International Conference on Circuit, Power and Computing Technologies (ICCPCT). IEEE, (2016)
Yang, Z., et al.: Text dimensionality reduction with mutual information preserving mapping. Chin. J. Electron. 26(5), 919–925 (2017)
Article Google Scholar
Kutlu, M., Cıǧır, C., Cicekli, I.: Generic text summarization for Turkish. Comput. J. 53(8), 1315–1323 (2010)
Article Google Scholar
Liu, C.-Y., Chen, M.-S., Tseng, C.-Y.: Incrests: towards real-time incremental short text summarization on comment streams from social network services. IEEE Trans. Knowl. Data Eng. 27(11), 2986–3000 (2015)
Article Google Scholar
Nguyen, H., Santos, E., Russell, J.: Evaluation of the impact of user-cognitive styles on the assessment of text summarization. IEEE Transact. Syst. Man Cybern. Part A 41(6), 1038–1051 (2011)
Article Google Scholar
Yuan, S.-T., Sun, J.: Ontology-based structured cosine similarity in document summarization: with applications to mobile audio-based knowledge management. IEEE Transact. Syst. Man Cybern. Part B 35(5), 1028–1040 (2005)
Article Google Scholar
Shimada, A., et al.: Automatic summarization of lecture slides for enhanced student preview technical report and user study. IEEE Transact. Learn. Technol. 11(2), 165–178 (2017)
Article Google Scholar
Chen, K.-Y., et al.: Extractive broadcast news summarization leveraging recurrent neural network language modelling techniques. IEEE Transact. Audio Speech Lang. Process. 23(8), 1322–1334 (2015)
Article Google Scholar
Chen, K.-Y., et al.: An information distillation framework for extractive summarization. IEEE/ACM Transact. Audio Speech Lang. Process. 26(1), 161–170 (2017)
Article Google Scholar
Sun, X., Zhuge, H.: Summarization of scientific paper through reinforcement ranking on semantic link network. IEEE Access 6, 40611–40625 (2018)
Article Google Scholar
Goyal, P., Behera, L., McGinnity, T.M.: A context-based word indexing model for document summarization. IEEE Transact. Knowl. Data Eng. 25(8), 1693–1705 (2012)
Article Google Scholar
Jones, K.S.: Automatic summarizing: factors and directions. Adv. Automat. Text Summ. 1–12 (1999)
Wan, X.: Towards a unified approach to simultaneous single-document and multi-document summarizations. Proceedings of the 23rd international conference on computational linguistics. Association for Computational Linguistics (2010)
Wan, X., Xiao, J.: Exploiting neighbourhood knowledge for single document summarization and keyphrase extraction. ACM Transact. Inform. Syst. 28(2), 8 (2010)
Google Scholar
Zhou, X., Wan, X., Xiao, J.: Cminer: opinion extraction and summarization for Chinese microblogs. IEEE Trans. Knowl. Data Eng. 28(7), 1650–1663 (2016)
Article Google Scholar
Barbosa, L., Feng, J.: Robust sentiment detection on Twitter from biased and noisy data. Proceedings of the 23rd international conference on computational linguistics: posters. Association for Computational Linguistics (2010)
Jiang, L., et al.: Target-dependent Twitter sentiment classification. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1. Association for Computational Linguistics, (2011)
Speriosu, M., et al.: Twitter polarity classification with label propagation over lexical links and the follower graph. Proceedings of the First Workshop on Unsupervised Learning in NLP. Association for Computational Linguistics, (2011)
Liu, B.: Sentiment analysis and opinion mining. Synth. Lectures Human Lang. Technol. 5(1), 1–167 (2012)
Article Google Scholar
Hirao, T., et al.: Summarizing a document by trimming the discourse tree. IEEE/ACM Transact. Audio Speech Lang. Proces. 23(11), 2081–2092 (2015)
Article Google Scholar
McDonald, R.: A study of global inference algorithms in multi-document summarization. European Conference on Information Retrieval. Springer, Berlin, Heidelberg, (2007)
Asyrofi, M.H., et al.: Biasfinder: metamorphic test generation to uncover bias for sentiment analysis systems. IEEE Transact. Softw. Eng. (2021)
Zad, S., et al.: A survey of deep learning methods on semantic similarity and sentence modelling. 2021 IEEE 12th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON). IEEE, (2021)
Ma, Y., Zhang, P., Ma, J.: An ontology-driven knowledge block summarization approach for Chinese judgment document classification. IEEE Access 6, 71327–71338 (2018)
Article Google Scholar
Zhang, P., et al.: Semantic similarity computing model based on multi-model fine-grained nonlinear fusion. IEEE Access 9, 8433–8443 (2021)
Article Google Scholar
Wei, J., et al.: Universal weighting metric learning for cross-modal retrieval. IEEE Transact. Pattern Anal. Mach. Intell. (2021)
van Opijnenand, M., Santos, C.: On the concept of relevance in legal information retrieval. Artif. Intell. Law 25(1), 65–87 (2017)
Article Google Scholar
Zhang, N., et al.: An ontological Chinese legal consultation system. IEEE Access 5, 18250–18261 (2017)
Article Google Scholar
Thenmozhi, D., Kannan, K., Aravindan, C.: A text similarity approach for precedence retrieval from legal documents. FIRE (Working Notes). (2017)
Magara, M.B., Ojo, S.O., Zuva, T.: A comparative analysis of text similarity measures and algorithms in research paper recommender systems. 2018 Conference on Information Communications Technology and Society (ICTAS). IEEE, (2018)
Shahmirzadi, O., Lugowski A., Younge, K.: Text similarity in vector space models: a comparative study. 2019 18th IEEE International Conference on Machine Learning and Applications (ICMLA). IEEE, (2019)
Hou, C., Li, Z., Jingli, Wu.: Unsupervised hash retrieval based on multiple similarity matrices and text self-attention mechanism. Appl. Intell. 52(7), 7670–7685 (2022)
Article Google Scholar
Bhartiya, D., Singh, A.: A Semantic Approach to Summarization." arXiv preprint arXiv:1406.1203 (2014).
Ma, J., Liang, Z., Zhang, L.: A text attention network for spatial deformation robust scene text image super-resolution. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. (2022)
Viggiato, M, et al.: Identifying similar test cases that are specified in natural language." IEEE Transact Softw Eng (2022)

Download references

Author information

Authors and Affiliations

RYMEC, Ballari, Affiliated to VTU, Belagavi, Karnataka, India
K. M. Shiva Prasad

Authors

K. M. Shiva Prasad
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

The overall contribution in this research paper is carried out by Dr. SPKM and this research work specifies the process involved with identification of relevant information from huge dataset. This research work presented an adaptive text summarization and identification of relevant text document based on the keywords specified by the user using Hybrid similariy model. The tasks were completed using a hybrid similarity algorithm that uses text summarization techniques to identify publications that are comparable in terms of semantic and contextual similarity. To give similarity between texts and attempt to provide a quantifiable number to the corpus's polysemy quotient, several of these strategies use deep learning with numerous layers and prebuilt models of NLP.The experiment been evaluated by considering various traditional algorithms and found that our model provided the best accuracy in comparision with traditional algorithms. Further the work must be carried out by considering the polysemy and synonyms which can help the user to identify the most relevant documents based on his requiement.

Corresponding author

Correspondence to K. M. Shiva Prasad.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Prasad, K.M.S. Text mining: identification of similarity of text documents using hybrid similarity model. Iran J Comput Sci 6, 123–135 (2023). https://doi.org/10.1007/s42044-022-00127-4

Download citation

Received: 26 September 2022
Accepted: 21 November 2022
Published: 03 December 2022
Issue Date: June 2023
DOI: https://doi.org/10.1007/s42044-022-00127-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Text mining: identification of similarity of text documents using hybrid similarity model

Abstract

Access this article

Similar content being viewed by others

Comparing Two Models of Document Similarity Search over a Text Stream of Articles from Online News Sites

Exploiting Siamese Neural Networks on Short Text Similarity Tasks for Multiple Domains and Languages

Using Deep Learning Word Embeddings for Citations Similarity in Academic Papers

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Text mining: identification of similarity of text documents using hybrid similarity model

Abstract

Access this article

Similar content being viewed by others

Comparing Two Models of Document Similarity Search over a Text Stream of Articles from Online News Sites

Exploiting Siamese Neural Networks on Short Text Similarity Tasks for Multiple Domains and Languages

Using Deep Learning Word Embeddings for Citations Similarity in Academic Papers

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation