Abstract
The volume of data that are accessible on the internet has increased dramatically. This growth of data will only increase exponentially in the future as more data exhaust devices are connected to the network. A part of these data consists of documents from various sources. As the data from various digital sources increase, it becomes tough to perform the process of identification of relevant information which is most essentially needed for their further usage. The goal of this research is to present a hybrid similarity algorithm that uses text summarization techniques to identify papers that are similar in terms of both semantic and contextual similarity. Some of these methods aim to quantify the corpus’s polysemy quotient using deep learning with numerous layers and prebuilt Natural Language Processing (NPL) models to determine document similarity. In comparison with other conventional algorithms, the experimental results of our model showed an accuracy of 76.25%.
Similar content being viewed by others
References
Meena, Y.K., Jain, A., Gopalan, D.: Survey on the graph and cluster-based approaches in multi-document text summarization. International Conference on Recent Advances and Innovations in Engineering (ICRAIE-2014). IEEE, (2014)
Kallimani, J.S.: Survey on extractive text summarization methods with multi-document datasets. 2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI). IEEE, (2018)
“Towards Automatic Text Summarization”, Sciforce blog spot text summarization, https://medium.com/sciforce/towards-automatic-summarization-part-2-abstractive-methods-c424386a65ea
Moratanch, N., Chitrakala, S.: A survey on abstractive text summarization. 2016 International Conference on Circuit, Power and Computing Technologies (ICCPCT). IEEE, (2016)
Yang, Z., et al.: Text dimensionality reduction with mutual information preserving mapping. Chin. J. Electron. 26(5), 919–925 (2017)
Kutlu, M., Cıǧır, C., Cicekli, I.: Generic text summarization for Turkish. Comput. J. 53(8), 1315–1323 (2010)
Liu, C.-Y., Chen, M.-S., Tseng, C.-Y.: Incrests: towards real-time incremental short text summarization on comment streams from social network services. IEEE Trans. Knowl. Data Eng. 27(11), 2986–3000 (2015)
Nguyen, H., Santos, E., Russell, J.: Evaluation of the impact of user-cognitive styles on the assessment of text summarization. IEEE Transact. Syst. Man Cybern. Part A 41(6), 1038–1051 (2011)
Yuan, S.-T., Sun, J.: Ontology-based structured cosine similarity in document summarization: with applications to mobile audio-based knowledge management. IEEE Transact. Syst. Man Cybern. Part B 35(5), 1028–1040 (2005)
Shimada, A., et al.: Automatic summarization of lecture slides for enhanced student preview technical report and user study. IEEE Transact. Learn. Technol. 11(2), 165–178 (2017)
Chen, K.-Y., et al.: Extractive broadcast news summarization leveraging recurrent neural network language modelling techniques. IEEE Transact. Audio Speech Lang. Process. 23(8), 1322–1334 (2015)
Chen, K.-Y., et al.: An information distillation framework for extractive summarization. IEEE/ACM Transact. Audio Speech Lang. Process. 26(1), 161–170 (2017)
Sun, X., Zhuge, H.: Summarization of scientific paper through reinforcement ranking on semantic link network. IEEE Access 6, 40611–40625 (2018)
Goyal, P., Behera, L., McGinnity, T.M.: A context-based word indexing model for document summarization. IEEE Transact. Knowl. Data Eng. 25(8), 1693–1705 (2012)
Jones, K.S.: Automatic summarizing: factors and directions. Adv. Automat. Text Summ. 1–12 (1999)
Wan, X.: Towards a unified approach to simultaneous single-document and multi-document summarizations. Proceedings of the 23rd international conference on computational linguistics. Association for Computational Linguistics (2010)
Wan, X., Xiao, J.: Exploiting neighbourhood knowledge for single document summarization and keyphrase extraction. ACM Transact. Inform. Syst. 28(2), 8 (2010)
Zhou, X., Wan, X., Xiao, J.: Cminer: opinion extraction and summarization for Chinese microblogs. IEEE Trans. Knowl. Data Eng. 28(7), 1650–1663 (2016)
Barbosa, L., Feng, J.: Robust sentiment detection on Twitter from biased and noisy data. Proceedings of the 23rd international conference on computational linguistics: posters. Association for Computational Linguistics (2010)
Jiang, L., et al.: Target-dependent Twitter sentiment classification. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1. Association for Computational Linguistics, (2011)
Speriosu, M., et al.: Twitter polarity classification with label propagation over lexical links and the follower graph. Proceedings of the First Workshop on Unsupervised Learning in NLP. Association for Computational Linguistics, (2011)
Liu, B.: Sentiment analysis and opinion mining. Synth. Lectures Human Lang. Technol. 5(1), 1–167 (2012)
Hirao, T., et al.: Summarizing a document by trimming the discourse tree. IEEE/ACM Transact. Audio Speech Lang. Proces. 23(11), 2081–2092 (2015)
McDonald, R.: A study of global inference algorithms in multi-document summarization. European Conference on Information Retrieval. Springer, Berlin, Heidelberg, (2007)
Asyrofi, M.H., et al.: Biasfinder: metamorphic test generation to uncover bias for sentiment analysis systems. IEEE Transact. Softw. Eng. (2021)
Zad, S., et al.: A survey of deep learning methods on semantic similarity and sentence modelling. 2021 IEEE 12th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON). IEEE, (2021)
Ma, Y., Zhang, P., Ma, J.: An ontology-driven knowledge block summarization approach for Chinese judgment document classification. IEEE Access 6, 71327–71338 (2018)
Zhang, P., et al.: Semantic similarity computing model based on multi-model fine-grained nonlinear fusion. IEEE Access 9, 8433–8443 (2021)
Wei, J., et al.: Universal weighting metric learning for cross-modal retrieval. IEEE Transact. Pattern Anal. Mach. Intell. (2021)
van Opijnenand, M., Santos, C.: On the concept of relevance in legal information retrieval. Artif. Intell. Law 25(1), 65–87 (2017)
Zhang, N., et al.: An ontological Chinese legal consultation system. IEEE Access 5, 18250–18261 (2017)
Thenmozhi, D., Kannan, K., Aravindan, C.: A text similarity approach for precedence retrieval from legal documents. FIRE (Working Notes). (2017)
Magara, M.B., Ojo, S.O., Zuva, T.: A comparative analysis of text similarity measures and algorithms in research paper recommender systems. 2018 Conference on Information Communications Technology and Society (ICTAS). IEEE, (2018)
Shahmirzadi, O., Lugowski A., Younge, K.: Text similarity in vector space models: a comparative study. 2019 18th IEEE International Conference on Machine Learning and Applications (ICMLA). IEEE, (2019)
Hou, C., Li, Z., Jingli, Wu.: Unsupervised hash retrieval based on multiple similarity matrices and text self-attention mechanism. Appl. Intell. 52(7), 7670–7685 (2022)
Bhartiya, D., Singh, A.: A Semantic Approach to Summarization." arXiv preprint arXiv:1406.1203 (2014).
Ma, J., Liang, Z., Zhang, L.: A text attention network for spatial deformation robust scene text image super-resolution. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. (2022)
Viggiato, M, et al.: Identifying similar test cases that are specified in natural language." IEEE Transact Softw Eng (2022)
Author information
Authors and Affiliations
Contributions
The overall contribution in this research paper is carried out by Dr. SPKM and this research work specifies the process involved with identification of relevant information from huge dataset. This research work presented an adaptive text summarization and identification of relevant text document based on the keywords specified by the user using Hybrid similariy model. The tasks were completed using a hybrid similarity algorithm that uses text summarization techniques to identify publications that are comparable in terms of semantic and contextual similarity. To give similarity between texts and attempt to provide a quantifiable number to the corpus's polysemy quotient, several of these strategies use deep learning with numerous layers and prebuilt models of NLP.The experiment been evaluated by considering various traditional algorithms and found that our model provided the best accuracy in comparision with traditional algorithms. Further the work must be carried out by considering the polysemy and synonyms which can help the user to identify the most relevant documents based on his requiement.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Prasad, K.M.S. Text mining: identification of similarity of text documents using hybrid similarity model. Iran J Comput Sci 6, 123–135 (2023). https://doi.org/10.1007/s42044-022-00127-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s42044-022-00127-4