Skip to main content
Log in

Text mining: identification of similarity of text documents using hybrid similarity model

  • Research
  • Published:
Iran Journal of Computer Science Aims and scope Submit manuscript

Abstract

The volume of data that are accessible on the internet has increased dramatically. This growth of data will only increase exponentially in the future as more data exhaust devices are connected to the network. A part of these data consists of documents from various sources. As the data from various digital sources increase, it becomes tough to perform the process of identification of relevant information which is most essentially needed for their further usage. The goal of this research is to present a hybrid similarity algorithm that uses text summarization techniques to identify papers that are similar in terms of both semantic and contextual similarity. Some of these methods aim to quantify the corpus’s polysemy quotient using deep learning with numerous layers and prebuilt Natural Language Processing (NPL) models to determine document similarity. In comparison with other conventional algorithms, the experimental results of our model showed an accuracy of 76.25%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

References

  1. Meena, Y.K., Jain, A., Gopalan, D.: Survey on the graph and cluster-based approaches in multi-document text summarization. International Conference on Recent Advances and Innovations in Engineering (ICRAIE-2014). IEEE, (2014)

  2. Kallimani, J.S.: Survey on extractive text summarization methods with multi-document datasets. 2018 International Conference on Advances in Computing, Communications and Informatics (ICACCI). IEEE, (2018)

  3. “Towards Automatic Text Summarization”, Sciforce blog spot text summarization, https://medium.com/sciforce/towards-automatic-summarization-part-2-abstractive-methods-c424386a65ea

  4. Moratanch, N., Chitrakala, S.: A survey on abstractive text summarization. 2016 International Conference on Circuit, Power and Computing Technologies (ICCPCT). IEEE, (2016)

  5. Yang, Z., et al.: Text dimensionality reduction with mutual information preserving mapping. Chin. J. Electron. 26(5), 919–925 (2017)

    Article  Google Scholar 

  6. Kutlu, M., Cıǧır, C., Cicekli, I.: Generic text summarization for Turkish. Comput. J. 53(8), 1315–1323 (2010)

    Article  Google Scholar 

  7. Liu, C.-Y., Chen, M.-S., Tseng, C.-Y.: Incrests: towards real-time incremental short text summarization on comment streams from social network services. IEEE Trans. Knowl. Data Eng. 27(11), 2986–3000 (2015)

    Article  Google Scholar 

  8. Nguyen, H., Santos, E., Russell, J.: Evaluation of the impact of user-cognitive styles on the assessment of text summarization. IEEE Transact. Syst. Man Cybern. Part A 41(6), 1038–1051 (2011)

    Article  Google Scholar 

  9. Yuan, S.-T., Sun, J.: Ontology-based structured cosine similarity in document summarization: with applications to mobile audio-based knowledge management. IEEE Transact. Syst. Man Cybern. Part B 35(5), 1028–1040 (2005)

    Article  Google Scholar 

  10. Shimada, A., et al.: Automatic summarization of lecture slides for enhanced student preview technical report and user study. IEEE Transact. Learn. Technol. 11(2), 165–178 (2017)

    Article  Google Scholar 

  11. Chen, K.-Y., et al.: Extractive broadcast news summarization leveraging recurrent neural network language modelling techniques. IEEE Transact. Audio Speech Lang. Process. 23(8), 1322–1334 (2015)

    Article  Google Scholar 

  12. Chen, K.-Y., et al.: An information distillation framework for extractive summarization. IEEE/ACM Transact. Audio Speech Lang. Process. 26(1), 161–170 (2017)

    Article  Google Scholar 

  13. Sun, X., Zhuge, H.: Summarization of scientific paper through reinforcement ranking on semantic link network. IEEE Access 6, 40611–40625 (2018)

    Article  Google Scholar 

  14. Goyal, P., Behera, L., McGinnity, T.M.: A context-based word indexing model for document summarization. IEEE Transact. Knowl. Data Eng. 25(8), 1693–1705 (2012)

    Article  Google Scholar 

  15. Jones, K.S.: Automatic summarizing: factors and directions. Adv. Automat. Text Summ. 1–12 (1999)

  16. Wan, X.: Towards a unified approach to simultaneous single-document and multi-document summarizations. Proceedings of the 23rd international conference on computational linguistics. Association for Computational Linguistics (2010)

  17. Wan, X., Xiao, J.: Exploiting neighbourhood knowledge for single document summarization and keyphrase extraction. ACM Transact. Inform. Syst. 28(2), 8 (2010)

    Google Scholar 

  18. Zhou, X., Wan, X., Xiao, J.: Cminer: opinion extraction and summarization for Chinese microblogs. IEEE Trans. Knowl. Data Eng. 28(7), 1650–1663 (2016)

    Article  Google Scholar 

  19. Barbosa, L., Feng, J.: Robust sentiment detection on Twitter from biased and noisy data. Proceedings of the 23rd international conference on computational linguistics: posters. Association for Computational Linguistics (2010)

  20. Jiang, L., et al.: Target-dependent Twitter sentiment classification. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies-Volume 1. Association for Computational Linguistics, (2011)

  21. Speriosu, M., et al.: Twitter polarity classification with label propagation over lexical links and the follower graph. Proceedings of the First Workshop on Unsupervised Learning in NLP. Association for Computational Linguistics, (2011)

  22. Liu, B.: Sentiment analysis and opinion mining. Synth. Lectures Human Lang. Technol. 5(1), 1–167 (2012)

    Article  Google Scholar 

  23. Hirao, T., et al.: Summarizing a document by trimming the discourse tree. IEEE/ACM Transact. Audio Speech Lang. Proces. 23(11), 2081–2092 (2015)

    Article  Google Scholar 

  24. McDonald, R.: A study of global inference algorithms in multi-document summarization. European Conference on Information Retrieval. Springer, Berlin, Heidelberg, (2007)

  25. Asyrofi, M.H., et al.: Biasfinder: metamorphic test generation to uncover bias for sentiment analysis systems. IEEE Transact. Softw. Eng. (2021)

  26. Zad, S., et al.: A survey of deep learning methods on semantic similarity and sentence modelling. 2021 IEEE 12th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON). IEEE, (2021)

  27. Ma, Y., Zhang, P., Ma, J.: An ontology-driven knowledge block summarization approach for Chinese judgment document classification. IEEE Access 6, 71327–71338 (2018)

    Article  Google Scholar 

  28. Zhang, P., et al.: Semantic similarity computing model based on multi-model fine-grained nonlinear fusion. IEEE Access 9, 8433–8443 (2021)

    Article  Google Scholar 

  29. Wei, J., et al.: Universal weighting metric learning for cross-modal retrieval. IEEE Transact. Pattern Anal. Mach. Intell. (2021)

  30. van Opijnenand, M., Santos, C.: On the concept of relevance in legal information retrieval. Artif. Intell. Law 25(1), 65–87 (2017)

    Article  Google Scholar 

  31. Zhang, N., et al.: An ontological Chinese legal consultation system. IEEE Access 5, 18250–18261 (2017)

    Article  Google Scholar 

  32. Thenmozhi, D., Kannan, K., Aravindan, C.: A text similarity approach for precedence retrieval from legal documents. FIRE (Working Notes). (2017)

  33. Magara, M.B., Ojo, S.O., Zuva, T.: A comparative analysis of text similarity measures and algorithms in research paper recommender systems. 2018 Conference on Information Communications Technology and Society (ICTAS). IEEE, (2018)

  34. Shahmirzadi, O., Lugowski A., Younge, K.: Text similarity in vector space models: a comparative study. 2019 18th IEEE International Conference on Machine Learning and Applications (ICMLA). IEEE, (2019)

  35. Hou, C., Li, Z., Jingli, Wu.: Unsupervised hash retrieval based on multiple similarity matrices and text self-attention mechanism. Appl. Intell. 52(7), 7670–7685 (2022)

    Article  Google Scholar 

  36. Bhartiya, D., Singh, A.: A Semantic Approach to Summarization." arXiv preprint arXiv:1406.1203 (2014).

  37. Ma, J., Liang, Z., Zhang, L.: A text attention network for spatial deformation robust scene text image super-resolution. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. (2022)

  38. Viggiato, M, et al.: Identifying similar test cases that are specified in natural language." IEEE Transact Softw Eng (2022)

Download references

Author information

Authors and Affiliations

Authors

Contributions

The overall contribution in this research paper is carried out by Dr. SPKM and this research work specifies the process involved with identification of relevant information from huge dataset. This research work presented an adaptive text summarization and identification of relevant text document based on the keywords specified by the user using Hybrid similariy model. The tasks were completed using a hybrid similarity algorithm that uses text summarization techniques to identify publications that are comparable in terms of semantic and contextual similarity. To give similarity between texts and attempt to provide a quantifiable number to the corpus's polysemy quotient, several of these strategies use deep learning with numerous layers and prebuilt models of NLP.The experiment been evaluated by considering various traditional algorithms and found that our model provided the best accuracy in comparision with traditional algorithms. Further the work must be carried out by considering the polysemy and synonyms which can help the user to identify the most relevant documents based on his requiement.

Corresponding author

Correspondence to K. M. Shiva Prasad.

Ethics declarations

Conflict of interest

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Prasad, K.M.S. Text mining: identification of similarity of text documents using hybrid similarity model. Iran J Comput Sci 6, 123–135 (2023). https://doi.org/10.1007/s42044-022-00127-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s42044-022-00127-4

Keywords

Navigation