Skip to main content
Log in

Document vector embedding based extractive text summarization system for Hindi and English text

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Nowadays, several automatic text summarization (ATS) methods have been proposed for resource-rich languages, such as English, Chinese. However, resource-limited languages like Hindi realized very little attention from researchers. The lack of resources still makes the ATS task for the Hindi language a challenging and open problem. Capturing semantic features and hidden relationships among the text units are the two main characteristics of an informative summary. In the current work, we propose an ATS model based on the document vector method to explore the semantic relations existing in the document. Moreover, we suggest two algorithms: sentence ranking and summary generation based on three main characteristics including, redundancy, diversity, and compression rate to create a clear and coherent summary. The proposed model is language-independent with some language-specific preprocessing. Further, we evaluate our model on two different language datasets as literary novels in Hindi and DUC 2007 news articles in English. We apply the ROUGE metric to measure the performance of the generated summaries. Besides, we also compare the proposed model against four baseline methods: TextRank, Lexrank, Latent Semantic Analysis (LSA), and Mudasir et al. models. The overall macro-Average F-Score (18.5% for Hindi, 26% for English) for very short length summaries of sizes 5% and 15% compression rates produced by our model is higher than the baseline approaches. In case of very lengthy summaries of size 50% compression rate, our model has the highest Macro-Average values, 18% for the Hindi novels and 25% for the English news articles against all the comparison methods. From the result analysis, we perceive that the proposed model beats all the baselines from the experimental outcomes and leads to diverse, least-redundant, semantic-rich, and compressed text summary generation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Nenkova A, Maskey S, Liu Y (2011) “Automatic summarization. In: Proceedings of the 49th annual meeting of the association for computational linguistics: Tutorial abstracts of ACL 2011, p. 3

  2. Luhn HP (1958) The automatic creation of literature abstracts. IBM J Res Dev 2(2):159–165

    Article  MathSciNet  Google Scholar 

  3. Evans DA, Klavans JL, McKeown K (2004) Columbia newsblaster: Multilingual news summarization on the web. In: Demonstration Papers at HLT-NAACL 2004, pp. 1–4

  4. Shi Z et al. (2007) Question answering summarization of multiple biomedical documents. In: Conference of the Canadian Society for Computational Studies of Intelligence, pp. 284–295

  5. Ganesan K, Zhai C, Han J (2010) Opinosis: a graph based approach to abstractive summarization of highly redundant opinions

  6. Ku L-W, Liang Y-T, Chen H-H (2006) Opinion extraction, summarization and tracking in news and blog corpora. In: Proceedings of AAAI, pp. 100–107

  7. Wu Z et al (2017) A topic modeling based approach to novel document automatic summarization. Expert Syst Appl 84:12–23

    Article  Google Scholar 

  8. Ceylan H (2011) Investigating the extractive summarization of literary novels. University of North Texas

  9. Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artif Intell Rev 47(1):1–66

    Article  Google Scholar 

  10. Aggarwal CC (2018) Text Summarization. In: Machine Learning for Text pp. 361–380. Springer

  11. Nallapati R, Zhou B, Gulcehre C, Xiang B (2016) Abstractive text summarization using sequence-to-sequence rnns and beyond. arXiv Prepr. arXiv1602.06023

  12. Oufaida H, Blache P, Nouali O (2015) Using Distributed Word Representations and mRMR Discriminant Analysis for Multilingual Text Summarization. In: International Conference on Applications of Natural Language to Information Systems, pp. 51–63

  13. Waheeb SA, Khan NA, Chen B, Shang X (2020) Multidocument Arabic text summarization based on clustering and Word2Vec to reduce redundancy. Information 11(2):59

    Article  Google Scholar 

  14. Radev DR et al. (2004) MEAD-a platform for multidocument multilingual text summarization

  15. Kaljahi R, Foster J, Roturier J (2014) Semantic role labelling with minimal resources: Experiments with french. In: Proceedings of the Third Joint Conference on Lexical and Computational Semantics (* SEM 2014), pp. 87–92

  16. Kabadjov M, Atkinson M, Steinberger J, Steinberger R, Van Der Goot E (2010) NewsGist: a multilingual statistical news summarizer. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 591–594

  17. Gong Y, Liu X (2001) Generic text summarization using relevance measure and latent semantic analysis. In: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 19–25

  18. Mihalcea R, Tarau P (2004) Textrank: bringing order into text. In: Proceedings of the 2004 conference on empirical methods in natural language processing

  19. Zhong S, Liu Y, Li B, Long J (2015) Query-oriented unsupervised multi-document summarization via deep learning model. Expert Syst Appl 42(21):8146–8155

    Article  Google Scholar 

  20. Kågebäck M, Mogren O, Tahmasebi N, Dubhashi D (2014) Extractive summarization using continuous vector space models. In: Proceedings of the 2nd Workshop on Continuous Vector Space Models and their Compositionality (CVSC), pp. 31–39

  21. Rush AM, Chopra S, Weston J (2015) A neural attention model for abstractive sentence summarization. arXiv Prepr. arXiv1509.00685

  22. Chopra S, Auli M, Rush AM (2016) Abstractive sentence summarization with attentive recurrent neural networks. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 93–98

  23. Gu J, Lu Z, Li H, Li VOK (2016) Incorporating copying mechanism in sequence-to-sequence learning. arXiv Prepr. arXiv1603.06393

  24. Paulus R, Xiong C, Socher R (2017) A deep reinforced model for abstractive summarization. arXiv Prepr. arXiv1705.04304

  25. Ma S, Sun X, Li W, Li S, Li W, Ren X (2018) Query and output: Generating words by querying distributed word representations for paraphrase generation. arXiv Prepr. arXiv1803.01465

  26. Dong Y (2018) A survey on neural network-based summarization methods. arXiv Prepr. arXiv1804.04589

  27. Rani R, Lobiyal DK (2021) A weighted word embedding based approach for extractive text summarization. Expert Syst Appl 186:115867

    Article  Google Scholar 

  28. Jain A, Bhatia D, Thakur MK (2017) Extractive text summarization using word vector embedding. In: 2017 International Conference on Machine Learning and Data Science (MLDS), pp. 51–55

  29. Mohd M, Jan R, Shah M (2020) Text document summarization using word embedding. Expert Syst Appl 143:112958

    Article  Google Scholar 

  30. Hailu TT, Yu J, Fantaye TG (2020) A framework for word embedding based automatic text summarization and evaluation. Information 11(2):78

    Article  Google Scholar 

  31. Franciscus N, Wang J, Stantic B (2019) Mining summary of short text with centroid similarity distance. In: International Conference on Advanced Data Mining and Applications, pp. 447–461

  32. Rani R, Lobiyal DK (2021) An extractive text summarization approach using tagged-LDA based topic modeling. Multimed Tools Appl 80(3):3275–3305

    Article  Google Scholar 

  33. Liu C-Y, Chen M-S, Tseng C-Y (2015) Incrests: towards real-time incremental short text summarization on comment streams from social network services. IEEE Trans Knowl Data Eng 27(11):2986–3000

    Article  Google Scholar 

  34. Ma T, Wang H, Zhao Y, Tian Y, Al-Nabhan N (n.d.) Topic-based automatic summarization algorithm for Chinese short text

  35. Mihalcea R, Ceylan H (2007) Explorations in automatic book summarization. In: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL)

  36. Ceylan H, Mihalcea R (2009) The decomposition of human-written book summaries. In: International Conference on Intelligent Text Processing and Computational Linguistics, pp. 582–593

  37. Kazantseva A, Szpakowicz S (2010) Summarizing short stories. Comput Linguist 36(1):71–109

    Article  Google Scholar 

  38. Bamman D, Smith NA (2013) New alignment methods for discriminative book summarization. arXiv Prepr. arXiv1305.1319

  39. Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: International conference on machine learning, pp. 1188–1196

  40. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv Prepr. arXiv1301.3781

  41. Blogwriter (n.d.) Munshi Premchand’s Stories. [Online]. Available: http://premchand.kahaani.org/. Accessed 29 Mar 2019

  42. Rani R, Lobiyal DK (2018) Automatic construction of generic stop words list for hindi text. In: Procedia Computer Science Elsevier Journal, pp. 1–7

  43. Rani R, Lobiyal DK (2020) Performance evaluation of text-mining models with Hindi Stopwords lists. J King Saud Univ Inf Sci

  44. Rani R, Lobiyal DK (2018) Social choice theory based domain specific hindi stop words list construction and its application in text mining. In: International Conference on Intelligent Human Computer Interaction, pp. 123–135

  45. Wikipedia (2019) Premchand. [Online]. Available: https://en.wikipedia.org/wiki/Premchand. Accessed 29 Mar 2019

  46. Vorhees E, Graff D (2008) AQUAINT-2 information-retrieval text: research collection. Linguistic Data Consortium

    Google Scholar 

  47. Hartigan JA, Wong MA (1979) Algorithm AS 136: a k-means clustering algorithm. J R Stat Soc Ser C (Appl Stat) 28(1):100–108

    MATH  Google Scholar 

  48. Kulkarni AR, Apte MSS (2002) An automatic text summarization using feature terms for relevance measure

  49. Ferreira R et al (2013) Assessing sentence scoring techniques for extractive text summarization. Expert Syst Appl 40(14):5755–5764

    Article  Google Scholar 

  50. Bhat IK, Mohd M, Hashmy R (2018) Sumitup: A hybrid single-document text summarizer. In: Soft computing: Theories and applications, pp. 619–634. Springer

  51. Mohd M et al. (2016) Sumdoc: a unified approach for automatic text summarization. In: Proceedings of fifth international conference on soft computing for problem solving, pp. 333–343

  52. Edmundson HP, Wyllys RE (1961) Automatic abstracting and indexing—survey and recommendations. Commun ACM 4(5):226–234

    Article  Google Scholar 

  53. McCreadie R, Macdonald C, Ounis I (2018) Automatic ground truth expansion for timeline evaluation. In: The 41st international acm sigir conference on research & development in information retrieval, pp. 685–694

  54. Zechner K (1996) Fast generation of abstracts from general domain text corpora by extracting relevant sentences. In: Proceedings of the 16th conference on Computational linguistics vol. 2, pp. 986–989

  55. Radev DR (2000) Centroid-based summarization of multiple documents: sentence extration, utility-based evalutation, and user studies. In: Proc ACL/NAAL Workshop on Summarization, Seattle, WA, 2000

  56. Aguilar J, Salazar C, Velasco H, Monsalve-Pulido J, Montoya E (2020) Comparison and evaluation of different methods for the feature extraction from educational contents. Computation 8(2):30

    Article  Google Scholar 

  57. Pakhira MK (2014) A linear time-complexity k-means algorithm using cluster shifting. In: 2014 International Conference on Computational Intelligence and Communication Networks, pp. 1047–1051

  58. Brainy (n.d.) Brainy questions. [Online]. Available: https://brainly.in/subject/hindi. Accessed 27 Mar 2019

  59. Ozsoy MG, Alpaslan FN, Cicekli I (2011) Text summarization using latent semantic analysis. J Inf Sci 37(4):405–417

    Article  MathSciNet  Google Scholar 

  60. Erkan G, Radev DR (2004) Lexrank: graph-based lexical centrality as salience in text summarization. J Artif Intell Res 22:457–479

    Article  Google Scholar 

  61. Lin C-Y (2004) Rouge: a package for automatic evaluation of summaries. Text Summ. Branches Out

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ruby Rani.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rani, R., Lobiyal, D.K. Document vector embedding based extractive text summarization system for Hindi and English text. Appl Intell 52, 9353–9372 (2022). https://doi.org/10.1007/s10489-021-02871-9

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-021-02871-9

Keywords

Navigation