Abstract
Nowadays, several automatic text summarization (ATS) methods have been proposed for resource-rich languages, such as English, Chinese. However, resource-limited languages like Hindi realized very little attention from researchers. The lack of resources still makes the ATS task for the Hindi language a challenging and open problem. Capturing semantic features and hidden relationships among the text units are the two main characteristics of an informative summary. In the current work, we propose an ATS model based on the document vector method to explore the semantic relations existing in the document. Moreover, we suggest two algorithms: sentence ranking and summary generation based on three main characteristics including, redundancy, diversity, and compression rate to create a clear and coherent summary. The proposed model is language-independent with some language-specific preprocessing. Further, we evaluate our model on two different language datasets as literary novels in Hindi and DUC 2007 news articles in English. We apply the ROUGE metric to measure the performance of the generated summaries. Besides, we also compare the proposed model against four baseline methods: TextRank, Lexrank, Latent Semantic Analysis (LSA), and Mudasir et al. models. The overall macro-Average F-Score (18.5% for Hindi, 26% for English) for very short length summaries of sizes 5% and 15% compression rates produced by our model is higher than the baseline approaches. In case of very lengthy summaries of size 50% compression rate, our model has the highest Macro-Average values, 18% for the Hindi novels and 25% for the English news articles against all the comparison methods. From the result analysis, we perceive that the proposed model beats all the baselines from the experimental outcomes and leads to diverse, least-redundant, semantic-rich, and compressed text summary generation.
Similar content being viewed by others
References
Nenkova A, Maskey S, Liu Y (2011) “Automatic summarization. In: Proceedings of the 49th annual meeting of the association for computational linguistics: Tutorial abstracts of ACL 2011, p. 3
Luhn HP (1958) The automatic creation of literature abstracts. IBM J Res Dev 2(2):159–165
Evans DA, Klavans JL, McKeown K (2004) Columbia newsblaster: Multilingual news summarization on the web. In: Demonstration Papers at HLT-NAACL 2004, pp. 1–4
Shi Z et al. (2007) Question answering summarization of multiple biomedical documents. In: Conference of the Canadian Society for Computational Studies of Intelligence, pp. 284–295
Ganesan K, Zhai C, Han J (2010) Opinosis: a graph based approach to abstractive summarization of highly redundant opinions
Ku L-W, Liang Y-T, Chen H-H (2006) Opinion extraction, summarization and tracking in news and blog corpora. In: Proceedings of AAAI, pp. 100–107
Wu Z et al (2017) A topic modeling based approach to novel document automatic summarization. Expert Syst Appl 84:12–23
Ceylan H (2011) Investigating the extractive summarization of literary novels. University of North Texas
Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artif Intell Rev 47(1):1–66
Aggarwal CC (2018) Text Summarization. In: Machine Learning for Text pp. 361–380. Springer
Nallapati R, Zhou B, Gulcehre C, Xiang B (2016) Abstractive text summarization using sequence-to-sequence rnns and beyond. arXiv Prepr. arXiv1602.06023
Oufaida H, Blache P, Nouali O (2015) Using Distributed Word Representations and mRMR Discriminant Analysis for Multilingual Text Summarization. In: International Conference on Applications of Natural Language to Information Systems, pp. 51–63
Waheeb SA, Khan NA, Chen B, Shang X (2020) Multidocument Arabic text summarization based on clustering and Word2Vec to reduce redundancy. Information 11(2):59
Radev DR et al. (2004) MEAD-a platform for multidocument multilingual text summarization
Kaljahi R, Foster J, Roturier J (2014) Semantic role labelling with minimal resources: Experiments with french. In: Proceedings of the Third Joint Conference on Lexical and Computational Semantics (* SEM 2014), pp. 87–92
Kabadjov M, Atkinson M, Steinberger J, Steinberger R, Van Der Goot E (2010) NewsGist: a multilingual statistical news summarizer. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pp. 591–594
Gong Y, Liu X (2001) Generic text summarization using relevance measure and latent semantic analysis. In: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 19–25
Mihalcea R, Tarau P (2004) Textrank: bringing order into text. In: Proceedings of the 2004 conference on empirical methods in natural language processing
Zhong S, Liu Y, Li B, Long J (2015) Query-oriented unsupervised multi-document summarization via deep learning model. Expert Syst Appl 42(21):8146–8155
Kågebäck M, Mogren O, Tahmasebi N, Dubhashi D (2014) Extractive summarization using continuous vector space models. In: Proceedings of the 2nd Workshop on Continuous Vector Space Models and their Compositionality (CVSC), pp. 31–39
Rush AM, Chopra S, Weston J (2015) A neural attention model for abstractive sentence summarization. arXiv Prepr. arXiv1509.00685
Chopra S, Auli M, Rush AM (2016) Abstractive sentence summarization with attentive recurrent neural networks. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 93–98
Gu J, Lu Z, Li H, Li VOK (2016) Incorporating copying mechanism in sequence-to-sequence learning. arXiv Prepr. arXiv1603.06393
Paulus R, Xiong C, Socher R (2017) A deep reinforced model for abstractive summarization. arXiv Prepr. arXiv1705.04304
Ma S, Sun X, Li W, Li S, Li W, Ren X (2018) Query and output: Generating words by querying distributed word representations for paraphrase generation. arXiv Prepr. arXiv1803.01465
Dong Y (2018) A survey on neural network-based summarization methods. arXiv Prepr. arXiv1804.04589
Rani R, Lobiyal DK (2021) A weighted word embedding based approach for extractive text summarization. Expert Syst Appl 186:115867
Jain A, Bhatia D, Thakur MK (2017) Extractive text summarization using word vector embedding. In: 2017 International Conference on Machine Learning and Data Science (MLDS), pp. 51–55
Mohd M, Jan R, Shah M (2020) Text document summarization using word embedding. Expert Syst Appl 143:112958
Hailu TT, Yu J, Fantaye TG (2020) A framework for word embedding based automatic text summarization and evaluation. Information 11(2):78
Franciscus N, Wang J, Stantic B (2019) Mining summary of short text with centroid similarity distance. In: International Conference on Advanced Data Mining and Applications, pp. 447–461
Rani R, Lobiyal DK (2021) An extractive text summarization approach using tagged-LDA based topic modeling. Multimed Tools Appl 80(3):3275–3305
Liu C-Y, Chen M-S, Tseng C-Y (2015) Incrests: towards real-time incremental short text summarization on comment streams from social network services. IEEE Trans Knowl Data Eng 27(11):2986–3000
Ma T, Wang H, Zhao Y, Tian Y, Al-Nabhan N (n.d.) Topic-based automatic summarization algorithm for Chinese short text
Mihalcea R, Ceylan H (2007) Explorations in automatic book summarization. In: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL)
Ceylan H, Mihalcea R (2009) The decomposition of human-written book summaries. In: International Conference on Intelligent Text Processing and Computational Linguistics, pp. 582–593
Kazantseva A, Szpakowicz S (2010) Summarizing short stories. Comput Linguist 36(1):71–109
Bamman D, Smith NA (2013) New alignment methods for discriminative book summarization. arXiv Prepr. arXiv1305.1319
Le Q, Mikolov T (2014) Distributed representations of sentences and documents. In: International conference on machine learning, pp. 1188–1196
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv Prepr. arXiv1301.3781
Blogwriter (n.d.) Munshi Premchand’s Stories. [Online]. Available: http://premchand.kahaani.org/. Accessed 29 Mar 2019
Rani R, Lobiyal DK (2018) Automatic construction of generic stop words list for hindi text. In: Procedia Computer Science Elsevier Journal, pp. 1–7
Rani R, Lobiyal DK (2020) Performance evaluation of text-mining models with Hindi Stopwords lists. J King Saud Univ Inf Sci
Rani R, Lobiyal DK (2018) Social choice theory based domain specific hindi stop words list construction and its application in text mining. In: International Conference on Intelligent Human Computer Interaction, pp. 123–135
Wikipedia (2019) Premchand. [Online]. Available: https://en.wikipedia.org/wiki/Premchand. Accessed 29 Mar 2019
Vorhees E, Graff D (2008) AQUAINT-2 information-retrieval text: research collection. Linguistic Data Consortium
Hartigan JA, Wong MA (1979) Algorithm AS 136: a k-means clustering algorithm. J R Stat Soc Ser C (Appl Stat) 28(1):100–108
Kulkarni AR, Apte MSS (2002) An automatic text summarization using feature terms for relevance measure
Ferreira R et al (2013) Assessing sentence scoring techniques for extractive text summarization. Expert Syst Appl 40(14):5755–5764
Bhat IK, Mohd M, Hashmy R (2018) Sumitup: A hybrid single-document text summarizer. In: Soft computing: Theories and applications, pp. 619–634. Springer
Mohd M et al. (2016) Sumdoc: a unified approach for automatic text summarization. In: Proceedings of fifth international conference on soft computing for problem solving, pp. 333–343
Edmundson HP, Wyllys RE (1961) Automatic abstracting and indexing—survey and recommendations. Commun ACM 4(5):226–234
McCreadie R, Macdonald C, Ounis I (2018) Automatic ground truth expansion for timeline evaluation. In: The 41st international acm sigir conference on research & development in information retrieval, pp. 685–694
Zechner K (1996) Fast generation of abstracts from general domain text corpora by extracting relevant sentences. In: Proceedings of the 16th conference on Computational linguistics vol. 2, pp. 986–989
Radev DR (2000) Centroid-based summarization of multiple documents: sentence extration, utility-based evalutation, and user studies. In: Proc ACL/NAAL Workshop on Summarization, Seattle, WA, 2000
Aguilar J, Salazar C, Velasco H, Monsalve-Pulido J, Montoya E (2020) Comparison and evaluation of different methods for the feature extraction from educational contents. Computation 8(2):30
Pakhira MK (2014) A linear time-complexity k-means algorithm using cluster shifting. In: 2014 International Conference on Computational Intelligence and Communication Networks, pp. 1047–1051
Brainy (n.d.) Brainy questions. [Online]. Available: https://brainly.in/subject/hindi. Accessed 27 Mar 2019
Ozsoy MG, Alpaslan FN, Cicekli I (2011) Text summarization using latent semantic analysis. J Inf Sci 37(4):405–417
Erkan G, Radev DR (2004) Lexrank: graph-based lexical centrality as salience in text summarization. J Artif Intell Res 22:457–479
Lin C-Y (2004) Rouge: a package for automatic evaluation of summaries. Text Summ. Branches Out
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Rani, R., Lobiyal, D.K. Document vector embedding based extractive text summarization system for Hindi and English text. Appl Intell 52, 9353–9372 (2022). https://doi.org/10.1007/s10489-021-02871-9
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-021-02871-9