Candidate sentence selection for extractive text summarization
Introduction
Today, tremendous data is available on the Internet. With its proliferation, it becomes difficult to efficiently gather the main information from this massive amount of data. Regarding the text documents, it is a complex and exhaustive process to gather and perceive the primary information from huge amount of resources in sufficient time for human-beings. Fortunately, these processes have been automatically performed by information retrieval methods for decades. However, the rise in the quantity of information causes some performance issues such as insufficient solutions and unwieldy applications of information retrieval tasks. The use of high technology machines may reduce the loss caused by these issues. However, it may cost more. As a more suitable alternative, dimension reduction can be employed in raw data to handle these issues and accelerate the implementations of these tasks. Regarding the domain of text processing, automatic text summarization is a good and highly interpretable experience for dimension reduction.
Text summarization is a process of generating a brief version of a single document or a set of documents. Automatic summarization of text documents is a challenging problem because it is highly vital to make the resulting summaries cover basic information of the source document(s) as much as possible. In literature, this problem has been studied according to two principle strategies: abstraction and extraction. In abstractive text summarization, the summarization process of humans has been imitated. People summarize documents by gathering salient information and reorganizing this information in idiosyncratic sentences. Imitating such a process lets more natural and artifact-like summaries to be generated. In abstractive text summarization, the critical concepts in the source document(s) have been determined first, and these concepts have been paraphrased with regards to the grammatical rules and constraints of corresponding natural language by natural language processing tools. This process is indeed highly challenging because of its language-dependency and semantic restrictions while paraphrasing, but still, it has been successfully applied on relatively short documents for simple tasks such as title/headline/keyword generation (Lopyrev, 2015, Nallapati, Zhou, dos Santos, Gulcehre, Xiang, 2016, Nasar, Jaffry, Malik, 2019), sentence compression (Knight, Marcu, 2002, Miao, Cao, Li, Guan, 2020, Zajic, Dorr, Lin, Schwartz, 2007) and sentence fusion (Krahmer, Marsi, & van Pelt, 2008) etc. The larger documents (or document sets), on the other hand, have been mostly summarized by extractive strategy. Here, the salient text units have been determined, and the most salient text units have been included in the summaries according to a ratio of compression. As a typical application strategy, these text units have corresponded to the sentences in the document(s). Reflecting the salient sentences to the summary has been made this kind of summarization more readable and plausible since the sentences in system summaries are grammatically correct and semantically proper human-written texts.
Several different approaches have handled the extractive text summarization process. Frequency-based term weighting approaches have been one of the preliminary studies on this area (Balabantaray, Sahoo, Sahoo, Swain, 2012, García-Hernández, Ledeneva, 2009, Ledeneva, Gelbukh, García-Hernández, 2008). Subsequently, the latent semantic analysis (Gong, Liu, 2001, Hachey, Murray, Reitter, et al., 2005, Steinberger, Ježek, 2009), hidden markov models (Brdiczka, & Chu, Conroy, O’leary, 2001), and graph-based unsupervised approaches (Aliguliyev, 2006, Fang, Mu, Deng, Wu, 2017, Mihalcea, Tarau, 2004, Wan, 2010) have been gathered attention. More recently, it has been considered an optimization problem, and the best summary sentences that maximize the evaluation metrics have been selected for model summaries. Here, the general approach for sentence selection is selecting the most related and less redundant sentences while avoiding to convey similar information as much as possible (Alguliev, Aliguliyev, Hajirahimova, 2012, Alguliev, Aliguliyev, Hajirahimova, Mehdiyev, 2011a, Alguliev, Aliguliyev, Isazade, 2013, Alguliev, Aliguliyev, Mehdiyev, 2011b, Alguliyev, Aliguliyev, Isazade, 2015, Alguliyev, Aliguliyev, Isazade, Abdi, Idris, 2019).
As the state-of-the-art, text summarization is related to machine learning’s classification problem. This kind of extractive text summarization method aims to determine the salience degree to sentences in the document and select the most salient sentences. Since it does not include sentence construction and paraphrasing processes, it has the advantage of language independence. To that end, extractive summarization has been gathering more attention in the literature (Allahyari, Pouriyeh, Assefi, Safaei, Trippe, Gutierrez, & Kochut, Nenkova, McKeown, 2012).
Simplifying the extractive text summarization, it consists of sentence scoring and sentence selection steps. In other words, a salience degree (or summary-worthiness) is determined for each sentence, the sentences are ranked according to their salience degree, and the most salient k sentences are selected as summary-worthy. These steps intrinsically associate the problem with classification problems in the machine learning field, since the determination of salience is a machine-learning problem which can be ideally handled by supervised learning.
In literature, sentence scoring for extractive text summarization has been handled by syntactic or semantic approaches. In syntactic approach, predetermined hand-crafted features for each sentence have been acquired, and considered during sentence scoring (Fattah, Ren, 2009, Ferreira, Lins, Freitas, Cavalcanti, Lima, Simske, Favaro, Others, 2013, Goularte, Nassar, Fileto, Saggion, 2019, Meena, Gopalani, 2014, Mutlu, Sezer, Akcayol, 2019, Mutlu, Sezer, Akcayol, 2020, Oliveira, Ferreira, Lima, Lins, Freitas, Riss, Simske, 2016, Suanmali, Salim, Binwahlan, 2009, Wan, 2010, Wang, Li, Wang, Zheng, 2017). In the semantic approach, on the other hand, the meaning of words/phrases, and the semantic relations of them have been taken into account (Chen, Liu, Chen, Wang, 2017, Cheng, Lapata, 2016, Denil, Demiraj, De Freitas, 2015, Mohamed, Oussalah, 2019, Narayan, Cohen, Lapata, 2018, Ren, Chen, Ren, Wei, Nie, Ma, De Rijke, 2018, Yin, Pei, 2015, Zhang, Lapata, Wei, Zhou, 2018). However, humans take into account both the meaning and semantic relations of sentences and some structural properties of text during summarization. However, as much as our literature knowledge, the use of an enhanced feature space that comprehensively handles these two feature types has not been gathered remarkable attention yet.
Being one of a few studies that consider syntactic and semantic features in one respect, the authors analyzed document-dependent and document-independent features for summarization in Cao et al. (2015). While the document-independent features corresponded to sentence embeddings which carry the meaning of sentences, the dependent features substituted the syntactic features which are sentence position of the sentence, the averaged term frequency values of words in the sentence, and the averaged cluster frequency values of words in the sentence. The basing sentence scoring method was the convolutional neural network, and the experiments on a well-known benchmark dataset (DUC2002) showed that 0.366 ROUGE-1 (Recall-Oriented Understudy for Gisting Evaluation) recall could be reached.
In Nallapati, Zhai, and Zhou (2017), a Recurrent Neural Network based Sequence Model (SummaRuNNer) was proposed for extractive summarization. The SummaRuNNer is a two-layer recurrent neural network (RNN) based sequence classifier where the first layer operates at word level within each sentence, and the following layer runs over sentences for classification purposes. Bidirectional Gated Recurrent Unit (Bi-GRU) based RNN was used as the basic building block of sequence classifier. While semantically representing the words and sentences by word2vec word embeddings, two positional features were also played a contributing role in the model’s decision making procedure, The ROUGE-1 recall value obtained from SummaRuNNer was 0.46 ± 0.8 on DUC2002 dataset.
Very recently, Joshi et al. represented a model for extractive text summarization based on deep auto-encoders in Joshi, Fidalgo, Alegre, and Fernández-Robles (2019). This model has been considered the sentence content relevance and sentence novelty relevance scores obtained from word and sentence embeddings and the syntactic sentence position relevance score to improve the salience of the first few sentences in the document. The ROUGE-1 score obtained from the resulting summaries on DUC2002 corpus was 0.517. Although these studies presented promising capacity in selecting the salient sentences from the text, their syntactic feature spaces were minimal and mostly relied only on the position of the sentences. However, several syntactic information can be extracted from the text, and it is still unknown, which of them affect the summarization performance in what ratio. Therefore, these existing approaches can not be considered as mature solutions yet. Here, the more important point is not the sentence scoring method itself, but what information was passed to the sentence scoring method. To that end, the input features which correspond to the importance of the sentence should be clearly determined and optimized to be used in sentence scoring.
In this study, the syntactic and semantic features used in extractive text summarization were deeply investigated, and their individual and joint contribution to summarization problem was analyzed by several experiments. Enhancing these two types of features, a comprehensive feature space with both syntactic and semantic summarization features were proposed, and it was shown that, when compared to their individual use, the combined use of these two types of feature spaces can contribute more according to both selecting the most informative sentences and in preserving the main information in the source document.
In automatic text summarization literature, there are few datasets created exactly for summarization task. In the optimum case, a summarization dataset has to provide proper goal summaries which are generated by a human. It is a crucial requirement for both extraction and abstraction. Although there exist plenty of text datasets in literature, a tiny portion of these datasets contain the goal summaries which are manually generated by a human. With this distinctive feature, the Document Understanding Conference (DUC) (DUC, 2007) has been the most convenient benchmark dataset for this task and employed in most of the studies on extractive summarization.
To be an alternative benchmark dataset to DUC, a summarization corpus was created and proposed in this study. The corpus was obtained from the proceedings of International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR, 2018). The proceedings are publicly available for academic purposes in SIGIR (2018). In the proposed corpus, the introduction section of each proceeding was considered as a source document and acquired from SIGIR (2018). Additionally, the abstracts, concepts, and keywords that were written by the original authors of publication were obtained to be possible baseline for abstractive summarization or text classification tasks, respectively. Besides these already determined text blocks, the sentences in introduction sections were manually labeled as summary-worthy or summary-unworthy by three human readers by asking them to select a subset of sentences from the source documents to reduce the original text to its 33% by preserving the basic information and the entire coherency as much as possible. As a result of this labeling process, a candidate sentence list was created for each reader, and three candidate extractive sentence sets were obtained for each document. Blending these candidate sentence sets, two extracts (Ext ∪ and Ext ∩ ) were obtained by the union and intersection of candidate sentence sets. The Ext ∪ extracts contain the sentences which were selected as summary-worthy by at least one reader. The Ext ∩ extracts, on the other hand, were obtained from the sentences which were selected as summary-worthy by at least two human readers. Naturally, the Ext ∪ extracts are larger documents than Ext ∩ extracts. To have an overall understanding of proposed corpus, statistical information was first given. Then, a ROUGE-based evaluation was performed several perspectives to validate and verify the consistency of human extracts (Ext ∪ and Ext ∩ ).
Using the new SIGIR 2018 corpus, enhanced feature space containing syntactic and semantic features were extracted, and an extensive feature space was proposed to be used in determining the salience of sentences. Additionally, a Long Short-Term Memory (LSTM)-based Neural Network (LSTM-NN) was proposed to classify the sentences as summary-worthy or summary-unworthy. This model processes the semantic and syntactic features in separate LSTMs, and combines the output vectors in a deeper layer. Then a two-layer fully connected neural network is applied for the classification of sentences. The sentences which were labeled as summary-worthy by this model were considered as model summaries, and these summaries were evaluated by ROUGE metrics obtained from 5-fold cross-validation. The evaluation was first based on measuring the contribution of individual and joint use of syntactic and semantic feature spaces. Here, it was observed that using the enhanced feature space significantly improves the ROUGE-values. Secondly, SummaRuNNer (Nallapati et al., 2017), and BanditSum (Dong, Shen, Crawford, van Hoof, & Cheung, 2018) were implemented to compare the informativeness of resulting summaries with the summaries of state-of-the-art deep learning methods. The obtained results showed that the LSTM-NN model fed by enhanced feature space provided more informative summaries by also including fewer sentences to the resulting summaries than SummaRuNNer, and it achieves comparable results with BanditSum, and it also outperform this baseline method basing the phrase-based assessments.
Section snippets
Objectives and contribution
In this study, the extractive text summarization was handled by three objectives based on the summarization dataset, the feature space and the methods used for determining an importance degree to sentences.
There are plenty of text documents available for numerous information retrieval or text mining tasks. However, a dataset for summarization purposes needs to fulfill some key requirements. First of all, it should include human-written documents on a specific topic. However, most of the
A new benchmark dataset for text summarization
In automatic text summarization literature, there are few datasets created exactly for summarization task. As already mentioned in the introductory section, having manually generated extracts that contain the important sentences in the source document is a crucial need for extractive text summarization. However, there exist a few dataset fulfilling this necessity. In this section, the most commonly used datasets are revisited with their advantages and weaknesses, and then the proposed dataset
Extractive text summarization on SIGIR 2018
In this study, the proposed SIGIR 2018 corpus was handled by extractive text summarization task. The problem was handled as sentence ranking and classification by basing the semantic features, syntactic features, and ensembled features.
Conclusion
In this study, the extractive text summarization was revisited according to three research areas: the utilized dataset, the feature space, and the method which is applied for sentence selection to obtain its summary worthiness.
Dataset: A new English dataset containing the proceedings of 41th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2018) was proposed. The introduction section of each proceeding was considered as a source document since, in
References (73)
- et al.
Gendocsum+ mclr: Generic document summarization based on maximum coverage and less redundancy
Expert Systems with Applications
(2012) - et al.
Mcmr: Maximum coverage and minimum redundant text summarization model
Expert Systems with Applications
(2011) - et al.
Multiple documents summarization based on evolutionary optimization algorithm
Expert Systems with Applications
(2013) - et al.
An unsupervised approach to generating generic summaries of documents
Applied Soft Computing
(2015) - et al.
Sentence selection for generic document summarization using an adaptive differential evolution algorithm
Swarm and Evolutionary Computation
(2011) - et al.
A survey on evaluation of summarization methods
Information Processing & Management
(2019) - et al.
Word-sentence co-ranking for automatic extractive text summarization
Expert Systems with Applications
(2017) - et al.
GA, MR, FFNN, PNN And GMM based models for automatic text summarization
Computer Speech and Language
(2009) - et al.
Assessing sentence scoring techniques for extractive text summarization
Expert Systems with Applications
(2013) - et al.
A text summarization method based on fuzzy rules and applicable to automated assessment
Expert Systems with Applications
(2019)
Opinion mining from online hotel reviews a text summarization approach
Information Processing & Management
Summcoder: An unsupervised framework for extractive text summarization based on deep auto-encoders
Expert Systems with Applications
Summarization beyond sentence extraction: A probabilistic approach to sentence compression
Artificial Intelligence
Multi document summarization based on news components using fuzzy cross-document relations
Applied Soft Computing
Multi-modal product title compression
Information Processing & Management
Srl-esa-textsum: A text summarization approach based on semantic role labeling and explicit semantic analysis
Information Processing & Management
Multi-document extractive text summarization: A comparative assessment on features
Knowledge-Based Systems
Textual keyword extraction and summarization: State-of-the-art
Information Processing & Management
Assessing shallow sentence scoring techniques and combinations for single and multi-document summarization
Expert Systems with Applications
Neural extractive text summarization with syntactic compression
Proceedings of the 2019 conference on empirical methods in natural language processing
Neural latent extractive document summarization
2018 conference on empirical methods in natural language processing
A novel partitioning-based clustering method and generic document summarization
2006 IEEE/WIC/ACM international conference on web intelligence and intelligent agent technology workshops
Annual international ACM SIGIR conference on research and development in information retrieval
Sigir2018 proceedings
Cosum: Text summarization based on clustering and optimization
Expert Systems
An approach for combining multiple weighting schemes and ranking methods in graph-based multi-document summarization
IEEE Access
Text summarization using term weights
International Journal of Computer Applications
Variations of the similarity function of textrank for automated summarization
Argentine symposium on artificial intelligence
Learning summary prior representation for extractive summarization
Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (volume 2: Short papers)
An information distillation framework for extractive summarization
IEEE/ACM Transactions on Audio, Speech, and Language Processing
Neural summarization by extracting sentences and words
Annual meeting of the association for computational linguistics
Text summarization via hidden Markov models
Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval
Effective academic writing 3
Extraction of salient sentences from labelled documents
Computing Research Repository
Cited by (38)
Clause-aware extractive summarization with topical decoupled contrastive learning
2024, Information Processing and ManagementText characterization based on recurrence networks
2023, Information SciencesAn optimized hybrid deep learning model based on word embeddings and statistical features for extractive summarization
2023, Journal of King Saud University - Computer and Information SciencesBayesian Optimization based Score Fusion of Linguistic Approaches for Improving Legal Document Summarization
2023, Knowledge-Based SystemsLeveraging feature-level fusion representations and attentional bidirectional RNN-CNN deep models for Arabic affect analysis on Twitter
2023, Journal of King Saud University - Computer and Information SciencesCitation Excerpt :The main purpose of SenticNet is to improve the accessibility and machine-interpretability of conceptual and affective information. It has several versions; the latest was SenticNet 7 [62], which assigns semantics and sentics to 400,000 concepts employing a novel neuro-symbolic Artificial Intelligence framework. In this paper, we have adopted BabelSenticNet [60] as the first concept-level knowledge for multilingual sentiment and emotion analysis.