Candidate sentence selection for extractive text summarization

https://doi.org/10.1016/j.ipm.2020.102359Get rights and content

Highlights

  • A new benchmark dataset for studies on automatic text summarization, which contains both human-generated abstracts and extracts, was proposed.

  • The extractive summarization problem was revisited.

  • The syntactic and semantic feature spaces used in summarization were comprehensively investigated.

  • An ensembled feature space was introduced on a new long short-term memory-based neural network model (LSTM-NN).

  • Experimental results showed that the use of ensemble feature space remarkably improved the single-use of syntactic or semantic features, and the proposed LSTM-NN also outperformed the state-of-the-art models for extractive summarization.

Abstract

Text summarization is a process of generating a brief version of documents by preserving the fundamental information of documents as much as possible. Although most of the text summarization research has been focused on supervised learning solutions, there are a few datasets indeed generated for summarization tasks, and most of the existing summarization datasets do not have human-generated goal summaries which are vital for both summary generation and evaluation. Therefore, a new dataset was presented for abstractive and extractive summarization tasks in this study. This dataset contains academic publications, the abstracts written by the authors, and extracts in two sizes, which were generated by human readers in this research. Then, the resulting extracts were evaluated to ensure the validity of the human extract production process. Moreover, the extractive summarization problem was reinvestigated on the proposed summarization dataset. Here the main point taken into account was to analyze the feature vector to generate more informative summaries. To that end, a comprehensive syntactic feature space was generated for the proposed dataset, and the impact of these features on the informativeness of the resulting summary was investigated. Besides, the summarization capability of semantic features was experienced by using GloVe and word2vec embeddings. Finally, the use of ensembled feature space, which corresponds to the joint use of syntactic and semantic features, was proposed on a long short-term memory-based neural network model. ROUGE metrics evaluated the model summaries, and the results of these evaluations showed that the use of the proposed ensemble feature space remarkably improved the single-use of syntactic or semantic features. Additionally, the resulting summaries of the proposed approach on ensembled features prominently outperformed or provided comparable performance than summaries obtained by state-of-the-art models for extractive summarization.

Introduction

Today, tremendous data is available on the Internet. With its proliferation, it becomes difficult to efficiently gather the main information from this massive amount of data. Regarding the text documents, it is a complex and exhaustive process to gather and perceive the primary information from huge amount of resources in sufficient time for human-beings. Fortunately, these processes have been automatically performed by information retrieval methods for decades. However, the rise in the quantity of information causes some performance issues such as insufficient solutions and unwieldy applications of information retrieval tasks. The use of high technology machines may reduce the loss caused by these issues. However, it may cost more. As a more suitable alternative, dimension reduction can be employed in raw data to handle these issues and accelerate the implementations of these tasks. Regarding the domain of text processing, automatic text summarization is a good and highly interpretable experience for dimension reduction.

Text summarization is a process of generating a brief version of a single document or a set of documents. Automatic summarization of text documents is a challenging problem because it is highly vital to make the resulting summaries cover basic information of the source document(s) as much as possible. In literature, this problem has been studied according to two principle strategies: abstraction and extraction. In abstractive text summarization, the summarization process of humans has been imitated. People summarize documents by gathering salient information and reorganizing this information in idiosyncratic sentences. Imitating such a process lets more natural and artifact-like summaries to be generated. In abstractive text summarization, the critical concepts in the source document(s) have been determined first, and these concepts have been paraphrased with regards to the grammatical rules and constraints of corresponding natural language by natural language processing tools. This process is indeed highly challenging because of its language-dependency and semantic restrictions while paraphrasing, but still, it has been successfully applied on relatively short documents for simple tasks such as title/headline/keyword generation (Lopyrev, 2015, Nallapati, Zhou, dos Santos, Gulcehre, Xiang, 2016, Nasar, Jaffry, Malik, 2019), sentence compression (Knight, Marcu, 2002, Miao, Cao, Li, Guan, 2020, Zajic, Dorr, Lin, Schwartz, 2007) and sentence fusion (Krahmer, Marsi, & van Pelt, 2008) etc. The larger documents (or document sets), on the other hand, have been mostly summarized by extractive strategy. Here, the salient text units have been determined, and the most salient text units have been included in the summaries according to a ratio of compression. As a typical application strategy, these text units have corresponded to the sentences in the document(s). Reflecting the salient sentences to the summary has been made this kind of summarization more readable and plausible since the sentences in system summaries are grammatically correct and semantically proper human-written texts.

Several different approaches have handled the extractive text summarization process. Frequency-based term weighting approaches have been one of the preliminary studies on this area (Balabantaray, Sahoo, Sahoo, Swain, 2012, García-Hernández, Ledeneva, 2009, Ledeneva, Gelbukh, García-Hernández, 2008). Subsequently, the latent semantic analysis (Gong, Liu, 2001, Hachey, Murray, Reitter, et al., 2005, Steinberger, Ježek, 2009), hidden markov models (Brdiczka, & Chu, Conroy, O’leary, 2001), and graph-based unsupervised approaches (Aliguliyev, 2006, Fang, Mu, Deng, Wu, 2017, Mihalcea, Tarau, 2004, Wan, 2010) have been gathered attention. More recently, it has been considered an optimization problem, and the best summary sentences that maximize the evaluation metrics have been selected for model summaries. Here, the general approach for sentence selection is selecting the most related and less redundant sentences while avoiding to convey similar information as much as possible (Alguliev, Aliguliyev, Hajirahimova, 2012, Alguliev, Aliguliyev, Hajirahimova, Mehdiyev, 2011a, Alguliev, Aliguliyev, Isazade, 2013, Alguliev, Aliguliyev, Mehdiyev, 2011b, Alguliyev, Aliguliyev, Isazade, 2015, Alguliyev, Aliguliyev, Isazade, Abdi, Idris, 2019).

As the state-of-the-art, text summarization is related to machine learning’s classification problem. This kind of extractive text summarization method aims to determine the salience degree to sentences in the document and select the most salient sentences. Since it does not include sentence construction and paraphrasing processes, it has the advantage of language independence. To that end, extractive summarization has been gathering more attention in the literature (Allahyari, Pouriyeh, Assefi, Safaei, Trippe, Gutierrez, & Kochut, Nenkova, McKeown, 2012).

Simplifying the extractive text summarization, it consists of sentence scoring and sentence selection steps. In other words, a salience degree (or summary-worthiness) is determined for each sentence, the sentences are ranked according to their salience degree, and the most salient k sentences are selected as summary-worthy. These steps intrinsically associate the problem with classification problems in the machine learning field, since the determination of salience is a machine-learning problem which can be ideally handled by supervised learning.

In literature, sentence scoring for extractive text summarization has been handled by syntactic or semantic approaches. In syntactic approach, predetermined hand-crafted features for each sentence have been acquired, and considered during sentence scoring (Fattah, Ren, 2009, Ferreira, Lins, Freitas, Cavalcanti, Lima, Simske, Favaro, Others, 2013, Goularte, Nassar, Fileto, Saggion, 2019, Meena, Gopalani, 2014, Mutlu, Sezer, Akcayol, 2019, Mutlu, Sezer, Akcayol, 2020, Oliveira, Ferreira, Lima, Lins, Freitas, Riss, Simske, 2016, Suanmali, Salim, Binwahlan, 2009, Wan, 2010, Wang, Li, Wang, Zheng, 2017). In the semantic approach, on the other hand, the meaning of words/phrases, and the semantic relations of them have been taken into account (Chen, Liu, Chen, Wang, 2017, Cheng, Lapata, 2016, Denil, Demiraj, De Freitas, 2015, Mohamed, Oussalah, 2019, Narayan, Cohen, Lapata, 2018, Ren, Chen, Ren, Wei, Nie, Ma, De Rijke, 2018, Yin, Pei, 2015, Zhang, Lapata, Wei, Zhou, 2018). However, humans take into account both the meaning and semantic relations of sentences and some structural properties of text during summarization. However, as much as our literature knowledge, the use of an enhanced feature space that comprehensively handles these two feature types has not been gathered remarkable attention yet.

Being one of a few studies that consider syntactic and semantic features in one respect, the authors analyzed document-dependent and document-independent features for summarization in Cao et al. (2015). While the document-independent features corresponded to sentence embeddings which carry the meaning of sentences, the dependent features substituted the syntactic features which are sentence position of the sentence, the averaged term frequency values of words in the sentence, and the averaged cluster frequency values of words in the sentence. The basing sentence scoring method was the convolutional neural network, and the experiments on a well-known benchmark dataset (DUC2002) showed that 0.366 ROUGE-1 (Recall-Oriented Understudy for Gisting Evaluation) recall could be reached.

In Nallapati, Zhai, and Zhou (2017), a Recurrent Neural Network based Sequence Model (SummaRuNNer) was proposed for extractive summarization. The SummaRuNNer is a two-layer recurrent neural network (RNN) based sequence classifier where the first layer operates at word level within each sentence, and the following layer runs over sentences for classification purposes. Bidirectional Gated Recurrent Unit (Bi-GRU) based RNN was used as the basic building block of sequence classifier. While semantically representing the words and sentences by word2vec word embeddings, two positional features were also played a contributing role in the model’s decision making procedure, The ROUGE-1 recall value obtained from SummaRuNNer was 0.46 ± 0.8 on DUC2002 dataset.

Very recently, Joshi et al. represented a model for extractive text summarization based on deep auto-encoders in Joshi, Fidalgo, Alegre, and Fernández-Robles (2019). This model has been considered the sentence content relevance and sentence novelty relevance scores obtained from word and sentence embeddings and the syntactic sentence position relevance score to improve the salience of the first few sentences in the document. The ROUGE-1 score obtained from the resulting summaries on DUC2002 corpus was 0.517. Although these studies presented promising capacity in selecting the salient sentences from the text, their syntactic feature spaces were minimal and mostly relied only on the position of the sentences. However, several syntactic information can be extracted from the text, and it is still unknown, which of them affect the summarization performance in what ratio. Therefore, these existing approaches can not be considered as mature solutions yet. Here, the more important point is not the sentence scoring method itself, but what information was passed to the sentence scoring method. To that end, the input features which correspond to the importance of the sentence should be clearly determined and optimized to be used in sentence scoring.

In this study, the syntactic and semantic features used in extractive text summarization were deeply investigated, and their individual and joint contribution to summarization problem was analyzed by several experiments. Enhancing these two types of features, a comprehensive feature space with both syntactic and semantic summarization features were proposed, and it was shown that, when compared to their individual use, the combined use of these two types of feature spaces can contribute more according to both selecting the most informative sentences and in preserving the main information in the source document.

In automatic text summarization literature, there are few datasets created exactly for summarization task. In the optimum case, a summarization dataset has to provide proper goal summaries which are generated by a human. It is a crucial requirement for both extraction and abstraction. Although there exist plenty of text datasets in literature, a tiny portion of these datasets contain the goal summaries which are manually generated by a human. With this distinctive feature, the Document Understanding Conference (DUC) (DUC, 2007) has been the most convenient benchmark dataset for this task and employed in most of the studies on extractive summarization.

To be an alternative benchmark dataset to DUC, a summarization corpus was created and proposed in this study. The corpus was obtained from the proceedings of 41st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR, 2018). The proceedings are publicly available for academic purposes in SIGIR (2018). In the proposed corpus, the introduction section of each proceeding was considered as a source document and acquired from SIGIR (2018). Additionally, the abstracts, concepts, and keywords that were written by the original authors of publication were obtained to be possible baseline for abstractive summarization or text classification tasks, respectively. Besides these already determined text blocks, the sentences in introduction sections were manually labeled as summary-worthy or summary-unworthy by three human readers by asking them to select a subset of sentences from the source documents to reduce the original text to its 33% by preserving the basic information and the entire coherency as much as possible. As a result of this labeling process, a candidate sentence list was created for each reader, and three candidate extractive sentence sets were obtained for each document. Blending these candidate sentence sets, two extracts (Ext   and Ext  ) were obtained by the union and intersection of candidate sentence sets. The Ext   extracts contain the sentences which were selected as summary-worthy by at least one reader. The Ext   extracts, on the other hand, were obtained from the sentences which were selected as summary-worthy by at least two human readers. Naturally, the Ext   extracts are larger documents than Ext   extracts. To have an overall understanding of proposed corpus, statistical information was first given. Then, a ROUGE-based evaluation was performed several perspectives to validate and verify the consistency of human extracts (Ext   and Ext  ).

Using the new SIGIR 2018 corpus, enhanced feature space containing syntactic and semantic features were extracted, and an extensive feature space was proposed to be used in determining the salience of sentences. Additionally, a Long Short-Term Memory (LSTM)-based Neural Network (LSTM-NN) was proposed to classify the sentences as summary-worthy or summary-unworthy. This model processes the semantic and syntactic features in separate LSTMs, and combines the output vectors in a deeper layer. Then a two-layer fully connected neural network is applied for the classification of sentences. The sentences which were labeled as summary-worthy by this model were considered as model summaries, and these summaries were evaluated by ROUGE metrics obtained from 5-fold cross-validation. The evaluation was first based on measuring the contribution of individual and joint use of syntactic and semantic feature spaces. Here, it was observed that using the enhanced feature space significantly improves the ROUGE-values. Secondly, SummaRuNNer (Nallapati et al., 2017), and BanditSum (Dong, Shen, Crawford, van Hoof, & Cheung, 2018) were implemented to compare the informativeness of resulting summaries with the summaries of state-of-the-art deep learning methods. The obtained results showed that the LSTM-NN model fed by enhanced feature space provided more informative summaries by also including fewer sentences to the resulting summaries than SummaRuNNer, and it achieves comparable results with BanditSum, and it also outperform this baseline method basing the phrase-based assessments.

Section snippets

Objectives and contribution

In this study, the extractive text summarization was handled by three objectives based on the summarization dataset, the feature space and the methods used for determining an importance degree to sentences.

There are plenty of text documents available for numerous information retrieval or text mining tasks. However, a dataset for summarization purposes needs to fulfill some key requirements. First of all, it should include human-written documents on a specific topic. However, most of the

A new benchmark dataset for text summarization

In automatic text summarization literature, there are few datasets created exactly for summarization task. As already mentioned in the introductory section, having manually generated extracts that contain the important sentences in the source document is a crucial need for extractive text summarization. However, there exist a few dataset fulfilling this necessity. In this section, the most commonly used datasets are revisited with their advantages and weaknesses, and then the proposed dataset

Extractive text summarization on SIGIR 2018

In this study, the proposed SIGIR 2018 corpus was handled by extractive text summarization task. The problem was handled as sentence ranking and classification by basing the semantic features, syntactic features, and ensembled features.

Conclusion

In this study, the extractive text summarization was revisited according to three research areas: the utilized dataset, the feature space, and the method which is applied for sentence selection to obtain its summary worthiness.

Dataset: A new English dataset containing the proceedings of 41th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2018) was proposed. The introduction section of each proceeding was considered as a source document since, in

References (73)

  • Y.-H. Hu et al.

    Opinion mining from online hotel reviews a text summarization approach

    Information Processing & Management

    (2017)
  • A. Joshi et al.

    Summcoder: An unsupervised framework for extractive text summarization based on deep auto-encoders

    Expert Systems with Applications

    (2019)
  • K. Knight et al.

    Summarization beyond sentence extraction: A probabilistic approach to sentence compression

    Artificial Intelligence

    (2002)
  • Y.J. Kumar et al.

    Multi document summarization based on news components using fuzzy cross-document relations

    Applied Soft Computing

    (2014)
  • L. Miao et al.

    Multi-modal product title compression

    Information Processing & Management

    (2020)
  • M. Mohamed et al.

    Srl-esa-textsum: A text summarization approach based on semantic role labeling and explicit semantic analysis

    Information Processing & Management

    (2019)
  • B. Mutlu et al.

    Multi-document extractive text summarization: A comparative assessment on features

    Knowledge-Based Systems

    (2019)
  • Z. Nasar et al.

    Textual keyword extraction and summarization: State-of-the-art

    Information Processing & Management

    (2019)
  • H. Oliveira et al.

    Assessing shallow sentence scoring techniques and combinations for single and multi-document summarization

    Expert Systems with Applications

    (2016)
  • J. Xu et al.

    Neural extractive text summarization with syntactic compression

    Proceedings of the 2019 conference on empirical methods in natural language processing

    (2019)
  • X. Zhang et al.

    Neural latent extractive document summarization

    2018 conference on empirical methods in natural language processing

    (2018)
  • R.M. Aliguliyev

    A novel partitioning-based clustering method and generic document summarization

    2006 IEEE/WIC/ACM international conference on web intelligence and intelligent agent technology workshops

    (2006)
  • Annual international ACM SIGIR conference on research and development in information retrieval

    Sigir2018 proceedings

    (2018)
  • R.M. Alguliyev et al.

    Cosum: Text summarization based on clustering and optimization

    Expert Systems

    (2019)
  • Allahyari, M., Pouriyeh, S., Assefi, M., Safaei, S., Trippe, E. D., Gutierrez, J. B., & Kochut, K. (2017). Text...
  • A. Alzuhair et al.

    An approach for combining multiple weighting schemes and ranking methods in graph-based multi-document summarization

    IEEE Access

    (2019)
  • R.C. Balabantaray et al.

    Text summarization using term weights

    International Journal of Computer Applications

    (2012)
  • F. Barrios et al.

    Variations of the similarity function of textrank for automated summarization

    Argentine symposium on artificial intelligence

    (2016)
  • Brdiczka, O., & Chu, M. K. (2011). Measuring document similarity by inferring evolution of documents through reuse of...
  • Z. Cao et al.

    Learning summary prior representation for extractive summarization

    Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (volume 2: Short papers)

    (2015)
  • K.-Y. Chen et al.

    An information distillation framework for extractive summarization

    IEEE/ACM Transactions on Audio, Speech, and Language Processing

    (2017)
  • J. Cheng et al.

    Neural summarization by extracting sentences and words

    Annual meeting of the association for computational linguistics

    (2016)
  • J.M. Conroy et al.

    Text summarization via hidden Markov models

    Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval

    (2001)
  • Document understanding conference (duc) dataset, (2001–2007)....
  • J. Davis et al.

    Effective academic writing 3

    (2006)
  • M. Denil et al.

    Extraction of salient sentences from labelled documents

    Computing Research Repository

    (2015)
  • Cited by (38)

    • Leveraging feature-level fusion representations and attentional bidirectional RNN-CNN deep models for Arabic affect analysis on Twitter

      2023, Journal of King Saud University - Computer and Information Sciences
      Citation Excerpt :

      The main purpose of SenticNet is to improve the accessibility and machine-interpretability of conceptual and affective information. It has several versions; the latest was SenticNet 7 [62], which assigns semantics and sentics to 400,000 concepts employing a novel neuro-symbolic Artificial Intelligence framework. In this paper, we have adopted BabelSenticNet [60] as the first concept-level knowledge for multilingual sentiment and emotion analysis.

    View all citing articles on Scopus
    View full text