Abstract

The text summarization task aims to generate succinct sentences that summarise what an article tries to express. Based on pretrained language models, combining extractive and abstractive summarization approaches has been widely adopted in text summarization tasks. It has been proven to be effective in many existing pieces of research using extract-then-abstract algorithms. However, this method suffers from semantic information loss throughout the extraction process, resulting in incomprehensive sentences being generated during the abstract phase. Besides, current research on text summarization emphasizes only word-level comprehension while paying little attention to understanding the level of the sentence. To tackle this problem, in this paper, we propose the SentMask component. Taking into account that the semantics of sentences that are filtered out during the extraction process is also worth considering, the paper designs a sentence-aware mask attention mechanism in the process of generating a text summary. By applying the extractive approach, the paper first selects the most essential sentences to construct the initial summary phrases. This information leads the model to modify the weights of the attention mechanism, which provides supervision for the generative model to ensure that it focuses on the sentences that convey important semantics while not ignoring others. The final summary is constructed based on the key information provided. The experimental results demonstrate that our model achieves higher ROUGE and BLEU scores compared to other baseline models on two benchmark datasets.

1. Introduction

With the rapid increase in the number of articles and papers, we have found ourselves drowned in the sea of documents. The time-consuming and energy-draining reading process can be avoided by creating a concise abstract of a text and transmitting the main concept to the reader. But summarizing articles automatically is a difficult process as it necessitates models to rewrite a long article into a concise and fluent version while preserving the essential information. In the area of automatic text summarization, extractive and abstractive methods are two primary paradigms. To produce a summary, the extractive[1] techniques select the salient phrases or sentences exactly from the original source, whereas the abstractive [2] techniques generate new phrases and sentences from scratch. However, because relevant information is spread throughout all sentences rather than contained in a few, extractive models suffer from a lack of semantics and cohesiveness in summary sentences, as well as redundancy in certain summary sentences. On the other hand, abstractive summarization models suffer from the slow encoding of long documents and the unreliability of the generated summaries.

Recently, some researchers have tried to combine these two methods in an extract-then-abstract way [3, 4]. The work [3] proposes a hybrid framework HYSUM for text summarization, which maintains salient content by switching rewriting sentences and copying sentences according to the degree of redundancy. The work [4] provides a hybrid abstractive-extractive method, which scans a document, produces prominent textual fragments that highlight its main ideas, and selects the important sentences by calculating the BERTScore. These models design a two-stage pipeline to pick out salient sentences from a source document first and then rewrite the extracted sentences into a complete summary. However, most research using the extract-then-abstract framework generates summaries based solely on the extracted sentences, which loses robustness. In many cases, significant content might be filtered by the extraction model, causing severe information loss in the generation process.

Furthermore, it is difficult to comprehend and generalise articles due to their rigorous grammatical statements. To maintain the consistency of professional grammatical definitions and logic within original sentences, it is vital to preserve sentence-level information and semantics in summaries, which have also been ignored in previous works.

To overcome both of these issues while combining the benefits of both paradigms, in this paper, we propose SentMask, a novel sentence-aware mask attention-guided two-stage text summarization component, adaptively reducing the attention weight of filtered sentences by training neural networks. Taking Figure 1 for example, the existing methods generate the summary according to the selected sentences extracted by the extractors only. However, the sentences also contain some information, which should not be lost, such as “adverse events.” Thus, the paper utilizes these sentences by reducing rather than deleting their attention weights.

An extractive summary is to extract important sentences to form a summary to achieve the function of summarizing the full text. During the extraction process, the model fully considers semantic information between sentences. The generative summary is to generate orderly words, form sentences, and then form a summary to highly summarise the entire article. During the generation process, the semantic information between words is fully considered by the model, but the emphasis on the semantic information between sentences is weakened. In order to make full use of the semantic information between each word and sentence, we employ an extractor to extract the initial summary and an abstractor to abstract the final summary. Therefore, our model takes into account both word-level and sentence-level information in the text generation process. Unlike selecting important words in other works that separate the semantics of the whole, the paper uses an extractor to select essential information at the sentence level, faithfully preserving the semantics of the whole sentence. In this way, with the above issues solved, our model can avoid syntactic errors and incoherent errors in summary sentences and ensure that the generated phrases are flexible and stable. To better leverage the results of the extractor algorithm and preserve the necessary global information, the paper proposes a sentence-aware mask attention mechanism in our model.

The paper evaluates the efficacy of our semisupervised and supervised SentMask models, respectively. The semisupervised SentMask model consists of the TextRank algorithm [5] and sequence-to-sequence model (Seq2Seq) [6], while the supervised SentMask model consists of the MemSum algorithm [7] and BART [8] model. The paper leverages the extractor algorithm to extract important sentences for summarization. Based on its results, the paper then masks other sentences by reducing rather than deleting their attention weights. The noise reduction capability of our model is demonstrated by the weight reduction of the information in trivial sentences, which, to some extent, relatively increases the weight of important information.

The following are our primary contributions:(1)The paper proposes a brand-new two-stage hybrid abstractive and extractive summary method. While acquiring the information of the salient sentences generated by the extractor, our abstractor also extracts knowledge in a specific way for the nonsalient sentences. Our method is implemented in semisupervised and supervised versions, which include unsupervised and supervised extractors, respectively.(2)The paper proposes a sentence mask module, a sentence-aware mask attention mechanism, and a mask-aware copy mechanism. The sentence mask module aims to transform a sample input into a mask matrix. The sentence-aware mask attention mechanism reduces the nonsalient sentences’ attention weight rather than losing its information. The mask-aware copy mechanism copies only words from salient sentences since there could be noise throughout the article.(3)The paper extensively evaluates SentMask on two benchmark datasets. The results of the experimental evaluation show that SentMask outperforms the current state-of-the-art in these evaluations.

2.1. Traditional Summarization

Several traditional summarization approaches for automatic summary generation have been advanced over the years, incorporating a variety of statistical-based [9], topic-based [10], graph-based [5], and semantic-based [11] techniques. For instance, the work [9] brings improvements by involving sentence position, sentence length, and keyword sentence features. The work [10] proposes a term frequency-inverse document frequency algorithm, which measures the importance of keywords based on their frequency of occurrence and uses it to assess each sentence. The abstract is extracted from the highest-scoring sentences. Biased TextRank [5] is a method for capturing meaning closeness between graph nodes and a target text that depends on document representation models and similarity measurements. The latent semantic analysis [11] is an unsupervised technique that encodes text semantics based on the observed cooccurrence of words.

Traditional unsupervised text summarization models do not require any training data and generate the summary by accessing only the target documents. However, these traditional methodologies do the summarization task using manual design features, which shows poor generalization ability for new data.

2.2. Neural Networks Summarization

The two most common types of study are extractive summarization and abstractive summarization. Extractive summarization methods commonly construct an encoder-decoder architecture, with the graph attention network [12] as an encoder and autoregressive [13] or nonautoregressive [14] decoders. The work [7] proposes a multistep extractive summariser based on reinforcement learning-based Markov decision processes, which considers information from the current extraction history.

In recent years, pretraining has been used in several varieties of transformer architecture in various ways, including encoder-only pretraining models like XLNet [15], decoder-only pretraining models like GPT [16], and encoder-decoder pretraining models like T5 [17] and BART [8]. For instance, the work [18] distills large pretrained sequence-to-sequence transformer models into smaller ones for faster inference and with the least amount of performance loss.

Two-stage document summarizing systems have been developed in recent studies. The first stage of this framework usually involves extracting some segments of the original text, and the second stage involves selecting or modifying these segments. There are various extract-then-abstract summarization methods such as extract-then-rewrite and extract-then-compress. In extract-then-rewrite models, the method [19] employs a coarse-to-fine approach inspired by humans, extracting all relevant sentences first and then decoding them simultaneously. The work [20] introduces a novel training signal that employs reinforcement learning to directly maximise summary-level ROUGE scores. In extract-then-compress models, the model [21] selects phrases from the document, identifies plausible compressions based on constituent parses, and rates those compressions using a neural network model to construct the final summary. The work [22] proposes a method for learning to select sentence singletons and pairs, which would subsequently be employed by an abstractive summariser to build a sentence-by-sentence summary, with singletons compressed and pairs fused.

Previous research using the extract-then-abstract framework generates summaries based solely on the extracted sentences, which loses semantic information in the filtered sentences, causing a severe information loss. To that end, the paper designs a sentence-aware mask attention-guided two-stage text summarization component, which captures the gist of the text.

3. Materials and Methods

In this section, the paper introduces our sentence-aware extract-then-abstract summarization framework in detail as illustrated in Figure 2. It consists of four components: (1) An extractor, an importance-aware content selection component that utilizes the TextRank or MemSum [7] algorithm to extract and organize salient sentences. (2) An abstractor, a Seq2Seq [6] or BART- [8] based abstract generation component with sentence-aware mask attention mechanism that compresses and rephrases both the extracted sentences and the original article to a succinct summary. (3) The sentence-aware mask attention mechanism, a modified version of the attention weight mechanism by masking the nonsalient sentences. (4) The mask-aware copy mechanism, a modified version of the copy mechanism by copying words from the salient sentences rather than the whole article. The paper describes these components in detail as follows.

3.1. Extractor

First, we split the article into sentences. Let denote the original sentences of the article, which consists of a sequence of sentences . Each consists of a sequence of words .

These sentences are constructed as a directed graph represented by a sentence similarity matrix with the TextRank algorithm or input to a multistep episodic Markov decision process with historical awareness using the MemSum algorithm. After the extractor algorithm, a score is calculated for each sentence, which represents the “importance” of the sentences. The sentences are sorted in reverse order of the score, and the first sentences with the highest scores are chosen to be the draft as the input of the abstractor to form the final summary.

denotes the initial sentences extracted by the extractor algorithm, which belong to the sentences in . , where . The paper redescribed , where .

So far, the paper is discussing the sentence level. The extractor helps us to preserve the whole sentence semantics. The paper then converts this information to the word-level since the Seq2Seq and BART models would take the word-level information into account.

The paper utilizes a sentence mask module to transform a sample input into a mask matrix. The transformation of the input of the SentMask model is shown in Figure 3.

indicates whether the word is in the selected sentences. , where is shown as follows: will be the essential component for us to perform a sentence-aware mask attention mechanism, as it conveys information about how important the word is. To make it clear, the paper reformulates and .

3.2. Abstractor

After obtaining the initial salient textual fragments representing the source article’s key points by the extractor, the paper generated the summary with the assistance of these extracted sentences.

The paper uses a pretrained word representation to map each token to a vector. Then, the paper utilizes an abstractor to encode and decode the whole article, . The decoder is initialized with the encoder’s last hidden state. In Seq2Seq, our encoder and decoder are GRU-based. is the encoder’s hidden state and is the decoder’s hidden state at the time step . The context vector is .

In the BART, our encoder and decoder are transformer architecture. is the hidden state of the encoder, and is the hidden state of the decoder at the time step t.where is the word generated in the last step.

The paper uses a sentence-aware attention mechanism in both of our abstractors. In addition, the paper utilizes a mask-aware copy mechanism in the Seq2Seq.

3.3. Sentence-Aware Mask Attention Mechanism

Based on the attention mechanism, the paper proposes a sentence-aware attention mechanism in this paper, which is employed both in semisupervised and supervised modes. is the attention score obtained by our sentence-aware mask attention mechanism. It consists of two parts: standard word-level attention and sentence-aware masked attention on the sentence level. The word-level attention is calculated by the associated phrase attention. In the masked sentence attention, the paper forces the model to focus on the important sentences extracted by the extractor algorithm. By combining such attention scores together with a hyper-parameter as the weight, the paper can not only emphasize information from important sentences but also not lose semantics in other sentences. The attention score calculation process in Seq2Seq is shown as follows:

The attention score calculation process in the BART is shown as follows:when , is the default attention mask. When , is shown as follows:where and are the hyperparameters. The extension of the generation sources encourages the integrity of the sentence and increases the probability of correctness.

For summary output, the final vocabulary distribution in BART at time step is , where is a dense layer, while the preliminary vocabulary distribution in Seq2Seq at time step is defined as follows:

3.4. Mask-Aware Copy Mechanism

The copy mechanism in the Seq2Seq, according to [23], uses the encoder’s representation of words to select a word in the inputs instead of choosing from the whole vocabulary. When dealing with important words, this technique may be more reliable than generating from all vocabulary. Due to the hidden state of a word being governed by its full context and lexical auxiliary feature collectively, the model can consistently produce great terms in the target vocabulary. The paper makes a modification to the original copy mechanism. The paper only copies words from important sentences since there could be noise throughout the article. By limiting the scope, the model can easily find the most possible word to generate. is calculated as follows:, , , and are trainable parameters. And means the sigmoid function.

The final prediction is obtained by merging the copy probability and the output of the decoder.

In conclusion, our SentMask model extends the Seq2Seq and BART models, respectively, with an important sentence-guided masked attention strategy that enables the model to leverage both word-level information and sentence-level information for final sequence generation. Taking advantage of containing more condensed semantics at the word-level and keeping the original sentence grammar at the sentence level, our SentMask model promotes the capacity of capturing the gist of the input text, either semisupervised or supervised.

4. Results and Discussion

4.1. Dataset

To comprehensively investigate our proposed model, we employ two benchmark datasets for evaluation, which are common options in previous research, including the Multi-Document Summarization of Medical Studies benchmark dataset (MS2) and the AESLC dataset. The paper declares both of them are open access, where the MS2 dataset can be downloaded at https://paperswithcode.com/dataset/ms-2 and the AESLC dataset can be downloaded at https://github.com/ryanzhumich/AESLC. The statistical details of the two datasets are shown in Table 1. The following are brief summaries of these benchmark datasets.

4.1.1. MS2 [24]

MS2 dataset is a scientific literature dataset with about 470k pages and 20k summaries. The paper removes the contents that are excessively long or too short, and 20,434 papers are ultimately acquired as our corpus, with 16,112 documents for training, 2,277 for validation, and 2,045 for testing.

4.1.2. AESLC [25]

The AESLC dataset is obtained from the Enron dataset, including many emails from staffers in the Enron Corporation, which are composed of 517,401 e-mail messages from 150 user mailboxes. After filtering and deduplicating, the paper obtains the final AESLC dataset.

4.2. Implementation and Evaluation Details

This method is suitable for any encoder-decoder model based on a neural network, including pretrained language models. In this paper, we implement our SentMask based on Seq2Seq and BART, respectively, which is sufficient to demonstrate the effectiveness of the method. The paper sets . The paper uses Pytorch to implement our model.

To demonstrate the performance of the proposed SentMask model, the paper compares the SentMask model to many baselines with the same model size for a fair comparison, including the Lead3 algorithm, TextRank algorithm, GenCompareSum model [4], Seq2Seq model, Presumm model [26], Global Encoding model [27], Pointer-Generator model [23], Transformer [28], AESLC baseline [25], and BART [8].

There are some descriptions of the baselines as follows.

4.2.1. Lead3 Algorithm

Lead3 algorithm takes the top K sentences.

4.2.2. TextRank Algorithm

The TextRank algorithm determines each sentence’s score based on how similar the sentences are to one another and then selects the top K scoring sentences.

4.2.3. GenCompareSum Model [4]

GenCompareSum model is a hybrid extraction method, which generates salient text fragments representing their main points and selects the most important sentences in the document by calculating using BERTScore.

4.2.4. Seq2Seq Model

Seq2Seq is an encoder-decoder architecture, which consists of LSTM or GRU.

4.2.5. Presumm Model [26]

The Presumm model is based on the BERT model, which can express the semantics of the document and obtain the representation of the sentence and improve the quality of the summary through the fine-tuning method.

4.2.6. Global Encoding Model [27]

The Global Encoding model is a Seq2Seq model, which employs a gated convolutional unit in the encoder for global encoding.

4.2.7. Pointer-Generator Model [23]

Pointer-generator is an encoder-decoder model solving the OOV problem by controlling the pointer to make the model copy the token from the original context.

4.2.8. Transformer [28]

It is a brand-new, uncomplicated network architecture, which consists of attention mechanism techniques.

4.2.9. AESLC Baseline [25]

AESLC baseline is a multisentence extractor and a multisentence abstractor.

4.2.10. BART [8]

BART is a transformer-based model, which employs a bidirectional encoder with a number of denoising pretraining objectives.

For the evaluation of the quality of the experiment, the paper comprehensively evaluates the quality of the summary generated by these baseline models from both intelligent and human evaluation perspectives. Automated overview evaluation metrics, including ROUGE [29] and BLEU [30], are used to evaluate the quality of text summarization. In particular, the BLEU evaluation metric is an enhanced N-gram assessment metric, and its N-gram weights can be defined here to conveniently fit the models for different purposes and more accurately determine the consistency of the model.

4.3. Automated Evaluation

The experimental results on the MS2 and AESLC datasets are shown in Tables 2 and 3, respectively. The results show that the proposed SentMask model performs remarkably well in two text summarization datasets, demonstrating the effectiveness of our masked sentences attention mechanism.

Meanwhile, the improvements confirm that not only further refining information from the original text can be captured by the structure of a multilayer neural network but also the expression capacity that enables the model to generate summaries with few grammatical errors is improved by adding updated encoding information.

4.4. Human Evaluation

To further assess the quality of the summaries produced by the SentMask model, the paper conducted a human evaluation using three typical indicators, informativeness, fluency, and faithfulness. The following are brief summaries of these human evaluation metrics.

4.4.1. Informativeness

The informativeness of the summary is determined by how accurately it summarises the material in the original article.

4.4.2. Faithfulness

Faithfulness evaluates how well the facts in the summary match those of the original article.

4.4.3. Fluency

The summary’s fluency is determined by how few serious grammatical faults it contains.

The paper hires five native English speakers and randomly chooses 300 news stories from the MS2 and AESLC datasets to evaluate the summaries of these baseline models and the SentMask model on three different aspects. The score ranges from 1 (poor) to 5 (outstanding).

Table 4 findings demonstrate that, in terms of informativeness, fluency, and faithfulness, our SentMask model outperforms other baseline models, which illustrates the value of the sentence-aware mask attention mechanism.

4.5. Ablation Study

To obtain a more scientifically accurate explanation, an ablation study is conducted by removing some components of our model to verify their contribution. The paper conducts the ablation study with the semisupervised model and a supervised model, respectively, on the MS2 dataset. The paper conducts several experiments and ablation tests as follows.

4.5.1. SentMask-T

It is our proposed semisupervised model. The sentences are first generated by the TextRank algorithm and then passed through the proposed SentMask neural network.

4.5.2. TextRank

TextRank is a graph-based ranking model for natural language processing, which finds the most relevant sentences in an article.

4.5.3. SentMask-C

It is our proposed supervised model. The MemSum algorithm generates the initial selected sentences and passes through the proposed SentMask neural network.

4.5.4. MemSum

MemSum is a historical-aware multistep episodic Markov decision process algorithm.

To investigate how the hyperparameters affect the model’s performance, the paper tries different hyperparameter settings in our ablation study. An essential hyperparameter is the number of sentences with the highest scores extracted by the extractor algorithm, .

The paper performs a set of experiments with a different selection of to uncover its influence on the quality of the generating sentence. There are two ways to control in the extractor algorithm, one is to control the percentage of selected sentences and the other is to set itself. The settings of the two ways are described as follows.

in the extractor algorithm; the first of sentences is selected as subsequent input sentences and as nonmasked sentences. In our experiments, the paper tries different .

in the extractor algorithm; the first sentences are selected as the subsequent input sentences and as the nonmasked sentences. The paper tries different .

For the semisupervised model, the ROUGE-L score and the BLEU score of the ablation models with different are shown in Figure 4. The ROUGE-L score and BLEU score of ablation models with different are illustrated in Figure 5. For the supervised model, the ROUGE-L score and BLEU score of ablation models with different are shown in Figure 6. The ROUGE-L score and BLEU score of ablation models with different are illustrated in Figure 7.

Overall, the ablation models, either the semisupervised model or the supervised model, perform poorly in terms of the ROUGE-L score and BLEU score, demonstrating the effectiveness of the sentence-aware masked attention mechanism in our SentMask model. From the eight figures, with different , the line trend of the results of the semisupervised SentMask model is more turbulent, while that in the supervised SentMask model is relatively stable. Thus, the performance of the semisupervised SentMask model is influenced by the parameter significantly, while the supervised model is slightly influenced. In addition, selecting the proper number of sentences is a crucial decision for our model. Comparatively speaking, it can be observed that the best setting of hyper-parameter is to select the first 50% of sentences of source articles, either the semisupervised model or the supervised model.

4.6. Effect of the Hyper-Parameter

To demonstrate our model robustness with different parameters, the paper tries different from 0.6 to 0.95 for the semisupervised model and the supervised model on the MS2 dataset. According to the results in Figure 8, the proposed SentMask performs well regardless of the value of . SentMask-T performs best when in the MS2 dataset and SentMask-C performs best when in the MS2 dataset. Note that the model mainly carries out the task of generating text abstracts, so the proportion of information from the attention mechanism represented by the masked sentences strategy should be less than that of the original attention mechanism.

4.7. Case Study

Table 5 shows an example of summaries generated by different models.

In this example, the original article provides verification of acupuncture’s efficacy and safety in relieving abdominal pain and distension associated with acute pancreatitis. The primary idea of this paper is definitely about acupuncture’s high efficacy and safety, and the research object is abdominal pain and distension for acute pancreatitis.

However, the baseline models generate an inappropriate summary to varying degrees. In detail, the summary of the Lead3 algorithm contains duplicate information that does not represent the true abstract of this article, such as “Methods and Analysis”

The TextRank algorithm has a risk of ranking redundant sentences high and generates condensed sentences that are semantically similar sentences, such as “safety of acupuncture” which appears twice in the summary text.

The Seq2Seq model creates a summary that solely comprises information related to acupuncture, not the efficacy or safety of acupuncture. Furthermore, it made the mistake of redundantly repeating the word “acupuncture.” The pointer network model generates an excessive number of words, emphasizing “acupuncture’s effect” rather than “its efficacy and safety.” Meanwhile, the trial method does not need to be included in the abstract of the paper. According to the summary of the Global Encoding model, “orthostatic hypotension and cardiovascular” is a component of the entire text, but not the main information. The main objective of the summary given by the Presumm model is “home-based ventilation in intensive care,” which is inconsistent. The summary generated by the BART model focuses on “pancreatitis” rather than “efficacy,” which is inappropriate.

Compared with these baseline models, the summary of our model is more coherent and semantically relevant to the source text. Our model focuses on information on the efficacy and safety of acupuncture rather than itself and points out that this is a systematic review and meta-analysis in its generated summary. Meanwhile, all the words generated from our model are the target words of the standard dataset, maintaining a high degree of conciseness.

Therefore, our model can better consider the grammatical word-level and sentence-level appearances simultaneously by masking the sentences to advise the generator. This indicates that the masked sentence attention in our model is able to capture substantial semantics and minimize noise information from the source article by inserting an original sentence pointer.

5. Conclusions

In this paper, we propose SentMask, a novel extract-then-abstract method for text summarization. By utilizing the sentence-aware mask attention mechanism, our method avoids information loss caused by the extraction model. Besides, the paper utilizes a sentence-level extractor, which can preserve sentence-level semantics during generation. Experimental results, the semisupervised model and the supervised model, both demonstrate our model can generate comprehensive summaries without suffering information loss.

In terms of our future work, the paper attempt to extend our solution in various directions. One possible direction is to take into account the varied connections among the words and sentences in articles. The paper will explore using the similarity of phrases, especially critical phrases, to further explore semantic relationships.

Data Availability

The data used to support the findings of this study are included in the article [24, 25]. The MS2 and AESLC datasets can be derived from the websites https://paperswithcode.com/dataset/ms-2 and https://github.com/ryanzhumich/AESLC.

Conflicts of Interest

The authors declare that they have no conflicts of interest in this article.