Align-then-abstract representation learning for low-resource summarization

ive


Introduction
Abstractive summarization is a natural language processing task of generating a short version of a document while preserving the salient details [1].The training of modern transformer-based solutions [2][3][4][5] typically requires massive labeled data, which levies obstacles in realistic scenarios of low-resource summarization (LRS) distinguished by just a few dozen labeled training instances [6,7,43].Of a particular challenge, LRS in the real world regards long documents [8,9] because producing the ground-truth summary of lengthy texts is expensive, time-consuming, and may demand domain-knowledge proficients.Long-input processing is a crux challenge in summarizing extended articles and assumes finding a strategy to address the quadratic memory complexity in the input size of transformer-based models [10].Existing methods generally rely on input truncation [11,12] or salient content selection [8].Nevertheless, the reliance on the input given to the model can undesirably prevent proper document representation learning, the quality of which may, in turn, affects the summarization accuracy, especially in low-resource conditions.To date, few efforts have been made to resolve this issue.This work bridges the gap by reconciling the inherent tension between the two highly dependent problems, i.e., long-input processing and representation learning, in one coherent and synergistic model trained for LRS.Our model, align-then-abstract representation learning (ATHENA), 1 jointly trains a text segmentation module with an abstractive summarizer with a novel alignment loss.Notably, we (i) address the computational issues of summarizing lengthy documents by segmenting them into small content-wise chunks, which are synthesized with fewer memory requirements, (ii) give summarization models entire inputs without truncating any information a priori or selecting a subset of sentences, and (iii) cope with the scarcity of labeled instances by implicitly augmenting the training data through text segmentation.
To reckon the effectiveness and generality of our solution, we accomplish comprehensive experiments on multiple public long document summarization datasets from different domains.ATHENA establishes new state-of-the-art performance in LRS on all corpora, outperforming previous works significantly.
Our main contributions are as follows: To the best of our knowledge, our paper pioneers the exploration of document representation learning for low-resource summarization by presenting a novel align-then-abstract learning model (ATHENA) in which the text segmentation is jointly learned to yield better summaries.Our proposed approach nails multiple issues related to longinput processing: (i) reduces the GPU memory usage, (ii) avoids input truncation, and (iii) works with few labeled training instances.
ATHENA achieves new state-of-the-art results in low-resource summarization on various well-known datasets.

Related work
Fine-tuning pre-trained models for downstream tasks is a standard strategy, but it is often ineffective if solely dozens of labeled training instances are available.Language models are usually pre-trained with self-supervised learning techniques with numerous unlabeled data [2,[13][14][15].Consequently, downstream-specific pretraining strategies [16] have been introduced to create lowresource-oriented models, e.g., PEGASUS [3].Nonetheless, with few training samples, pre-trained language models still struggle to adapt to new data from diverse domains [17].For this reason, several approaches have been proposed to tackle the limited availability of labeled instances.Prompt-based methods [18,19] tune continuous prompts to adapt quickly to new tasks with few examples.Other works [20,21] applied synthetic data augmentation, enhancing the summarization accuracy in low-resource conditions but only experimenting on short texts of max 400 and 200 tokens, respectively.Conversely, to mimic the real-world LRS scenario over long documents, Bajaj et al. [8] proposed an extract-then-abstract approach to provide exclusively salient sentences to a pretrained model.Despite its effectiveness, this solution involves a two-stage training, in which the summarization model does not process all information in the long input.Finally, Chen and Shuai [22] introduced a meta-transfer learning technique that augments the training data with multiple similar corpora.
Unlike prior contributions, we experiment with multiple public long document summarization datasets from different domains using a base model without synthetic data augmentation.

Background
Problem Definition We define the problem of long document summarization with the following setup.The input is a lengthy text X ¼ fx 1 ; . . .; x x g coupled with its corresponding summary Y ¼ fy 1 ; . . .; y y g, where each x i 2 X and y i 2 Y is a token.The standard training algorithm adopts the cross-entropy loss, which requires the model to predict the next token y i of the target summary given X and the previous target tokens y 1:iÀ1 , as follows: where h indicates the model parameters and p is the predicted probability over the vocabulary.
Model Architectures The number of input tokens could be potentially significant (e.g., >10,000).Therefore, X cannot be processed at its full size with current quadratic transformer-based models and commodity hardware environments.For this reason, several model architectures have been proposed to handle long-input processing: (i) Efficient sequence-to-sequence reads more input tokens thanks to sparse attention mechanisms with linear complexity in the input size [11,12,[23][24][25].(ii) Extract-then-abstract lessens the input size by supplying just a subset of the source to the summarization model [8,26,27].(iii) Segment-then-abstract divides the input into sections, independently summarized and concatenated to produce the final summary [9,[28][29][30]42].

Method
In this section, we describe our align-then-abstract representation learning model (ATHENA) in detail.In a nutshell, ATHENA comprises two key collaborating modules tackling long-input processing (Section 4.1) and document representation learning (Section 4.2).The two components work jointly to learn the best text segmentation that yields better summaries.Our model differs from existing ones thanks to an end-to-end learning solution in which segmentation and summarization cooperate to generate the synthesis.An overview of our proposed and existing architectures is shown in Fig. 1.

Long-input processing
In long document summarization, the source length, in terms of the number of tokens, may exceed the limit summarization models can consume (e.g., BART [2] truncates inputs longer than 1024 tokens).Nevertheless, we argue that the whole document information can contribute to the final summary.To this aim, our ATHENA model is trained to synthesize a long input by segmenting it into small coherent chunks, learning end-to-end the best document segmentation, thus summarizing the chunks and concatenating the chunk-level summaries to produce the final prediction.Consequently, our model can read the entire document without truncating any information or relying on a subset of snippets.
Source Segmentation To segment a long input into small chunks, we leverage the SE3 algorithm [30].This unsupervised method uses a BERT-based model to semantically represent the sentences and create the chunks based on their meaning. 2Unlike SE3, which employs a frozen BERT, we follow the same algorithm but introduce a novel loss (Section 4.2) to train the segmenter end-to-end to uncover the best document segmentation that improves the summarization accuracy.Overall, ATHENA segments a long input X into n notoverlapping chunks f X 1 ; . . .; X n g, each with a number of tokens 6 M, corresponding to the max input size summarization models can process (e.g., M ¼ 1024 for BART).
Target Alignment We align each sentence y j 2 Y to the chunk that can better summarize it, yielding new high-correlated instances ð X ; YÞ.More precisely, each sentence y j is assigned to the chunk X i that maximizes the formula: where R 1 p stands for the ROUGE-1 precision metric [31] and indicates the concatenation of previous target sentences already assigned to X i .Note that this matching algorithm does not assure Y i -£, noticeable if the number of chunks exceeds the summary sentences.Yet, we aim to create high-correlated chunk-target pairs ð X ; YÞ for the summarization model (see Section 5.8).

Document representation learning
Our method is trained end-to-end to maximize the conditional probability of generating Y i from X i , where ð X ; YÞ is the set of chunk-target pairs produced by the segmentation and alignment modules.To jointly train the segmenter, we propose alignment loss to teach the model to segment better ðX ; YÞ into more aligned pairs ð X ; YÞ, learning the best document representation.
2 Please refer to the original paper [30] for details.
Alignment Loss The segmenter is trained to maximize the alignment between each chunk-target pair ð X i ; Y i Þ in terms of semantic content coverage, encouraging the model to locate the best text segmentation that improves the summarization.In detail, the alignment loss L g align computes the cosine similarity between ð X e ; Y e Þ, where e denotes the embedding representation of each ð YÞ. Specifically, to obtain X e , we generate the sentence embeddings with the segmenter (g denotes its parameters) by computing the mean pooling operation over the token embeddings of each sentence [32].Afterward, we calculate the mean over the output vectors to obtain a single embedding for each chunk and target.With this loss, the model learns to maximize the alignment between the chunk-target pairs, thus learning to better segment and represent the documents.More precisely, the weights of the model updated during this learning are the same used for segmenting the document sentences into chunks.Hence, we train the model to segment the document better by maximizing the alignment between the chunk-target pairs created after the segmentation.The alignment loss is the following: Summarization Loss The summarization module takes as input the chunk-target pairs and is trained to generate the next output token for each target by minimizing the negative log-likelihood with the following function: where c are the parameters of the summarization module, X is the input chunk, and y 1:t are the tokens from position 1 to t of the target Y.Note that, for the training process, we take only the chunk-target pairs ð X i ; In contrast, we consider all the chunks at inference time.

Training objective
The overall training objective of our solution is the following: The whole model is trained end-to-end with an align-then-abstract approach to segment the source into content-wise chunks and summarize each of them.Concretely, the segmenter is optimized with the alignment loss, whereas the summarizer is optimized with the summarization loss (Alg.1).Moreover, we involve an update step of the model weights with a dynamic mini-batch gradient descent equal to the number of chunks per instance, formally defined as follows: where h ¼ g þ c are the parameters of the full model, n ¼ j Xj i is the number of chunks of the i-th document, and k ¼ f1; N n g is the number of update steps, with N equal to the amount of training samples.In this way, the gradients are (i) computed for each document chunk, implicitly augmenting the training instances, (ii) averaged per document, and (iii) descended after each instance.
Fig. 2 illustrates our proposed solution.Technically, while the input is a single training instance, the summarizer reads more labeled samples (3 in the example) produced by the alignment module.At inference time, the final summary is obtained by concatenating the predicted chunk-level summaries.Note that the chunks without an assigned target are not considered during training.

Setup
To assess the performance of our solution, we simulate a lowresource scenario with labeled data scarcity, adopting the identical experimental setup in previous comparative works on lowresource summarization [3,22] for a fair comparison and reproducibility.Specifically, we train our model with all datasets' first 10 and 100 training instances.

Datasets
We contemplate the following well-known public long document summarization datasets from different domains and text sizes as evaluation benchmarks.Key measurements are reported in Table 1. 3ig. 1.The establish model architectures for long document summarization.Gray parts indicate the learning modules.The ''align-then-abstract" is our proposed architecture.

Baselines
We compare ATHENA with cutting-edge baselines: BART [2] is the state-of-the-art denoising sequence-to-sequence pre-trained model for various text generation tasks.Since ATHENA is built upon BART, this comparison can reveal whether the segmentation approach is meaningful in long-input processing.We report the results of BART-base.
PEGASUS [3] is a transformer-based model with a summarizationspecific pre-training that helps fast adapting with few labeled data.As there is no public checkpoint of the base version, we include the results of PEGASUS-large.MTL-ABS [22] is a meta-transfer learning approach for LRS that cope with data scarcity by augmenting the training data with multiple similar corpora.
LED [12] is a state-of-the-art efficient transformer built upon BART with a self-attention mechanism that scales linearly in the input size, allowing the model to process long sequences.We employ LED-base.
LONGT5 [36] is a powerful pre-trained model with sparse attention built upon T5.We use LONGT5-base.

Evaluation metrics
Although human evaluation is deemed the gold standard for estimating model accuracy, it is prohibitively expensive, and recent research has even shown some shortcomings [37].For these reasons, we embrace automatic evaluation metrics, assessing the inferred summaries from different perspectives.
Lexical Overlap We use ROUGE-{1,2,L} F1 scores against reference summaries, reporting R-1 and R-2 for informativeness and R-L for fluency. 4Additionally, inspired by [38], we also compute r to derive an aggregated judgment 2 ½0; 1 (the higher, the better) that penalizes generations with heterogeneous results across the ROUGE dimensions.
Semantic Similarity We report BERTScore F1 (BS) [39], 5 which computes the contextual similarity between a candidate and its reference summary.We use this metric to assess the model performance in the ablation studies.

Implementation details
Pre-trained Models The summarizer is initialized with BART-base [2] weights, whereas the segmenter is initialized with a pretrained sentence embedding model based on BERT-small [13]. 6echnically, the segmenter is based on a siamese network and has

Table 1
The long document summarization datasets used as testbeds.Statistics include corpus size, number of source texts per instance, number of total words in source and target texts, and source-target coverage, density, and compression ratio of words [33].Except for the number of samples, all reported values are averaged across all instances.already been fine-tuned using a contrastive learning objective for the semantic textual similarity task.

Model
Training We train on all datasets with mixed precision for 20 epochs, saving the model that perform best on the validation set. 7We apply gradient checkpointing to save memory, use the Adam optimizer with b 1 ¼ 0:9 and b 2 ¼ 0:99, and set the learning rate to 3e-5.We consider a chunk size between 256 and 1024 tokens for text segmentation.We eventually set the seed to 42 for reproducibility.
Inference We set the beam width to 5 and use the following summary size (min-max) based on experiments and statistics reported in Table 1: BILLSUM (100-300), PUBMED (100-300), GOVREPORT (500-1000).Finally, we utilize an n-grams penalty of 5 for GOVRE-PORT and 3 for the other datasets.
Hardware Each experiment is run on a single RTX 3090 GPU of 24 GB memory with PyTorch [40] in a workstation with 64 GB of RAM and an Intel Ò Core TM i9-10900X1080 CPU @ 3.70 GHz.

Results and discussion
The evaluation results are reported in Table 2. ATHENA achieves new state-of-the-art ROUGE scores in low-resource conditions with a wide margin on BILLSUM, PUBMED, and GOVREPORT.
Capacity to summarize long inputs The results suggest the effectiveness of the align-then-abstract approach for long document summarization in low-resource conditions.Indeed, segmenting a long input into small content-wise chunks allows existing models to summarize very lengthy documents by processing small chunks, extending the input size that the model can handle.In this way, the summarization phase reads all document details, handling the long input at its full length, avoiding prior input truncation or processing only a subset of pre-selected sentences.
Capacity to adapt to data scarcity We show the high capability of our model to synthesize long sequences in a low-resource data scenario.During the training phase, the segmenter creates small highcorrelated chunk-target pairs and feeds the summarization module many high-quality training samples that augment the number of training instances and the model's focus during at learning time.Table 3 illustrates a qualitative example of chunk-target pairs correlation in ATHENA, namely the training data and their labels.
Capacity to handle multiple domains Our model attains new state-of-the-art results on a comprehensive collection of datasets, indicating a high potential of adapting to different dictionaries and language styles.
Efficiency and effectiveness ATHENA is built upon small and base models, so it is memory-efficient and more practical to use in small-and medium-sized organizations that cannot afford highbudget GPU memories (i.e., more than 12 GB).Regardless, despite its small number of parameters, ATHENA achieves state-of-the-art results, proving its effectiveness and a conceivable additional gain with large models and more GB of GPU memory available.

Subset analysis
Unlike prior works, we believe that the selection of samples plays a vital role in the final results in low-resource regimes.For this reason, we conduct further experiments on multiple subsets of the training sets.Technically, we use the first 5 notoverlapping subsets with 10 and 100 instances within each subset to assess if the performance of our proposed model remains stable or highly depends on the input data.Table 4 reports the high similarity of the results despite the different training subsets.

Segmentation analysis
We first study how the segmentation affects the number and the size of the chunks at inference time.Table 5 shows the mean number of chunks per corpus, which is really high for the GOVREPORT dataset.We notice that the documents are segmented into chunks of about 400 tokens in length for all datasets, despite the distinct text sizes.

Complexity analysis
Our model ATHENA has a quadratic memory growth w.r.t. the chunk size.Therefore, the space complexity to summarize the entire input text is OðL 2 c Þ, where L c is the max chunk size (or the model max input length).Table 6 reports the running time and the GPU memory requirement of models on all datasets trained on 10 instances for 1 epoch.The more training time required for ATHENA is due to its capability to read all the long document chunk by chunk, allowing a low-memory occupation.

Ablation studies
We conduct ablation studies to investigate the effectiveness of the modules of our solution.In detail, we report the performance of ATHENA after removing alignment loss and generation loss.The results are summarized in Table 7.
We notice that excluding the generation loss (w/o L gen ) leads to the most significant loss in performance.Nonetheless, we still achieve competitive results, demonstrating the excellent capability of our solution architecture.Training the model without considering the alignment loss (w/o L align ) decreases the performance, showing the importance of creating high-correlated samples in low-resource conditions.

Comparison with SE3
The architecture of our solution is related to but differs significantly from SE3 [30].The similarity only lies in using the segmentation algorithm to split a long input into multiple chunks.However, unlike our dynamic chunk creation through learning, the segmentation module in SE3 is frozen with pre-selected chunk sizes.Table 8 reports the results of SE3 on the BILLSUM dataset to further prove the importance of learning how to better segment a long input.

Conclusion
In this paper, we propose ATHENA, a novel approach for long document summarization in low-resource conditions, namely with just dozens of labeled training instances available, which is a real-world scenario.ATHENA is trained end-to-end on an alignthen-abstract representation learning to better segment a long input to (i) create small content-wise chunks processable with fewer memory requirements, (ii) read the long text in its full size, and (iii) create high-correlated samples to augment the data with high-quality source-target instances.We demonstrate the effectiveness of our solution by benchmarking three datasets of differ-ent domains, significantly outperforming the current state-ofthe-art in low-resource summarization on all datasets.
For future works, we suggest investigating the following approaches to better model chunks creation: (i) memory-based operations [44] from unsupervised approaches for entity relationships acquisition [45,46] and classes extraction [47,52] to avantgarde semantic parsing solutions such as event extraction [48]; (ii) retrieval-enhanced techniques [49].Lastly, as proposed for communication networks [50,51], tracking and propagating knowledge refinements across sentences could be critical when tackling extended sequences.

Fig. 2 .
Fig. 2. The illustration of ATHENA at training and inference time.The input is a long document and the output is a short summary.

Table 2
Low-resource summarization performance of ATHENA on all target datasets.y means that the results are from the original papers.The other models are fine-tuned with the same training details of ATHENA.Best scores on each dataset are bolded.

Table 3
Chunk-target alignment in ATHENA of a random sample from BILLSUM.

Table 4
ROUGE scores of ATHENA using different subsets of the training set.

Table 5
Analyzes on the segmented chunks at inference time.All values are averaged.

Table 6
Running time and GPU memory requirement of models.

Table 7
Ablation studies to validate the effectiveness of the full method (L ¼ L align þ Lgen).Best results are bolded.

Table 8
[30]arison with SE3 on the BILLSUM dataset (the results are taken from the original paper[30]).Best values are bolded.