Pre-trained language models with domain knowledge for biomedical extractive summarization

Biomedical text summarization is a critical task for comprehension of an ever-growing amount of biomedical literature. Pre-trained language models (PLMs) with transformer-based architectures have been shown to greatly improve performance in biomedical text mining tasks. However, existing methods for text summarization generally fine-tune PLMs on the target corpora directly and do not consider how fine-grained domain knowledge, such as PICO elements used in evidence-based medicine, can help to identify the context needed for generating coherent summaries. To fill the gap, we propose KeBioSum, a novel knowledge infusion training framework, and experiment using a number of PLMs as bases, for the task of extractive summarization on biomedical literature. We investigate generative and discriminative training techniques to fuse domain knowledge (i.e., PICO elements) into knowledge adapters and apply adapter fusion to efficiently inject the knowledge adapters into the basic PLMs for fine-tuning the extractive summarization task. Experimental results from the extractive summarization task on three biomedical literature datasets show that existing PLMs (BERT, RoBERTa, BioBERT, and PubMedBERT) are improved by incorporating the KeBioSum knowledge adapters, and our model outperforms the strong baselines.


Introduction
The quantity of, and frequency by which, biomedical literature is being produced makes it challenging for clinicians and other domain experts to consume the information they require to stay up-to-date in their area of expertise [1].Biomedical text mining [2] has addressed these information needs by developing methods for information retrieval and information extraction tailored to biomedicine.Automatic text summarization [3] is an important text mining application that aims to condense key information within documents into shorter and more easily consumable texts.Existing summarization methods can be classified into two categories: extractive summarization and abstractive summarization [4].The former method extracts the most important sentences from documents and concatenates them into a summary, while the latter method generates new sentences based on the information within the longer documents.However, abstractive summarization methods have been shown to struggle to generate factually consistent summaries [5,6], and therefore, extractive summarization methods are often considered more suitable for practical applications where factual consistency is important, e.g., in the biomedical domain.
Inspired by the success of pre-trained language models (PLMs) for natural language processing (NLP) tasks in the general domain [7,8], PLMs have also more recently been used to improve performance on extractive summarization tasks [9].Extractive summarization [10] is generally formulated as a binary classification task, in which the model aims to predict whether or not each sentence should be included in the summary.Existing summarization methods generally use PLMs to encode the documents, and directly fine-tune the PLMs on given corpora for the summarization task.However, direct fine-tuning methods are unable to fully capture medical knowledge for tasks in the biomedical domain [11], and thus there remains a semantic gap between these methods and biomedical texts.The domain knowledge which is fundamental for extractive summarization in biomedical area, such as biomedical entities and their correlation, is ignored by PLMs.For PLMs which are pre-trained to capture the semantic relationships between masked tokens and their contexts, biomedical entities and concepts would be deemed as normal tokens.This limits the ability of existing PLMs based methods since they fail to capture the correlations between https://doi.org/10.1016/j.knosys.2022.1094600950-7051/© 2022 The Author(s).Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).domain-specific tokens which can help the method to distinguish between salient sentences and non-salient sentences [12].Without the guidance of the domain knowledge, their methods would wrongly select redundant sentences with high-frequency normal words into the summary, rather than informative sentences with domain-specific tokens.It is required to leverage the medical knowledge and identify key medical concepts, to fully understand biomedical texts.The PICO framework (including four elements: Population, Intervention, Comparison, and Outcome) [13] is a structural medical knowledge representation method, which is widely used for formulating search queries in literature database searches to improve screening in evidence-based medicine [11].The domain knowledge embedded in labels in the PICO framework can help identifying the core biomedical concepts in the sentences and capture the semantic relationships between sentences, which is beneficial for generating a more coherent summary in the biomedical domain.However, there are no existing efforts incorporating the PICO framework into PLMs based methods for improving the performance of the biomedical extractive summarization.
To address the above issues, we aim to explore the integration of medical knowledge to PLMs for biomedical extractive summarization.We propose a novel knowledge-enriched pre-trained language model based method called KeBioSum.KeBioSum can efficiently inject the medical knowledge into the PLMs using distantly-labelled PICO annotations via the knowledge adapter, which is a lightweight training framework, based on the auxiliary tasks that predict the PICO elements and their labels in the sentences.Specifically, we design a novel training framework, based on the adapter, using both generative and discriminative training to fully exploit the domain knowledge in PICO elements and annotations.For each input sentence, our method masks PICO elements with a higher probability than normal tokens.Given the sentence with masked domain-specific tokens, a generator is applied to reconstruct all missing tokens in the input sentence, and a discriminator is adopted to predict the PICO element labels for every token in the input sentence.This encourages our method to explicitly identify the PICO elements along with their annotations and explore the domain knowledge through the training process.Inspired by previous adapter-based methods, the trained adapters, which fully capture the domain knowledge, are infused into the PLM.The PLM is further used as an encoder fine-tuned for extractive summarization.With an adapter fusion strategy, the PLMs in our method can focus on the domain-specific tokens guided by the knowledge adapter during the fine-tuning and therefore better model the relationships between the medical concepts within the tokens and sentences of the document.This helps to understand the global meaning of the document and select salient sentences containing the most important arguments and evidence in the literature as the abstract.In summary, the main contributions of our work are as follows: 1.A novel model KeBioSum, which efficiently incorporates medical evidence knowledge into PLMs for fine-tuning of biomedical extractive summarization tasks.To the best of our knowledge, this is the first work incorporating the domain knowledge PICO into the pre-trained language models for biomedical extractive summarization.

Pre-trained language models for biomedical extractive summarization
PLMs for text summarization in the general domain is a wellresearched area, in which many efficient methods such as BERT-Sum [14] have been proposed.Existing research in the biomedical domain focused on using PLMs to encode the input documents and fine-tuning them for extractive summarization.Du et al. [15] proposed BioBERTSum which used a domain-aware PLM as the encoder and fine-tuned it on the biomedical extractive summarization task.Kanwal et al. [16] proposed a method to fine-tune BERT on the International Classification of Diseases (ICD-9) labelled MIMIC-III discharge notes for the extractive summarization of electronic health records (EHRs).Kieuvongngam et al. [17] proposed a fine-tuned BERT encoder and a GPT-2 decoder for both extractive and abstractive summarization of the COVID-19 literature.Moradi et al. [18] proposed an unsupervised extractive summarization in the biomedical domain, using a hierarchical clustering algorithm to group contextual embeddings of sentences based on the BERT encoder and selecting the most informative sentences from each group to generate the final summary.Padmakumar et al. [19] also proposed an unsupervised extractive summarization model, which used the GPT-2 model to encode sentences and pointwise mutual information (PMI) to calculate the semantic similarity between sentences and documents.The proposed method showed a better performance than previous similarity-based models on a medical journal dataset.However, previous efforts such as BioBERT [20], BlueBERT [21], Clinical-BERT, [22] and SciBERT [23] generally pretrained the PLMs on biomedical corpora, such as the PubMed and de-identified clinical notes from MIMIC-III.Although their methods can capture the domain knowledge embedded in the contextual information of PLMs, they still treat terms in biomedical area as normal words, and fail to fully leverage these terms to understand the global meaning of the biomedical literature, which limits their ability in generating coherent abstract for extractive summarization.

Enriching pre-trained language models with biomedical knowledge
Some previous studies have considered using external domain knowledge to enrich the PLMs for various text mining tasks.He et al. [12] augmented BERT-based models (i.e., BERT, BioBERT, SciBERT, ClinicalBERT, BlueBERT, and ALBERT [24]) with disease information to improve tasks such as question answering, inference, and disease name entity recognition (NER).Hao et al. [22] proposed to infuse UMLS [25] information into PLMs in the pre-training stage, which improved the performance of two PLMs, BERT and ALBERT on two downstream tasks: medical natural language inference (NLI) and named entity recognition (NER).Liu et al. [26] leveraged the UMLS knowledge base to improve entity linking, where the self-alignment pre-training framework was proposed to learn biomedical entity representations.Michalopoulos et al. [27] proposed to use domain knowledge from the UMLS Metathesaurus in the pre-training strategy of BERT to achieve knowledge enriched contextual representations, which outperformed existing domain-specific PLMs including BioBERT and Bio_ClinicalBERT [28] on the NER and NLI tasks.Meng et al. [29] proposed to partition the knowledge graph into subgraphs and infused them into a series of PLMs including, BioBERT, SciBERT, and PubmedBERT [30].Wallace et al. [5] proposed a multi-document neural abstractive summarization model for trial reports and they improved the Bidirectional and Auto-Regressive Transformers (BART) model [31] by demarcating tokens containing PICO elements.As reported in their research, the external domain knowledge is critical for exploiting the semantic information in the biomedical texts, as it can strengthen the discriminative power of the generated representations and improve the performance of downstream tasks.However, to the best of our knowledge, our method is the first study exploring the inclusion of fine-grained biomedical knowledge with PLMs for biomedical extractive summarization.

Methods
We start from the formulation.Let us assume d is a biomedical document in the given corpus D, including n sentences {s 1 , . . ., s n }.The extractive summarization task aims to extract m informative sentences (m ≪ n) from the document to formulate the final summary S. It can be considered as a sentence classification problem that predicts the label y i ∈ {0, 1} of each sentence s i (i ∈ {1, . . ., n}), in which y i = 1 means the sentence s i should be chosen to the summary.PLMs, which are proven to be effective in capturing the contextual information embedded in the text, are widely used to improve the performance of this task recently.Yet, as mentioned previously, they are limited in their understanding of the global meaning of the document in the biomedical domain, due to the lack of domain knowledge.It is still a challenge to incorporate the domain knowledge into the PLMs for extractive summarization in the biomedical domain.To address it, there are two key problems: (1) how do we identify the medical evidence knowledge in the biomedical documents?(2) how do we inject the identified knowledge into the PLMs to help generate informative summaries without greatly increasing the complexity and consumption of the model?In this paper, we propose a novel framework based on PLMs called KeBioSum, in which the medical knowledge is identified by the PICO annotation and injected into the PLMs via the lightweight knowledge adapters to efficiently improve the extractive summarization in the biomedical area.
As shown in Fig. 1, the medical knowledge from the PICO is firstly captured by identifying the PICO elements in the sentences of the document utilizing a domain-specific pretrained language model SciBERT which is fine-tuned on the annotated PICO dataset EBM-NLP [32].After detection, the generated PICO label sequences of the sentences are utilized to pre-train lightweight transformers to be the PICO elements aware knowledge adapters, based on both the generative and discriminative auxiliary tasks.This allows our method to explicitly transfer the domain knowledge from the PICO to the PLMs, via infusing the knowledge adapters into the encoder of the PLMs with an designed adapter fusion strategy.As illustrated in Fig. 2, the major difference between our method and existing methods is that we can fully leverage the domain knowledge to enrich the contextual representations from the PLMs via the adapter fusion with pre-trained knowledge adapters.Our method can yield the knowledge-aware contextual representations for sentences in the biomedical literature, which are further fed into the inter-sentence transformer to propagate information between sentences at the document level.Therefore, the final generated representations as the input of the classifier can provide adequate support for formulating the classification boundary between salient sentences with key medical concepts and terms and redundant sentences with high-frequency normal words.Since the parameters in the pre-trained PLMs are fixed and only the parameters in the lightweight adapters are updated during the fine-tuning process, our method can introduce the external knowledge and yield superior performance via the proposed knowledge adapters with limited complexity and consumption.In the following subsections, we will further introduce each part of our proposed model in detail.

PICO detection
PICO [13] is a well-known framework used for representing clinical knowledge in the evidence-based medicine (EBM) area, in which medical evidence can be represented by four different types of PICO elements, i.e., Population (P), Intervention (I), Comparison (C), and Outcome (O).It is reported in other biomedical tasks such as biomedical evidence summary [5], and clinical question answering [34] that the PICO elements (see Fig. 3 for example) embed the domain-specific knowledge which can help capture the key concepts and salient sentences.Yet to-date, no efforts have been made to integrate the PICO elements and labels into the PLMs, for the extractive summarization task.To address this, the first step of our method is how to identify the PICO elements of each sentence in the biomedical documents.We propose to first train a sequence labelling model based on the domain-specific language model SciBERT [23] with the publicly available dataset EBM-NLP [32] as shown in Fig. 4. EBM-NLP contains 5000 PubMed abstracts on the clinical trials, which are annotated with the I-P, I-I, and I-O elements following PICO.Fig. 5 presents a simple sample of the labelled sentence in the EBM-NLP dataset.It is noted that Intervention and Comparison are collapsed into a single category I-I in the EBM-NLP dataset, therefore there are generally three elements I-P, I-I, and I-O annotated in the dataset and used in our experiments.As previous methods [23,30], we split the dataset into a training set with 4300 documents, a validation set with 500 documents, and a test set with 200 documents.We use SciBERT to encode the input training documents of EBM-NLP and feed the final representations of each token from the SciBERT to the output classification layer with the softmax activation function.We optimize the model with the cross-entropy loss and select the best model on the validation dataset.For a given biomedical corpus D, the label of each token in each sentence is predicted by the best SciBERT-based predicting model.

Knowledge guided training
After detecting PICO elements of all documents in the given corpus, the next challenge is how to inject this domain knowledge embedded in them into PLMs for extractive summarization, without greatly increasing the complexity and consumption.To this end, we first propose a knowledge-guided lightweight training framework named knowledge adapter based on the adapters, which uses both generative and discriminative auxiliary tasks for learning, to explicitly model the domain knowledge from PICO elements.Adapters [35] are neural modules with a small number of trainable parameters added between layers of the transformers in the PLMs.Adapter-based tuning is considered a lightweight fine-tuning strategy compared with the whole language model fine-tuning, in which only a few trainable parameters in the adapter are trained while the original parameters of PLMs are fixed during fine-tuning.It is flexible for transferring and combination of multiple knowledge sources via training knowledge-specific adapters [36,37].To fully leverage the medical evidence knowledge, inspired by ELECTRA [38], we propose to train both generative and discriminative adapters based on the detected PICO elements for generation and the medical evidence for prediction, which can exploit different aspects of knowledge injection while making training complementary to each other [38,39].In our method, the generative adapter is designed to generate all the domain-specific tokens which are previously masked from the input sentence.It is trained to distinguish the PICO elements from other normal words in the sentence.The discriminative adapter is proposed to predict the PICO element   type of each token in the input sentence, which is trained to distinguish between different types of medical evidence.Unlike existing methods that inject the domain knowledge into language models with auxiliary training objectives, we propose to train multiple knowledge-specific adapters independently to learn corresponding representations, which can avoid the confusion between different knowledge sources and potential loss of the information [36].It is noticed that generative and discriminative are different from the common concept in machine learning.Following previous methods, we use them to describe two strategies for pre-trained language model.Next, we will present the detail of the generative and discriminative adapter.

Generative adapter
The generative adapter is trained to generate all tokens with masked knowledge sensitive input tokens, as shown in Fig. 6.Assume x = {x 1 , . . ., x t } is the original sequence of tokens for the sentence s i in the document d. m = {m 1 , . . ., m t } is the binary masking vector, in which m j = 1(j ∈ {1, . . ., t}) represents the replacement of the original token with the masked token ''[MASK]''.To leverage the domain knowledge from PICO, we masked domain-specific tokens if they are predicted to be PICO elements according to the labels from the PICO elements detection.We also randomly mask 15% of tokens that have not been predicted to be PICO elements.Formally, with the masking vector m k yielded from the detected medical evidence knowledge k, the masked input sequence of tokens x = Replace(x, m k , [mask])., which is fed into the adapter network to obtain the contextualized vector h of each token: where θ g is the parameter set of the adapter.Similar to Houlsby et al. [35], the generative adapter consists of a multi-headed attention sub-layer, two feed-forward sub-layers, and two adapter modules, which can fully capture the contextual correlations between the domain-specific tokens and exploit the domain knowledge embedded in the PICO elements.The generative loss of the adapter is the negative log-likelihood of predicting the token compared to the ground truth token: , V is the vocabulary size.The contextualized vector of each token is used to generate the predicting logits z j = Relu(W × h j + b) with the feed-forward layer, where W and b are the weight matrix and bias.Finally, the generative loss is the negative log-likelihood of predicting the token as close as to the ground truth token.In contrast to the random masking strategy of BERT in the masked language training, we prioritize masking the tokens belonging to the PICO elements, which can guide the model to memorize PICO knowledge.

Discriminative adapter
To further utilize the fine-grained domain knowledge in the PICO annotations, a discriminative adapter is proposed to predict the PICO labels of the tokens as shown in Fig.To predict the PICO labels of the input sequence, the adapter takes input as the input sequence x and yield the contextualized representation u j for each token: It is then transformed via a linear layer with the softmax function to generate the probability y i for each token.To capture the finegrained label information, the adapter maximizes the probability of the expected category for each token in the input sentence: where ŷj ∈ {P, I, O, N} is the ground truth category of token x j .

Adapter infusion
Given the pretrained generative and discriminative knowledge adapters which can capture both the label information and label category information from the detected PICO elements and annotations, we now present how to efficiently infuse the representations from two adapters into the PLMs for generating knowledge enriched representations of sentences.Following the previous method [40], we design a knowledge adapter fusion strategy similar to the self-attention in the transformer model [41].For each transformer layer l in PLMs, we introduce an attention mechanism with a set of the fusion parameters: query matrix Q l , key matrix K l , and value matrix V l .The output of the transformer layer h l is adopted as the query and the output of each adapter is considered as the key and value respectively.In addition to the generative and discriminative adapter, we also introduce an additional fine-tuned adapter to avoid forgetting of the contextual information from the PLMs.Assuming {h g l , d , f } are the output of the generative, and fine-tuned adapter added to the lth transformer layer, we then infuse these representations as: As a result, our method can combine the knowledge from different representations from PLMs and three knowledge-aware adapters and yield knowledge-aware contextual representations for each sentence.

Fine-tuning for extractive summarization
After infusing these adapters with the PLMs to inject the domain knowledge, we fine-tune the whole model for extractive summarization.Since the input document consists of multiple sentences, we insert the [CLS] token at the start of each sentence and the [SEP] token at the end of each sentence.In line with previous methods, we adopt three different types of embeddings including token embedding, which converts the token into vector representations, position embedding which represents the positional information of each token, and segment embedding which reflects the segmentation of multiple sentences.We sum up these three embeddings of each token to yield the representation of each token, which are then fed into the PLMs with several bidirectional transformer layers and our pretrained generative and discriminative adapters.We take the outputs of the [CLS] tokens from the adapter fusion layer as the sentence representations.The sentence representations are then fed into the inter-transformer layers to yield the final representations u = {u 1 , . . ., u n } of sentences to incorporate the document-level semantic information between sentences.The final generated representations are then taken into the classification layer: where sigmoid is the sigmoid function, W c is the weight matrix, and b c is the bias.Following previous methods, we utilize the binary cross entropy loss to guide learning, which encourages the model to predict the correct label y di of each sentence to be the same as the ground truth label ŷdi in every input document d of the corpus D: ŷdi logy di (7) As opposed to existing methods which are forced to update all parameters of PLMs whilst fine-tuning on the extractive summarization task, our method updates only a limited set of parameters in the adapters, the infusion layers, and additional transformer layers and fixes the large set of pre-trained parameters in the PLMs.

Datasets
To evaluate the effectiveness of our model, we conducted experiments on three literature datasets from biomedicine: CORD-19 [42], PubMed [43], S2ORC [44].CORD-19 is an open dataset, which includes scientific papers on COVID-19.The dataset is updated weekly.We use the version of the dataset which was released on 2020-06-28.After we removed null and duplicate articles, it includes 57 037 documents.We randomly select 75% of documents as the training set, 15% as the validation set, and 10% as the test set.We use the abstracts of the documents as the gold summary.S2ORC is a publicly released dataset that includes scientific papers from domains, such as biology, medicine, and computer science.We sample and use a random subset of articles from the biomedical domain.For S2ORC, we also use the abstract of each article as its gold summary and use the same train/validation/test split as the CORD-19 dataset.PubMed is a commonly used dataset for the task of text summarization, which contains scientific literature from biomedicine.The study [43] which released the original PubMed-Long dataset used the whole document as the input and the abstract of each document as its gold summary.Recent research adapted this dataset with only the introduction of the document as the input [45], and we call this dataset PubMed-Short.We conducted experiments on both datasets, using the train/validation/test split from the original paper.The statistics of these datasets are shown in Table 1.For CORD-19, PubMed-Long, and S2ORC, we extract three sentences to formulate the final summary.While for PubMed-Short, we extracted 6 sentences to formulate the final summary, in order to enable a direct comparison with the existing method [45].Moreover, we also evaluate our method on the heart disease dataset [46] for evidence summarization.We use the same train/valid/test split as the study [46].To make a fair comparison, we extract 8 sentences to formulate the summary following previous methods.

Implementation details
We implemented our model with Pytorch, Huggingface [47] and adapter-hub [37] framework.Following BERTSum, we use the Stanford CoreNLP [48] to split the sentences.All documents are truncated to 512 tokens.For PICO detection with SciBERT [23], we used the scibert-scivocab-uncased model.We trained the SciBERT based sequence labelling model on EBM-NLP, with the learning rate as 0.001, gradient accumulation batch size as 32, training epochs as 75, and dropout as 0.1.We investigated the BERT [8], RoBERTa [49], BioBERT [20] and PubMedBERT [30] models, implemented in Huggingface as the PLMs in our experiments.We used the adapter-hub to implement both generative and discriminative adapters.In the training of the generative adapter, we set the learning rate to 1e−4, warmup steps to 500, training epochs to 12, weight decay to 0.001, and batch size to 24.We used perplexity as the metric to select the best generative adapter model.In the training of discriminative adapter, we set the learning rate to 5e−5, warmup steps to 500, training epochs to 12, weight decay to 0.001, and batch size to 24.We used the F1-score as the evaluation metric.We ran 30 000 steps to train our whole method and saved the model checkpoint every 1000 steps.We selected the best checkpoint according to the cross-entropy loss in the validation and report the results in the test.To train the model, we used the greedy search algorithm [10] to select the oracle summary of each document, via maximizing the ROUGE-2 score against the gold summary.We set the learning rate to 2e−3, warm-up steps to 1000, and drop out to 0.4.We used two-layer transformers to conduct sentence representations.For evaluating the quality of the generated summaries, we used the ROUGE [50] metric and calculated it using the package pyrouge. 1 Specifically, 1 https://github.com/andersjo/pyrouge.git.
we report the unigram and bi-gram F1 (ROUGE-1 and ROUGE-2) between the generated summary and the gold summary.We also report the longest common subsequence (ROUGE-L) to evaluate the fluency.Moreover, we also use the recently proposed BERT score (BS) [51] metric, to evaluate semantic similarity with contextual embedding, between generated summary and the gold summary.We calculate it with the released implementation. 2

Baselines
We compare our methods with the following baselines: 1. LEAD: this method selects the first three sentences of the document as the final summary.2. ORACLE: a method which greedily selects sentences which maximize the ROUGE scores.This is the upper bound for an extractive summary.

MATCH-ORACLE: it is the upper bound of the metrics, as
given in the MatchSum paper.4. BERTSum [14]: a strong extractive summarization baseline, which directly fine-tunes the BERT model on the extractive summarization task. 5. PubMedBERTSum [14]: BERTSum method, be extended to use the PubMedBERT as the encoder.6. MatchSum [45]: it is a state-of-art extractive summarization method.7. Aceso [46] 3 : a pico-guided evidence summarization method.

Results analysis 4.4.1. Description of tables
For all language models, we use the base model of them.We report the result of LEAD, ORACLE, BERTSum and our method in Tables 2 and 3 on CORD-19, PubMed-Long and S2ORC, and present the performance of LEAD, ORACLE, MATCH-ORACLE, BERTSum, MatchSum, and our method in Table 4 on PubMed-Short.The evaluation results for LEAD, ORACLE, and MATCH-ORACLE are taken from a previous paper [45] as shown in the first block of the table.The second section presents the results of the strong baseline BERTSum, which directly fine-tunes the BERT model on the extractive summarization task.We also present its results with both BERT and PubMedBERTSum.The final section shows the results of our method based on BERT, RoBERTa, BioBERT, and PubMedBERT respectively.Moreover, we compare the performance of our method with different settings, including ''-finetune'' as the model with the finetune adapter, ''-gen'' as the model with the generative adapter, ''-dis'' as the model with the discriminative adapter, ''-all'' as the model with all adapters, ''-all-full'' as the model with all adapters and all parameters updated during the fine-tuning.Note that in Table 4, there is an additional block after the BERTSum which reports the result of the MatchSum.

Analysis of adapters
As shown in Tables 2-4, our method with the PubMedBERT yields the best performance on all four datasets among all methods, and is close to the upper performance of the ORACLE, demonstrating the effectiveness of our method.We hypothesize that the reason for our increased performance is the ability of our model to exploit medical evidence using the PICO elements and annotations with knowledge adapters, to distinguish from salient sentences and non-salient sentences.This superior performance is also highlighted by the direct comparison of ROUGE metrics, of all four datasets, between our BERT and PubMedBERT based methods to the strong baselines BERTSum and PubMedSum.The former shows a superior performance than the latter in all four datasets respectively.Methods based on PLMs such as BERTSum, MatchSum, and our method all present a significant improvement when compared with the LEAD method, showing the advantage of using the PLMs in this task.However, previous PLMs based methods fail to leverage the external domain knowledge from the PICO elements and annotations, which can benefit the formulation of the summary.Our method using knowledge adapters improved performances of all PLMs (BERT, RoBERTa, BioBERT, and PubMedBERT), this can be seen by comparison of our methods (-all), where both discriminative and generative adapters were used, with direct fine-tuning methods (-finetune).It illustrates that our method can fully capture the domain knowledge in the PICO annotations and enrich the contextual representations from PLMs via the knowledge adapters and adapter infusion.
We also show that both the discriminative (-dis) and generative (-gen) knowledge adapter can capture corresponding PICOrelevant domain knowledge and present an improvement in the performance when injected into the PLMs compared with the fine-tuned method.The generative knowledge adapter can tell the difference between PICO elements and non-domain specific words, while the discriminative knowledge adapter can further capture the category information of fine-grained PICO types.Moreover, it is shown that they are complementary for extractive summarization in the biomedical domain as reported in our results, since our methods with all adapters (-all) outperform those with only one of the discriminative and generative adapters, for all four datasets.

Analysis of different language models
Here we further discuss the effect of the different PLMs on our method.From Tables 2-4, we can observe that our framework based on the PubMedBERT model achieves the best performance among all datasets, and the models based on PLMs which have been pre-trained on domain datasets, such as PubMedBERT and BioBERT, show better performance than models which have been on general corpora, such as BERT and RoBERTa.This demonstrates the importance of PLMs to be pre-trained on the large-scale biomedical texts, which can capture the domain knowledge during the pre-training process.For general PLMs such as BERT and RoBERTa, there is still a semantic gap between them and the biomedical texts.However, our proposed method can help to efficiently fill the knowledge gap between general PLMs and biomedical extractive summarization, where the injected knowledge adapters can provide essential medical evidence to enrich the contextual representations of PLMs regardless of whether the PLMs are pretrained on the domain literature or not.As shown in the results, our models based on BERT and RoBERTa with all knowledge adapters outperform those based on BioBERT and PubMedBERT without any knowledge adapters.This demonstrates that our adapters can provide better guidance from the medical evidence embedded in the PICO than the largescale pretraining on the domain literature, despite that the former requires much less consumption than the latter.
We also observe that the improvement of infusing the knowledge adapters varies with different PLMs.For the BERT-based model, we can see a significant improvement in performance after knowledge infusion.When compared to the BERT-finetune model, the BERT-all model outperforms it by nearly 1%, 1.5%, and 3% on all datasets.While for RoBERTa, BioBERT, and PubMedBERT, there is a limited improvement of performance after incorporating adapters in most cases.For example, on the CORD-19 dataset, the RoBERTa-all, BioBERT-all, and PubMedBERT-all models outperform RoBERTa-finetune, BioBERT-finetune, and PubMedBERTfinetune by only around 0.5%, 0.8%, and 1% respectively.This is likely because these three PLMs are strengthened for pretraining on biomedical literature or a much larger general corpus.However, our method still provides external domain knowledge which can enrich the contextual representations generated from these PLMs, and increase the performance on the extractive summarization task.

Light-weight fine-tuning
In this section, we further compare our method with existing PLMs regarding the trade-off between model complexity and model performance.We present the average time taken for one training step, in seconds, and the parameter size of different models in Table 5.This table shows that our models including PubMedBERT-gen, PubMedBERT-dis, PubMedBERT-all to be quicker for training and have fewer training parameters compared with BERTSum and PubMedBERTSum which fine-tune all PLMs such as BERTSum and PubMedBERTSum which fine-tune all parameters in their methods, the knowledge adapters in our method allow us to not only leverage the external knowledge but also yield competitive performance by updating a small set of parameters in the appended adapter layers, transformers, and classifier.This greatly reduces the complexity and consumption of the method and also compromises the performance for limited model space.
To further demonstrate the trade-off in our method, Tables 2-4, present the results of our method based on PubMedBERT named PubMedBERT-all-full, which is the same architecture as PubMedBERT-all, but updates all the parameters.This model can be seen to outperform PubMedBERT-all, which only updates the adapter weights and keeps the base PLM weights fixed.Compared with BERTSum and MatchSum which fine-tune all parameters, our model with adapters PubMedBERT-all is a lightweight framework, that only updates parameters in added adapter layers, transformers, and classifier.The lightweight fine-tuning framework can reduce the training time to some extent, but may also weak the performance since it only updates a small number of parameters.Therefore, to make a fair comparison, we further show the results of our model based on PubMedBERT named PubMedBERT-all-full which fine-tunes all parameters in Tables 2-4.Furthermore, we show results of PubMedBERT-based BERTSum called PubMedBERTSum in these three tables.We can see that the full fine-tuning model PubMedBERTSum outperforms the light-weight fine-tuning model PubMedBERT-finetune model in all datasets.This illustrates that the light-weight finetuning based on the adapter harms the performance due to limit parameter updating.Compared with the full fine-tuning model PubMedBERTSum, our full fine-tuning model PubMedBERT-allfull, which incorporates all adapters, has better performance and achieves the best performance among all models on all datasets.This again proves the importance of domain knowledge and the effectiveness of our knowledge infusing framework again.

Case study
Another question to be investigated is how different types of PICO elements influence the biomedical extractive summarization.To verify this, we conducted experiments on the PubMed-Short with our method based on PubMedBERT-based models considering one of three PICO elements respectively.We also introduce different adapter settings to interpret how our proposed knowledge adapters acquire the domain knowledge from the PICO elements.As shown in Table 6, in both adapter settings, models incorporating Outcome (i.e., PubMedBERT-gen-O, PubMedBERT-dis-O, PubMedBERT-all-O) have slightly better performance than the models incorporating Intervention and Population.This indicates that the Outcome in the dataset may provide more semantic information than other PICO categories.This may The number of different PICO elements is positively correlated to their influence on the summary generation performance as shown in Table 6.A similar situation can be also observed in Table 8, which presents an example of the extracted summary by our model consisting of six sentences.We show the tags of PICO elements of these sentences, where words belonging to PICO elements are indicated with the blue colour.We mark the sentences with the same colour in both the extracted summary of our method and the gold summary if they are highly similar.As reported in the ROUGE metric, our method can generate a highly similar summary with the referenced summary that all six sentences in our method can find their corresponding sentences in the gold summary which contains the key information of the document.We can also observe that most of the sentences in the summary contain PICO elements that embed the core concepts of the sentence, such as the Outcome ''purulent fluid''.

Evidence summarization
In Table 9, we first show PICO detection results of different methods including: Bi-LSTM-CRF [32], Aceso [46], SciBERT and SciBERT+UMLS [26] method used in our model.We can see that SciBERT based methods and Aceso significantly outperform the Bi-LSTM-CRF.When compared with SciBERT method, Aceso yields better performance.This is because Aceso incorporates the external knowledge from UMLS [25], while SciBERT is pre-trained on scientific papers in the general domain.SciB-ERT+UMLS [26] achieves the best performance, due to incorporating domain knowledge based on the self-alignment pre-training on the UMLS knowledge base.
In Table 10, we further show evidence summarization results of different methods on the heart disease dataset.Different from biomedical literature datasets such as CORD-19 that use abstracts as the gold summary, it uses 15 human-generated evidence summaries as the ground truth.Each evidence summary contains 10-15 selected sentences with PICO knowledge.Abst is a simple baseline that uses the abstract as the summary.RobotRe-viewer3 is an evidence summarization tool based on multiple machine learning and natural language processing methods.Aceso is the PICO-guided method based on knowledge embedding.We show our method PubMedBERT-all yields better performance than the strong baseline Aceso.Both PubMedBERT-all and Aceso use the PICO framework.Aceso further incorporates external knowledge base UMLS based on knowledge embedding.The superior performance of PubMedBERT-all, can be attribute to the pre-training on large-scale domain-specific texts of PubMedBERT.

Discussion
We present a novel framework called KeBioSum on incorporating fine-grained medical knowledge (i.e., PICO elements), into pre-trained language models, for the biomedical extractive summarization.KeBioSum has the lightweight framework called knowledge adapter, which is trained to memory the key tokens of sentences: PICO elements and types of key tokens on the PICO representation.We evaluate its effectiveness on three biomedical literature datasets, achieving promising performance.Results demonstrate the ability of our model on incorporating PICO elements in improving a series of pre-trained language models in both the general domain and biomedical domain.
Besides PICO elements, our framework is flexible to incorporating other knowledge such as the biomedical knowledge graph, since the knowledge adapter is a plug-play model for the pre-trained language models.It does not break the original structure of pre-trained language models and can inject different knowledge into contextual representations of PLMs via independent adapters.Moreover, it is also promising to enhance PLMs from both the general domain and biomedical domain with our framework, for other biomedical tasks such as summarization for electronic health records (EHR), and disease prediction.
Moreover, transformer-based models impose a token limit on the input document length due to the expensive nature of their attention computations.Therefore, in line with prior research exploring summarization methods using PLMs, we take the only first 512 tokens of the document and apply our summarization models to only sentences contained within this data.This is broadly equivalent to taking the introduction of a biomedical article.This inevitably results in losing useful information in other sections of the documents.In future research, we will explore strategies for overcoming this limitation, which would allow us to summarize full documents.For example, by using Longformer [53] as the encoder, thus overcoming the limitation imposed by Transformer-based models.

Conclusion
In this paper, we propose KeBioSum, a novel medical evidence knowledge-enhanced PLM for biomedical extractive summarization.We designed a generative and discriminative training task with an adapter-based lightweight fine-tuning framework, to effectively infuse medical evidence knowledge into the language model.We analysed the effectiveness of including medical evidence knowledge to improve a series of pre-trained language models in biomedical extractive summarization.Experimental results on three biomedical literature datasets showed that our proposed model outperforms strong baselines on the biomedical extractive summarization task and that PLMs can be enhanced by the inclusion of fine-grained domain knowledge.In future work, we will explore other sources of domain knowledge for enhancing language models in biomedical summarization tasks, such as the incorporation of knowledge from UMLS.We will also extend our methods to abstractive summarization from the biomedical literature.

Fig. 1 .
Fig. 1.The overall framework of our model.

Fig. 4 .
Fig. 4. The model architecture of the SciBERT based PICO elements detection model.

Fig. 5 .
Fig.5.The labelling format of the sentence in the EBN-NLP dataset.The labels ''I-O'', ''I-P'' identify tokens belonging to the outcome, intervention, and population correspondingly, while the label ''O'' identifies the tokens not belonging to PICO elements.

7 .
It aims to classify the different types of PICO medical evidence {I-P,I-I,I-O,O}, where {I − P, I − I, I − O} represents Population (P) and Comparison, Intervention (I), Outcome (O), and an additional label O for tokens that are not labelled as a PICO element.We adopt the same structure for the discriminative adapter as the generative adapter.

Table 1
Statistics of the datasets.Ext denotes the number of sentences extracted in the final summary.

Table 2
Rouge F1 results of different models on CORD-19 and PubMed-Long datasets.''-finetune'' represents the KeBioSum model with the finetune adapter.''-gen'' represents the KeBioSum model with the generative adapter.''-dis'' represents the KeBioSum model with the discriminative adapter.''-all'' represents the KeBioSum model with all adapters.* represents the cases where a model outperformed the strong baseline of BERTSum significantly (p < 0.05).† represents the cases where an ''-all'' model outperforms the models with only one adapter model significantly (p < 0.05).

Table 3
Rouge F1 results of different models on the s2orc dataset.

Table 4
[45]e F1 results of different models on the PubMed-Short dataset.Some results are referred from the paper[45].

Table 5
The time taken of different models running 1 training step on the CORD-19 dataset (on one NVIDIA TITAN RTX), and parameter size of different models.

Table 6
Rouge F1 results of our model based on PubMedBERT with different PICO elements on the PubMed-Short dataset.

Table 7
Number of different PICO elements on the PubMed-Short dataset.that the Outcome entities have a larger percentage than other PICO categories in the dataset.To further prove this, we show the number of different PICO elements on the PubMed-Short dataset in Table7.We found that Outcome entities in texts of PubMed-Short have the highest percentage.