Utilizing passage‐level relevance and kernel pooling for enhancing BERT‐based document reranking

The pre‐trained language model (PLM) based on the Transformer encoder, namely BERT, has achieved state‐of‐the‐art results in the field of Information Retrieval. Existing BERT‐based ranking models divide documents into passages and aggregate passage‐level relevance to rank the document list. However, these common score aggregation strategies cannot capture important semantic information such as document structure and have not been extensively studied. In this article, we propose a novel kernel‐based score pooling system to capture document‐level relevance by aggregating passage‐level relevance. In particular, we propose and study several representative kernel pooling functions and several different document ranking strategies based on passage‐level relevance. Our proposed framework KnBERT naturally incorporates kernel functions from the passage level into the BERT‐based re‐ranking method, which provides a promising avenue for building universal retrieval‐then‐rerank information retrieval systems. Experiments conducted on two widely used TREC Robust04 and GOV2 test datasets show that the KnBERT has made significant improvements over other BERT‐based ranking approaches in terms of MAP, P@20, and NDCG@20 indicators with no extra or even less computations.


Motivation
2][3][4][5] The success in applying these semantic-based language models may lead to another rapid rise in search engines.In practice, the process of information retrieval can summarize how to retrieve relevant documents or chunks quickly and accurately from a large amount of text related to queries consisting of only a few words submitted by users. 6Existing information retrieval frameworks that using PLMS includes two components: the retrieval phase and the rerank phase. 7This "retrieval-then-rerank" multistage retrieval pipeline outperforms well in a diversity of downstream NLP tasks, such as Question Answering, Recommendation system, 8 and Information Retrieval. 7,9Both recall and rerank phases influence the results jointly.
Considering time and memory costs during retrieval, using an unsupervised model such as BM25 + RM3 or DPH + KL for the first-stage retrieval is a cost-effective approach.BM25 + RM3 is an unsupervised ranking model using pseudo relevance feedback signals. 10Derived from the divergence-from-randomness framework, DPH is an unsupervised retrieval model, DPH + KL uses Kullback-Leibler (KL) divergence and Rocchio model to expand the original query, and then uses DPH to rank documents. 11,12The pre-trained model like BERT is more often adopted to further improve the retrieval results in rerank stage due to its ability to obtain the interaction between query and document.The effectiveness of BERT is mainly attributed to the complex architecture of transformer encoder 13 and its ability to compute deeply contextualized semantic interaction of input sequence.A challenge needs to be addressed when applying BERT to document retrieval.BERT uses the Transformer structure, where the self-attention mechanism is one of the core components of the model.In the self-attention mechanism, each position needs to calculate attention weights relative to all other positions in the sequence, and the complexity of this calculation is the square of the input sequence length.Therefore, as the length of the input sequence increases, the computational cost shows a quadratic growth.At the same time, the BERT 1 mentioned that in their work, BERT pre-trained a model with a sequence length of 128 tokens in 90% of the steps during pretraining and trained the remaining 10% of the steps with a sequence length of 512 tokens to learn positional embeddings.If we need to use this checkpoint, the vocabulary used will be limited to 512 tokens.Computational efficiency cannot be ignored when the input sequence is too long.Therefore, the limit of the sequence's length is up to 512 tokens.Most candidate documents in the retrieval system are longer than 512 tokens and the document relevance scores can't be directly calculated by BERT.To apply BERT to document retrieval tasks, the most popular and effective approach today is decomposing documents into natural passages or independent sentences and aggregating passage-level or sentence-level relevance then using the aggregated score to represent the relevance between query and whole document. 14,15This approach's success revealed that BERT was designed for processing short spans of text. 16Some variant Transformer structures that can handle thousands of tokens, for example, Longformer 17 and BigBird 18 in a single inference.However, due to the requirement of ad-hoc retrieval, their response time prevents them from being used in real cases.
A multitude of previous work has tried to prove that passage-level relevance impacts document-level relevance indeed and illustrated that considering fine-grained relevance signals will enhance performance.Due to the fact that a document is composed of multiple passages with various features, which play different roles in relevance assessment, researchers Liu and Croft 19 had taken the highest passage relevance scores as document scores.Most existing BERT-based models, such as Birch, 15 consider that the most or top-k relevant passage or sentence scores provide a good proxy for document-level relevance.
However, the common practice of judging overall document relevance by single separated chunks may lead to mismatch.These common passage-level relevance aggregation approaches cannot capture important semantic information such as document structure.For example, if these common aggregation methods are used, a document that is out of order may obtain the same result as a document with a natural passage structure.Therefore, these approaches may not work well in certain real cases as the assumption that several passages could replace the whole document and may lead to a contrast judgment.Liu and Croft 19 and Wu et al. 20 build a dataset and collect relevance judgments from persons, studying the process of relevance judgment of humans.They discovered that more relevant documents usually involve a higher percentage of relevant passages.Kong et al. 21proposed the aggregate relevance (AR) principle that assumes the proportion of relevant passages in a document could determines the document's relevance score.Li et al. 22 conducted human eye-tracking research and got the conclusion that the passages at the beginning of the document attract more attention and have significantly impact on the full document-level relevance.In most candidate documents, the document's beginning or end always contains the main thought and the author's point.Therefore, the passages of sentences there are usually more relevant to the topic.Additionally, experiments suggest that passage-level relevance labels play an important role in predicting the relevance score of documents. 23,24o capture such document-level relevance information, in this work, we provide a kernel-based architecture for aggregating passage-level labels into document scores on the top of the BERT-based ranker and a reward mechanism for enhancing retrieval performance by integrating passage-level relevance distribution information.Due to the requirement of response time, we choose to apply the kernel functions to the simple score aggregation approach instead of additionally training a CNN or Transformer-based classifier.Specifically, a series of novel BERT-based re-ranking methods (KnBERT1, KnBERT2, KnBERT3, and KnBERT4) are proposed, in which different kernel functions are utilized to feature the weight of passage position.After analyzing extracted sequences of passage-level relevance scores, the documents which are subject to the given passage distribution are rewarded.

Research objectives
Proposing a multi-stage framework that integrates passage-level relevance into BERT for ad-hoc document re-ranking is the main purpose of this work.Our specific purpose can be summarized as below: 1. To study the influence of combining passage-level relevance and pre-trained language model, namely BERT, into the proposed document re-ranking module and how passages distribution affects the results.2. To study the impact of parameters of kernel functions and smoothing factors on affecting the effectiveness of the proposed framework.3. To boost the effectiveness of multi-stage retrieval framework.For measuring the improvement, several metrics are considered, including precision at position 20 (P@20), normalized discounted cumulative gain at position 20 (NDCG@20), and the mean average precision at position 100 and 1000(MAP@100, MAP@1000).

Contributions
Our main contributions can be summarized as follows: 1.A novel position-aware BERT-based re-ranking model named KnBERT, incorporating both kernel-based passage-level score aggregation strategy and reward mechanism enhancement module.2. For better weighting the position of passage and aggregating passage-level signals, we present a kernel-based passage-level relevance weighting formula that use a linear combination to balance the importance of two components.The first component is a simple passage score aggregation strategy (the max passage score computed by BERT in this article).The second component can be regarded as utilizing kernel functions to feature the weight of the passage position.3. We explore more strategies based on passage-level relevance distribution to reward the documents which are subject to specific passage distribution.Experiments show the strategies could advance the BERT-based ranking model when applying document retrieval tasks.
We experiment on two widely used TREC test collections Robust04 and GOV2 and the evaluation demonstrate our framework named KnBERT could boost NDCG@20 and MAP metrics significantly compared with the state-of-the-art BERT-based ranking model which has been pre-trained on MS MARCO before fine-tuning.
The remainder of the article is organized as follows.In Section 2, A brief overview of the related work is first reviewed.Next, in Section 3, the proposed re-ranking framework which combines overall document relevance and semantic passage distribution information at the passage level is presented.The baselines and experimental settings are introduced in Section 4. The metrics results are shown in Section 5.In Section 6, we discuss the parameters and further analyze the effectiveness of our proposed framework.Finally, the conclusion of this work and our expectations about future research are provided in Section 7. Our code is available at: https://github .com/panminiii/KnBERT.

BERT-based ranking models
The contextualized pre-trained language model BERT is a superior technique for various natural language processing and IR tasks including document ranking. 9MacAvaney et al. 25 integrated the interaction representation of BERT into an existing neural model, namely CEDR, which utilized the word embeddings of BERT's last layer to build matrix based on similarity and then feeds into an existing ranking model to enhance the performance. 16leverages larger datasets and more training samples by transferring models in different domains and aggregating sentence-level evidence to rank the document list.Huang et al. 26 leverages the intermediate layer of BERT to acquire semantic information across various levels and devises features of multiple granularities for the ultimate relationship classification.More recently, Nogueira and Cho 14 first adopted BERT for passage re-ranking tasks using BERT's [CLS] vector, converting the ranking task into a text classification problem.MS MARCO 27 and TREC CAR 28 are used in the training phase and demonstrated a significant improvement over unsupervised baselines and existing shallow ranking.This simple use of BERT is very effective for retrieval tasks, especially in the re-rank phase.Many researchers try to incorporate pseudo-relevance feedback into the BERT model, which aims to mitigate the representation gaps between queries and documents.
Pseudo-relevance feedback is an effective technology for traditional query expansion. 29,30arly studies have shown that using traditional query expansion methods to improve the performance of BERT-based re-rankers is not effective.To address the problem of mismatch between query and document vocabulary, Zheng et al. 31 proposed a novel query expansion model BERT-QE.The model uses an unsupervised block selection method to expand the original query in three stages and utilizes the most relevant text blocks to re-evaluate the document relevance score.Evaluation results on Robust04 and GOV2 test sets show that BERT-QE significantly outperforms BERT-Large, and the additional computational cost is relatively small, thus providing great advantages in terms of improving computational efficiency.Transformer-based re-rankers can also benefit from the additional context provided by PRF.To address the problem of query being a poor description of information needs, Yu et al. 32 proposed a pseudo relevance feedback framework PGT using a graph-based transformer.This framework treats feedback documents as additional context and utilizes sparse attention to reduce computation.Compared to BERT-based re-rankers, PGT can use more feedback documents to improve retrieval accuracy and computational efficiency.Considering that the ranking environment is crucial for the performance of learning ranking, 33 fully explored the application of Co-BERT in BERT-based re-rankers.Co-BERT is a ranking model based on end-to-end BERT.It combines ranking contexts by jointly modeling the interactions between queries and multiple documents in the same ranking and uses pseudo-relevance feedback to adjust the relevance weights.Experimental results on three standard TREC test sets (Robust04, GOV2, and Clueweb09-B) show that the Co-BERT model has great advantages in performance improvement.Pan et al. 34 and Wang et al. 35 proposed two new pseudo-relevance feedback approached that combining relevance matching and sentence-level semantic matching.
In Table 1, we present an overview of pertinent studies, delineating their respective deficiencies and proposing potential solutions for addressing the limitations inherent in the existing models.

Passage-level relevance
To overcome the limitation of input sequence lengths when using BERT, researchers have studied using score aggregation to obtain the overall document relevance score.We briefly summarize the utilization of passage relevance scores from BERT for document re-ranking.Callan 23 realized

TA B L E 1
The shortcomings and the potential solution of the existing models.

Contributions and shortcomings Solutions
Yilmaz et al. 16 Proposing sentence-level evidence aggregation for document ranking to address the length constraints encountered in BERT processing.Solely focusing on high-scoring sentences may overlook other potentially relevant information.
Analyzing other segments of the document or employing more sophisticated ranking methods.
Nogueira and Cho 14 Adapting BERT for query-based passage reordering to enhance the average response time in retrieval.For MS MARCO, the imposition of an upper limit on irrelevant passages may potentially result in overfitting during model training.
On the MS MARCO dataset, mitigating the risk of overfitting by constraining the upper limit on irrelevant passages.
Zheng et al. 31 Introducing the novel query expansion model BERT-QE, utilizing BERT for unsupervised block selection and reevaluating document relevance scores.Balancing efficiency and effectiveness, the BERT-QE model has yet to discover the optimal solution.
Attempting end-to-end training of a context-based query expansion model to enhance the overall performance of the model.
Yu et al. 32 Presenting PGT, a graph-based transformer pseudo-relevance feedback method designed to augment non-PRF Transformer re-rankers.Although PGT enhances non-PRF BERT re-rankers, it comes with a relatively elevated computational complexity.
Improve the information in PGT graph nodes to achieve a method of reducing computational complexity.
Chen et al. 33 Introducing Co-BERT, an end-to-end BERT-based ranking model with a lightweight PRF-based calibration for improved query retrieval performance Ranking the training data based on initial ranking may lead to a performance decline in information retrieval.
Training with randomly shuffled batches to enhance model performance in information retrieval.
Pan et al., 34 Leveraging BERT to incorporate sentence-level semantic information into PRF for query expansion.There may be an excessive reliance on the performance of the BERT model.
Substituting BERT with alternative pre-trained language models to enhance retrieval effectiveness and robustness.
the importance of the relevance scores of passages within the document and extracted passages with passage-based and window-based approaches.Fan et al. 36 experimented with the representation of the correlation of aggregated passages and found it perform well in the context.Dai and Callan 37 and Li et al. 38 explored a BERT-based approach to prove the importance of sentence-level or passage-level relevance.Dai and Callan 39 proposed three strategies to catch passage-level relevance, that is, the first passage (BERT-FirstP), the most relevant passage (BERT-MaxP), and all passage (BERT-SumP) for ad-hoc re-ranking.Specifically, they divided the document using a sliding window into passages that length is the same and used a BERT model to predict the relevance between the query and each individual passage Song et al. 40 Dai and Callan 41 mapped contextualized embeddings learned by BERT to estimate the term's weight.Wu et al. 20 explore a semantic passage-level score aggregation based on passage distribution for document ranking and study how passage-level relevance labels impact the relevance of full document.They figured out that more relevant documents usually involve a higher percentage of relevant passages.PARADE Li et al. 24 explored the approach that aggregate the passage representation then generate documents' score with neural pooling.IDCM 42 attempted to choose top relevant chunks within the document while applying pre-trained language model to reduce the time and memory costs.Bi et al. 43 utilized fine-grained semantic matching to explain the ranking results to serve people shopping online.Pan, Li, et al. 44 proposed SE-BERT to utilize a pre-trained generative language model to summarize the sides of candidate passages and connect them to a new input sequence.
The attempt to utilize fine-grained semantic information for replacing the overall document relevance may achieve some success at present.Though these mentioned methods which also study passage relevance for IR task achieve fine results, there's still something to be improved.Traditional pooling methods include max pooling, mean pooling, and first pooling, which stand for using the best passage score, the first passage score, and the average score of all passage scores as the score for the entire document, respectively. 39The pooling layer uses max, mean and the first method is too simple for capturing long-distance dependency information between passages.Therefore, we try to model passage-level semantic relationships with kernel pooling.

Kernel-based features in IR
Kernel functions are used widely in IR for estimating similarity, term proximity, and term co-occurrence.Based on the robustness of kernel functions, we consider them an effective approach for measuring passage-level relevance.In many studies, 45,46 proximity-based kernel functions have been applied to model the influence between terms.de Kretser and Moffat 47 utilized four contribution functions (i.e., Triangle, Cosine, Circle, and Arc functions) to model local similarity between each query term and other position.Later, the concept of query term occurrence was proposed by Zhao et al. 46,48 They proposed several kernel functions to estimate the relation between query terms that appear together.In particular, more kernel functions such as Gaussian, Triangle, Cosine, Circle, Quartic which own the same features: Symmetry, Nonnegativity, Monotonicity, Continuity and Identity, Gaussian kernel is widely used in machine learning algorithms such as statistics and support vector machines, while Triangle kernel, Circle kernel, and Cosine kernel originate from basic genomic graphics and are used to estimate the density distribution based on proximity for estimating position language models, were discussed by Zhao et al. 46 and Pan et al. 49 proposed that a higher co-occurrence value may lead to relevance matching between two query terms.They captured the term proximity between query terms and expansion terms by adopting kernel functions that satisfy the features above.

OUR PROPOSED METHOD
We describe KnBERT in this section, which includes a kernel-based approach for aggregating passage-level relevance signals.Given a query q and a document d, for the estimation of the degree document d fits the query q, a multi-stage ranking framework aims to generate a relevance score S(q, d) is proposed as follow.Specifically, we combined the passage-level relevance score aggregation strategy with an effective reward mechanism which could boost the ranking quality significantly.Then we describe in detail how kernel functions work in our proposed model.The model architecture is shown in Figure 1.

Overview
As mentioned in Section 1, BERT cannot directly process most candidate documents longer than 512 tokens due to its fixed sequence length limitation.To apply BERT to ad-hoc document re-rank tasks, the main purpose in this process is to re-rank a document list from the first stage recall model such as BM25 + RM3 and DPH + KL.Traditionally, the document-level relevance has been assessed by measuring the aggregation of passage scores.Akin to prior work, 31 we decompose a document into passages by a sliding window of 100 tokens where two neighboring passages are overlapped by 50 words.The process could formally express as d = {P 1 , P 2 , … , P n } where n is the number of divided passages.In our methods, the size of sliding window determines the smoothness of kernel functions while pooling the scores.Though longer sliding windows could capture long-distance dependency in the document, they don't lead to better results.A BERT checkpoint fine-tuned on MS MARCO 27 is used to estimate the relevance between each individual passage and query.
Likewise in Nogueira and Cho, 14 the query q and passage P i are first concatenated into the sequence [CLS]Query[SEP]Passage[SEP], encoding by multiple BERT layers (e.g., 12 layers if using BERT-Base).The output embedding of the token [CLS] in the last layer encodes the interaction between query and passage.Relevance Scores will be generated after the [CLS] embedding passes through an MLP layer, donated as following Equation ( 1): As discussed, referring to full-document relevance signals is important.Only consider the highest passage score is unilateral.The proposed KnBERT includes two components for this: contextualized position decay weight with reward mechanism and the enhancement module using query expansion.In particular, after computing S P and feeding it into the model for training, the interaction between the query and each passage will be encoded by fine-tuned BERT, and relevance scores will be generated after a mapping layer.These scores are used to provide the original passage-level relevance and generate the position decay weight.We formulate some strategies to reward the document which contains sequential positive passages.These interactions can be seen as transit stations that provide full-document relevance to the model in another way.In conclusion, putting the two components together, instead of independently extracting the highest score of passages, the proposed KnBERT is formulated as KnBERT(q, d; str), where str donates the strategy during the reward mechanism.During inference, given a query q, we score individual documents d as in Equation (2): S(q, d) = KnBERT(q, d; str). (2)

Kernel-based weights at the passage level
In this work, we believe that local passage-level relevance determines overall document-level relevance in certain cases.In Previous works, passage-level relevance is aggregated into a document relevance score with a score aggregation approach (e.g., max pooling, sum pooling, and average pooling). 39These common passage-level relevance aggregation approaches cannot capture important semantic information such as document structure.Therefore, these approaches may not work well in certain real cases as the assumption that several passages could replace the whole document and may lead to a contrast judgment.Unlike the common score aggregation approach, our proposed method generates an overall document relevance signal by aggregating the position-based local relevance with kernel functions.During document-level relevance judgments, many factors such as passage position or passage length may influence the weight of the passage.Li et al. 22 find that people pay more attention to the passages at the beginning of the document.We assume that the passage precedes other parts of the document has a greater influence in determining the full-document relevance and passage position within a document influences the importance of each individual passage.
Previous studies 49 provide an alternative kernel-based method to count the term frequency or co-occurrence in a document.Due to the requirement of response time, we choose to apply the kernel functions to the simple score aggregation approach instead of additionally training a CNN or Transformer-based classifier.This article proposes a novel method that utilizes various kernel functions to extract and weigh the position of passage in each candidate document and combine them with the maximum score for each passage (MaxP score).Both overall document relevance and deep-contextualized information are acquired in the process.We believe that the overall document relevance signal can be captured using suitable approaches that allocate weights to the passage based on their position in the document then aggregate them.Gaussian, Circle, and Cosine kernels are all widely used kernel functions that have the same features, such as continuity and symmetry. 50In addition, inspired by Pan et al., 49 we utilize different kernel functions to feature the weight of passage position, which is formulated as follows: where BERT(q, P i ) denotes the relevance score between the query and the i − th passage.K(i, n) denotes the weight of the i − th passage over all n passages.S P represents the highest passage score computed in Equation (1). is a tuning parameter controlling the relative contributions of the two parts of weight, and it range from 0 to 1 with a step size of 0.1, we obtained the best fixed constant value of  through five-fold cross-validation.
The positional weight modeling is to first capture the relative position of the passage within the document, and then model the positional weights using three different kernel functions, with higher weights for passages closer to the beginning and end of the document.In this article, the weight is denoted by K(i, n).To test the effectiveness differences among these kernel functions, we also provide a simple decline function for comparison.The calculation is shown in Equations ( 4)- (7).a. Decline function (KnBERT1) b. Gaussian kernel (KnBERT2) ) .
(5) c.Circle kernel (KnBERT3) d. Cosine kernel (KnBERT4) where u is a parameter which controls phase shift to ensure the weight of the first and the last passage are equal. is a dynamic parameter that controls the scale of kernel functions.For different documents and datasets, the optimal value  are different.Therefore, unlike common static  settings such as {1, 10,100}, we set  to a value ranging from 0.1u to u with a step of 0.1uto adapt different documents.

Reward mechanism based on passage distribution
In this section, we first analyze the relationships between the document-level relevance and the passage distribution.Previous work 20 construct a news dataset and collected relevant judgments from well-trained staff.The conclusion is that more relevant documents involve a higher number of more relevant passages.The larger the number, the more relevant the document.Sequential information captured by continuous relevant passage within a document may theoretically affect document relevance judgments.Specific sub-sequences may provide useful information for document-level relevance judgments.
Based on the two conclusions above, we utilize part of the reward mechanism from Reinforcement Learning and propose the reward mechanism for enhancing retrieval performance by integrating passage-level relevance distribution information.Specifically, we developed two strategies.For capturing sequential relevance information, we provide strategy 1 called Continuous Relevance (CR), which rewards documents that contain continuous strong relevant sub-sequence.As the more relevant passage a document has, the more relevant the document is, we provide strategy 2 called High Percentage Relevance (HPR) to reward documents that have a higher percentage of strong relevant passages.
Among the reward mechanism above, three hyperparameters are the key points. is a dynamic hyperparameter that presents the threshold of strong relevant passages.We found that setting  to a fixed value is unwise because we want to cluster passage by a percentage such as top-5% but not a fixed value that may lead to mismatch.In our framework,  = k presents that the top-k% of the passages whose relevant scores exceed the threshold are treated as strong positive samples and k ∈ [0,100].Due to the difference in datasets, the threshold should be different.But  is more likely to be a fixed value that is set based on experience in real cases.In this article, our reward mechanism only considers strengthening the weights of positive samples, while other unselected samples will not be given any punishment.We will discuss the reward mechanism in detail in Section 5.The number of continuous positive passages  is another hypermeter.Both  and  are proportional to the number of documents that are rewarded.While the distribution of a document obeys the strategies, a bonus score will be added for CR and the extra weight will be multiplied for HPR.The extra weight depends on the percentage of positive samples and the bonus score is the last hyperparameter.S b is also dynamic and S b = 10 means that the bonus score equals the score of passage at the of 10% position of the relevance ranking list.That means if a document reaches CR, we consider this document more relevant than the other 90% of candidate documents.N pos represents the number of positive passages in the document and N all represents the total number of passages in the document.The procedure is shown in Equations ( 8) and ( 9).a. Continuous Relevance (sequential relevance information) b. High Percentage Relevance (percentage of strong relevant passage)

Interpolation with the initial ranking
The final relevant score S final is represented as Equation ( 10): where S ini is the score from the initial ranking (e.g., DPH + KL),  is a constant that controls the relative contribution of the score from BERT and initial ranking, and it range from 0 to 1 with a step size of 0.1, we obtained the best fixed constant value of  through five-fold cross-validation.At the same time, we can also use two strategies simultaneously, and the final score S B r is obtained by multiplying S B by the weight of the two strategies.Before interpolation, normalization on both scores should be adopted.The first-stage initial ranking contains rich statistical information (e.g., term frequency and proximity), which is essential in relevance judgments.The unsupervised ranking model and the pre-trained language model have access to semantic information that the other cannot capture.The relevance matching score obtained through initial ranking is combined with the second round of using KnBERT to merge passage level kernel functions into a BERT-based reordering method, aggregating passage level correlations to obtain semantic matching scores, which compensates for the shortcomings of initial ranking and is more effective than using relevance matching or semantic matching alone. 35,51,52Therefore, the first-stage retriever and the re-ranker should be complementary but not conflicting.The performance of the re-ranker depends on the average precision of the results from the first stage.In this work, we find that the initial ranking performance of DPH + KL on Robust04 and GOV2 is stronger than the performance of BM25 + RM3.Meanwhile, since the baseline model we compared in the first stage also uses DPH + KL, in order to facilitate comparison with the baseline model, we ultimately choose DPH + KL as the first-stage retriever.However, in other cases, choosing to use BM25 + RM3 may be a more convenient method as results can be easily obtained from Anserini. 53By using the representative kernel pooling functions and two reward mechanisms mentioned above, the KnBERT framework can naturally merge passage level kernel functions into BERT-based re-ranking method.

EXPERIMENTAL SETTINGS
In Section 4.1, the datasets and evaluation metrics are introduced, and the baselines are described in Section 4.2.We describe training and inference details in Section 4.3.

Data sets and evaluation metrics
Akin to Zheng et al., 31 we experiment with the standard Robust04 dataset 54 and GOV2 dataset 55 text retrieval collections.Robust04 is a collection composed of 528,155 documents containing newswire data.GOV2 consists of 25,205,179 documents which is a web collection crawled from US government websites used in the TREC Terabyte 2004-06 tracks.For each topic, only a few title keywords are employed for both retrieval and rerank to fit the real case queries.We employ 250 topics for Robust04 (301-450, 601-700) and 150 TREC keyword queries for GOV2.Likewise, in this work, all the BERT-based rerank models, including the proposed models and the baseline re-ranker, have been interpolated with the first stage ranking scores in the same way.In order to fairly compare with many excellent and widely recognized baseline models, we use the same metrics such as P@20 and NDCG@20 as evaluation metrics for our model.Simultaneously, in many important applications like Web retrieval systems, users typically only browse the first page or the first three pages to find good results.Therefore, accuracy needs to be calculated within a fixed and limited number, such as 20 retrieved documents.Thus, we use the P@20 metric to represent the proportion of relevant documents in the first 20 returned documents, using NDCG@20 follow the relevance level of the first 20 documents and use MAP@100 and MAP@1000 measure the mean average precision of the first 100 and first 1000 returned documents separately.Due to the fact that only comparing the results of MAP, P@20 and other indicators between the two models, it is not possible to determine whether there is a statistically significant difference in the results before and after model improvement.Therefore, we conducted statistical tests on the evaluation results of each two groups based on Wilcoxon paired signed 56 (p < 0.05).

Initial ranking and baselines
We compare KnBERT with the following representative models, including traditional and neural methods.

DPH + KL
DPH + KL is the first-stage initial ranking model in our framework to generate top-1000 documents.DPH is a traditional retrieval model that ranks documents based on the divergence-from-randomness framework.After using Rocchio's pseudo relevance feedback using Kullback-Leibler divergence to expand the queries, final document list is contained.We adopt the version of the Terrier toolkit 57 and the performance has been listed.

BM25 + RM3
BM25 + RM3 is another strong baseline including pseudo relevance feedback.RM3 is adopted for query expansion.The experimental settings of retrieval model are from Yilmaz et al. 16 and we employ the implementation of default settings from Anserini. 53

BERT-Base
BERT-Base is the vanilla BERT-Base model fine-tuned on MS MARCO 27 and achieved extraordinary performance on document retrieval.The superior performance comes from the model's strong ability to transfer learning.Likewise, 16 a BERT-Base checkpoint trained on MS MARCO is fine-tuned on target datasets.We employ the score of the passage with a maximal relevance score, the mean score of all passages and the score of first passage to represent the relevance of an individual document.

Co-BERT
Co-BERT 33 is a recently contextualized BERT-based model which can capture the local ranking context among different candidate documents and query-specific information.Unlike another BERT-based model, Co-BERT is a framework trained end-to-end.The embedding of top-m documents and each interaction representation of query-document pairs are stacked into a sequence to capture query-specific chartists.Then pass through a listwise scorer to get the final result.The results from Chen et al. 33 are directly included.

PARADE-Avg
PARADE-Avg is another BERT-based model exploiting full-document relevance signals by aggregating pass representation. 24We use the Avg variants similar to our position approach.PARADE-Avg assumes each passage contributes differently according to the position of passage and sums the passage relevance representations combined with document length normalization as final weights.The results from the original article are directly included.

BERT-QE
BERT-QE 31 is a recently proposed BERT-based re-ranking model adopting query expansion.We use two different BERT variants to expand the queries, (namely BERT-QE-BMS), BERT-Medium for the second phase and BERT-Small for the third phase.For the remaining, we follow the configurations from the original article 31 using a sliding window including 100 words with an overlap of 50 words to generate the MaxP scores, and the maximum length of the concatenated token sequence is set to 384.

Data preprocess
We employ the top-1000 documents from the initial ranking on training and inferencing instead of using all the candidate documents.Akin to Zheng et al., 31 for KnBERT and BERT-based baselines, individual documents are decomposed using sliding windows including 100 terms and the stride is 50 terms.As mentioned in Section 3.1, the "best" passage is chosen using a fine-tuned BERT on the target dataset.The chosen most relevant passages are used as individual documents substitute.
The pairs of queries and the composed passages for training are first concatenated and we pad the max sequence length to 384.

Training details
Similar to Nogueira and Cho 14 and Yilmaz et al., 16 a BERT model is first trained on MS MARCO passage ranking dataset.Then the target dataset (e.g., GOV2) is used to fine-tune the model.The checkpoint is first used to select the top relevant passages and the concatenate to query-passage pairs for fine-tuning.We use the cross-entropy loss as in Equation (11).
where I pos and I neg represent the sets of documents that are relevant or non-relevant to query.p i is the probability that d i is a relevant document.Akin to previous work, we use only the most relevant passages instead of all passages.This leads to a shorter training time but a comparable performance.Training of KnBERT and the baselines were performed on a single NVIDIA Tesla V100-32G.We train BERT for 2 epochs with a batch size of 32.The Adam optimizer and the learning rate schedule from Nogueira and Cho 14 are employed.The initial learning rate is set to 1e-6.

Inference details
In the proposed KnBERT, the input sequence of the inference phase is similar to the training phase.Query and passage are concatenated and the sequence is padding to 384 tokens.For Robust04, titles of documents that contain overall document semantic information are added in front of the passage during inference.The hyperparameter u in kernel functions is set to half of the number of passages within the document.For reward mechanism,  = 5 and  = 2 are the best combination for both Robust04 and GOV2.With the given parameters above, S b = 30 is suitable for Robust04 and S b = 50 performs best on GOV2.

Cross-validation
We employ basic five-fold cross-validation settings and queries are equally split into five partitions.The division on Robust04 follows the settings form. 39For evaluation, we choose the best hyperparameters  and  based on the validation set and apply them to the test set to report the highest performance evaluation results.

RESULTS
The results of our proposed KnBERT are reported in this section and we compare them to other advanced models and baselines.The re-ranking effectiveness of KnBERT on the common-used Robust04 and GOV2 collections is summarized on both shallow (using P@20 and NDCG@20) and deep pools (using MAP@1 K and MAP@100) in Tables 2 and 3. Experiments include two Note: The comparisons are mainly relative to the best-performed BERT-based model that is fine-tuned on MS MARCO before fine-tuning, de-noted as BERT-MaxP.The best result for each metric is shown in bold.The relative gain/loss in terms of percentage (in the bracket) relative to DPH + KL and BERT-MaxP is reported."*" and "+" mean statistically improvements over DPH + KL and BERT-MaxP.
state-of-the-art unsupervised ranking models, that is, BM25 + RM3 and DPH + KL, for reference.We report two relative comparisons in terms of percentage in brackets.DPH + KL is the first-stage retriever so the comparisons could significantly represent the gain of our proposed re-ranker.BERT-Base is the vanilla BERT-Base model fine-tuned on MS MARCO.The two values in the bracket represent the improvement of our method relative to the first stage and a complete two-stage BERT-based retrieval system.

Comparison with unsupervised and BERT-based models
Considering the three kernel-based approaches, KnBERT3 is more effective than KnBERT2 and KnBERT4 on MAP metrics, though both datasets.On P@20, KnBERT2 outperforms others on GOV2.Kernel-based methods are consistently performing better than the simple decline function, but KnBERT1 sometimes gets a better result on NDCG@20 for the GOV2 dataset.All our proposed models consistently outperform the unsupervised baselines and recent BERT-based re-rankers.This confirms the effectiveness of our proposed kernel-based passage score aggregation approaches.
Comparisons with unsupervised baseline DPH + KL show that our methods significantly outperform the first stage recall phase for all metrics on both collections.BM25 + RM3 is a widely used unsupervised ranking model.The conclusion we obtained during the comparisons with the unsupervised model is that the two-stage retrieval then re-rank system is always better than the single retrieval model with a huge gap.The results demonstrate the potential and robustness of the re-ranker, even taking the computational cost into account.
The proposed KnBERT could outperform comfortably the pre-trained BERT-Base model from MS MARCO.As mentioned in Section 4.3, KnBERT was first initialized using the model pre-trained on MS MARCO.As can be seen in Tables 2 and 3, While utilizing the kernel-based method, KnBERT could comfortably outperform BERT-MaxP models on Robust04 and GOV2 with a wide margin.
To be specific, on the shallow pool, in terms of NDCG@20, KnBERT improves upon BERT-MaxP by 11.0% on Robust04 and 12.1% on GOV2, respectively.On MAP@1000, KnBERT improves BERT-Base by 7.9% on Robust04 and 4.8% on GOV2, respectively.It's worth noting that the KnBERT1 can always outperform BERT-MaxP.For example, in terms of NDCG@20, the KnBERT1 underperforms KnBERT3 by 1.8% but outperforms BERT-MaxP by 9.1%.The results confirm the effectiveness of the proposed kernel-based model architecture relative to the vanilla BERT-BASE model, especially on the shallow pool.Thereby, we have assured most of the improvement comes from the novel kernel-based position weight.Relative to other recent advanced BERT-based re-ranker using full-document relevance signals or pseudo relevance feedback, the proposed KnBERT is still at an advantage.According to Tables 2 and 3, KnBERT outperforms serval fine-tuned BERT-based re-rankers which rerank feedback documents from the same unsupervised model DPH + KL as we employed.All four variants of KnBERT could outperform these models.The core idea of PARADE-Avg is the same as KnBERT1 which does not utilizing kernel-based functions, and the difference is our proposed framework adds a normalization term while aggregating the passage scores and then combines the position-based score with the MaxP score.Sometimes simple score aggregation works better than the representation aggregation strategy.In addition, the comparisons relative to BERT-QE, a recent BERT-based PRF model, prove the effectiveness of our proposed framework.In our experiments, we employ BERT-QE with the size of BERT-Base for all three phases and other settings are the same for fair comparisons.For example, the best variant of KnBERT which has less computational than BERT-QE could achieve a 4.0% mean average precision gain on Robust04.

Employing reward mechanism in KnBERT
The reward mechanism is an effective way to enhance the performance of the framework.We describe the mechanism individually because our main contribution is the kernel-based position weight.The subtle enhancement represents the mechanism that works on both TREC collections.We represent the improvement of the reward mechanism on one of the variants of KnBERT to prove its effectiveness.When taking the reward to enhance components into consideration, KnBERT achieves a stronger performance.The results are shown in Table 4.
As shown in Table 4, the two strategies will enhance the performance respectively.CR focuses on sequential passage-level relevance signals and HPR is proposed to reward documents that have a higher percentage of strong relevant passages.Specifically, in terms of the MAP@1 k and MAP@100, the performance of KnBERT1 using HPR was improved by 5% and 8.2%, respectively, compared to BERT-MaxP.The performance on P@20 was close to that of KnBERT1 using both CR and HPR simultaneously, both of which improved by 8% compared to BERT-MaxP.Just NDCG@20 in terms of performance, KnBERT1 is 12.1% higher than BERT-MaxP.The results confirmed that by combining the two proposed reward mechanisms, KnBERT1 would be more effective to a certain extent.It is worth noting that our proposed method consistently outperforms the baseline BERT-MaxP in both CR, HPR, and CR + HPR reward mechanisms.

Effectiveness of score aggregation
Previous works on score aggregation and representation aggregation have confirmed the effectiveness of these approaches, there has yet to be a comprehensive comparison of these approaches.In this section, we compare KnBERT with these sophisticated methods in the aspect of effectiveness and time complexities.We use max pooling, sum pooling, first pooling 39 and k-max pooling 16 as concrete examples for score pooling systems and max pooling, sum pooling, average pooling and Transformer pooling 24  for latent vectors from D∕C chunks in PARADE-Transformer.For score pooling systems, we consider their time complexities to be the same, with the only difference is that the weight of passages on different positions is determined by pooling methods.As shown in Table 5, we record three different metrics of all score pooling systems mentioned above.Kernel pooling system could outperform other score aggregation strategies with same time complexity and most representation aggregation with extra computations.Compared with PARADE-Transformer, our method is indeed slightly inferior in terms of accuracy.However, considering the balance between time cost and accuracy, our method is significantly better than most pooling systems and could obtain cost-effective results.
While kernel-based score aggregation approaches are effective across the two widely-use datasets, in real cases unpredictable situations may appear.We hypothesize that the focused nature of queries may result in a certain degree of decline in the effectiveness of complex score aggregation like KnBERT.The features of this type of query may lead to less highly relevant passages per document than normal ones.Wu et al. 20 studied the distribution of a series of datasets and the conclusion was that the type of queries does affect the number of highly relevant passages per document and the representation aggregation strategy performs better on those datasets which have more complex fine-grained relevance distribution.The judgment of the document in a dataset that only has 1-2 relevant passages per document is simple because the relevance of the overall document can be sufficiently replaced by a single highly relevant document.MS MARCO is a good example to represent these datasets.
We test this hypothesis by using jointly distribution and judgments of passage-level relevance on GOV2.The passage-level judgments are available on GOV2 so we can easily contain the joint distribution of passage-level relevance and document relevance.We use the sentence-level relevance judgments available in WebAP, 58,59 which over 82 queries.For GOV2, 38 percent of the documents only have one single passage with a relevance label.40% of the documents contain 3 or more relevant passages and the number of such documents is proportional to the effectiveness of score aggregation with complex structure in our conclusion.
To further analyze the results, Figure 2 shows the joint distribution of passage-level relevance and overall document relevance in our framework.Document relevance is divided into three levels by the judgment file from TREC, which are (0) irrelevant, (1) relevant, and (2) strongly relevant.The passage-level relevance is also divided according to the relevance score computed by BERT.We find that the percentage of strong relevant passages within the document increases sharply as the document-level relevance increases in both collections.The proportions of irrelevant passages between different document relevance change smoothly with minor differences.As we only employ positive samples in Section 3.3, the conclusion is that developing strategies according to the number of irrelevant passages is unwise.Even simple score aggregation performs better when the number of relevant passages per document is low, complex score aggregation like ours is still at an advantage.We combine both best passage and position semantic information based on kernel functions during the inference of the model so we balance the robustness and effectiveness of our proposed framework.

Re-ranking effectiveness versus efficiency
While BERT-based re-rankers are effective at feeding back ranked lists that contain deep-contextualized interaction information which could complement the missing semantic information of the traditional model based on term frequency, the compactional cost is unaffordable in real cases.The characteristic that the retrieval system must rank the documents and feed them back in a few seconds after the user issues a query determines the model is sensitive to efficiency.We consider using smaller BERT variants to improve the efficiency of KnBERT.For providing guidance for employing a retrieval system in the real case, we study the re-ranking effectiveness of KnBERT when using different sizes of BERT.We use the per-trained BERT provided by Li et al. 24 Serval hyperparameters such as the number of hidden layers and attention heads are changed in our experiment, while other settings in our framework are fixed.All the process of inference time was performed on a single Nvidia RTX 3090.The performance of varying sizes on Robust04 is shown in Table 6.The number of parameters and inference time TA B L E 6 KnBERT2's effectiveness using BERT models of varying sizes on Robust04 title queries.

Model size L/H
MAP@1 K MAP@100 P@20 NDCG@20 per doc are also given for choosing a suitable size of BERT in different applications to balance the effectiveness and efficiency.

Effect of the normalization
As described Section 3.2, we employ a simple function to normalize the weights of kernel-based passage position.We design this function according to the number of passages to balance documents with different lengths.However, the effect of normalization differs in the two datasets.
The hyperparameters used on both datasets are the same.The effect of applying normalization is shown in Table 7.As can be seen from the table, the employment of normalization on the Robust04 has a positive effect on all the metrics.However, the same function does not work on GOV2.A certain degree of reduction on the MAP@1000, P@20 and NDCG@20 metrics has appeared.The only factor related to the normalized function that causes the difference between the two collections is the average length of documents.The phenomenon may be caused by the fact that the average document length on GOV2 is about three times longer than Robust04.We infer those longer documents are more relevant in GOV2 because longer documents are more affected while reducing the relevance score of all documents.So, the length of documents should be a parameter when judging the relevance of a document.We will explore more normalized functions to eliminate this effect.

Hyper-parameter tuning
In this section, we study the impacts of several hyperparameters in kernel functions and the reward mechanism introduced in Section 3. The hyperparameters studied include the dynamic parameter  in kernel functions, the threshold of positive samples , the number of continuous positive passages  and the bonus score S b .We tune these hyperparameters to study their influence on the retrieval framework and report the impacts of  in Figure 3.The impacts of other hyperparameters belonging to the reward mechanism are reported in Table 8.The sensitive parameter  may affect the robustness of the KnBERT methods.In our proposed method, we mainly focus on  which based on another parameter u. u is controlled by the number of passages within a document and set to (i + 1)∕2 where i is the number of passages a document contained.We set  to a value ranging from 0.1 u to u with a step of 0.1 u to adapt different documents.
As seen from the Figure 3, the value of  could greatly impact the performance of models.Our proposed KnBERT model achieves the best performance in terms of MAP@1000 on both  collections when using the circle kernel function and  is set to u∕2.On the shallow pool, we consider a large  with a value of 0.7 u performs better on both datasets. determines the scale of kernel functions so the conclusion is the value of kernel functions must be suitable and the optimal parameter value varies for different sizes.Our approach of setting dynamic hyperparameter values solves the sensitive problem initially.Overall, for a new data set or real case about which is little known, the kernel parameter  is recommended to be set to (i + 1)∕4 on deep pool and 7(i + 1)∕5 for shallow pool.
From Table 8 it can be seen that the configuration of all three hyperparameters could influence the performance effectively.Different configurations of  do not lead to big differences in the inference effectiveness of KnBERT.While  = 5, KnBERT achieves higher MAP than other settings of .The number of continuous positive passages  also affects the performance slightly.The two parameters above limit the range of documents rewarded together.The model is robust concerning the different choices of S b .When the limitation of  and  is critical, S b supposed to be small to have a nice performance.On the contrary, if the limitation is generous, S b should be large.

CONCLUSIONS AND FUTURE WORK
The existing BERT-based re-ranking methods only consider the most relevant passage to rank the list and ignore other parts of the document.This policy may lose some context information from the entire document.The main diversity between the proposed kernel-based re-ranking framework from other models is that we utilize kernel functions to capture the document-level relevance by aggregating passage-level relevance.Our proposed framework named KnBERT naturally incorporates kernel functions from the passage level into the BERT-based re-ranking method, which provides direction for building universal retrieval-then-rerank methods for Information Retrieval.
In this work, we evaluate the relevance of the document from two portions, which are different from the previous ranking methods based on the passage level.We apply the kernel functions from the passage level to BERT-based re-ranking models and capture the full semantic information from documents by incorporating the weight of kernel-based passage position in the re-ranking framework.Through several experiments based on two TREC datasets, the performance reveal that our proposed framework is effective and outperform BERT-based re-rankers and baseline models on both collections according to the MAP@1000, MAP@100, P@20 and NDCG@20 metrics.The effectiveness of KnBERT is comparable to previously proposed robust models.We further discuss and analyze the application scenarios of score aggregation and the settings of hyperparameters.Through the clever application of kernel functions, KnBERT better captures the semantic correlations between passages and documents, showcasing superior performance in the overall information retrieval task.This semantic superiority is a key factor contributing to why our model, KnBERT, surpasses other models in ranking and retrieval effectiveness.This semantic advantage enables KnBERT to comprehend document contents more comprehensively and accurately, leading to a more significant improvement in ranking and retrieval results.
Our work focuses on incorporating passage-level distribution features to improve the performance of the multi-stage retrieval system.The experimentation phase was confined to the TREC Robust04 and GOV2 datasets, implying that the generalizability of our model to other datasets is not assured which needs to be improved in our future study.Despite this, we envision the practical application of our proposed framework in real-world scenarios.Firstly, we intend to explore the incorporation of alternative kernel functions into our methodology.Secondly, we aim to investigate the compatibility of our model with other Transformer architectures or frameworks to ascertain its potential for yielding favorable outcomes. 2,3Furthermore, we plan to expand the scope of our evaluation by testing our framework and methods on additional datasets, and extend the applications of our proposed framework and methods to diverse practical domains, such as biomedical and chemical IR research, [60][61][62][63][64] recommendation systems, 65,66 and question answering system. 67In an environment marked by explosive growth in data volume, our proposed framework may potentially enhance the internal information processing efficiency of enterprises, providing users with more accurate and efficient search services to a certain extent.

1
Model architecture of KnBERT.First, documents from the first-stage ranking are decomposed into passages.The scores of each individual passage are computed by a fine-tuned BERT.The position of passages is featured by kernel functions to capture position semantic information.The final score of individual documents is the interpolation of position weight and the score of the most relevant passage.

F I G U R E 2
Joint distribution of passage-level and document-level relevance on Robust04 and GOV2.The number in color blocks represents the percentage of passages contained in documents of different relevance levels within the range of the given passage ranking.

Note:
Three hyperparameters including the threshold of positive samples , the number of continuous positive passages  and the bonus score S b are studied.F I G U R E 3 Sensitivity of three variants of proposed KnBERT to the kernel parameter on the two TREC collections Robust04 and GOV2.
Effectiveness of KnBERT relative to baselines and BERT-based re-rankers on Robust04.Effectiveness of KnBERT relative to baseline models on GOV2.
TA B L E 2Note: The comparisons are mainly relative to the best-performed BERT-based model that is fine-tuned on MS MARCO, denoted as BERT-MaxP.The best result for each metric is shown in bold.The relative gain/loss in terms of percentage (in the bracket) relative to DPH + KL and BERT-MaxP is reported."*"and "+" mean statistical improvements over DPH + KL and BERT-MaxP.TA B L E 3 KnBERT1's effectiveness using reward mechanism on GOV2 title queries.The best result for each metric is shown in bold.Superscript "*" means statistically important improvements over BERT-MaxP.
TA B L E 4Note:

TA B L E 5
Effectiveness of KnBERT relative to other score aggregation and representation aggregation approaches.
KnBERT1's effectiveness using normalization on Robust04 and GOV2 title queries.
TA B L E 8