Boosted Dense Retriever

We propose DrBoost, a dense retrieval ensemble inspired by boosting. DrBoost is trained in stages: each component model is learned sequentially and specialized by focusing only on retrieval mistakes made by the current ensemble. The final representation is the concatenation of the output vectors of all the component models, making it a drop-in replacement for standard dense retrievers at test time. DrBoost enjoys several advantages compared to standard dense retrieval models. It produces representations which are 4x more compact, while delivering comparable retrieval results. It also performs surprisingly well under approximate search with coarse quantization, reducing latency and bandwidth needs by another 4x. In practice, this can make the difference between serving indices from disk versus from memory, paving the way for much cheaper deployments.


Introduction
Identifying a small number of relevant documents from a large corpus to a given query, information retrieval is not only an important task in-and-of itself, but also plays a vital role in supporting a variety of knowledge-intensive NLP tasks (Lewis et al., 2020;Petroni et al., 2021), such as open-domain Question Answering (ODQA, Voorhees and Tice, 2000;Chen et al., 2017) and Fact Checking (Thorne et al., 2018).While traditional retrieval methods, such as TF-IDF and BM25 (Robertson, 2008), are built on sparse representations of queries and documents, dense retrieval approaches have shown superior performance recently on a range of retrieval and related large-scale ranking tasks (Guu et al., 2020;Karpukhin et al., 2020;Reimers and Gurevych, 2019;Hofstätter et al., 2021b).Dense retrieval * Equal contribution involves embedding queries and documents as lowdimensional, continuous vectors, such that query and document embeddings are similar when the document is relevant to the query.The embedding function leverages the representational power of pretrained language models and is further finetuned using any available training query-document pairs.Document representations are computed offline in an index allowing dense retrieval to scale to millions of documents, with query embeddings being computed on the fly.
When deploying dense retrievers in real-world settings, however, there are two practical concerns: the size of the index and the retrieval time latency.The index size is largely determined by the number of documents in the collection, as well as the embedding dimension.Whilst we cannot generally control the former, reducing the embedding size is an attractive way to reduce index size.On lowering latency, Approximate Nearest-Neighbor (ANN) or Maximum Inner Product Search (MIPS) techniques are usually required in practice.This implies that it is far more important for retrieval models to perform well under approximate search rather than in the exact search setting.Developing a dense retrieval model that produces more compact embeddings and are more amenable to approximate search is thus the focus of this research.
In this paper, we propose DrBoost, an ensemble method for learning a dense retriever, inspired by boosting (Schapire, 1990;Freund and Schapire, 1997).DrBoost attempts to incrementally build compact representations at training time.It consists of multiple component dense retrieval models ("weak learners" in boosting's terminology), where each component is a BERT-based bi-encoder, producing vector embeddings of the query and document.These component embeddings are in lower dimensions (e.g., 32 vs. 768) compared to those of regular BERT encoders.The final relevance function is implemented as a linear combination of inner products of embeddings produced by each weak learner.This can be efficiently calculated by concatenating the vectors from each component and then performing a single MIPS search, which makes DrBoost a drop-in replacement for standard dense retrievers at test time.Component models are trained and added to the ensemble sequentially.Each model is trained as a reranker over negative examples sampled by the current ensemble and thus can be seen as specializing on retrieval mistakes made previously.For example, early components focus on high-level topical information, whereas later components can capture finer-grained tail phenomena.Through this mechanism, individual components are disentangled and redundancy minimized, leading to more compact representations.
There are a couple of noticeable differences in training DrBoost when compared to existing dense retrieval models.Although iterative training using negatives sampled by models learned in the previous rounds has been proposed (Xiong et al., 2020;Qu et al., 2021;Oguz et al., 2021;Sachan et al., 2021, inter alia.), existing methods keep only the final model.In contrast, the iteratively trained weak learners in DrBoost are preserved and added to the ensemble.The construction of the embedding also differs.DrBoost can be viewed as a method of slowly "growing" overall dense vector representations, lending some structure to otherwise delocalized representations, while existing retrieval models encode queries and documents in one step.
More importantly, DrBoost enjoys several advantages in real-world settings.Because each weak learner in DrBoost produces very low-dimensional embeddings to avoid overfitting (32-dim in our experiments), many components can be added whilst the index stays small.Our experiments demonstrate that DrBoost produces very compact embeddings overall, achieving accuracy on par with a comparable non-boosting baseline with 4-5x smaller vectors, and strongly outperforming a dimensionally-matched variant.Probing Dr-Boost's embeddings using a novel technique, we also show that the embeddings can be used to recover more topical information from Wikipedia than a dimensionally-matched baseline.
Empirically, DrBoost performs superbly when using approximate fast MIPS.With a k-mean inverted file index (IVF), the simple and widely used approach, especially in hierarchical indices and web-scale settings (Jégou et al., 2011;Johnson et al., 2019;Matsui et al., 2018), DrBoost greatly outperforms the baseline DPR model (Karpukhin et al., 2020) by 3-10 points.Alternatively, it can reduce bandwidth and latency requirements by 4-64x while retaining accuracy.In principle, this allows for the approximate index to be served ondisk rather than in expensive and limited RAM (which is typically 25x faster), making it feasible to deploy dense retrieval systems more cheaply and at much larger scale.We also show that DrBoost's index is amenable to compression, and can be compressed to 800MB, 2.5x smaller than a recent state of the art efficient retriever, whilst being more accurate (Yamada et al., 2021).

Dense Retrieval
Dense Retrieval involves learning a scalable relevance function h(q, c) which takes high values for passages c which are relevant for question q, and low otherwise.In the popular dense bi-encoder framework, h(q, c) is implemented as the dot product between q and c, dense vector representations of passages and questions respectively, produced by a pair of neural network encoders, E Q and E C , (1) where q = E Q (q) and c = E C (c).At inference time, retrieval from a large corpus C = {c 1 , . . ., c |C| } is accomplished by solving the following MIPS problem: In standard settings, we assume access to a set of m gold question-passage pairs D = {(q i , c + i )} m i=1 .It is most common to learn models by training to score gold pairs higher than sampled negatives.Negatives can be obtained in a variety of ways, e.g. by sampling at random from corpus C, or by using some kind of importance sampling function on retrieval results (see §2.1).When augmented by n negatives per gold passage-document pair, we have training data of the form: which we use to train a model, e.g. using a ranking or margin objective, or in our case, by optimizing negative log-likelihood (NLL) of positive pairs

Iterated Negatives for Dense Retrieval
The choice of negatives is an important factor for what behaviour dense retrievers will learn.Simply using randomly-sampled negatives has been shown to result in poor outcomes empirically, because they are too easy for the model to discriminate from the gold passage.Thus, in dense retrieval, it is common to mix in some hard negatives along with random negatives, which are designed to be more challenging to distinguish from gold passages (Karpukhin et al., 2020).Hard negatives are usually collected by retrieving passages related to a question from an untrained retriever, such as BM25, and filtering out any unintentional golds.This ensures the hard negatives are at least topically-relevant.
Recently, it has become common practice to run a number of rounds of dense retrieval training to bootstrap hard negatives (Xiong et al., 2020;Qu et al., 2021;Oguz et al., 2021;Sachan et al., 2021, inter alia.).Here, we first train a dense retriever following the method we describe above, and then use this retriever to produce a new set of hard negatives.This retriever is discarded, and a new one is trained from scratch, using the new, "harder" negatives.This process can then be repeated until performance ceases to improve.This approach, which we refer to dense retrieval with iteratively-sampled negatives is listed in Algorithm 1.

Boosting
Boosting is a loose family of training algorithms for machine learning problems, based on the princi-ple of gradually ensembling "weak learners" into a strong learner.Boosting can be described by the following high-level formalism (Schapire, 2007).For a task with a training set {(x 1 , y 1 ), • • • , (x m , y m )}, where (x i , y i ) ∈ X ×Y we want to learn a function h : X → Y , such that h(x i ) = ŷi ≈ y i .This is achieved using an iterative procedure over R steps: • For round r, we construct an importance distribution D r over the training data, based on where error of our current model h is high • Learn a "weak learner" h r to minimize error r = i D r (i)L(h r (x i ), y i ) for some loss function L measuring the discrepancy between predictions and real values.
• Combine h and h r to form a new, stronger overall model, e.g. by linear combination h new = αh r + βh.The iteration can now be repeated.
The initial importance distribution D 0 is usually assumed to be a uniform distribution, and the h 0 model a constant function.Note how each additional model added to h is specifically designed to solve instances that h currently struggles with.

Boosted Dense Retrieval: DrBoost
We note similarities between the boosting formulation, and the dense retrieval with iterativelysampled negatives.We can adapt a boostinginspired approach to dense retrieval with minimal changes, as shown in Algorithm 1. Algorithmically, the only difference (lines 10-13) is that in the case of iterative negatives, the model h after r rounds is replaced by the new model h r , whereas in the boosting case, we combine h r and h.
In this paper, we view the boosted "weak learner" models h r as rerankers over the retrieval distribution from the current model h.That is, when training dense boosted retrievers, we only train using hard negatives, and do not use any random or in-batch negatives.Thus each new model is directly trained to solve the retrieval mistakes that the current ensemble makes, strengthening the connection to boosting, using the construction of negatives as a mechanism to define the importance distribution.
Each model h r is implemented as a bi-encoder, as in Equation ( 1).We combine models as linear combinations: The coefficients could be learnt from development data, or, simply by setting all coefficients to 1, which we find to be empirically effective.The overall model after the R rounds can be written as: where [. . .] indicates vector concatenation.Thus h is fully decomposable, and can be computed as a single inner product, and as such, is a drop-in replacement for standard MIPS dense retrievers at test time.The vectors produced by each boosted dense retriever component can be considered to be subvectors of overall dense vector representations q and c that are being "grown" one round at a time, imparting some structure to the overall vectors.
One downside of the boosting approach is that we must maintain R encoders for both passages and questions.Since passages are embedded offline, this does not create additional computational burden on the passage side at test time.However, on the query side, for a question q, boosted dense retrieval requires R forward passes to compute the full representation, one for each subvector q r .This is expensive, and will result in high-latency search.While this step is fully parallelizable, it is still undesirable.We can remedy this for low-latency, lowresource settings by distilling the question encoders of h into a single encoder, which can produce the overall question representations q directly.Here, given the training dataset D train of gold questionpassage pairs, and a model h we want to distill, we first compute overall representations q and c for all pairs using h as distillation targets, then train a new question encoder E dist Q with parameters φ, by minimizing the objective: 3 Experiments

Datasets
We train models to perform retrieval for the following two tasks: Natural Questions (NQ) We evaluate retrieval for downstream ODQA using the widely-used NQopen retrieval task (Kwiatkowski et al., 2019).
This requires retrieving Wikipedia passages which contain answers to questions mined from Google search logs.Gold passages are annotated by crowdworkers as containing answers to given questions.
The retrieval corpus consists of 21M passages, each 100 words in length.We use the preprocessed and gold pairs prepared by Karpukhin et al. (2020), and report recall-at-K (R@K) for K ∈ {20, 100}.
MSMARCO We evaluate in a web-text setting using the widely-used passage retrieval task from MSMARCO (Bajaj et al., 2016).Queries consist of user search queries from Bing, with humanannotated gold relevant documents.The corpus consists of 8.8M passages, and we use the preprocessed corpus and training and dev data gold pairs and data splits from Oguz et al. (2021).We follow the common practice of reporting the Mean-Reciprocal-Rank-at-10 (MRR@10) metric for the public development set.

Tasks
In this section, we'll describe the experiments we perform, and the motivations behind them.
Exact Retrieval We are interested in understanding whether the boosting approach results in superior performance for exhaustive (exact) retrieval.Here, no quantization or approximations are made to MIPS, which results in large indices, and slow retrieval, but represents the upper bound of accuracy.This setting is the most commonlyreported in the literature.
Approximate MIPS: IVF Exact Retrieval does not evaluate how a model performs in practicallyrelevant settings.As a result, we also evaluate in two approximate MIPS settings.First, we consider approximate MIPS with an Inverted File Index (IVF, Sivic and Zisserman, 2003).IVF works by first clustering the document embeddings offline using K-means (Lloyd, 1982), At test time, for a given query vector, rather than compute an inner product for each document in the index, we instead compute inner products to the K centroids.We then visit the n probes highest scoring clusters, and compute inner products for only the documents in these clusters.This technique increases the speed of search significantly, at the expense of some accuracy.Increasing K, the number of centroids, increases speed, at the expense of accuracy, as does decreasing the value of n probes.A model is preferable if retrieval accuracy remains high with very fast search, i.e. low n probes and high K1 .
In our experiments we fit K = 65536 clusters and sweep over a range of values of n probes from 2 0 to 2 15 .Other methods such as HNSW (Malkov and Yashunin, 2020) are also available for fast search, but are generally more complex and can increase index sizes significantly.IVF is a particularly popular approach due it its simplicity, and as a first coarse quantizer in hierarchical indexing (Johnson et al., 2019), since it is straightforward to apply sharding to the clusters, and further search indices can be built for each cluster.
Approximate MIPS: PQ Whilst IVF will increase search speeds, it does not reduce the size of the index, which may be important for scalability, latency, and memory bandwidth considerations.To investigate whether embeddings are amenable to compression, we experiment with applying Product Quantization (PQ, Jégou et al., 2011).PQ is a lossy quantization method that works by 1) splitting vectors into subvectors 2) clustering each subvector space and 3) representing vectors as a collection cluster assignment codes.For further details, the reader is referred to Jégou et al. (2011).We apply PQ using 4-dimensional sub-vectors and 256 clusters per sub-space, leading to a compression factor of 16x over uncompressed float32.
Generalization Tests In addition to in-domain evaluation, we also perform two generalization tests.These will determine whether the boosting approach is superior to iteratively-sampling negatives in out-of-distribution settings.We evaluate MSMARCO-trained models for zero-shot generalization using selected BEIR (Thakur et al., 2021) datasets that have binary relevance labels.Namely, we test on the SciFact, FiQA, Quora and ArguAna subsets.This will test how well models generalize to new textual domains, and different query surface forms.We also evaluate NQ-trained models on EntityQuestions (Sciavolino et al., 2021), a dataset of simple entity-centric questions which has been recently shown to challenge dense retrievers.This dataset uses the same Wikipedia index as NQ, and tests primarily for robustness and generalization to new entities at test time.

Models
We compare a model trained with iterativelysampled negatives to an analogous model trained with boosting, which we call DrBoost.There are many dense retrieval training algorithms available which would be suitable for training with iteratively-sampled negatives and boosting with Dr-Boost.Broadly-speaking, any dense retriever could be used that utilizes negative sampling, and could be trained in step 9 of algorithm 1.We choose Dense Passage Retriever (DPR, Karpukhin et al., 2020) with iteratively-sampled negatives due to its comparative simplicity and popularity.

Iteratively-sampled negatives baseline: DPR
DPR follows the dense retrieval paradigm outlined in section 2 It is trained with a combination of in-batch negatives, where gold passages for one question are treated as negatives for other questions in the batch (which efficiently simulates random negatives), and with hard negatives, sampled initially from BM25, and then from the previous round, as in Algorithm 1.We broadly follow the DPR training set-up of Oguz et al. (2021).We train BERT-base DPR models using the standard 768 dimensions, as well as models which match the final dimension size of DrBoost.We use parametersharing for the bi-encoders, and layer-norm after linear projection.Models are trained to minimize the NLL of positives, and the number of training rounds is decided using development data, as in Algorithm 1, using an initial h 0 retriever BM25.

DrBoost Implementation
For our DrBoost version of DPR, we keep as many experimental settings the same as possible.There are two exceptions, which are required for adapting dense retrieval to boosting.The first is that each component "weak learner" model has a low embedding dimension.This is to avoid overfitting, ensures each model is not too powerful, and means the final index size is a manageable size.We report using models of 32-dim (c.f. the standard 768 dim), but note that training with dimension as low as 8 is stable.The second is that, as motivated in section 2.3, we train each weak learner using only hard negatives, and no in-batch negatives.In effect, this choice of negatives means each model is essentially trained as a reranker. 2DrBoost models are fit following algorithm 1, and we stop adding models when development set performance stops improving.The initial retriever h 0 for DrBoost is a constant function, and thus the initial negatives for DrBoost are sampled at random from the corpus, unlike DPR, which uses initial hard negatives collected from BM25.
DrBoost α Coefficients DrBoost combines weak learners as a linear combination.We experimented with learning the α coefficients using development data, however this did not significantly improve results over simply setting them all to 1, so for the sake of simplicity and efficiency, we report DrBoost numbers with all α = 1.0.Empirically, we find the magnitudes of embeddings for DrBoost's component models to be similar, and thus their inner products do not drastically differ, so one component does not dominate over others.

DrBoost Distillation
We experiment with distilling DrBoost ensembles into a single model for latency-sensitive applications using the L2 loss at the end of section 2.3.We distill a single BERTbase query encoder, and perform early stopping and model selection using development L2 loss.

Exact Retrieval
Exact Retrieval results for MSMARCO and Nat-uralQuestions are shown in Table 1 in the "Exact Search" Column.We find that our DrBoost version of DPR reaches peak accuracy after 5 or 6 rounds when using 32-dim weak learners (see section 4.1.1later), leading to overall test-time index of 160/192-dim.In terms of Exact Search, DrBoost outperforms the iteratively-sampled negatives DPR baseline on MSMARCO by 2.2%, and trails it by only 0.3% on NQ R@100, despite having a total dimension 4-5× smaller.It also strongly outperforms a dimensionally-matched DPR, by 3% on MSMARCO, and 1% NQ R@100, demonstrating DrBoost's ability to learn high-quality, compact embeddings.We also quote recent state-of-theart results, which generally achieve stronger exact search results (Zhang et al., 2021).Our emphasis, however, is on comparing iteratively-sampled negatives to boosting, and we note that State-ofthe-art approaches generally use larger models and more complex training strategies than the "inner We find this improves results for early rounds loop" BERT-base DPR we report here.Such strategies could also be incorporated into DrBoost if higher accuracy was desired, as DrBoost is largelyagnostic to the training algorithm used.

Number of Rounds
The performance of DPR and DrBoost on MS-MARCO for different numbers of rounds are shown in Table 2.We find that all models saturate at about 4 or 5 rounds.Note DrBoost does not need more iterations to train, even though it doesn't use BM25 negatives for the first round.On NQ, adding a 6 th model slightly improves DrBoost's precision, at the expense of recall (see Table 1).While iterative training is expensive, we find that subsequent rounds are much cheaper than the first round, with the first round taking ∼20K steps in our experiments to converge, with additional DrBoost rounds converging after about 3K steps.
Bagging Dense Retrieval We also trained a simple ensemble of six 32-dim DPR models for NQ, which we compare to our 6×32-dim component DrBoost.This experiment investigates whether the improvement over DPR is just a simple ensembling effect, or whether it is due to boosting effects and specialization of concerns.This DPR ensemble performs poorly, scoring 74.5 R@20 (not shown in tables), 6.8% below the equivalent DrBoost, confirming that the boosting formulation is important, not simply having several ensembled dense retrievers.

Approximate MIPS
Table 1 shows how DPR and DrBoost behave under IVF MIPS search, which is also shown graphically in figure 1.We find that DrBoost dramatically outperforms DPR in IVF search, indicating that much faster search is possible with DrBoost.Highdimensional embeddings suffer under IVF due the the curse of dimensionality, thus compact embeddings are important.Using 8 search probes, Dr-Boost outperforms DPR by 10.5% on MSMARCO and 6.3% on NQ R@100.The dimensionallymatched DPR is stronger, but still trails DrBoost by about 4% using 8 probes.The strongest exact search model is thus not necessarily the best in practical approximate MIPS settings.For example, if we can tolerate a 10% relative drop in accuracy from the best performing system's exact search, DrBoost requires 16 (4) probes for MSMARCO (NQ) to reach the required accuracy, whereas DPR  will require 1024 ( 16), meaning DrBoost can be operated approximately 64× (4×) faster.
The Distilled DrBoost is also shown for NQ in Table 1.The precision (low R@K values) is essentially unaffected, (exact search drops by 0.1% for R@20), but recall drops slightly (-0.7% R@100).Interestingly, the distilled DrBoost performs even better under IVF search, improving over DrBoost by ∼1% at low numbers of probes.Crucially, whilst the distilled DrBoost is only slightly better than the 192-dim DPR under exact search, it is 4-5% stronger under IVF with 8 probes (alternatively, 8× faster for equivalent accuracy).
Fast retrieval is important, but we may also require small indices for edge devices, or for scalability reasons.We have already established that DrBoost can produce high quality compact embeddings, but Product Quantization can reduce this even further.Table 3 shows that DrBoost's NQ index can be compressed from 13.5 GB to 840MB with less than 1% drop in performance.We compare to BPR (Yamada et al., 2021), a method specifically designed to learn small indices by learning binary vectors.DrBoost's PQ index is 2.4× smaller than the BPR index reported by Yamada et al. (2021), whilst being 2.4% more accurate (R@20).A more aggressive quantization leads to a 420MB index -4.8×smaller than BPR -whilst only being 1.2% less accurate.

Analysis
We conduct qualitative and quantitative analysis to better understand DrBoost's behavior.

Qualitative Analysis
Since each round's model is learned on the errors of the previous round, we expect each learner to "specialize" and learn complementary representations.To see if this is qualitatively true, we look at the retrieved passages from each round's retriever in isolation.Indeed, we find that each 32-dim subvector tackles the query from different angles.For instance, for the query "who got the first nobel prize in physics?", the first sub-vector captures general topical similarity based on keywords, retrieving passages related to the "Nobel Prize".The second focuses mostly on the first paragraphs of the Wikipedia articles of prominent historical personalities, presumably because these are highly likely to contain answers in general; and the third one retrieves from the pages of famous scientists and inventors.The combined DrBoost model would favor passages in the intersection of these sets.Examples can be seen in Table 10 in the Appendix.

In-distribution generalization
Boosting algorithms are remarkably resistant to over-fitting, even when the combined classifier has sufficient capacity to achieve zero training error.In their landmark paper, Bartlett et al. (1998) show that this desirable generalization property is a result of the following: the training margins increase with each iteration of boosting.We empirically show the same to be true for DrBoost.For a given

Method
SciFact FiQA Quora ArguAna NDCG@10 NDCG@10 NDCG@10 NDCG@10 query embedding, dense retrieval acts as a linear classifier, where the gold passage is positive and all other passages are negatives (Eq.1).We adopt the classical definition of margin for linear classifiers to dense retrieval by defining a top-k margin as follows: (2) where µ c is the average norm of passage embeddings and the operator max {k} returns the kth maximum element in the set.For a fixed q i and k = 1, this definition is identical to the classical margin definition.Figure 2 plots the 50th, 75th and 90th percentiles of the top-20 margin for DrBoost on the NQ training set.We clearly see that margins indeed increase at each step, especially for cases that the model is confident in (high margin).We hypothesize this property to be the main reason for the strong in-distribution generalization of DrBoost that we observed, and potentially also for the surprisingly strong IVF results, since wide margins should intuitively make clustering easier as well.

Cross-domain generalization
It has been observed in previous work (Thakur et al., 2021) that dense retrievers still largely lag behind sparse retrievers in terms of generalization capabilities.We are interested to test whether our method could be beneficial for out-of-domain trans-Method EntityQuestions R@20 R@100 BM25 (Chen et al., 2021) 71.2 79.7 DPR (Chen et al., 2021) 49  fer as well.We show the results for zero-shot transfer on a subset of the BEIR benchmark in Table 4 and the EntityQuestions dataset in Table 5.While DrBoost improves slightly over the dimensionmatched baseline on EntityQuestions, where the passage corpora stays the same, it produces worse results on the BEIR datasets.We conclude that boosting is not especially useful for cross-domain transfer, and should be combined with other methods if this is a concern.We leave this for future work.

Representation Probing
One of the hypothesis we formulate for the stronger performance of DrBoost over DPR is that the former might better capture topical information of passages and questions.To test this, we collected topics for all Wikipedia articles in Natural Questions using the strategy of Johnson et al. (2021) and associate them with both passages and questions.We then probed both DPR and DrBoost representations with an SVM (Steinwart and Christmann, 2008) classifier considering a 5-fold cross-validation over 500 instances and 8 different seeds.Results (in Figure 3) confirms our hypothesis: the topic classifier accuracy is higher with DrBoost representations with respect to DPR ones of the same dimension (i.e., 192), for both questions and passages.

Related Work
Boosting for retrieval Boosting has been studied in machine learning for over three decades (Kearns and Valiant, 1989;Schapire, 1990).Models such as AdaBoost (Freund and Schapire, 1997) and GBMs (Friedman, 2001) became popular approaches to classification problems, with implementations such as XGBoost still popular today (Chen and Guestrin, 2016).Many boosting approaches have been proposed for retrieval and learning-to-rank (LTR) problems, typically employing decision trees, such as AdaRank (Xu and Li, 2007), RankBoost (Freund et al., 2003) and lamdaMART (Wu et al., 2009).Apart from speed and accuracy, boosting is attractive due to promising theoretical properties such as convergence and generalization.(Bartlett et al., 1998;Freund et al., 2003;Mohri et al., 2012).Boosted decision trees have recently been demonstrated to be competitive on LTR tasks (Qin et al., 2021), but, in recent years, boosting approaches have generally received less attention, as (pretrained) neural models began to dominate much of the literature.However, modern neural models and boosting techniques need not be exclusive, and a small amount of work exploring boosting in the context of modern pre-trained neural models has been carried out (Huang et al., 2020;Qin et al., 2021).Our work follows this line of thinking, identifying dimensionally-constrained bi-encoders as good candidates as neural weak learners, adopting a simple boosting approach which allows for simple and efficient MIPS at test time.
Dense Retrieval Sparse, term-based Retrievers such as BM25 (Robertson and Zaragoza, 2009) have dominated retrieval until recently.Dense, MIPS-based Retrieval using bi-encoder architectures leveraging contrastive training with gold pairs (Yih et al., 2011) has recently shown to be effective in several settings (Lee et al., 2019;Karpukhin et al., 2020;Reimers and Gurevych, 2019;Hofstätter et al., 2021b).See Yates et al. (2021) for a survey.The success of Dense Retrieval has led to many recent papers proposing schemes to improve dense retriever training by innovating on how negatives are sampled (Xiong et al., 2020;Qu et al., 2021;Zhan et al., 2021c;Lin et al., 2021, inter alia.), and/or proposing pretraining objectives (Oguz et al., 2021;Guu et al., 2020;Chang et al., 2020;Sachan et al., 2021;Gao and Callan, 2021).Our work also innovates on how dense retrievers are trained, but is arguably orthogonal to most of these training innovations, since these could still be employed when training each component weak learner.
Distillation We leverage a simple distillation technique to make DrBoost more efficient at test time.Distillation for dense retrievers is an active area, and more complex schemes exist which could improve results further (Izacard and Grave, 2021;Qu et al., 2021;Yang and Seo, 2020;Lin et al., 2021;Hofstätter et al., 2021a;Barkan et al., 2020;Gao et al., 2020).
Multi-vector Retrievers Several approaches represent passages with multiple vectors.Humeau et al. (2020) represent queries with multiple vectors, but retrieval is comparatively slow as relevance cannot be calculated with a single MIPS call.ME-BERT (Luan et al., 2021) index a fixed number of vectors for each passage and ColBERT (Khattab and Zaharia, 2020) index a vector for every word.Both can perform retrieval with a single MIPS call (although ColBERT requires reranking) but produce very large indices, which, in turn, slows down search.DrBoost can also be seen as a multivector approach, with each weak learner producing a vector.However, each vector is small, and we index concatenated vectors, rather than indexing each vector independently, leading to small indices and fast search.This said, adapting DrBooststyle training to these settings would be feasible.SPAR (Chen et al., 2021) is a two-vector method: one from a standard dense retriever, and the other from a more lexically-oriented model.SPAR uses a similar test-time MIPS retrieval strategy to ours, and SPAR's lexical embeddings could be trivially added to DrBoost as an additional subvector.
Efficient retrievers There have been a number of recent efforts to build more efficient retrieval and question answering systems (Min et al., 2021).Izacard et al. (2020) and Yang and Seo (2021) experiment with post-hoc compression and lowerdimensional embeddings, Lewis et al. (2021) index and retrieve question-answer pairs and Yamada et al. (2021) propose BPR, which approximates MIPS using binary vectors.There is also a line of work learning embeddings specifically suited for approximate search (Yu et al., 2018;Zhan et al., 2021a,b) Generative retrievers (De Cao et al., 2021) can also be very efficient.DrBoost also employs lower-dimensional embeddings and off-the-shelf post-hoc compression for its smallest index, producing smaller indices than BPR, whilst also being more accurate.

Discussion
In this work we have explored boosting in the context of dense retrieval, inspired by the similarity of iteratively-sampling negatives to boosting.We find that our simple boosting approach, DrBoost, performs largely on par with a 768-dimensional DPR baseline, but produces more compact vectors, and is more amenable to approximate search.We note that DrBoost requires maintaining more neural models at test time, which may put a greater demand on GPU resources.However the models can be run in parallel if latency is a concern, and if needed, these models can be distilled into a single model with little drop in accuracy.We hope that future work will build on boosting approaches for dense retrieval, including adding adaptive weights, and investigating alternative losses and sampling techniques.We also suggest that emphasis in dense retrieval should be placed on more holistic evaluation than just exact retrieval accuracy, demonstrating that models with quite similar exact retrieval can perform very differently under practically-important approximate search settings.

Figure 1 :
Figure 1: Search accuracy vs the number of clusters visited in IVF search (proportional to latency).Accuracy drops as search speed increases, but the accuracy drop-off for DrBoost is much slower than for DPR.

Figure 2 :
Figure 2: Quantiles of the top-20 margin on the NQ training set, for each iteration of DrBoost.

Figure 3 :
Figure 3: Topic classification accuracy when probing DrBoost and DPR representations with an SVM.

Table 1 :
Summary of Results on MSMARCO development set and NaturalQuestions test set."Exact" indicates Exact MIPS results, IVF indicates IVF MIPS search with 65K centroids, with the number of search probes (proportional to search speed) indicated.* Dimensional-matched DPR is 160 dims for MSMARCO and 192 for DPR. MSMARCOMRR@10

Table 2 :
Ablations for the number of rounds for DPR with iterative negatives and DrBoost for MSMARCO

Table 4 :
Thakur et al. (2021)A row is copied fromThakur et al. (2021), and the numbers represent the best model for each dataset.

Table 5 :
Entity Questions Results.

Table 7 :
IVF indexing results on NQ.Metric is Recall@100.The number of clusters used for IVF training was 65536.