1 Introduction

In the past, lists of results obtained by information retrieval (IR) systems were ranked by probabilistic models, such as the BM25 measure (Robertson and Zaragoza 2009), based on a small number of attributes (the frequency of query terms in the document, in the collection, etc.). The parameters of these models were usually tuned by trial and error. As the number of useful features increased, these manually crafted models became increasingly laborious to configure. Alternatively, one can use as many (possibly redundant) attributes as possible, and employ machine learning (ML) techniques to induce a ranking model. This approach alleviates the human effort needed to design the ranking function and also provides a natural way to directly optimize the retrieval performance for any particular application and evaluation metric. As a result, learning to rank has gained considerable academic interest in the past decade.

ML-based ranking systems are traditionally classified into three categories. In the simplest pointwise approach, the instances are first assigned a relevance score using classical regression or classification techniques and then ranked by posterior scores obtained using the trained model (Li et al. 2007). In the pairwise approach, the order of pairs of instances is treated as a binary label and learned by a classification method (Freund et al. 2003). Finally, in the most complex listwise approach, the fully ranked lists are learned by a tailor-made learning method which seeks to optimize a ranking-specific evaluation metric during the learning process (Valizadegan et al. 2009).

In web page ranking or subset ranking (Cossock and Zhang 2008) the training data is given in the form of (query, document, relevance label) triplets. The relevance label of a training instance indicates the usefulness of the document to its corresponding query, and the ranking for a particular query is usually evaluated via the (normalized) Discounted Cumulative Gain ((N)DCG) or the Expected Reciprocal Rank (ERR) (Chapelle et al. 2009) measure. It is rather difficult to extend classical learning methods to directly optimize these evaluation metrics. Nevertheless, since the DCG can be bounded by the zero-one-loss (Li et al. 2007), traditional classification error can be regarded as a surrogate function for DCG.

Calibrating the output of a classifier is crucial in applications with quality measures different from the zero-one-error (Niculescu-Mizil and Caruana 2005). Our approach is based on the calibration of a multi-class classifier learned by AdaBoost.MH (Freund and Schapire 1997) or a multi-class Support Vector Machine (Crammer and Singer 2001) (SVM). The class labels are assumed to be random variables and the goal is to estimate the probability distribution of the class labels given a feature vector representing the (query, document) pair. A key novelty in our approach is that, instead of using a single calibration technique, we apply several methods to estimate the same probability distribution and we combine these estimates in a final step. Our approach is motivated by the Bayesian paradigm and the Minimum Description Length principle (Rissanen 1983), both of which suggest that it is usually more efficient to mix different conditional distributions according to a prior than to select one “optimal” distribution.

We use both regression-based calibration (RBC) and class-probability-based calibration (CPC) to transform the output scores of multi-class classifiers into relevance label estimates. In RBC the real-valued scores are obtained by regressing the relevance grades against the output score vector, whereas in CPC the posterior probability distribution is used to approximate the so-called Bayes-scoring function (Cossock and Zhang 2008), which is shown to optimize the expected DCG in a probabilistic setup.

The proper choice of the weighting of the set of conditional distributions obtained by the calibrated classifiers is an important decision in practice. In this paper, we use an exponential scheme based on the quality of the rankings implied by the conditional distributions (via their corresponding conditional ranking functions) which is theoretically better-founded than the uniformly weighted aggregation used by McRank (Li et al. 2007).

Figure 1 offers a structural overview of our system. It is based on a set of standard techniques of (i) multiclass classification, (ii) output score calibration, and (iii) an exponentially weighted forecaster that is used to combine the various hypotheses. The computationally expensive first two steps belong to the simplest, pointwise category of learning-to-rank models, whereas the final mixing step optimizes a listwise objective function.

Fig. 1
figure 1

The schematic overview of our approach. On the first level, a multi-class method (AdaBoost.MH) is trained using different hyperparameter settings. Then we calibrate the multi-class models in several ways to obtain diverse scoring functions. In the last step, we aggregate the scoring functions using an exponential weighting

Most of the learning-to-rank methods had been tested on (and tuned to) the relatively small LETOR data sets, published by Microsoft. Recently, two larger benchmark data sets have been published by Yahoo and Microsoft. In addition to the mandatory comparison of our approach with the state of the art of learning to rank, we carried out rigorous and exhaustive experiments to compare the methods with each other on these new sets. To our knowledge, this is the first large-scale study of this kind. Reproducibility was an important goal, so we give all the algorithmic details necessary to repeat the experiments. For the same reason, we only tested methods that can be implemented easily (without ambiguity) or for which an open source implementation is available. Our most important finding is that pointwise methods are more competitive on large data sets than had been previously thought, and that they scale better as the data sets grow.

The paper is organized as follows. In Sect. 2, we review some approaches that are the most similar to ours. In Sect. 3, we define the ranking problem formally and introduce our notation. Section 4 contains theoretical results that motivated the calibration techniques described in Sect. 5. We explain the final ensemble step in Sect. 6. Section 7 then contains our experimental results. In Sect. 8 we draw some pertinent conclusions and briefly suggest some ideas for future study.

2 Related work

Among the plethora of ranking algorithms, our approach is probably the closest to the McRank algorithm (Li et al. 2007). We both use a multi-class classification algorithm at the core (they use gradient boosting, whereas we apply AdaBoost.MH and multi-class SVM). The chief novelties of our approach are that we use decision product base classifiers besides the popular decision trees and that we apply several different calibration approaches. Both elements add more diversity to our models that we exploit by using a final meta-ensemble technique. In addition, McRank’s implementation is inefficient in the sense that the number of decision trees trained in each boosting iteration is as large as the number of different classes in the data set.

Even though McRank is not regarded as a state-of-the-art method itself, its importance is unquestionable. It can be viewed as a milestone that proved the raison d’etre of classification-based learning-to-rank methods. It attracted the attention of researchers working on learning-to-rank to classification-based ranking algorithms. The most remarkable method motivated by McRank is LambdaMart (Wu et al. 2010), which adapts the MART algorithm to the subset ranking problem. The winning entry of Track 1 in the Yahoo! Learning-to-rank Challenge (Chapelle et al. 2011) was largely based on this method.

In the Yahoo! challenge (Chapelle et al. 2011), the general conclusion was that listwise and pairwise methods achieved the best scores in general, but tailor-made pointwise approaches also achieved very competitive results. In particular, the approach presented here is based on the system we devised when we participated in Yahoo! Learning-To-Rank Challenge (Busa-Fekete et al. 2011a). A preliminary version of this study appeared in a conference paper (Busa-Fekete et al. 2011b), but the results we present here are more general. The main contributions of this paper compared to Busa-Fekete et al. (2011a, 2011b) are that (1) we evaluate our approach on all publicly available benchmark data sets and investigate several issues experimentally; (2) we present a novel calibration approach, namely the sigmoid-based class probability calibration (CPC), which is theoretically better grounded than regression-based calibration; (3) we rigorously and exhaustively compare state-of-the-art open source rankers with a special emphasis on reproducibility; and (4) we test multi-class SVM as an alternative to AdaBoost.MH. As a theoretical motivation, we also provide an upper bound on the difference between the DCG value of the Bayes optimal scoring function and the DCG value achieved by its estimate using CPC in terms of the Kullback-Leibler divergence.

In a recent article (Kotlowski et al. 2011), it was shown that the bipartite ranking problem can be cast as a binary classification problem and that the rank regret of a classifier can be upper bounded by its regret for exponential loss and logistic loss. This result explains why many classifiers optimizing the exponential or logistic losses also perform well as rankers. For example, in Cortes and Mohri (2005) the authors showed experimentally that AdaBoost works well in the bipartite ranking problem.

3 Definition of the ranking problem

In this section we formally define the learning-to-rank problem and introduce the notation that will be used in the rest of this paper. First, let us assume that we are given a set of query objects \(\mathbf{D}= \{\mathcal{D}^{(1)},\ldots,\mathcal{D}^{(M)}\}\). Each query object \(\mathcal{D}^{(k)}\) consists of a set of n (k) pairs

$$\mathcal{D}^{(k)} = \bigl\{ \bigl(\mathbf{x}^{(k)}_1, \ell^{(k)}_1 \bigr), \ldots, \bigl(\mathbf{x}^{(k)}_{n^{(k)}}, \ell^{(k)}_{n^{(k)}} \bigr) \bigr\}. $$

The real-valued feature vectors \(\mathbf{x}^{(k)}_{i}\) represent the kth query and the ith document received as a potential hit for the query.Footnote 1 The label index \(\ell^{(k)}_{i}\) of the query-document pair \(\mathbf{x}^{(k)}_{i}\) is an integer between 1 and K. We assume that we are given a set of numerical relevance grades

$$\mathcal{Z}= \{z_1,\ldots,z_K\}. $$

The relevance grade \(z^{(k)}_{i} = z_{\ell^{(k)}_{i}}\) expresses the relevance of the ith document to the kth query on a numerical scale. A popular choice for the numerical relevance grades is

$$ z_\ell= 2^{\ell-1} - 1 $$
(1)

for all =1,…,K.

The goal of the ranker is to output a permutation \(\mathbf{j}^{(k)} = (j_{1}, \ldots, j_{n^{(k)}})\) over the integers (1,…,n (k)) for each query object \(\mathcal{D}^{(k)}\). A widely used empirical measure of the quality of the permutation j (k) is the Discounted Cumulative Gain (DCG)

$$ \textrm{DCG} \bigl(\mathbf{j}^{(k)},\mathcal{D}^{(k)} \bigr) = \sum_{i=1}^{n^{(k)}} c_i z_{j_i}^{(k)}, $$
(2)

where c i is the discount factor of the ith document in the permutation. The most common discount factor applied is

$$ c_i = \frac{1}{\log(1+i)}. $$
(3)

The rationale behind this formula is that a user will be increasingly happy when he/she finds relevant documents early in the permutation. To normalize the DCG between 0 and 1, (2) is usually divided by the DCG score of the best permutation (NDCG). It is also a common practice to truncate the sum (2) at n max, defining the \(\textrm{DCG}_{n_{\max}}\) and \(\textrm{NDCG}_{n_{\max}}\) scores. The reason for this is that a user should rarely browse beyond the first page of search results containing the first n max hits.

We will treat the label index \(\ell^{(k)}_{i}\) as a random variable with a conditional discrete probability distribution

$$p^{*} \bigl(\ell\vert\mathbf{x}^{(k)}_i \bigr) = P \bigl(\ell^{(k)}_i = \ell| \mathbf{x} ^{(k)}_i \bigr) $$

over the label indices for document i of query k. The Bayes-scoring function

$$v^{*} \bigl(\mathbf{x}^{(k)}_i \bigr) = \mathbb{E}\bigl\{z \vert\mathbf{x}^{(k)}_i\bigr\}= \sum_{\ell=1}^K z_\ell p^{*} \bigl(\ell\vert\mathbf{x}^{(k)}_i\bigr) $$

is the conditional expectation of the relevance grade given the (query, document) pair \(\mathbf{x}^{(k)}_{i}\). Then the expected DCG for any permutation j (k) is

$$\textrm{EDCG} \bigl(\mathbf{j}^{(k)},\mathcal{D}^{(k)} \bigr) = \sum_{i=1}^{n^{(k)}} c_i {\mathbb{E} \bigl\{z \vert\mathbf{x}^{(k)}_{j_i}}\bigr\} = \sum _{i=1}^{n^{(k)}} c_i v^{*} \bigl( \mathbf{x}^{(k)}_{j_i} \bigr). $$

Let the Bayes optimal permutation \({\mathbf{j}^{(k)}}^{*} =({j_{1}^{(k)}}^{*},\ldots,{j_{n^{(k)}}^{(k)}}^{*})\) over \(\mathcal{D}^{(k)}\) be the one which maximizes the expected DCG; that is,

$${\mathbf{j}^{(k)}}^* = \mathop{\mathrm{arg\,max}}\limits_{\mathbf{j}^{(k)}} \textrm{EDCG} \bigl( \mathbf{j} ^{(k)},\mathcal{D} ^{(k)} \bigr). $$

According to Theorem 1 stated in Cossock and Zhang (2008), j (k) has the property that if c i >c i, then for the Bayes-scoring function we have \(v^{*}(\mathbf{x}_{{j_{i}^{(k)}}^{*}}) > v^{*}(\mathbf{x}_{{j_{i^{\prime}}^{(k)}}^{*}})\). This means that the optimal j (k) can be easily obtained from the Bayes-optimal scoring function v by ordering the query-document pairs \(\mathbf{x}^{(k)}_{i}\) according to \(v^{*}(\mathbf{x}^{(k)}_{i})\). This result justifies the pointwise approach that estimates v in a regression setup, since having a regressor function that approximates v well, one can readily obtain the Bayes optimal permutation. In this paper, we will also use the pointwise approach in a discrete density estimation setup: our goal is to estimate \(p^{*}(\ell\vert\mathbf{x}^{(k)}_{j})\) by \(p^{\mathcal {A}}(\ell\vert \mathbf{x}^{(k)}_{j})\), where the label \(\mathcal{A}\) will refer to the method that generates the probability estimates. For the scoring function generated by \(p^{\mathcal{A} }\), we will use the notation

$$ v^{\mathcal{A}}\bigl(\mathbf{x}^{(k)}_i \bigr) = \sum_{\ell=1}^K z_\ell p^{\mathcal{A}} \bigl(\ell\vert\mathbf {x}^{(k)}_i \bigr). $$
(4)

In the pointwise approach, the scoring function v induces the permutation j v for which

$$ v \bigl(\mathbf{x}^{(k)}_{j_1^v} \bigr) \ge\cdots\ge v \bigl(\mathbf{x} ^{(k)}_{j_{n^{(k)}}^v} \bigr). $$
(5)

In Sect. 4 we will show that the excess EDCG with respect to the optimal EDCG can be upper bounded by the L q distance between \(p^{\mathcal{A}}(\ell\vert\mathbf{x}^{(k)}_{j})\) and \(p^{*}(\ell \vert \mathbf{x}^{(k)}_{j})\), motivating the density-estimation-based calibration.

4 Bounds for the excess of EDCG

The main goal of calibrating a multi-class classifier is to get more accurate class conditional probability estimates. Even if a classifier has a good classification performance, its probability estimates can be very poor, as pointed out in Mease et al. (2007). To motivate our approach whose backbone is calibrated multi-class classification, we will show that if the class conditional probability distribution is estimated well in the sense that the Kullback-Leibler divergence between the original and the estimated distribution is small, then we can obtain a close-to-optimal expected DCG in our probabilistic setup.

The results are similar in spirit to the bounds derived in Cossock and Zhang (2008), where the excess of the DCG is bounded in terms of the L p error of a regressor. Our bounds motivate the multi-class classification setup and the class-probability-based calibration techniques (Sect. 5.2), whereas the results of Cossock and Zhang (2008) motivate the regression setup and the regression-based pointwise calibration approach (Sect. 5.4).

4.1 Excess of EDCG in terms of L q

In (4), we described a way to obtain a scoring function v based on the estimate of the probability distribution of relevance grades. Based on the estimated scoring function v, ranking the set of query-document pairsFootnote 2 \(\mathcal{D}\) is straightforward by using (5).

The following proposition gives an upper bound for the difference between the expected DCG values of the Bayes optimal scoring function and its estimate in terms of the quality of the relevance probability estimate.

Proposition 1

Let p,q∈[1,∞] and 1/p + 1/q=1. Then

$$\begin{aligned} &\textrm{EDCG}\bigl(\mathbf{j}^*,\mathcal{D} \bigr) - \textrm{EDCG}\bigl(\mathbf {j}^v,\mathcal{D}\bigr) \\&\quad{}\leq \Biggl( \sum_{i=1}^n \sum _{\ell=1}^K \bigl\vert (c_{\widetilde{j}^{v}_i} - c_{\widetilde{j}^{*}_i}) z_\ell \bigr\vert ^p \Biggr)^{\frac{1}{p}} \Biggl( \sum_{i=1}^n \sum_{\ell=1}^K \bigl| p(\ell\vert \mathbf{x}_i) - p^*(\ell\vert\mathbf{x}_i) \bigr|^q \Biggr)^{\frac{1}{q}}, \end{aligned}$$
(6)

where \(\widetilde{j}^{v}_{i}\) and \(\widetilde{j}^{*}_{i}\) are the inverse permutations of \(j^{v}_{i}\) and \(j^{*}_{i}\). The relation between p(|x) and v(x) is defined in (4).

Proof

Following the lines of Theorem 2 stated in Cossock and Zhang (2008),

$$\begin{aligned} \textrm{EDCG}\bigl(\mathbf{j}^v,\mathcal{D}\bigr) = & \sum _{i=1}^n c_{i} v^{*}( \mathbf{x} _{j^v_i}) \\= & \sum_{i=1}^n c_{i} v( \mathbf{x}_{j^v_i}) + \sum_{i=1}^n c_i \bigl(v^{*}(\mathbf{x}_{j^v_i}) - v(\mathbf {x}_{j^v_i}) \bigr) \\\geq& \sum_{i=1}^n c_i v( \mathbf{x}_{j^{*}_i}) + \sum_{i=1}^n c_i \bigl(v^{*}(\mathbf{x}_{j^v_i}) - v(\mathbf {x}_{j^v_i}) \bigr) \\= & \sum_{i=1}^n c_i v^{*}(\mathbf{x}_{j^{*}_i}) + \sum_{i=1}^n c_i \bigl(v^{*}(\mathbf{x}_{j^v_i}) - v(\mathbf {x}_{j^v_i}) \bigr) + \sum_{i=1}^n c_i \bigl(v(\mathbf{x}_{j^*_i}) - v^{*}(\mathbf {x}_{j^*_i}) \bigr) \\= & \textrm{EDCG}\bigl(\mathbf{j}^{*},\mathcal{D}\bigr) + \sum _{i=1}^n c_i \bigl(v^{*}(\mathbf{x}_{j^v_i}) - v(\mathbf {x}_{j^v_i}) \bigr) + \sum_{i=1}^n c_i \bigl(v(\mathbf{x}_{j^*_i}) - v^{*}(\mathbf {x}_{j^*_i}) \bigr). \\ \end{aligned}$$
(7)

In (7), \(\sum_{i=1}^{n} c_{i} v(\mathbf{x}_{j^{v}_{i}}) \geq \sum_{i=1}^{n} c_{i} v(\mathbf{x}_{j^{*}_{i}})\), because j v is an optimal permutation for the scoring function v, and so reordering the indices decreases the DCG value. Then, for the permutations \(j^{v}_{i}\) and \(j^{*}_{i}\) and for their respective inverse permutations \(\widetilde{j}^{v}_{i}\) and \(\widetilde{j}^{*}_{i}\), we have

$$\begin{aligned} &\textrm{EDCG}\bigl(\mathbf{j}^*, \mathcal{D}\bigr) - \textrm{EDCG}\bigl( \mathbf {j}^v,\mathcal {D}\bigr) \\&\quad{}\leq \sum_{i=1}^n (c_{\widetilde{j}^v_i} - c_{\widetilde{j}^{*}_i}) \bigl(v(\mathbf{x}_{i}) - v^*( \mathbf{x}_{i}) \bigr) \\&\quad{}=\sum_{i=1}^n \sum _{\ell=1}^K (c_{\widetilde{j}^v_i} - c_{\widetilde {j}^{*}_i}) z_\ell \bigl(p(\ell\vert\mathbf{x}_i ) - p^*(\ell\vert \mathbf {x}_i ) \bigr) \\&\quad{}\leq \sum_{i=1}^n \sum _{l=1}^K \bigl| \bigl( (c_{\widetilde{j}^v_i} - c_{\widetilde{j}^{*}_i}) z_l \bigr) \bigl( p( \ell| \mathbf{x}_i ) - p^* ( \ell| \mathbf{x}_i ) \bigr) \bigr| \\&\quad{}\leq \Biggl( \sum_{i=1}^n \sum _{l=1}^K \bigl| (c_{\widetilde{j}^v_i} - c_{\widetilde{j}^{*}_i}) z_l \bigr|^p \Biggr)^{\frac{1}{p}} \Biggl( \sum_{i=1}^n \sum _{l=1}^K \bigl| p( \ell| \mathbf{x}_i ) - p^* ( \ell| \mathbf{x}_i ) \bigr|^q \Biggr)^{\frac{1}{q}}, \end{aligned}$$
(8)

where (8) follows from Hölder’s inequality. □

Corollary 2

$$\mathrm{EDCG} \bigl(\mathbf{j}^*,\mathcal{D}\bigr) - \mathrm{EDCG} \bigl(\mathbf {j}^v,\mathcal{D}\bigr) \leq C_1 \cdot \Biggl( \sum_{i=1}^n \sum _{\ell=1}^K \bigl\vert p(\ell_i \vert \mathbf{x}_i ) - p^*(\ell_i \vert\mathbf {x}_i ) \bigr\vert ^q \Biggr)^{\frac{1}{q}}, $$

where

$$ C_1 = \max_{\mathbf{j},\mathbf{j}^\prime} \Biggl( \sum _{i=1}^n \sum_{\ell=1}^K \bigl\vert (c_{j_i} - c_{j^\prime_i}) z_\ell\bigr\vert ^p \Biggr)^{\frac{1}{p}}, $$
(9)

and the maximum is taken over arbitrary permutations j and jover 1,…,n.

Corollary 2 shows that as the distance between the “exact” and the estimated conditional distributions over the relevance labels tends to 0, the difference in the expected DCG values also tends to 0.

4.2 Bounds for excess of the expected DCG in terms of relative entropy

So far, we have shown that if the estimated class probabilities are close to the conditional discrete probability distribution over the label indices in terms of an L q -norm, then the Bayes-scoring function will be estimated well. In particular, we have shown that the L q -norm gives an upper bound on the difference of the EDCG values of the best ranking and the predicted ranking. We will now show that the relative entropy also gives an upper bound on the loss of the EDCG value for predicted rankings. That is, if the entropy of the estimated conditional distribution function is small relative to the class conditional probabilities, then a close-to-optimal ranking is obtained. This finding motivates some of our particular calibration techniques that are related to entropy minimization (Sect. 5).

To simplify the notation, in this section we will denote the class probability vectors by p i =(p i,1,…,p i,K ) and \(\mathbf{p} ^{*}_{i} = (p^{*}_{i,1}, \ldots, p^{*}_{i,K})\), where p i, =p(|x i ) and \(p^{*}_{i,\ell} = p^{*} (\ell\vert\mathbf{x}_{i})\).

Proposition 3

Assume that all elements of p and p i are positive. For all \(0<\epsilon\le\frac{1}{2}\) there exists a δ>0 such that if \(\|\mathbf{p}_{i}-\mathbf{p}^{*}_{i}\|_{2}^{2} < \delta\) then

$$ \mathrm{EDCG}\bigl(\mathbf{j}^*,\mathcal{D}\bigr) - \mathrm{EDCG}\bigl(\mathbf {j}^v,\mathcal {D}\bigr) \leq C_2 \sum_{i=1}^n D_{\mathrm{KL}} \bigl( \mathbf{p}^{*}_{i} \| \mathbf{p}_{i} \bigr), $$
(10)

where D KL(⋅∥⋅) is the Kullback-Leibler divergence between multinomial distributions of one trial with parameters p i =(p(1|x i ),…,p(K|x i )) and \(\mathbf{p} ^{*}_{i} = (p^{*}(1 \vert\mathbf{x}_{i}), \ldots, p^{*}(K \vert\mathbf {x}_{i}))\) and the constant C 2 is

$$C_2 = C_1 \max_{1\le i \le n} \frac{1 + 2\epsilon}{\frac{1}{2} \min_{1\le \ell\le K } \{ 1/p_{i,\ell} \} }, $$

where C 1 is defined in (9).

Proof

First note that the relevance labels come from a multinomial distribution with parameter \(\mathbf{p}^{*}_{i}\) in our setup. We know (e.g., Gruenwald 2007, pp. 120) that

$$\begin{aligned} &D_{\mathrm{KL}} \bigl( \mathbf{p}^{*}_{i} \| \mathbf{p}_{i} \bigr) \\&\quad{}=\frac{1}{2} \bigl(\mathbf{p} _{i}- \mathbf{p}^{*}_{i}\bigr) \mathcal{I} \bigl(f( \zeta_{1},\ldots,\zeta_{K} ; 1, \mathbf {p}_{i}) \bigr) \bigl(\mathbf{p}_{i}-\mathbf{p}^{*}_{i} \bigr)^{\intercal}+ o\bigl(\bigl\|\mathbf {p}_{i}-\mathbf{p}^{*}_{i} \bigr\|_2^2\bigr), \end{aligned}$$
(11)

where \(\mathcal{I}(f(\zeta_{1},\ldots,\zeta_{K} ; 1, \mathbf{p} _{i}))\) is the Fisher Information Matrix (FIM) of the multinomial probability distribution f(ζ 1,…,ζ K ;1,p i ) assuming one trial. In our special case, the FIM is diagonal with elements 1/p i, in the diagonal, thus we have

$$\frac{1}{2} \bigl(\mathbf{p}_{i}-\mathbf{p}^{*}_{i} \bigr) \mathcal{I} \bigl(f(\zeta _{1},\ldots ,\zeta_{K} ; 1, \mathbf{p}_{i}) \bigr) \bigl(\mathbf{p}_{i}-\mathbf {p}^{*}_{i}\bigr)^{\intercal}\le\frac{1}{2} \max _\ell\{ 1/p_{i,\ell} \} \bigl\|\mathbf{p}_{i}- \mathbf{p}^{*}_{i}\bigr\|_2^2 $$

and

$$\frac{1}{2} \bigl(\mathbf{p}_{i}-\mathbf{p}^{*}_{i} \bigr) \mathcal{I} \bigl(f(\zeta _{1},\ldots ,\zeta_{K} ; 1, \mathbf{p}_{i}) \bigr) \bigl(\mathbf{p}_{i}-\mathbf {p}^{*}_{i}\bigr)^{\intercal}\ge\frac{1}{2} \min _\ell\{ 1/p_{i,\ell} \} \bigl\|\mathbf{p}_{i}- \mathbf{p}^{*}_{i}\bigr\|_2^2. $$

For all ϵ>0 there exists a δ>0, such that if \(\|\mathbf{p}_{i}-\mathbf{p}^{*}_{i}\|_{2}^{2} < \delta\), then \(\bigl|\frac{o(\|\mathbf{p}_{i}-\mathbf{p}^{*}_{i}\|_{2}^{2})}{\|\mathbf {p}_{i}-\mathbf{p}^{*}_{i}\|_{2}^{2}} \bigr| < \epsilon\cdot\frac{1}{2} \min_{\ell}\{ 1/p_{i,\ell} \}\). Now, using this constant and the lower bound of the FIM matrix, we can rewrite (11) as

$$(1-\epsilon ) \frac{1}{2} \min_\ell\{ 1/p_{i,\ell} \} \bigl\|\mathbf{p}_{i}-\mathbf{p}^{*}_{i} \bigr\|_2^2 \le D_{\mathrm{KL}} \bigl( \mathbf{p}^{*}_{i} \|\mathbf{p} _{i} \bigr). $$

Using the fact that 1/(1−ϵ)≤1+2ϵ for \(0 < \epsilon\le\frac {1}{2}\), we have

$$\frac{1}{2} \min_\ell\{ 1/p_{i,\ell} \} \bigl\|\mathbf{p}_{i}-\mathbf {p}^{*}_{i} \bigr\|_2^2 \le(1 + 2\epsilon) D_{\mathrm{KL}} \bigl( \mathbf{p}^{*}_{i} \| \mathbf{p}_{i} \bigr). $$

It then follows that

$$\bigl\|\mathbf{p}_{i}-\mathbf{p}^{*}_{i}\bigr\|_2^2 \le\frac{1 + 2\epsilon }{\frac{1}{2} \min_\ell\{ 1/p_{i,\ell} \} } D_{\mathrm{KL}} \bigl( \mathbf{p}^{*}_{i} \| \mathbf{p} _{i} \bigr). $$

We obtain the inequality of the proposition using Corollary 2 with q=2. □

The relevance labels are typically represented as integers and so they constitute an ordered set. It is arguably a promising approach to exploit this ordering when learning a ranking function. This is possible both in a classification setup (via the optimization of a customized loss that differentiates errors proportionally to the severity of the misclassification) and naturally in an ordinal regression setting (Chu and Keerthi 2005; Aiolli and Sperduti 2010).

Our theoretical results justify using a classification based method which produces good posterior probability estimates in terms of the KL divergence or L p norm. At the same time, a similar result also exists for the regression based setup where the excess of DCG is bounded in terms of the L p error of the regressor (Cossock and Zhang 2008), and efficient regression-based ranking algorithms (Wu et al. 2010) also exist.

Intuitively, the main difference between the methods based on classification and regression is that in the latter case, the loss function is strictly monotonic.Footnote 3 AdaBoost.MH provides a convenient way to use a strictly monotone loss function by applying an appropriate initial weighting that encodes the loss function itself. We do not explore this promising research direction in this paper, however, the instance weighting used here Appendix A also emphasizes accuracy on highly relevant classes and therefore has a somewhat similar effect. Another possibility, originally used in the McRank algorithm, is to convert the K-class ordinal regression problem into K−1 binary classification problem with the goal of obtaining posterior probabilities for p( i >) for =1,…,K−1. We found that our multi-class setup worked sufficiently well for obtaining state-of-the-art experimental results, nevertheless, trying the approach and working out the corresponding calibration methodology is definitely an interesting avenue to explore, especially if the goal is to further diversify the final ranking ensemble.

5 Calibration

We shall assume throughout the paper that a multi-class classification algorithm provides vector-valued multi-class discriminant functions of the form \(\mathbf{f}:\mathcal{X} \rightarrow{\mathbb{R}}^{K}\), where \(\mathcal{X}\) is the input space (in our case the space of query-document pairs represented by a real-valued vector) and K is the number of classes (relevance levels). Elements of these vector-valued discriminant functions will be denoted by \(\mathbf{f}(\mathbf{x}) = \big (f_{1}(\mathbf{x}), \ldots, f_{K}(\mathbf{x})\big)\). The goal of multi-class classification is to identify the correct class (or classes in the case of multi-label classification). For completeness, we will provide details on training multi-class AdaBoost.MH and multi-class SVM (see Appendix A and Appendix B, respectively).

In general, multi-class large-margin classification algorithms force discrimination by pulling the scores f (x) away from zero. This means that direct (linear) conversion into class probabilities usually does not produce good estimates (Mease et al. 2007). This phenomenon is particularly pronounced in the case of AdaBoost due to the exponential loss which increases sharply with negative margins (Niculescu-Mizil and Caruana 2005). At the same time, the score vector usually represents the order of the probabilities rather well, so a simple nonlinear function can transform the scores into good probability estimates. The process of learning this nonlinear function from held-out data is called calibration (Platt 2000). In this section, we will describe several calibration techniques, some of them inspired by classical techniques tuned for squared error and cross-entropy (Niculescu-Mizil and Caruana 2005; Wu et al. 2004), and some of them motivated directly by the NDCG measure.

5.1 Obtaining posterior probabilities: the naive estimator

In classical multi-class classification the elements of f(x) are treated as posterior scores corresponding to the labels, and the predicted label is

$$\widehat{\ell}(\mathbf{x}) = \mathop{\mathrm{arg\,max}}\limits_{\ell=1,\ldots,K} f_\ell (\mathbf{x}), $$

where f (x) is the th element of f(x). The scores f (x) are usually not all positive and they do not sum to 1. Hence, if we need to estimate the posterior class probabilities p(|x), we have to transform and normalize the scores. The naive calibration of a classifier f consists of a linear rescaling followed by a normalization. First, we make the scores positive by applying the transformation

$${\mathbf{f}^\prime}(\mathbf{x}) = 1 + \frac{\mathbf{f}(\mathbf{x})}{R}, $$

where

$$ R=\max_{\mathbf{x}\in\mathcal{X},1\le\ell\le K} \bigl\vert {f_\ell (\mathbf{x})} \bigr\vert. $$
(12)

Then we normalize the shifted score to obtain

In the case of AdaBoost.MH, the classifier has the form

$$\mathbf{f}(\mathbf{x})=\sum_{t=1}^{T}\alpha^{(t)}\mathbf {v}^{(t)}\varphi^{(t)}(\mathbf{x}), $$

where both the elements of the vote vector v (t) and the scalar classifier φ (t)(x) are ±1-valued, so R can be replaced by \(\sum_{t=1}^{T}\alpha^{(t)}\). In the case of multi-class SVMs there is no such “natural” upper bound, so we fall back on the explicit maximization (12).

5.2 Class-probability-based calibration (CPC) using a sigmoidal function

The common solution (Platt 2000) to transform the scores to probability estimates is to apply a sigmoidal function

$$s_{\theta}(f) = \frac{1}{1+\exp(-a(f-b))}, $$

where the parameters θ=(a,b) are to be tuned. The probability estimates are then of the form

$$ p^{s_\theta}(\ell\vert\mathbf{x}) = \frac{s_{\theta}(f_{\ell }(\mathbf{x}))}{\sum_{\ell^{\prime}=1}^K s_{\theta}(f_{\ell^{\prime }}(\mathbf{x}))}. $$
(13)

The parameters of the sigmoid function can be tuned by minimizing a so-called target calibration function (TCP) \(L^{\mathcal{A}}(\theta ,\mathbf {f})\), where θ is the set of parameters to be tuned, f is the score vector, and the upper index \(\mathcal{A}\) refers to the type of the particular TCF. \(L^{\mathcal {A}}(\theta ,\mathbf{f} )\) is naturally a function of the validation data set, too, (which is not necessarily the same as the training set), but here we will omit this dependence to simplify the notation.

Given a TCF \(L^{\mathcal{A}}\) and a multi-class classifier f, our goal is to find the optimal calibration parameters

$$\theta^{\mathcal{A},\mathbf{f}}= \mathop{\mathrm{arg\,min}}\limits_{\theta}L^{\mathcal {A}}(\theta,\mathbf{f}). $$

The output of this calibration step is a probability distribution \(p^{\mathcal{A},\mathbf{f}}(\ell\vert\mathbf{x})\) and a corresponding Bayes-scoring function \(v^{\mathcal{A},\mathbf{f}}(\mathbf{x})\) defined in (4). From now on, we will refer to this scheme as a class-probability-based calibration (CPC).

When several calibration functions \(\mathcal{A}\) are used on several scores f (generated by, for example, multi-class classifiers with different hyperparameters), the result is an ensemble of probability distributions \(p^{\mathcal{A},\mathbf{f}}(\ell\vert\mathbf{x})\) indexed by \(\mathcal{A}\) and f. To obtain a single combined score, we mix the ensemble using a linear combination

where \(\pi(\mathcal{A},\mathbf{f})\) is an appropriately chosen weight. Then we obtain a combined Bayes-scoring function by noticing that

(14)

The proper selection of π(⋅,⋅) can further increase the quality of the estimation. In Sect. 6, we will describe a simple setup borrowed from the theory of experts.

5.3 Target calibration functions

In the simplest case, the TCF can be

(15)

We refer to this function as the log-sigmoid TCF. The motivation for using the log-sigmoid TCF is that the resulting probability distribution minimizes the relative entropy between the Bayes optimal probability distribution p and p ls,f. According to Proposition 3, a small relative entropy implies that the expected DCG score of the resulting ranking is close to the minimum expected DCG score.

In practice, distributions being less (or more) uniform over the labels might work better. This degree of freedom can be expressed by introducing the entropy weighted version of the log-sigmoid TCF

(16)

where

$$H_{M} (p_{1},\ldots,p_{K} )= - \sum _{\ell=1}^K p_{\ell} \log p_{\ell}, $$

and C is a hyperparameter.

We also use TCFs based on the expected loss

$$ L^{\mathrm{EL}}(\theta) = \sum_{k=1}^M \sum_{i=1}^{n^{(k)}} \sum _{\ell=1}^K \mathcal{L} \bigl(\ell,\ell^{(k)}_{i} \bigr) p^{s_\theta} \bigl(\ell \vert \mathbf{x}^{(k)}_{i} \bigr), $$
(17)

where \(\mathcal{L}(\ell, \ell^{(k)}_{i})\) represents the loss if is predicted instead of the correct label \(\ell^{(k)}_{i}\). We use the standard squared loss \(\mathcal{L}(\ell,\ell')=(\ell- \ell^{\prime})^{2}\).

If the label indices have some structure (that is, they are ordinal as in our case), it is also possible to first compute the expected label

$$\overline{\ell}^{(k)}_{i} = \sum_{\ell=1}^K \ell p^{s_\theta} \bigl(\ell\vert\mathbf{x}^{(k)}_{i} \bigr) $$

and then compute the expected label loss TCF

(18)

Note that the definition of \(\mathcal{L}(\cdot,\cdot)\) might need to be redefined for L ell since the weighted average of the label indices might not be a label index at all.

Finally, we can apply the idea of SmoothGrad (Chapelle and Wu 2010) to obtain a TCF. In SmoothGrad a smooth surrogate function is used to optimize the NDCG metric. In particular, using the normalized soft indicator (or similarity function)

$$h_{\theta,\sigma} \bigl(\mathbf{x}^{(k)}_i,\mathbf {x}^{(k)}_{i^\prime} \bigr) = \frac{\exp (-\frac{1}{\sigma} (v^{s_{\theta}} (\mathbf{x} ^{(k)}_{i} ) - v^{s_{\theta}} (\mathbf{x}^{(k)}_{i^\prime} ) )^{2} )}{ \sum^{n^{(k)}}_{i^{\prime\prime}=1} \exp (-\frac{1}{\sigma} (v^{s_{\theta}} (\mathbf{x} ^{(k)}_{i} ) - v^{s_{\theta}} (\mathbf{x}^{(k)}_{i^{\prime\prime}} ) )^{2} )}, $$

the soft NDCG TCF can be written as

(19)

where \(\mathbf{j}=(j_{1},\ldots,j_{n^{(k)}})\) is the permutation defined by the scoring function \(v^{s_{\theta}}(\cdot)\) (5). The parameter σ controls the smoothness of \(L^{\textsc{sndcg}}_{\sigma}\); that is, the higher the value of σ, the smoother the function will be, but also the bigger the difference will be between the NDCG value and the value of the surrogate function. If σ→0 then \(L^{\textsc{sndcg}}_{\sigma}\) tends to the NDCG value; but, at the same time, optimizing the surrogate function becomes harder.

5.4 Regression-based pointwise calibration (RBC)

The CPC calibration can be naturally replaced by a regression technique in which the relevance grades are predicted explicitly instead of estimating the discrete conditional probability distribution p (|x). We will denote a regression function by \(g_{\theta} : \mathbb{R}^{K} \rightarrow\mathbb {R}\), where \(\theta\in\mathbb{R}^{p}\) is the vector that parametrizes the regression function. A pointwise estimate of the relevance grade can be obtained using

$$ \widehat{z}(\mathbf{x}) = g_{\theta} \bigl(\mathbf{f}( \mathbf {x}) \bigr), $$
(20)

where \(\mathbf{f}(\mathbf{x}) \in\mathbb{R}^{K}\) is the score vector output by the multi-class classifier. Optimizing the NDCG score with respect to the regression parameters θ is computationally difficult. The common solution to this problem is to use a surrogate function. Finding good surrogate functions for hard-to-optimize IR metrics is an open research problem (Ravikumar et al. 2011; Chapelle and Wu 2010; Cossock and Zhang 2008). The simplest choice for the surrogate function is the square loss for which the regression model can be fitted in a standard L 2 setup by minimizing

$$L_{2}(\theta,\mathbf{f}) = \sum_{k=1}^M \sum_{i=1}^{n^{(k)}} \bigl(z_i^{(k)} - g_{\theta} \bigl(\mathbf{f}\bigl(\mathbf{x}^{(k)}_{i} \bigr) \bigr) \bigr)^{2}. $$

The problem with this choice is that the square loss is not NDCG consistent (Ravikumar et al. 2011). In spite of this, it turns out that if the relevance grades are rescaled querywise by the DCG scores, that is,

$$ L^{\prime}_{2}(\theta,\mathbf{f}) = \sum _{k=1}^M \sum_{i=1}^{n^{(k)}} \biggl(\frac{z_i^{(k)}}{ \textrm{DCG} ({\mathbf {j}^{(k)}}^*,\mathcal{D} ^{(k)} )} - g_{\theta} \bigl(\mathbf{f}\bigl( \mathbf{x}^{(k)}_{i}\bigr) \bigr) \biggr)^{2}, $$
(21)

then the objective function becomes NDCG-consistent. Since \(\textrm{DCG}\big({\mathbf{j}^{(k)}}^{*},\mathcal{D}^{(k)}\big)\) is unknown for a new query, it is impossible to predict the relevance grades \(z_{i}^{(k)}\) on an absolute scale. Nevertheless, the ordering of the documents within a query is not changed by this constant scaling, so the predictions \(g_{\theta}(\mathbf{f} (\mathbf{x}^{(k)}_{i}))\) can be used for ranking scores without knowing \(\textrm{DCG}({\mathbf{j}^{(k)}}^{*},\mathcal{D}^{(k)})\).

In our experiments we applied four different regression methods; namely, logistic regression, linear regression, neural network regression, and polynomial regression of degree between 2 and 4, inclusive.

6 Ensemble of the calibrated models

Selecting the best hyperparameters for a multi-class learning algorithm and the best calibration function is normally done in a validation step. This procedure would “throw away” most of the diverse information represented by the different predictors. Instead of selecting the best, we use all the relevance predictions \(v^{\mathcal{A},\mathbf{f}}(\mathbf{x},S)\) obtained by different multi-class classifiers f using different TCFs \(\mathcal{A}\). Each relevance prediction can be used as a scoring function to rank the set of documents \(\mathbf{x}^{(k)}_{i}\) according to (5). Up until now, our method is an almostFootnote 4 purely pointwise approach. To fine-tune the algorithm and to make use of the diversity of our models, we combine the scoring functions \(v^{\mathcal {A},\mathbf {f}}(\mathbf{x},S)\) using an exponentially weighted forecaster (Cesa-Bianchi and Lugosi 2006).

The weights of the models are tuned on the NDCG score, giving a slight listwise touch to the final step of our approach. Formally, the final scoring function is obtained by using the weights

$$\pi(\mathcal{A},\mathbf{f})=\exp\bigl(c\omega^{\mathcal{A},\mathbf{f}}\bigr), $$

where \(\omega^{\mathcal{A},\mathbf{f}}\) is the NDCG 10 score of the ranking obtained by using \(v^{\mathcal{A},\mathbf{f}}(\mathbf{x})\). Plugging it into (14), the combined scoring function is

(22)

The parameter c is also tuned on the held-out validation set. It controls the dependence of the weights on the NDCG 10 values. A large c means that we focus only on the good models, whereas a c close to zero represents a near-uniform weighting.

Our rationale for using this particular mixing scheme is as follows. First, it is simple and computationally efficient to tune, which is important when we have a large number of models. Our basic setup is to train the models in parallel using computationally cheap pointwise objectives, and combine these fixed models linearly to improve a computationally expensive listwise objective. Complex dynamic weighting schemes where model combination and model training are intertwined (such as boosting) are not suitable in this setup. Among linear schemes, we were looking for a technique that subsumes classical winner-takes-all validation (choosing the best model) and simple equal-weight voting, and which could be tuned with a single hyperparameter between these two extremes. Exponential weighting is arguably the simplest of such schemes. Simplicity is also crucial to prevent model-mixing from overfitting.

Our choice was also inspired by theoretical guarantees over the cumulative regret of a mixture of experts on individual (model-less) sequences (Cesa-Bianchi and Lugosi 2006), without, of course, claiming that these theoretical results apply directly to our setup. That said, we do not assert that it is the best possible model-mixing scheme. It may be that more sophisticated techniques would allow us to further improve the results. Our point is that model-mixing is important, and even a simple method can be significantly better than the classical winner-takes-all validation. This result seems to agree with the general consensus drawn from the experiences on recent large-scale learning challenges (Bennett and Lanning 2007; Dror et al. 2009; Chapelle et al. 2011).

6.1 Guidelines for building the models

According to the results of our experiments, there does not seem to be any statistical cost of including as many models as possible; that is, we know no instance when deleting models before mixing improved the results, mainly since bad models were discarded anyway by the weighting scheme. At the same time, increasing the number of models without limit does not make much sense, but the problem is computational rather than statistical.

Within our computational limits, our objective was to have the largest number of diverse models possible. Most of the computational time was spent on training the boosting models, so the first step was to “cover” the hyperparameter space (number of tree leaves, number of product terms, number of iterations, the regularization coefficient in SVM) quasi-uniformly in a region derived from our previous experiences and from some preliminary experiments. Compared to training the models, calibration took almost no time, which explains why we added as many and as diverse calibration techniques as possible. In Busa-Fekete et al. (2011a) we conducted an empirical analysis to discover what the main source of diversity was and we tried some other techniques to increase diversity that we did not apply in this paper for simplifying the method. All we could conclude was that any reasonable “perturbation” of the problem or of the models helps to improve the overall result.

7 Experiments

In this section, we will present our experimental results. In Sect. 7.1, we will briefly discuss some of the issues related to the available evaluation tools. Section 7.2 describes the benchmark data sets used in the experiments. In Sect. 7.3, we summarize the state-of-the-art techniques we compare with our approach and with each other. To assure full reproducibility of our experiments, we also provide details of the experimental setup here. Section 7.4 contains the results of the comparative experiments. We also discuss the general conclusions that can be drawn from the results of our experiments. In Sect. 7.5, we investigate how the performance of the algorithm depends on the size of the data and on the quality of the training relevance grades. In Sect. 7.6, we assess the effect of query-wise normalization of relevance grades proposed by Ravikumar et al. (2011). In Sect. 7.7, we examine the diversity of the models that we mix in the ensemble step in a qualitative manner. Lastly, we will summarize and discuss the official results in the Yahoo! Learning-to-Rank Challenge (Chapelle et al. 2011) in Sect. 7.8.

7.1 Evaluation tools

Here, we will briefly describe and compare the various tools available for computing NDCG scores. The definition (2) is unambiguous; nevertheless, the tools can differ in the definition of the discount factor c i (3). More importantly, there may be important differences in the way the DCG score is normalized either when there exist no relevant documents for a query (z i =0 for all i), or when the number of documents is less than the truncation level n max. Even though this seems to be a technical subtlety, it turns out that the confusion arising from using the different tools can significantly alter the numerical scores and in some case may even change the relative ordering of the algorithms on the data sets.

We compared six evaluation tools to compute the NDCG scores:

  1. 1.

    The LETOR 3.0 script implemented in PerlFootnote 5

  2. 2.

    The LETOR 4.0 script implemented in PerlFootnote 6

  3. 3.

    The MS script implemented in PerlFootnote 7

  4. 4.

    The Yahoo script implemented in PythonFootnote 8

  5. 5.

    The RankLib package implemented in JavaFootnote 9

  6. 6.

    The TREC evaluation tool v8.1 implemented in CFootnote 10

The evaluation tools can be divided into three groups. The tools of the first group compute \(\textrm{DCG}_{n_{\mathrm{max}}}\) according to the definition (2) described in Sect. 3. The LETOR 3.0, RankLib, and TREC tools belong to this group. All of these tools assign a zero score to a query if it is empty; that is, z i =0 for all i, which means that there are no relevant documents. The TREC tool makes use of the labels of documents given in the input file as relevance grades by default. From this point of view, this is the most flexible implementation, since arbitrary relevance grades can be defined. For example, in the case of the MQ2008 dataset, the labels 0, 1, and 2 should be simply replaced by 0, 1, and 3, respectively, to have the commonly used exponential grades as given in (1).

The second group is comprised of the Yahoo tool alone. It also computes the \(\textrm{DCG}_{n_{\mathrm{max}}}\) according to the definition (2), but it assigns 1.0 to the empty queries. This is a minor difference that generates an additive bias between the \(\textrm{NDCG}_{n_{\mathrm{max}}}\) computed by Yahoo tool and the three tools of the first group.

The third group consists of the LETOR 4.0 and MS tools. Except for a small technical difference (the LETOR 4.0 tool can be applied for up to three relevance labels, whereas the MS tool can handle up to five relevance labels), they compute the same score. Like the RankLib and LETOR 3.0 tools, they assign a zero to a query where the ideal \(\textrm{DCG}_{n_{\mathrm{max}}}\) is zero. Their rather strange feature is that they also assign a zero \(\textrm{DCG}_{n_{\mathrm{max}}}\) score to a query with less than n max documents in it, even if these documents are highly relevant. So, formally, they compute the \(\textrm{DCG}_{n_{\mathrm{max}}}\) score as

$$ \textrm{DCG}_{n_{\mathrm{max}}}\big(\mathbf{j}^{(k)},\mathcal {D}^{(k)}\big) = \begin{cases} \sum_{i=1}^{n_{\mathrm{max}}} c_i z_{j_i}^{(k)} & \hbox{if $n_{\mathrm{max}}\le n^{(k)}$} \\0 & \hbox{otherwise}. \end{cases} $$
(23)

This truncation not only distorts the test score, but it can also alter the training of such algorithms that depend directly on the NDCG score. Indeed, for example, in AdaRank (Xu and Li 2007), which optimizes the NDCG 10 evaluation metric, a query containing fewer than 10 documents does not influence the computation of the coefficient of the weak ranker at all, and the weight of such queries converge to zero over the successive boosting iterations.

To illustrate the effect of these differences, we compared the \(\textrm{NDCG}_{n_{\mathrm{max}}}\) scores obtained by the Yahoo tool (Figs. 2(a) and 2(c)) with the Letor 4.0 tool (Figs. 2(b) and 2(d)) on the Letor 4.0 data sets (MQ2007 and MQ2008) using five state-of-the-art rankers.Footnote 11 First, note the striking absolute differences, especially for larger n max for which the effect of the truncation (23) is bigger.Footnote 12 Worse, on the MQ2007 data set even the order of the methods is altered: the RankNet method is put at a serious disadvantage by the incorrectly implemented evaluation tool. This latter finding was the main reason why we included this technical section in the paper.

Fig. 2
figure 2

The empirical \(\textsc{NDCG}_{n_{\mathrm{max}}}\) scores on the Letor 4.0 test sets. The left panels ((a) and (c)) depict the \(\textrm{NDCG}_{n_{\mathrm{max}}}\) scores computed by the Yahoo tool, and the right panels ((b) and (d)) show the results when we use the official LETOR 4.0 tool. Apart from the strikingly large absolute differences, the incorrectly implemented LETOR 4.0 tool also alters the order of the methods

From now on all reported NDCG scores will be computed using the Yahoo tool (with a score of 1.0 for “empty” queries). Note that we also changed the code of RankLib based on this tool. Although our main evaluation metric is the NDCG 10 score, we will also report the Expected Reciprocal Rank (ERR) scores (Chapelle et al. 2009)

$$ \textrm{ERR} \bigl(\mathbf{j}^{(k)},\mathcal{D}^{(k)} \bigr) = \sum_{i=1}^{n^{(k)}} \frac{1}{i} \frac{z_{j_i}^{(k)}}{Z} \prod_{i^{\prime}=1}^{i-1} \biggl( 1 - \frac{z_{j_{i^{\prime }}}^{(k)}}{Z} \biggr), $$
(24)

where Z is the maximal relevance grade (that is, Z=2K−1−1 in our case).

In summary, we propose the following guidelines for future research studies, with the aim of making numerical results comparable across studies carried out.

  1. 1.

    If possible, apply the Yahoo tool.

  2. 2.

    Always specify which tool is used to compute the numerical results.

  3. 3.

    If a new tool for computing the \(\textrm{NDCG}_{n_{\mathrm {max}}}\) is implemented, always specify the default value for empty queries, and avoid the bug (23) when n max>n (k).

7.2 Data sets

We evaluated the ranking methods on the commonly used benchmark data sets summarized in Table 1. We were only interested in data sets with more than two relevance levels, firstly because the calibration for binary relevance labels does not make too much sense, and secondly, because the difference between various learning algorithms can be more pronounced in the multi-label case. Note also that the general consensus in the IR community is that graded relevance labels are superior to the binary setup when large document collections are involved (Järvelin and Kekäläinen 2002; Kekäläinen and Järvelin 2002; Sakai 2007).

Table 1 The benchmark data sets used in the experiments

The features were normalized querywise in the LETOR data sets (OHSUMED, MQ2007, MQ2008) so we did not preprocess them. In the case of Yahoo and MS data sets, we augmented the feature sets. Besides the original features, we added querywise standardized features, which means that we rescaled the feature values for each query separately so to have zero mean and a standard deviation of one. The idea behind querywise normalization is that features used in a learning-to-rank tasks often represent some kind of count. Some of these quantities are not comparable in an absolute way: for example, the number of times a query term occurs in a document is not comparable between common query terms (e.g., “dog”) and rare query terms (e.g., “AdaBoost”).

We used the official train/valid/test cut for each data set. We divided the official training sets by a random 80–20 % split into training and calibration sets. The latter was used to adjust the parameters of the different calibration methods and to tune the hyperparameter c of the exponential weighting scheme.

7.3 Methods and experimental setup

We compared our algorithms with five state-of-the-art ranking methods and with the ranker that used the simple best feature (described below). Here, we will briefly summarize them.

  1. 1.

    BestFeature: As a baseline, we will report the performance of the ranker based on the single best feature. Each feature can be used as a ranker function since values of a given feature determine a ranking on an individual query (for example, in the early years of IR, the BM25 feature alone was used as a ranking score). In the training phase, the ranker chooses the feature that achieves the highest performance in terms of the evaluation metric of interest on the training data. For a test query, the ranker then simply returns the values of this single feature as a score. Since the feature values are given in our experiments, there is no hyperparameter to be validated for this method. We will refer to this simple approach as BF.{ERR,NDCG}, according to the evaluation metric used.

  2. 2.

    AdaRank (Xu and Li 2007) is a listwise boosting approach that seeks to optimize an arbitrary listwise IR metric, such as the Mean Average Precision (MAP), ERR, or NDCG. Inspired by AdaBoost, it uses a stepwise greedy optimization technique to maximize the chosen IR metric. In every boosting iteration, AdaRank re-weights the queries based on their scores obtained by the evaluation metric: it up-weights the query that has a lower score and down-weights high-scoring queries. The weak learner is chosen by optimizing the listwise evaluation metric of interest, which is usually hard to optimize except for very simple weak classifiers. This may be viewed as a handicap of this method. According to the original implementation of AdaRank, we used the best feature ranker (BF) described above as the base ranker taking into account the weighting of queries. The only hyperparameter of AdaRank is the number of boosting iterations, which we optimized by using early-stopping on the validation set. We refer to this method as AdaRank.{MAP,ERR,NDCG}, depending on which evaluation metric is used.

  3. 3.

    RankNet (Burges et al. 2005) is a neural-net-based method which employs a loss based on pairwise cross entropy as its objective function. The neural net with one output node is trained to optimize directly the differentiable probabilistic pairwise loss instead of the common squared loss. We validated the number of hidden layers ranging from 1 to 3 and the number of neurons in the hidden layers ranging from 10 to 500. For the number of training epochs we applied early stopping.

  4. 4.

    RankBoost (Freund et al. 2003) is a pairwise boosting approach. The objective function is the rank loss (as opposed to AdaBoost, which optimizes the exponential loss). In each boosting iteration the weak classifier is chosen by maximizing the weighted rank loss. For the weak learner, we used decision stumps and a variant of the single decision stump described in Freund et al. (2003), which is able to optimize the rank loss in an efficient way.

  5. 5.

    RankSVM (Joachims 2006) is a pairwise method based on SVM, which formulates the ranking task as a binary classification. We used a linear kernel because the optimization using non-linear kernels cannot be carried out in a reasonable time. The tolerance level of the optimization was set to 0.001 and the regularization parameter was validated in the interval [10−6,104] with a logarithmically increasing step size.

  6. 6.

    CoordinateAscent (CA) (Metzler and Croft 2007) is a linear listwise model, where the scores of the query-document pairs are calculated as weighted combinations of the feature values. The weights are tuned by using a coordinate ascent optimization method, where the objective function is an arbitrary evaluation metric given by the user. The coordinate ascent optimization method itself has two hyperparameters to be tuned: the number of restarts R from random initial weights, and the number of iterations T taken after each restart. We used R=30 and T=100. We did not validate these hyperparameters, but using the validation set we evaluated every model obtained due to restarting the optimization process, and we kept the one that had the highest performance score.

In our approach, we applied two multi-class learning methods whose outputs were then used in a calibration procedure: AdaBoost.MH (Schapire and Singer 1999) and the multi-class Support Vector Machine (MC-SVM) (Crammer and Singer 2001). To train AdaBoost.MH, we used our open source implementation (Benbouzid et al. 2012b) available at http://multiboost.org. We employed decision trees with 8 and 64 leaves and decision products with 3 and 10 terms. We calibrated and mixed all the trained boosted models. The training was performed on the EGI grid,Footnote 13 which allowed us to perform the training process in parallel, and thus it took less than one day to get all the strong classifiers. Further details on training AdaBoost.MH can be found in Appendix A.

To train the MC-SVM (Crammer and Singer 2001), we used a linear kernel as the training time for non-linear kernel functions was prohibitively long. The tolerance of the optimization was set to 0.001. We trained MC-SVM using different trade-off parameters ranging from 10−6 to 10. Further details on training MC-SVM can be found in Appendix B.

We only tuned the number of iterations T for AdaBoost.MH and the trade-off parameter C for SVM, and the base parameter c in the exponential weighting scheme (22) on the validation set. In the exponential weighting combination (22) we set the weights using the NDCG 10 performance scores of the calibrated models, and c and T were selected based on the performance score of the combined scoring function v ensemble(⋅) in terms of NDCG 10. The hyperparameter optimization was performed using a simple grid search, where c ranged from 0 (corresponding to a uniform weighting) to 200 and for T from 10 to 10000. On the larger Yahoo and MS data sets, the optimal numbers of boosting iterations were between 8000 and 10000 and about 2000, respectively. Interestingly, on the LETOR data sets the optimal T was much lower: for LETOR 3.0 the best number of iterations is T=100 and for both LETOR 4.0 data sets it is T=50. The best base parameter c is larger than 100 for all data sets. This value is relatively high considering that it is used in the exponent, but the performances of the best models were relatively close to each other so the weight distribution of these good models was not very far from being uniform. We used fixed parameters C=2 in the TCN function \(L_{C}^{\textsc{ewls}}\) (16), and σ=0.01 in \(L_{\sigma}^{\textsc{sndcg}}\) (19).

To demonstrate the efficiency of our ensemble scheme described in Sect. 6, we will also present the test performance scores of the single best calibrated multi-class classifier. In this case, the scores for a single trained and calibrated model can be obtained using (4). The best ranker is then chosen based on its performance score on the validation set. We will refer to this ranker based on a single calibrated multi-class classifier as {AB,SVM}+OneBest, depending on which multi-class training method was used.

7.4 Comparative results

Tables 2 and 3 list the NDCG and ERR scores, respectively, for the different methods. The results reveal some clear general trends. First, the exponentially weighted ensemble of calibrated AdaBoost.MH (AB+EXP) and multi-class SVM classifier (SVM+EXP) outperform all of the baseline methods in terms of both evaluation metrics of interest, with the exception of the MQ2007 data set. On this single set the winner is the RankSVM algorithm, which achieves an excellent NDCG 10 score (the ERR score, however, is lower than the ERR score of our methods).

Table 2 The NDCG 10 values of the methods compared on benchmark data sets. The best scores for each data set and those within one sigma are marked in bold. Here, we used the standard deviation of the average NDCG 10. Pointwise, pairwise, and listwise approaches are with red (♣), green (♠), and blue (♢) backgrounds, respectively
Table 3 The ERR of the methods compared on benchmark data sets. The best scores for each data set and those within one sigma are marked in bold. We used the standard deviation of the average ERR. Pointwise, pairwise, and listwise approaches have red (♣), green (♠), and blue (♢) backgrounds, respectively

Second, it is interesting that the one best calibrated AdaBoost.MH model outperforms RankBoost; and, similarly, the single calibrated MC-SVM outperforms the RankSVM in many cases. RankBoost can be thought of as the ranker counterpart of AdaBoost.MH: RankBoost optimizes the rank loss in a stepwise fashion similar to AdaBoost, which optimizes the exponential loss. RankSVM can be viewed as the ranking counterpart of the calibrated MC-SVM, since the core of both algorithms is a quadratic optimization although with different loss functions. In other words, using an appropriate calibration function, classical multi-class classifiers can have state-of-the-art ranking performance. Note also that in RankBoost, it is not easy to design base learners that can optimize the weighted rank loss, so we could only use an updated version of the decision stump introduced in the original RankBoost paper (Freund et al. 2003). In contrast, in AdaBoost.MH we were able to use a larger variety of standard base learners.

We also report the scores we achieved when just using the class-probability-calibrated models (denoted by EXP.CPC) or just using the regression-based-calibration models (denoted by EXP.RBC) in the exponential weighted ensemble for both multi-class classifiers applied here. Both calibrations achieve similar scores, except for the CPC multi-class SVM which gives below par performance scores on MQ2007 and OHSUMED. The reason for this is that in these data sets there are many queries where all of the feature values are zeros. In this case, an SVM-based method results in zero scores (unlike to AdaBoost.MH-based methods). SVM-based techniques also perform quite poorly on the MS data sets because the regularization penalty C cannot be set to its optimal value due to the excessive running time with large C.

The final ensemble step described in Sect. 6 has a significant impact when used on the classifiers produced by AdaBoost.MH. In the case of MC-SVM the improvement is marginal, but the exponentially weighted ensemble on linear models is computationally efficient both to train and to evaluate, so there is no reason why we cannot use the full ensemble.

The listwise algorithms (AdaRank, RankNet and CA) perform well on the LETOR data sets (OHSUMED, MQ2007, MQ2008).Footnote 14 However, pointwise and pairwise approaches significantly outperform the computationally expensive listwise algorithms on the Yahoo and MS data sets and slightly outperform them on the LETOR data sets. The only exception is the CA algorithm, which can achieve a state-of-the-art performance even on large-scale learning-to-rank benchmark data sets.

The methods using SVM engines are horrendously slow on large-scale data sets. For example, running RankSVM and MC-SVM takes more than three weeks for a single fold of the MS data sets if the regularization coefficient C is larger than 10−4. Running them with smaller coefficients, though, produces suboptimal results, which explains why we were not able to get state-of-the-art performance scores using these methods for MS data sets.

7.5 Learning curves

In this set of experiments we investigated how the performance scores of the methods depend on the size of the training set and on the quality of the queries. In the first experiment, we randomly divided the Yahoo 2 training set into ten equal parts, and trained the rankers on 10 %, 20 %, …, 100 % of the available data. The Yahoo 2 data set was better suited to this experiment because, on the one hand, the test set consisted of almost 3800 queries, so small differences in NDCG scores could be still significant, and on the other hand, the training size of about 1300 queries was not prohibitively large, so all of the methods could be trained in a reasonable time.

Figure 3(a) shows the results we got. Although it is hard to draw general conclusions from one experiment, first, it seems that listwise methods (AdaRank, RankNet and CA) are more competitive on small data sets and their learning curves “flatten” as we add more and more data. Still, pointwise and pairwise methods seem to improve steadily as the data size grows. Second, among our two classification-based techniques, the final ensemble step (Sect. 6) helps AdaBoost.MH in the full range of data sizes, whereas it helps MC-SVM only for smaller sets. Lastly, among the two pairwise techniques, RankBoost seems to work better for small data whereas the NDCG score of RankSVM grows faster, which is somewhat unfortunate as we know that RankSVM will be hard to optimize for larger data sets.

Fig. 3
figure 3

The dependence of NDCG 10 scores on training data size. The queries were added gradually to the Yahoo 2 training data in 10 % portions. In the top panel the queries were ordered randomly, whereas in the left and right panels they were added in decreasing and increasing order, respectively, according to their NDCG 10 scores obtained by using the BestFeature ranker

In a variant of this experiment we investigated how noise affects the performance of the methods. First, using the BestFeature ranker we computed the NDCG score for each query. A low score on a query indicates that the features cannot capture the relevant nontrivial semantics between the query and the documents. This is one of the main sources of noise in a learning-to-rank task (besides the label noise coming from the disagreement of the annotators). We then trained the rankers on 10 %, 20 %, …, 100 % of the available data as in the previous experiment, but this time we added the queries in (decreasing or increasing) order of their NDCG scores. Figures 3(b) and 3(c) show the results. As expected, the learning curves are significantly flatter when we keep bad queries till the end. Besides the same general trends observed in the previous experiment, it is interesting to see that listwise methods seem to work well only if trained using good queries, whereas pairwise and pointwise techniques improve even when bad queries are added to the training set. It is also noticeable that the best calibrated AdaBoost.MH model performs very well on the best 10 % to 30 % of the queries, but the ensemble step helps significantly when noisy queries start to accumulate.

7.6 Normalizing the relevance grades

We carried out some experiments to assess the effect of the query-wise normalization of relevance grades proposed in Ravikumar et al. (2011). In the regression-based calibration setup, we calibrated the trained multi-class classifiers in two ways: (1) using the relevance grades as target values according to (20) and (2) using the relevance grades normalized query-wise by the ideal DCG 10 scores for each query according to (21).

We Applied AdaBoost.MH to get multi-class classifiers using the same set of base classifiers that described in Sect. 7.3, and we ran it for 10000 iterations for the Yahoo 1 and Yahoo 2 data sets. Then we calibrated the classifiers using linear regression, logit regression, neural network regression, and polynomial regression of degree 2. The NDCG 10 scores in Fig. 4 indicate that this step did indeed improve the individual rankers.

Fig. 4
figure 4

The scatterplot of NDCG 10 scores computed by using the relevance grades normalized by the ideal DCG 10 query-wise according to (21) (vertical axis) versus NDCG 10 scores computed by just using the original relevance grades (r=2y−1−1) in the RBC calibration according to (20) (horizontal axis). The rectangles show the scores of the calibrated models

The intuitive rationale of relevance normalization is that normalizing the relevance scores by the ideal DCG scores will balance the contribution of the individual queries to the loss (in this case L 2) to be optimized. In particular, the normalization downweights queries having high DCG scores. These queries can be thought of as “easy” with a large number of relevant documents, so downweighting them means that the calibration will focus on harder queries that may further improve the generalization performance.

7.7 The diversity of class-probability-based calibration outputs

To investigate how diverse the score values of different class probability calibrated models are, we compared the scores obtained by the five CPC methods described in Sect. 5.2 using the t-test. We obtained five p-values for each CPC pair. Then we applied Fischer’s method to get one overall p-value, assuming that these five p-values came from independent statistical tests. Here, we just used the output of boosted trees with the number of tree leaves set to 8.

The results in Fig. 5 indicate that for a subset of TCFs, the estimated probability distributions were quite close to each other. Although the TCFs are rather different, it seems that they approximate a similar distribution with only small differences. We think that one reason for the observed efficiency of the proposed method is that these small differences within the cluster are due to the estimation noise, so by mixing them, the level of the noise decreases.

Fig. 5
figure 5

The p-values for different calibrations obtained via Fischer’s method on foldwise p-values of the t-test, Letor 4.0/MQ2007. The class probability based calibration was calculated as described in Sect. 5.2

7.8 Yahoo! Learning-to-Rank Challenge

The Yahoo 1 and Yahoo 2 data sets were the official data sets of the Yahoo! Learning-To-Rank Challenge.Footnote 15 This open challenge attracted scores of academic researchers as well as industrial practitioners, and drew a huge number of participants with over 300 teams coming from both industrial and academic areas. The data sets used in the challenge can be considered the first freely available large-scale data sets in learning-to-rank, allowing a more reliable benchmarking tests than the earlier ones based on the LETOR sets. The challenge revealed some important findings. (Table 4 shows the final scores achieved on the test set.) First, ensemble methods achieved the best scores. Almost without exception, the disseminated algorithms devised by the top teams were based on ensemble techniques. Second, the general consensus (coming mainly from benchmarks on the LETOR data sets (Cao et al. 2007; Valizadegan et al. 2009)) that pairwise and listwise techniques outperform pointwise approaches seems to be refuted. The majority of the best teams used pointwise approaches and they were competitive with the listwise and pointwise techniques. This finding agrees with our experimental results on benchmark data sets: some of the pairwise and listwise methods were on par with our approach on the LETOR data sets, but on the larger Yahoo 1−2 and MS 1, our pointwise technique performed significantly better than the pairwise and listwise methods.Footnote 16

Table 4 The ERR and NDCG 10 scores on the official Yahoo! Learning-to-Rank Challenge entries. The entries are in order of their ERR scores. Scores within one sigma from the winner score are in bold. Teams with red (♣), green (♠), and blue (♢) backgrounds used pointwise, pairwise, and listwise approaches, respectively. We have no information on uncolored teams. The name of our team was LAL. For comparison, we also included the results of AB+EXP and RankSVM from Tables 2 and 3 and the results of the Gradient Boosted Decision Tree algorithm (Friedman 2002) taken from (Chapelle and Chang 2011)

We also participated in the Yahoo! Learning-To-Rank Challenge with an earlier version of our approach (Busa-Fekete et al. 2011a). The algorithm we used there resembles the most to AB+EXP.RBC. The main difference is that we further diversified our ensemble by merging several relevance grades to create five different multi-class classification problems. This created further diversity in the ensemble and pushed our entry into the top 6. Label-grouping could be included in any classification-based ensemble technique. We decided to not to use it in the benchmark tests in this paper to keep the method simple and computationally less expensive.

8 Conclusions and future work

In this paper, we described a generic technique for learning to rank. The method consists of three steps: (1) training several multi-class classifiers to predict the relevance labels, (2) calibrating the output score vectors to predict either the posterior probabilities of the relevance labels or the real-valued relevance grades, and (3) combining the generated models using a simple exponential weighting scheme. The advantages of the method are its conceptual simplicity, its practical performance, and its computational efficiency. We also presented a theoretical analysis where we examined the link between the Kullback-Leibler divergence and the expected DCG. We showed that the better estimate of the conditional probability distribution of relevance labels in terms of Kullback-Leibler divergence results in a higher DCG score in our probabilistic setup.

In experiments we showed that our essentially pointwise approach is competitive with more complex methods including RankSVM, AdaRank RankBoost, CA, or RankNet on most of the available large learning-to-rank benchmark data sets. In a comparison of multi-class classifiers, we found that AdaBoost.MH is better suited for this task than MC-SVM. The main bottleneck of MC-SVM is that it slows down for larger trade-off parameters on large-scale data sets, making it difficult to achieve its optimal performance. To alleviate this problem, as a further study, we plan to test the algorithm described in Hazan and Kale (2011), where MC-SVM is trained in an online bandit setup.

We also investigated how the performance of different algorithms evolves as the size of the data grows. We found that pointwise and pairwise techniques in general, and our approach in particular, scale better for large data sets.

The results of our paper along with the findings of the Yahoo! Learning-To-Rank Challenge underscores the performance of ensemble rankers. Their applicability in practice is mainly limited by the fact that they have to evaluate many rankers at test time, and it is well known that the evaluation time is crucial in a real world learning-to-rank application. This motivates the development of a framework in which a controller can select the rankers to be evaluated based on the characteristics of individual queries (Cambazoglu et al. 2010). Our future goal here is to model the problem as a Markov decision process, and solve it using standard reinforcement learning techniques (Dulac-Arnold et al. 2011; Benbouzid et al. 2011, 2012a).