Rank-order-correlation-based feature vector context transformation for learning to rank for information retrieval

As a crucial task in information retrieval, ranking defines the preferential order among the retrieved documents for a given query. Supervised learning has recently been dedicated to automatically learning ranking models by incorporating various models into one effective model. This paper proposes a novel supervised learning method, in which instances are represented as bags of contexts of features, instead of bags of features. The method applies rank-order correlations to measure the correlation relationships between features. The feature vectors of instances, i.e., the 1st-order raw feature vectors, are then mapped into the feature correlation space via projection to derive the context-level feature vectors, i.e., the 2nd-order context feature vectors. As for ranking model learning, Ranking SVM is employed with the 2nd-order context feature vectors as the input. The proposed method is evaluated using the LETOR benchmark datasets and is found to perform well with competitive results. The results suggest that the learning method benefits from the rank-order-correlation-based feature vector context transformation.


INTRODUCTION
In information retrieval (IR), ranking is a crucial task that defines the preferential order among the retrieved documents for a given query. Traditional IR has adopted empirical ranking models, such as the Boolean model, the vector space model, and the probabilistic model, which are designed in unsupervised manners [1]. In practice, the aforementioned models usually suffer high costs for parameter tuning, and sometimes overfitting occurs, especially when the models are carefully tuned to fit particular needs [22]. Nowadays, as many IR results are increasingly accompanied by relevance judgments, e.g., query and click-through logs collected in search engines, supervised learning methods, referred to as the learning to rank (LTR) meth- * Corresponding Author. Tel.: +886 4 2322 6940x728; fax: +886 4 2323 2621 ods, have been devoted to automatically learning ranking models (e.g., [5,6,12,14,18,45,53]). In general, supervised learning allows the automatic tuning of parameters and the incorporation of various models into a singular one with high effectiveness. 1 Figure 1 presents the general paradigm of learning to rank for IR. The learning process consists of two phases, namely, training and test. First, the training phase is introduced. Given a query collection, i.e., Q = {q 1 , q 2 , . . . , q |Q| }, and a document set, i.e., D = {d 1 , d 2 , . . . , d |D| }, a training instance is a query-document pair, i.e., (q i , d j ) ∈ Q × D, upon which a relevance judgment indicating the relationship between q i and d j is assigned by a labeler. The relevance judgment can be (1) a class label, e.g., relevant or non-relevant; (2) a rating, e.g., a 3-star rating scaling from 0 to 2 for non-relevant, possibly relevant, and definitely relevant; (3) an order, e.g., k, meaning that d j is ranked at the k-th position in the ordering of documents when q i is considered; or (4) a score, e.g., si m(q i , d j ), specifying the degree of relevance between q i and d j . For each instance, i.e., (q i , d j ), a feature extractor produces a vector of features that describes the match between q i and d j . Such features can be classical IR models (e.g., term frequency, inverse document frequency, and Okapi BM25 [32]) or newly developed models (e.g., Hos-tRank [51], Feature Propagation [29,36], and Topical PageRank [26]). The inputs to the learning algorithm comprise training instances, their feature vectors, and the corresponding relevance judgments. The output is a ranking model, f , where f (q i , d j ) is supposed to give the "true" relevance judgment for q i and d j . The learning algorithm attempts to learn a ranking model, such that a performance measure, e.g., classification accuracy, error rate, and Mean Average Precision (MAP) [1], with respect to the output relevance judgments can be optimized. In the test phase, the ranking model is applied to judge the relevance between each document d i in D and a new query q.
The feature vector model, 2 a.k.a. the bag-of-features model, is widely used for representing instances. The assumptions behind the model include [2]: (1) the independence relationship between features (i.e., each feature is a priori independent from the others); (2) the flatness of the feature values (i.e., no hierarchy among the values); and (3) the certainty of the observations (i.e., only one value for each feature). However, empirical observations have found that features, i.e., the ranking models in this study, are not always independent. For example, TF-IDF [1] and Okapi BM25 [32] are considered somewhat correlated since both are designed based on term frequency and inverse document frequency. In such cases, the feature vector model neglects the correlations between features and treats the features as independent coordinate axes.
This paper proposes to model instances as bags of contexts of features, instead of bags of features. The contexts are extracted from the feature correlation space that is built with rank-order correlations to capture the correlation relationships between features. The feature vectors of instances (hereafter called the 1storder raw feature vectors) are mapped into the feature correlation space via projection for deriving the context-level feature vectors (hereafter called the 2nd-order context feature vectors). The 2nd-order context feature vectors inherently take into account the correlation relationships between features and are believed to be capable of conquering the limitations for the 1st-order raw feature vectors. A new learning method, which extends Figure 1 by incorporating the 2nd-order context feature vectors and the state-of-the-art learning algorithm, Ranking SVM [18], is also developed.
The rest of this paper is structured as follows. Section 2 presents a brief review of the literature. Section 3 describes the technical details of the proposed method. Section 4 provides the experimental results, and Section 5 concludes this paper and points out possible directions for further research. 2 Each dimension in the feature vector model corresponds to a ranking (or retrieval) model.

RELATED WORK
Previous studies of learning to rank fall into three categories [22]: (1) the point-wise approach; (2) the pair-wise approach; and (3) the list-wise approach.
In the point-wise approaches, each training instance is associated with a class or rating. The learning process finds a model that maps instances into classes or rates close to their true values. The point-wise approach can be further divided into three subcategories, namely, regression-based (e.g., [11]), classification-based (e.g., [25] and McRank [21]), and ordinal regression-based (e.g., [16], [37] and Pranking [12]). A typical example is Pranking [12], which trains a perceptron model to directly maintain a totally-ordered set via projections. Another one is McRank [21], which defines a 5-star rating and casts the ranking problem as multiple classification in accordance with an observation that perfect classifications lead to perfect DCG (Discounted Cumulative Gain) [17] scores. The classification model is learned via gradient boosting.
The pair-wise approaches take pairs of objects and their relative preferences as training instances and learn to classify each object pair as correctly-ranked or incorrectly-ranked. Most existing methods are pair-wise approaches, e.g., Ranking SVM [18], RankBoost [14], and RankNet [5]. Ranking SVM employs support vector machines (SVM) to classify object pairs in consideration of large margin rank boundaries. Both Rank-Boost and QBrank [57] conduct boosting to find a combined ranking, which minimizes the number of mis-ordered pairs of objects. RankNet defines cross entropy as a probabilistic cost function on object pairs and uses neural network to optimize the cost function. LambdaRank [4] also employs neural network but uses gradient based on NDCG (Normalized Discounted Cumulative Gain) [17] scores smoothed by the RankNet loss (also see LambdaMART [3], the boosted tree version of LambdaRank). FRank [43] adopts Fidelity to measure loss of ranking and uses a generalized additive model to minimize the Fidelity loss. Semi-RankSVM [27] extends Ranking SVM by a graph-based regularized algorithm to learn a ranking function that minimizes the least squares ranking loss.
Finally, the list-wise approaches use a list of ranked objects as training instances and learn to predict the list of objects. There are two sub-categories which are, respectively, based on the direct optimization of IR evaluation measures (e.g., SoftRank [42], AdaRank [50], and SVM map [56]) and the minimization of listwise ranking losses (e.g., ListNet [6], and ListMLE [48]). Examples are briefed as follows. AdaRank [50], a learning algorithm within the framework of boosting, repeatedly constructs "weak rankers" and finally linearly combines the weak rankers to make ranking predictions. SVM map [56] is a SVM-based learning algorithm that efficiently finds a globally optimal solution to a straightforward relaxation of MAP [1]. ListNet [6] introduces a probabilistic-based list-wise loss function and adopts neural network and gradient descent to train a list prediction model. In ListMLE [48], the likelihood loss is employed as the surrogate for the IR evaluation measures.
More examples are described as follows: [25] treats IR as binary classification of relevance and explores the applicability of discriminative classifiers to solve the problem. [18] takes pairs of documents and their relative preferences derived from Figure 1 The general paradigm of learning to rank for IR [52]. click-through data (i.e., the log of links that users click on in the presented ranking) as training instances and applies Ranking SVM for learning better retrieval functions. [7] modifies the "Hinge Loss" function in Ranking SVM to consider two essential factors for IR: (1) to have high accuracy on the top-ranked documents, and (2) to avoid training a biased model towards queries with many relevant documents. [49] uses Ranking SVM to address definition search, where the retrieved definitional excerpts of a term are ranked according to their likelihood of being good definitions. [54] extends SVM selecting sampling techniques in classification for learning to rank. [24] proposes a multiple nested ranker approach to re-rank the top scoring documents of the result list, in which RankNet is applied to learn a new ranking at each iteration. RankCosine [30] uses cosine similarity between the ranking list and the ground truth as a query-level loss function. RV-SVM [55] develops the 1-norm Ranking SVM, which is based on 1-norm objective function, for faster training using much less support vectors than the standard Ranking SVM.
Other research directions, which are receiving increasing attention in recent years, include [8]: online learning to rank for quickly learning the best re-ranking of the top position of the original ranked list based on real-time user click feedback (e.g., [9,10,34,35]); large-scale learning to rank which leverages both the learning theory and computational theory for ranking when facing large-scale training data (e.g., [31,41,47]); learning to rank for diversity that optimizes not only for relevancy, but also for diversity (i.e., for minimum redundancy) by taking into account document similarity and ranking context (e.g., [33,38]); and robust learning to rank which optimizes the tradeoffs between model effectiveness and robustness for real-world retrieval scenarios (e.g., [20,46]). Figure 2 gives an overview of the proposed method, which is essentially an extension of Figure 1. Newly added modules are marked in gray. Feature Correlation stores the correlation relationships between features in a matrix, in which the relationships are measured by rank-order correlation coefficients.

THE PROPOSED METHOD
Vector Transformation considers the feature correlation matrix as an intermediary context space for transformation and derives the 2nd-order context feature vectors by projecting the 1st-order raw feature vectors into the space. Last, Learning Algorithm: Ranking SVM takes the 2nd-order context feature vectors as the input to train a linear binary classifier by Ranking SVM [18] for judging, as regards a particular query, the binary ordering relations between documents. The details of the proposed method are elaborated in the following subsections. The symbols used are denoted as follows: • Q : the query set. Q = {q 1 , . . . , q m }, |Q| = m; • D : the document set. D = {d 1 , . . . , d n }, |D| = n; • f : the ranking model f (·). f (q, d) indicates the relevance judgment for document d with respect to query q.

Relevance labeling
The relevance labeling annotates instances with proper relevance judgments, which play the role of answers (or observations) that guide the learning algorithm to train an effective ranking model. The labeling scheme in this paper is n-star rating for different levels of relevance. In terms of the 3-star rating, for example, the relevance judgments are quantified as 0 for not relevant, 1 for possibly relevant, and 2 for definitely relevant.

Feature vector extraction and context transformation
The three following steps are carried out in the process: (1) 1st-order raw feature vector extraction; (2) feature correlation extraction; and (3) 2nd-order context feature vector transformation.
(1) 1st-order raw feature vector extraction With the feature set F comprising a |F|-dimensional vector space, each query-document pair (q i , d i, j ) is represented as a vector of numerical features: where The function, f_val, is a feature extraction function, defined as: Here, d i, j is named "1st-order raw feature vector," to be distinguishable from "2nd-order context feature vector," as will be explained later.
(2) Feature correlation extraction For query q i , a document sequenceŘ i,k can be established by ordering documents in D(q i ) in accordance with their feature values, calculated by Eq. (2), when f k is considered.
where r i,k is an order function that maps each document in D(q i ) to an appropriate position inŘ i,k . For instance, is satisfied if and only if p and t exist and Given It becomes possible to assess the correlation between two sequences using rank-order correlation coefficients, such as Pearson's r [28], Spearman's rho (ρ) [40], and Kendall's tau (τ ) [19]. Here, the correlations between sequences are referred to as the correlations between features with respect to the given query. To get an overall feature correlation in consideration of all queries in Q, a simple strategy is utilized to compute the correlation between f i and f j , i.e., c i, j , based on the rule of macro-correlation, as presented below: where F_RCorr is a rank-order correlation coefficient function to measure the rank-order correlation between two document sequencesŘ t,i (D(q t )) andŘ t, j (D(q t )). F_RCorr is formulated as: Finally, a feature correlation matrix, C, is generated by Eq. (6): It is recalled that, by Eq. (4), c i, j = c j,i holds. The matrix C is evidently a symmetric matrix so that Eq. (7) is satisfied.
where C i =< c i,1 , . . . , c i,k > is the i -th row vector of C, and C i =<ĉ i,1 , . . . ,ĉ i,k > is the i -th column vector of C. Notably, the row vector (or the column vector) provides a mathematical formalization for f i , as shown: 3 Table 1 lists the rank-order correlation coefficients used for F_RCorr. For example, in terms of Kendall's tau (τ ), values of the coefficient range from −1 (i.e., 100% negative association, or perfect inversion) to +1 (i.e., 100% positive association, or perfect agreement). A value of 0 indicates the absence of association. Considering that correlation coefficients may have a different range of values (e.g., Kendall's tau (τ ) versus Kendall tau distance), the comparison between them in such cases may lead to insignificant contrast. Therefore, each coefficient value is further normalized into a common range of [0, 1] for fair comparisons, where 0 represents no association and +1 represents perfect agreement. computer systems science & engineering J.-Y. YEH Table 1 The rank-order correlation coefficients used for F_RCorr.
In consideration of one rank-order correlation coefficient used for F_RCorr, there are in total |Q| feature correlation values for f i and f j , according to Eq. (5). In the implementation, the distribution of |Q| feature correlation values is found to be not concentrated, which might cause the distortion of c i, j . Thus, Box-and-Whisker Plot [44], a.k.a. Boxplot, is employed to filter out potential outliers. The outlier detection process is detailed as follows. First, all feature correlation values are arranged in sequence from low to high. Then, the value of the 25th percentile and the value of the 75th percentile are defined as Q 1 and Q 3 , respectively. The interquartile range, IQR = Q 3 − Q 1 , is computed. Finally, c i, j is calculated as the mean of feature correlation values in range of [Q 1 − 1.5 × IQR, Q 3 +1.5 × IQR]. Note that feature correlation values outside the range are treated as outliers and are discarded during the computation of Eq. (4).
The proposed 2nd-order context transformation method, named as Latent Semantic Analysis Based, applies latent semantic analysis (LSA) [13,23] to project d i, j into a space with latent semantic dimensions that is derived from the feature correlation matrix C. The transformation process consists of three steps which are in sequence singular value decomposition, dimensionality reduction, and folding-in. Firstly, singular value decomposition (SVD) is performed on the feature correlation matrix C. The SVD of C is defined as C = U SV T , where U is a k × k column-orthonormal matrix of left singular vectors in columns; S is a matrix with singular values (s 1 , . . . , s k ) sorted in descending order in diagonal and zeros elsewhere; and V is a k × k orthonormal matrix of right singular vectors in columns. Suppose that the rank of C is p, S satisfies s 1 ≥ s 2 ≥ . . . ≥ s p > s p+1 = · · · = s k = 0. Dimensionality reduction follows to keep only z(z < p) singular values of S for obtaining a z × z matrix C z . Note that C z is an approximation of C, i.e., C z = U z S z V T z ≈ C, in which S z represents the latent semantic structure derived from C. Finally, folding-in folds d i, j into the latent semantic space S z to obtain d (2) i, j by Eq. (10): It is worth noting that by dimensionality reduction related features are mapped onto the same dimensions of the reduced space S z and unrelated features are mapped onto different dimensions. This operation reflects a grouping of features into z linearlyindependent base vectors, i.e., the contexts of features in this study. It can be said that the dimensions of the reduced space correspond to the axes of greatest variation [23]. Thus, folding d i, j into the latent semantic space S z means to represent d i, j by these base vectors.

Ranking model learning and ranking prediction
To learn the ranking model, Ranking SVM [18] is employed 4 since previous studies have already demonstrated its feasibility and effectiveness. The 2nd-order context feature vectors of query-document pairs are viewed as instances. Pairs of instances and their relative preferences are inputted into Ranking SVM. The algorithm targets binary ordering relations between documents with respect to a query and tries to learn a model that can minimize the number of discordant pairs based on the observed parts of the target ranking. In the implementation, SVM rank , a toolkit for efficiently training Ranking SVMs, is used. 5 The output ranking model is a linear binary classifier, which is capable of determining whether a pair of documents is in concordant order. As for ranking prediction, given a new query q and its retrieved document set D(q), the output of the ranking model comprises binary ordering relations between documents, according to which the final ordering of documents in D(q) can be established. 4 The learning algorithm inside the proposed method is not limited to Ranking SVM. Other learning algorithms that represent query-document pairs as vectors of features can be integrated. This issue is left for future work. 5 Available at http://www.cs.cornell.edu/people/tj/svm_light/svm_rank.html. Note that SVM rank learns an unbiased classification rule using linear kernel.

EXPERIMENTS
This section describes the datasets and evaluation measures, and reports the preliminary experimental results.

The LETOR benchmark datasets
The LETOR 3.0 and LETOR 4.0 benchmark datasets 6 are used to evaluate the effectiveness of the proposed method. The LETOR datasets are created as query-document pairs, each containing a feature vector and its corresponding relevance judgment. The 5-fold partitions are provided for cross-validation. In each fold, three subsets are used for learning, one subset for validation, and the other one for testing. The proposed method is tested on the TD2003, TD2004, and MQ2008 datasets, the statistics of which are listed in Table 2. Table 3 provides an illustration of the sample data, with each row standing for a query-document pair.

Evaluation measures
The standard P@n, NDCG@n, and MAP measures are used in the evaluation.
(1) Precision at position n (P@n) [1] For a given query, its precision of the top n results of the ranking list is defined as: Note that, when computing P@n, a document with the relevance judgment of either definitely or possibly relevant is regarded as a document relevant to the given query. The mean P@n is reported by averaging the P@n values of all queries.
(2) Mean Average Precision (MAP) [1] For a given query, its average precision, A P, is computed by Eq. (12), where N is the number of retrieved documents and rel(n) is either 1 or 0, indicating whether the n-th document is relevant to the query or not. The MAP is obtained as the mean average precision over a set of queries.
(3) Normalized Discounted Cumulative Gain (NDCG) [17] For a query, the NDCG of its ranking list at position n is calculated by: where r ( j ) is the rating of the j -th document in the list, and the normalization constant Z n is set so that the perfect list receives an NDCG of 1. The r ( j ) is set to the relevance judgment, i.e., 2 when the j -th document is definitely relevant to the query, 1 6 Available at http://research.microsoft.com/en-us/um/beijing/projects/letor/ default.aspx. when the j -th document is possibly relevant to the query, and 0 when the j -th document is irrelevant to the query. The mean NDCG@n is reported by averaging the NDCG@n values of all queries.

Experimental settings
In the experiments, 5-fold cross-validation is conducted and the average score is reported. For each fold, the training set is first used to learn a ranking model. The feature correlation matrix (see Eq. (6)) is also built using the training set. The validation set is used for tuning model parameters, and the ranking model is then applied on the testing set. The standard LETOR evaluation tools are used in order to avoid differences in the evaluation results caused by different implementations of the evaluation measures. Two types of parameters need to be determined, namely, the z value for dimensionality reduction in the latent-semanticanalysis-based 2nd-order context transformation method, and the SVM rank learning-specific options. For the z value, a naïve method that sets z with a reduction ratio is adopted. As an example, supposing that the rank of the feature correlation matrix C is p, z is set to 0.2 × p when a reduction ratio of 20% is considered. The possible ratio takes the values (10%, 20%, …, 90%). For each fold, the ratio with the best MAP performance on the validation set is selected and its performance on the testing set is reported. The SVM rank learning-specific options are set with "−c <C> −e 0.001 −1 1," where <C>, the trade-off between training error and margin, takes the values (0.00001, 0.00002, 0.00005, 0.0001, 0.0002, 0.0005, 0.001, 0.002, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1, 2, 5, 10). Similarly, for each fold, the <C> value with the best MAP performance on the validation set is selected and its performance on the testing set is reported.

Results
The evaluation results are given in Tables 4-12. The baseline, RankSVM-Struct 7 [18], takes the 1st-order raw feature vectors as the input. Various models of the proposed method are examined and indicated in forms of {G, K1, P, SD, SR, K2}-L, where the previous part denotes the rank-order correlation coefficient used for F_RCorr (see Table 1) and the later part specifies the proposed 2nd-order context transformation method, i.e., Latent Semantic Analysis Based (see Section 3.2). For example, the model K1-L employs "Kendall's tau (τ )" for measuring the feature correlations and "Latent Semantic Analysis Based" for performing the 2nd-order context transformation. The values in the parentheses suggest the relative improvements of the proposed method when being compared with RankSVM-Struct. Last, in each column, the best performance is given in bold. Tables 4-6 list the results on TD2003. It can be seen that the proposed models significantly outperform RankSVM-Struct in terms of P@ [1,3] and NDCG@ [1,3]. As for P@ [5,10] and NDCG@ [5,10], some of the proposed models generate  worse results than RankSVM-Struct; for instance, G-L has decreases of 8.7% and 2.22%, regarding P@5 and NDCG@5, respectively. When considering MAP, Table 6 shows that the proposed models are superior to RankSVM-Struct, except that G-L and P-L have slight decreases of 0.26% and 0.22%, respectively. The maximum and minimum increases of improvement are 6.3% (for SD-L) and 2.8% (for SR-L). The average MAP of the proposed models is 0.2785, indicating an increase of 2.65%, compared with RankSVMStruct. In Table 6, the MAP scores of several representative baselines, including ListNet [6], AdaRank [50] (in two versions, namely, AdaRank-NDCG and AdaRank-MAP), and RankBoost [14], are also provided. Evidently, the proposed method performs well with competitive results. The best model, SD-L, for example, outperforms List-Net, AdaRank-MAP, AdaRank-NDCG, and RankBoost with increases of 4.76%, 26.33%, 21.79%, and 26.82%, respectively.0. Tables 7-9 list the results on TD2004. For all measures, P-L is observed as having significant improvements when compared with RankSVM-Struct. Both SR-L and K2-L have satisfying improvements when P@ [1,3] and NDCG@ [1,3] are considered. Other models, namely, G-L, K1-L, and SD-L, have worse results than RankSVM-Struct. As for MAP, Table 9 indicates that the proposed models are superior to RankSVM-Struct. The maximum and minimum increases of improvement are 6.01% (for P-L) and 2.41% (for K1-L), respectively. The average MAP of the proposed models is 0.2286, implying an increase of 4.11%, compared to RankSVM-Struct. Again, the proposed method is found to be competitive in comparison to other baselines. For instance, the best model, P-L, outperforms ListNet, AdaRank-MAP, and AdaRank-NDCG with increases of 4.35%, 6.35%, and 20.25%, respectively. Unfortunately, none of the proposed models could beat RankBoost.
Tables 10-12 list the results on MQ2008. The proposed mod-els are observed as having better performance than RankSVM-Struct, as regards P@1 and NDCG@1. Some statistics are as follows. The increases of improvement for K1-L are 4.47% for P@1 and 4.08% for NDCG@1. As for P@ [3,5,10] and NDCG@ [3,5,10], the proposed models do not work well as expected. Some models obtain slight increases of improvement, while others perform with worse results. Regarding MAP, Table 12 suggests that although the proposed models outperform RankSVM-Struct, the improvements are not significant enough, except for K2-L, which has an increase of 1.21%. The average MAP of the proposed models is 0.4725, denoting an increase of 0.62%, compared to RankSVM-Struct. However, none of the proposed models perform better than the other baselines. Table 13 lists the upper-bound results of the proposed models on MQ2008. The results are obtained by the following steps. First, ranking models are trained with all possible combinations of parameters. For each fold, 171 (9 × 19 = 171; 9 values for z and 19 values for <C> in Section 4.3) ranking models are produced. Second, all the models are evaluated using the testing set and the best model is picked. 8 Finally, the scores of the best model are reported. Table 13 shows that with the proper parameters, the proposed models can perform better than RankSVM-Struct with significant increases of improvement. It is conjectured that the testing sets and the validation sets in MQ2008 have diverse properties. In such a case, the use of the validation set fails to select a good model for the testing set, which might explain why the proposed method leads to insignificant improvements compared to RankSVM-Struct when it is evaluated on MQ2008 (see Table 12).
Overall, the proposed method behaves differently on differ- 8 Since the picked model is with the best parameters that are directly optimized using the testing set, the results in Table 13 are regarded as the upper-bound results.        With regard to MAP, the proposed method is found to be superior to RankSVM-Struct with significant increases of improvement on TD2003 and TD2004 and a slight increase of improvement on MQ2008 (see Table 6, Table 9, and Table 12). The results suggest that the learning method benefits from the rank-ordercorrelation-based feature vector context transformation, which attempts to represent instances as vectors of contexts that take vol 33 no 1 January 2018 into account the correlation relationships between features. The proposed method is also observed as having good performance in terms of P@1 and NDCG@1, implying that the ranking model trained by the proposed method tends to rank the relevant document in the first place of the retrieved list. Furthermore, according to MAP, we can rank the proposed models in sequence as SD-L > K1-L > K2-L > SR-L > P-L > G-L for TD2003, P-L > K2-L > SR-L > SD-L > G-L > K1-L for TD2004, and K2-L > K1-L > SR-L > SD-L > G-L > P-L for MQ2008. Then, a final ranking of the proposed models can be built as K2-L > SD-L > K1-L = SR-L > P-L > G-L, according to the average rank-order of every model.

CONCLUSION AND FUTURE WORK
In this paper, a novel supervised learning method for learning to rank is developed. The method proposes to model instances as bags of contexts of features, instead of as bags of features, i.e., vectors of features, which most supervised learning methods adopt. It applies rank-order correlations to measure the correlation relationships between features. The feature vectors of instances, i.e., the 1st-order raw feature vectors, are then mapped into the feature correlation space via projection to derive the context-level feature vectors, i.e., the 2nd-order context feature vectors. The following six rank-order correlation coefficients are considered for feature correlation extraction (see Table 1): Goodman and Kruskal's Gamma (G), Kendall's tau (τ ), Pearson's r , Somer's d, Spearman's rho (ρ), and Kendall tau distance. One 2nd-order context transformation method is proposed, i.e., Latent Semantic Analysis Based (see Section 3.2), which produces the 2nd-order context feature vector by directly folding the 1st-order raw feature vector into the latent semantic space of the feature correlation matrix. In terms of ranking model learning, Ranking SVM is employed with the 2nd-order context feature vectors as the input. The proposed method is evaluated using the LETOR benchmark datasets and is found to perform well with competitive results. The results suggest that the learning method benefits from the rank-order-correlation-based feature vector context transformation.
Future work will continue to investigate the effectiveness of the proposed method by introducing other rank-order correlation coefficients for feature correlation extraction and different projection techniques for the 2nd-order context transformation. Another interesting objective is to test other learning algorithms by incorporating the 2nd-order context feature vectors. Lastly, considering that a row vector (or a column vector) of the feature correlation matrix provides a context-level mathematical formalization for feature f i , it would be beneficial to design feature selection or feature clustering technologies for detecting redundant features based on the context formalizations of features.