Incremental Sparse Bayesian Ordinal Regression

Ordinal Regression (OR) aims to model the ordering information between different data categories, which is a crucial topic in multi-label learning. An important class of approaches to OR models the problem as a linear combination of basis functions that map features to a high dimensional non-linear space. However, most of the basis function-based algorithms are time consuming. We propose an incremental sparse Bayesian approach to OR tasks and introduce an algorithm to sequentially learn the relevant basis functions in the ordinal scenario. Our method, called Incremental Sparse Bayesian Ordinal Regression (ISBOR), automatically optimizes the hyper-parameters via the type-II maximum likelihood method. By exploiting fast marginal likelihood optimization, ISBOR can avoid big matrix inverses, which is the main bottleneck in applying basis function-based algorithms to OR tasks on large-scale datasets. We show that ISBOR can make accurate predictions with parsimonious basis functions while offering automatic estimates of the prediction uncertainty. Extensive experiments on synthetic and real word datasets demonstrate the efficiency and effectiveness of ISBOR compared to other basis function-based OR approaches.


Introduction
The task of modeling ordinal data has attracted attention in various areas, including computer vision [1,2], information retrieval [3], recommender systems [4] and machine learning [5,6,7,8,9].Because of the explicit or implicit relationship between labels, simple regression or multi-classification algorithms may fail to find optimal decision boundaries, which motivates the development of dedicated methods.
Generally, OR algorithms can be classified into three categories: naive approaches, ordinal binary decompositions, and threshold models [5].For naive approaches, OR tasks are simplified into traditional multi-classification or regression tasks, omitting ordering information, and solved by simple machine learning algorithms, e.g., Support Vector Machine (SVM) Regression [10].For ordinal binary decomposition, the ordinal labels are decomposed into several binary pairs, which are then modeled by a single or multiple classifiers.For the threshold models, the OR problem is addressed by training a threshold model, which models the hidden score function and an implicit set of thresholds that derive the ordinal paradigm.Among these three categories, the third, threshold models, is the most popular way to model the OR problems [5].Thus, in this paper, we focus on threshold models.
Since data may lie in a low-dimensional space where data are not distinguishable by a linear combination of the features, basis functions are widely used in all three types of OR algorithm.The basis function can map features to highly non-linear spaces where the data can be distinguishable by a linear combination of basis functions [11].We call this kind of algorithms basis function-based algorithms.Most of the current basis function-based OR algorithms do not scale well, as they are batch methods and require access to the full training dataset.
To address this scalability problem, we propose Incremental Sparse Bayesian Ordinal Regression (ISBOR), which utilizes an incremental Bayesian approach to learning.We impose a zero-mean Gaussian prior over function parameters and utilize the ordinal likelihood [12], which is regarded as a probit function of OR to model the ordinal relationship between categories.Then we apply the Laplace method [13] to derive a Maximum a Posteriori (MAP) estimate of the unknown parameters over the dataset.In order to derive a full Bayesian solution, we derive a type-II maximum likelihood optimization [14], in which ISBOR automatically optimizes the thresholds that determine the decision boundaries of ordering categories as hyper-parameters.Finally, to accelerate training, we follow the idea of fast marginal likelihood learning [15] and derive an incremental training strategy for ISBOR.
With this paper, we make an important step towards efficient ordinal regression based on basis functions.In particular, the main contributions are as follows: • We propose a basis function-based sequential sparse Bayesian treatment for ordinal regression, ISBOR, which scales well with the number of training samples.
• We provide an experimental evaluation of ISBOR's performance against existing basis function-based OR algorithms in terms of efficacy, efficiency and sparseness.
The remainder of the paper is organized as follows.Section 2 revisits the related work.Section 3 presents ISBOR.Section 4 details the hyper-parameter optimization of ISBOR.We report on the experimental results in Section 5.The paper is concluded in Section 6.

Related Work
In this paper, we focus on so-called basis function-based approaches to ordinal regression, which bring non-linear patterns to the linear decision functions and are well studied in machine learning.Three types of basis function-based approaches are widely used for the OR task: SVMs [11], Gaussian Processes (GP) [16] and Sparse Bayesian Learning (SBL) [14].SVM approaches convert the learning process to a convex optimization problem for which there are efficient algorithms, e.g., SMO [17], to find global minima.However, SVM is not equipped with a probabilistic interpretation, as a result of which it is hard to use expert or prior knowledge and make the probabilistic predictions with SVM.GP [16] and SBL are Bayesian methods, which take expert knowledge as prior information and interpret the prediction with the posteriori distribution.In order to conduct Bayesian inference and model selection, most of them require one to compute the inverse of the basis function matrix, which leads to O(N 3 ) computational complexity, where N is the number of training samples.
In the following, we describe some of these algorithms to provide context for our work.The SVM-based Support Vector Ordinal Regression (SVOR) approach [18] is an accurate OR algorithm [5].SVOR is optimized using a sequential minimal optimization strategy, which brings the upper bound down to O(N 2 log N).Solving SVOR in the dual problem boils down to optimizing with L2-regularization, which leads to a slightly sparse solution.
Incremental Support Vector Machine for Ordinal Regression (ISVOR) [19] addresses the problem of basis function-based batch algorithms for OR.It decomposes the OR problem into ordinal binary classification and simultaneously builds decision boundaries with linear computational complexity.However, ISVOR suffers from the problem of stability and it doubles the problem size because of its binary decomposition approach.The main difference between the proposed ISBOR and SVM-based methods is that ISBOR can use prior knowledge and make probabilistic predictions.
Gaussian Process Ordinal Regression (GPOR) [12] is the first GP algorithm that has been proposed for the OR task.GPOR employs a GP prior on the latent functions, and uses an ordinal likelihood, which is a generalization of the probit function, to estimate the distribution of ordinal data conditional on the model.To conduct model adaptation, GPOR applies two Bayesian inference techniques: Laplace approximation [13] and expectation propagation approximation [20], respectively.Since approximate Bayesian inference methods requires one to compute the inverse of an N × N matrix, the computational complexity of GPOR is O(N 3 ).The main differences between GPOR and ISBOR are twofold: 1. ISBOR is a sparse method, as a result of which the prediction is only based on the relevant samples.In contrast, GPOR makes predictions based on the whole training data.2. ISBOR is an incremental learning algorithm, while GPOR is a batch algorithm: during training, GPOR needs to compute the matrix inverse of size N × N , while ISBOR only computes the matrix inverse of size M × M, where M ≪ N is the number of relevant samples.
Based on GPOR, various OR algorithms have been proposed [21,22,23,24].However, they are all batch algorithms.In contrast, the proposed method, ISBOR is an incremental learning algorithm and gets rid of computing the inverse of N × N matrix.
Based on SBL, Sparse Bayesian Ordinal Regression (SBOR) [25] builds a probabilistic solution to the OR problem.Here, "sparse" that means SBOR utilizes a sparseness assumption that enables it to make predictions based on a few relevant samples with a O(M 3 ) computational bound, where M is the number of relevant samples.However, SBOR is still a batch algorithm and requires one to handle matrix inversion on the full dataset during initial iterations.Other basis function-based batch OR algorithms include Kernel Discriminate for Ordinal Regression (KDOR) [26].
In summary, ISBOR differs from the above algorithms in the following ways.Instead of operating in batch, ISBOR utilizes an incremental way to sequentially choose relevant samples.Because of the sparsity assumption, during sequential training ISBOR only selects a small portion of the training data with linear computational complexity in each iteration.Moreover, instead of designing ordinal partitions like ISVOR, ISBOR directly learns the implicit thresholds and score function, which is a more natural way to reveal ordinal relations.

Incremental Sparse Bayesian Ordinal Regression
We start this section by defining the notation used in the paper.The training set is D = {x n , y n } N n=1 , where x n ∈ R d is the feature vector, y n ∈ {1, 2, . . ., r} is the corresponding category; r is the number of categories.We use normal-face letters to denote scalar and boldface letters to denote vectors and matrices.
We present ISBOR in four steps: model specification, likelihood definition, prior assumption and maximum a posterior.

Model specification
As a threshold OR model [5], ISBOR chooses a linear combination of basis functions as the score function, f (x n ; w), which maps a sample from the d-dimensional feature space to a real number: where w ∈ R N denotes the parameter vector1 and φ( is the basis function, e.g., the Gaussian Radial Basis Function (RBF): After mapping, ISBOR exploits a set of thresholds, [b 0 , . . ., b r ], to determine intervals of different categories.In order to represent the ordering information, these thresholds are chosen as a set of ascending numbers, e.g., b i+1 > b i , and work with a set of positive auxiliary numbers,

Ordinal likelihood
To model ordinal data, we take the ordinal likelihood proposed in GPOR [12].The likelihood is the joint distribution of the samples conditional on the model parameters, and with the I.I.D. assumption; it is computed as: where Y = {y n } N n=1 and X = {x n } N n=1 .Following the standard probabilistic assumption [12], we assume that the outputs of a score function are contaminated with random Gaussian noise: ŷn = f (x n ) + ǫ, where ǫ ∼ N(0, σ 2 ).σ is the standard deviation of the noise distribution, which is learned by the model selection (Section 4.2).In this way, the score function is linked to the probabilistic output p( And the likelihood over a sample is computed as follows: ] divide the real line into r ordinal intervals.Thus, with these intervals, the ideal likelihood maps the real value output f (x) to ordinal categories.However, because of the uniform distribution, Eq. ( 3) is not differentiable, and hence we cannot implement Bayesian inference.To tackle this issue, we integrate out the noise term and obtain a differentiable likelihood as follows: where and ψ(z) is the Gaussian cumulative distribution function.Based on Eq. ( 4), maximum likelihood estimation is equivalent to maximizing the area under the standard Gaussian distribution between z n,1 and z n,2 , which is differentiable.

Priori assumption
For large scale datasets, if we directly learn parameters by maximum likelihood estimation, we may easily encounter severe over-fitting.To avoid this, we add an additional constraint on parameters: the regularization term.In Bayesian learning, we achieve this by introducing a zeromean Gaussian prior for w: Assuming that each parameter is mutually independent, the prior over parameters is computed as: where α = [α 1 , . . ., α N ] and α n , the inverse of variance, serves as the regularization term.If the value of α n is large, the posterior of w n will be mainly constrained by the prior and w n will be bound to a small neighborhood of 0.2 To complete the definition of the sparse prior, we define a set of flat Gamma hyper-priors over α, which together with Gaussian priors result in Student's-t prior and work as L 1 regularization [14].

Maximum a posterior
Having defined the prior and likelihood, ISBOR proceeds by computing the posterior over all training data, based on Bayes' rule: where D is the training data set, p(w | α) defined in Eq. ( 5) is the prior, p(Y | X, w, σ) defined in Eq. ( 4) is the likelihood, the denominator p(D | η) = p(Y | X, w, σ)p(w | α)dw is the marginal likelihood, which we use for model selection and hyper-parameter optimization in the next section.To simplify our notation, we collect all the hyper-parameters, including noise level σ, thresholds and α, into η.
We prefer the w * with the highest posterior probability, and formulate the MAP point estimate as w * = max w p(w | D).However, we cannot integrate w out in the marginal likelihood where A is a diagonal matrix with diagonal elements [α 1 , . . ., α N ], const is a term independent of w.The first part of the last line, from the likelihood, works as the loss term; the second part, from the prior, acts as the regularization term.
Next, the Newton-Raphson method [27] is applied to compute the MAP estimate.First, we compute the first and second order derivatives of the first term (log-likelihood part), L = ln p(Y | X, w): where Then, combining Eq. ( 7), Eq. ( 8) and Eq. ( 9), we obtain the derivative of the log-posterior as Note that Φ T HΦ is a quadratic form and A is a diagonal matrix with positive diagonal elements, so is a positive definite matrix, which implies that MAP estimation is a concave programming problem, with a global maximum.
Having found the MAP point w * , we use the Laplace method to approximate the posterior distribution by a Gaussian distribution N(w | w * , Σ), where w * and Σ are the mean and variance and computed as follows: where t = H −1 δ + Φw * .Using a local Gaussian at the MAP point to represent the posterior distribution over weights is often considered as a weakness of the Bayesian treatment, especially for complex models.However, as pointed out by Tipping [14], a log-concave posterior implies a much better accuracy and no heavier sparsity than L1-regularization.As we discussed above, the posterior of ISBOR has the feature of log-concavity.We report the plots of the log-posterior as well as its first and second order derivatives in Figure 1 and see that ∂L ∂w is monotonically decreasing w.r.t.w, while ∂ 2 w is always smaller than 0. So the MAP here is essentially a log-concave optimization problem, which implies that the Laplace approximation in ISBOR enjoys the same features of accuracy and sparsity as in the Relevance Vector Machine (RVM) [14].

Marginal likelihood
As a fully Bayesian framework, hyper-parameters are optimized by maximizing the posterior mode of hyper-parameters p(η | D) ∝ p(D | η)p(η), where η contains all hyper-parameters.As we assume a non-informative Gamma hyper-prior, the optimization is equivalent to maximizing the marginal likelihood p(D | η), which is computed as As there is no closed form for this equation, again, we apply Laplace approximation and get the following approximations: In the rest of this section, we deal with the log-marginal likelihood, and maximize Eq. ( 12) with respect to each hyper-parameter.

Threshold and noise hyper-parameters
For the threshold hyper-parameters, we only need to determine r − 1 values: b 1 and [∆ 2 , . . ., ∆ r−1 ].Since we cannot compute these analytically, we exploit gradient descent (ascent, actually) to iteratively choose these parameters.The derivatives of the log-marginal likelihood, Eq. ( 12), with respect to b 1 and ∆ i , are computed as follows: Based on these two equations, we use gradient descent to search for proper thresholds.
For the noise term σ, setting the derivative we obtain an update rule for the noise term: where t = H −1 δ + Φw * .

Fast marginal learning
We compute the contribution of the sparsity hyper-parameter α to the marginal likelihood as follows: ln where we compute C as follows: Since computing C requires matrix inversion, it is impractical to maximize it for large scale training sets.Fortunately, Tipping and Faul [15] proposed a sequential way to maximize the marginal likelihood.We take this strategy and optimize α as follows: • First, we use the established matrix determinant and inverse identities [28] to compute the determination and inverse of C as follows: where I is the identity matrix, and C / j denotes C without the contribution of the j-th sample.
• Second, we define two auxiliary variables: Combining Eqn. ( 16), ( 18) and ( 19), we isolate the contribution of sample j to the marginal likelihood as follows: For simplicity, we define g(α j ) = ln p( D | α j ).
• However, we still need to compute the inverse of C / j in Eq. ( 19).To speed up the computation, we define the follow auxiliary variables: where Σ ∈ R M×M is the covariance of the posterior distribution (Eq (10)). 3Then, we can compute s j = α j S j α j − S j and q j = α j Q j α j − S j .
• Finally, setting ∂g(α j ) ∂α j = 0, we get the closed form solution for α j : Since α j ≥ 0, the denominator of Eq. ( 21), denoted as f j = q 2 j − s j > 0, which works as an important criterion for determining the relevant samples.

ISBOR
We summarize the pseudo-code of ISBOR in Algorithm 1.We provide brief comments on three ingredients.First, we initialize ISBOR (line 4) by randomly picking a sample from each category as the initial relevant samples.Based on these r samples, we initialize Q, S and f.On Line 6, we compute the delta marginal likelihood for the samples not yet considered.As to the call to Estimate() (line 13), we update w based on Eq. ( 11); update α based on Eq. ( 21); update ml based on Eq. ( 12) and use gradient search to update threshold b based on Eq. ( 13) and Eq. ( 14).w, α, b, ml = Estimate(w, α, b, Φ, φ, σ); if abs(ml − ml old ) < minDelta then ml old = ml; 21: end for

Computational analysis
The maximization rule for marginal likelihood is based on the MAP estimate which, in Eq. (10), requires the inversion of a matrix with O(M 3 ) computational complexity and O(M 2 ) memory.However, as we constructively maximize the marginal likelihood, M ≪ N, first, we choose one sample from each category to initialize the algorithm; second, we benefit from the sparse learning, as the scale of M remains small (around a few dozen based on our experiments).In this case, matrix inversion is not the main computational bottle-neck for each iteration.
Although we apply an incremental strategy to train ISBOR, we have to compute the basis function matrix in the initialization step, which has O(N 2 ) computational complexity and O(N 2 ) memory.Combining these two parts, the total computational complexity of ISBOR is O(N 2 +M 3 ) and the memory complexity O(N 2 ).However, we should mention that the basis function matrix can be computed in the pre-training session, so the computational complexity is essentially O(N+ M 3 ).For comparison, we report the computational and space complexity of SBOR and other state-of-the-art methods in Table 1.We see that ISBOR has the best computational complexity, and thus, ISBOR is more efficient than others, at least theoretically.
As computing the posterior covariance requires the inverse of the Hessian matrix, (A + Φ T HΦ) −1 , it is inevitable to encounter the singular values.Theoretically speaking, H and A are the diagonal matrices with positive elements, Φ T HΦ is the quadratic form.However, there still exist singular problems, especially when some α are extremely large.In order to avoid illconditioning, we manually prune training samples with large α values.

Sparsity analysis
The simple Gaussian prior working as an L2-regularization in the posterior model leads to a non-sparse MAP estimate.However, with the Gamma hyper-prior, the real prior over w follows a Student's t distribution which is considered as a sparse prior with a sharp peak at 0 [14, Section 5.1].During inference, we do not integrate out α, which implies that α is the direct factor to sparsity, which in turn means that for irrelevant vectors the corresponding α should be large.However, the learned α in the sequential model are relatively small: we only add potentially relevant samples whose α are essentially small to the model.There is no reason to learn α of samples excluded from the model, which have large values.

Experimental Evaluation
Our experimental evaluation aims at addressing the following three research questions.
RQ1 Efficacy: Is the generalization performance of the proposed algorithm, ISBOR, comparable to other baselines?RQ2 Efficiency: Does fast marginal analysis reduce ISBOR's computational complexity compared to baselines?RQ3 Sparseness: Can ISBOR achieve the competitive predictions only based on a small subset of the training set?

Experimental design
The research questions listed above lead us to two experimental designs.The first involves a synthetic dataset to give us an understanding of the efficacy, effectiveness and sparsity.The second is on benchmark datasets, i.e., 7 widely used ordinal datasets to extensively evaluate the performance of ISBOR.

Datasets Synthetic dataset.
To create a synthetic dataset we follow the data-generating strategy in [29].First, 21, 000 two-dimensional points are sampled within the square area [0, 10] × [0, 10] under a uniform distribution.Second, each point is assigned a score by the function f (x) = 10(x 1 − 0.5)(x 2 − 0.5) + ǫ, where ǫ ∼ N(0, 0.5 2 ) acts as a Gaussian random noise.Finally, we choose six thresholds {−∞, −60, −9, 15, 60, +∞}, and each point is attached with a category by computing: In this manner, we generate a five-category dataset and the numbers of data points assigned to each category are 4431, 4535, 3949, 3780 and 4305, respectively.We choose 10 different sizes of training sets: 1000, 2000, . . ., 10000 and use the rest of the data as test sets.For each size training sets, we randomly generate 30 different partitions.Then, the experiments are conducted on all 30 partitions.Benchmark datasets.We also compare ISBOR with five algorithms on seven benchmark datasets. 4he details of the benchmark datasets are summarized in Table 2.Each benchmark dataset is ran- domly split into 20 partitions.

Metrics
We use Mean Absolute Error (MAE) to measure the efficacy: where ŷn is the predicted category.As for efficiency, we choose running time (in seconds) as the measurement.

Methods used for comparison
We choose KDOR, GPOR, SVOR, SBOR and ISBOR discussed in the related work section as baselines.We use the ORCA package [5] (in MATLAB) 5 for KDOR.The authors of SVOR and GPOR provide a publicly available implementation in C. 6 We use a MATLAB implementation of ISVOR shared by the authors.SBOR and ISBOR are implemented in MATLAB.

Settings and parameters
We choose the Gaussian RBF in Eq. ( 2) as the basis function for each algorithm.We initialize ISBOR by setting α = 10 −3 , σ = 1. 7We select the kernel width via 5-fold cross-validation on the training set within the values of θ ∈ {10 −2 , 10 −1 , . . ., 10}.GPOR automatically learns the hyper-parameters, which does not require any pre-selection process.For other methods, we follow the model selection process in [5] and use a nested 5-fold cross-validation on the training set to search for the best hyper-parameters.Specifically, we choose θ ∈ {10 −3 , 10 −2 , . . ., 10 3 } for every algorithm.The additional regularization parameter of SVOR and ISVOR are chosen within the values of c ∈ {10 −1 , . . ., 10 3 }.For KDOR, we choose the regularization parameter within the range of c ∈ {0.1, 1, 10}, since the regularization parameter of KDOR presents a different interpretation from the one in SVM.Additionally, KDOR requires another singularity-avoiding parameter, which is chosen in the range of u ∈ {10 −6 , 10 −5 , . . ., 10 −1 }.
Cross-validation is conducted using MAE.That is, once the hyper-parameters with the lowest MAE are obtained, we apply them to the whole training set and then validate them on the test sets.

Efficacy
We begin by addressing RQ1 concerning efficacy.We first consider the results on the synthetic dataset.Figure 2(a) shows the performance in terms of MAE on the synthetic dataset.From the figure, we see that other than ISVOR, all the algorithms work well on the Synthetic datasets, in terms of efficacy.Specifically, ISBOR and SVOR are the two best performing algorithms.When the data sizes are larger than 5000, SVOR outperforms ISBOR, but the gaps are small.Running time in seconds  Next, we turn to the benchmark datasets.The MAE scores are presented in Table 3 (top half).The results are averaged over 20 partitions.
To determine the significance of observed differences, we use the Wilcoxon test [30,31] and compare the efficacy of each pair of algorithms.Since we compare 6 algorithms, there are 30 comparisons for each dataset in total.We choose the significance level α = 0.1 and take the number of comparisons into account, and obtain the corrected significance level as α = 0.1/30 ≈ 0.0033.For each algorithm, we record the number of statistically significant wins, losses (or failures in finishing the training on time) and draws.The Wilcoxon test results are reported in Table 4.
Based on the top half of Table 3 and Table 4, we find that SVOR is the best performing ordinal regression algorithm in terms of MAE.Specifically, SVOR wins 24 times out of 35 pair- To sum up, these results answer RQ1 as follows: although SVOR has the best generalization performance, ISBOR outperforms other baselines and is comparable to SVOR.

Efficiency
We turn to RQ2.We report the running time of competing algorithms on the synthetic dataset with different data scales in Figure 2(b).Generally, the implementations in C run much faster than those in pure MATLAB.To suppress this effect, we compare the running times on a logarithmic scale.We omit plotting the results of GPOR, because after running 24 hours GPOR failed to complete any run on any partition.
Considering Figure 2(b), when it comes to efficiency, ISBOR is faster than all algorithms except for SVOR, which is implemented in C. Comparing to SBOR, which can be regarded as the offline version of ISBOR, the gaps between ISBOR and SBOR are getting larger with the size of data increasing.On 10000-size data, ISBOR is about 10 times faster than SBOR.These results demonstrate that incremental learning together with the sparseness assumption can accelerate the training speed of ISBOR.In summary, Figure 2 shows that ISBOR can be an efficient ordinal regression algorithm while preserving a comparable prediction accuracy to SVOR.
From the bottom part of Table 3, we notice that on the small datasets, ISBOR does not show any advantages in running time.However, on the large datasets, ISBOR outperforms the baselines.Specifically, we can see a trend that the larger scale of the dataset is, the bigger the gaps between ISBOR and the batch algorithms are.This trend provides an answer to RQ2: the incremental setting makes ISBOR a faster OR algorithm.

Sparseness
Finally, we address RQ3.Since GPOR and KDOR make predictions based on all training samples, in Table 5, we only report the number of support or relevant samples of SVOR, ISVOR, SBOR and ISBOR so as to answer the sparseness question (RQ3).Analyzing Table 5, we notice that the sparse Bayes based SBOR and ISBOR employ much smaller numbers of training samples to make predictions than the SVM-based SVOR and ISVOR. 8mong the seven benchmark datasets, ISBOR wins 5 times and SBOR wins 2 times, which supports our claim that ISBOR is a parsimonious ordinal regression algorithm and can make effective predictions based on a small subset of the training set.This finding answers RQ3 on sparseness.

Conclusion
We have presented a novel incremental ordinal regression algorithm within an efficient sparse Bayesian learning framework.Instead of processing the whole training set in one go, the proposed algorithm can incrementally learn from representations of training samples and has linear computational complexity in the training data size.Our empirical results show that Incremental Sparse Bayesian Ordinal Regression (ISBOR) is comparable or superior to state-of-the-art OR algorithms based on basis functions in terms of efficacy, efficiency and sparseness.
We hope that this work paves the way for research into large-scale ordinal regression.We believe that the design of ISBOR can be improved in multiple directions.From a Bayesian viewpoint, a more elegant way to optimize the hyper-parameters would be to maximize p(η | D) rather than p(D | η) with additional hyper-assumptions.This is achievable via other approximation inference methods like variational Bayes and expectation propagation [32,Chapter 10].From an application view, we can equip ISBOR with other sparse Bayesian architectures and adapt it to other problems like semi-supervised learning [33,8,23] and feature selection [34,35].From a ranking viewpoint, higher positions are more important.So far, ISBOR ignores pair-wise preferences and considers each position equally important, which amounts to a point-wise approach.Another promising future direction, therefore, is to take pair-wise position information into account and apply ISBOR to ranking problems.

Figure 1 :
Figure 1: The ordinal posterior and its first and second derivatives.

Figure 2 :
Figure 2: MAE and running time of OR algorithms on the synthetic dataset.

Table 1 :
Computational and space complexity of ordinal regression algorithms.N and M represent the number of training samples and the number of relevant and/or support samples respectively.

Table 3 :
Benchmark results: MAE and running time.Standard deviations (of MAE) indicated in brackets.Failure to complete all runs in 24 hours is indicated with '-'; best results are marked in boldface, second best in italics.

Table 4 :
Wilcoxon tests for the MAE results obtained using the benchmark datasets and reported in Table3.ISBOR, the second best performing algorithm, wins 17 comparisons.Because of the time limitation, GPOR fails to complete the experiments on 5 datasets and performs worse.The rest algorithms performs similar with each others and win 11 times.

Table 5 :
Relevant and support samples used on the Benchmark datasets.Best results marked in boldface, second best in italics.