Keywords

1 Introduction

Quantifying the properties of interest is an integral part in many domains, e.g., assessing the condition of a patient [27], estimating the risk of an investment [1], or predicting binding affinity of a ligand [4] when developing new drugs. Various measuring technologies and sensors are devised to quantify such properties of interest, which would in turn be utilized for informing decisions and making appropriate actions. However, the properties of interest are often not easy to obtain, whether they are difficult to measure directly or completely unobservable. This is usually the case when the properties are conceptual, i.e. they are latent constructs, such as health, satisfaction, and even intelligence. Under these circumstances, other measurable characteristics, considered related and informative of the true target, are observed and used as surrogate variables. In clinical settings, variables like temperature, blood pressure and various biomarkers measured from tissues are commonly tracked and considered when determining the health of the patient.

Typically, some heuristic rules are decided to map these surrogate variables into the desired score. The process of deciding these heuristic rules (or scoring functions) is usually long and tedious. For example, disease severity scores that are needed in clinical practices for patient diagnostics require years of effort and consensus of the medical community before the scoring functions can become part of the protocols. Fortunately, developments in machine learning and increasing amounts of collected data allows an alternative and complementary way for engineering the scoring functions by extracting rules automatically from the data, which facilitates and complements traditional approaches.

Algorithms for learning scoring functions from data were previously proposed, mainly in the medical domain, with the objective to learn disease severity scores [11, 12, 21, 28, 31]. Initial approaches posed the problem as traditional supervised learning tasks of classification [21, 28] and regression [31]. However, classification and regression approaches require scores to be already accessible up front, which limits their applicability to problems with a good surrogate. The approach in [11, 12] suggests the very appealing idea that there is a more convenient alternative form of supervised information to learn the scoring function from. Namely, ranked pairs are much easier to obtain than direct score estimates, and moreover, learning from pairs of ranked examples may result in more reliable and robust scoring functions.

In this work, we extend the suggested ranking-based approach [11] for score learning in multitask settings. The efforts are motivated by the applications in which there are multiple related tasks, with a limited amount of data for each task. Related tasks commonly share underlying regularities which could be learned more accurately by modeling all tasks together. For example, in education, scores on different subjects (e.g. Math and English) are dependent on the same characteristics of a particular student and a particular school. In the medical domain, disease severity scores for related illnesses (e.g. various respiratory viral infections) are expected to share common underlying biological mechanisms.

Consequently, we propose a novel multitask formulation for learning scoring functions from pairwise comparisons, by enforcing structural regularities on joint parameter space, using a matrix norm regularizations. In addition, we provide another contribution by developing an optimization algorithm in the form of an alternate minimization scheme based on a proximal gradient method. We evaluated the proposed approach on a synthetic data and two real-world applications. The objective of the first application is learning exam scores of elementary school pupils, while the objective of the second application is learning the tolerance to respiratory viral infections in humans. The results showed increased prediction accuracy of the proposed approach over individual tasks.

2 Related Work

Early efforts to learn scoring functions were dependent on complete supervised information (e.g. classification and regression tasks). In the classification settings, where the discrete class labels are provided, the classification methods were used to estimate the probability of a sample belonging to a certain class; these probabilities were used as a scoring function. For example, the method in [28] uses sparsity inducing \(L_1\) norm in combination with a classical logistic loss function to learn the disease severity scoring function for assessing the abnormality of the skull in craniosynostosis cases.

Another similar approach is to learn the scoring function in a regression manner from the continuous outcome. In [31], Alzheimer’s disease severity, as measured by cognitive scores, is modeled as a (temporal) multi-task regression problem using the fused sparse group lasso approach. The approach was more concerned with the progression of the disease; hence, the multi-task problem was formulated considering each time-step as a separate task. In contrast, we are interested in multiple score mapping from a single time-point set of measurements. There is also work on multitask learning to rank in the context of web search results ranking [6], where the ranking function is learned using the gradient boosted trees from the ranking scores provided by the human experts.

The problem with such completely-supervised methods is the necessity of providing direct values of scores for training purposes, which render the approaches as less powerful in settings where characteristics of interest are latent and not directly accessible. However, rather than giving direct estimates of the score, the easier task seems to be comparing two samples and asserting whether one has a higher score than the other. Ranking SVM [18] was the first approach that recognized the benefits of learning from ordered pairs of samples. This method was applied to learn an improved relevance function for documents retrieval from click-through data. Main insight was that clicked links are definitely more relevant for the search, as compared to non-clicked ones. And such kind of data is much more abundant than the user provided rankings. Recently, the ranking SVM-based method was adopted for Sepsis severity score learning [11] and extended for temporal applications by introducing a term that ensures gradual score change over consecutive time points.

Multitask learning is based on the idea that generalization (predictive performance) can be increased by accounting for the intrinsic relationships among multiple tasks. Multitask approach is found particularly effective when the number of samples per task is small. To the best of our knowledge, there are no published multitask formulations for ranking-based scoring functions, that is, for methods that learn from pairwise comparisons. The closest approaches are the previously mentioned multitask regression-based models for Alzheimer’s disease progression [31] and search results ranking [6]. Other multitask regression methods exist that learn the structure among the tasks using norm regularization [30], or methods that utilize fixed relatedness structure [23] obtained from domain knowledge [25] or learned from a statistical correlation [24]. However, since they are not directly proposed for ranking-based learning of the scoring functions, we will not consider them, nor will compare with them in this work.

The main problem in multi-task learning is finding the most appropriate assumption on how the tasks are related and incorporating such assumption into the model. Typically, in linear models, such structural assumptions are imposed on the joint parameter matrix, where rows correspond to features and columns to different tasks. Kernel methods assume that all tasks are related and similar [13], but some methods enforce tasks to be grouped into clusters [16]. For example, “Dirty method” [17] encourages block-structured row-sparsity in the joint parameter matrix by \(\Vert . \Vert _{1,1}\) norm, and element-wise sparsity with \(\Vert .\Vert _{1,\infty }\). The robust approach [14] selects sparse rows of features for related tasks with \(\Vert .\Vert _{2,1}\) and dense columns for outlier tasks with \(\Vert .\Vert _{1,2}\), in order to discern between related and unrelated tasks. Other approaches assume some shared common set of features [3] or shared common subspace [2, 9]. The approach proposed in [10] attempts to learn such relatedness subspace with trace (nuclear) norm \(\Vert .\Vert _*\) by encouraging the parameter matrix to have low rank, and finding outlier tasks with additional sparse group norm \(\Vert .\Vert _{1,2}\).

In this work we use regularization composed of trace norm [10], and grouped Lasso penalty [3] to jointly learn multiple ranking based scoring tasks, from temporal data.

3 Model

Let us assume that we have N samples (examples), where each sample i is represented as \(X_i\in \mathbb {R}^d\), and where \(X_{ij}\) is the value (measurement) of the feature \(j=\{1,2,\ldots ,d\}\) for the sample \(i=\{1,2,\ldots ,N\}\). Let us assume that \(y_i\in \mathbb {R}\) represents the property of interest (outcome variable) for the sample i. Scoring function \(score:\mathbb {R}^d \rightarrow \mathbb {R}\) is then a mapping \(X_i \mapsto y_i^\prime \) that provides a close estimate \(y_i^\prime \) of the true score \(y_i\).

However, in many cases the values of the true scoring function are difficult to obtain. In such situations, it is easier to assess the ranking between the scores of two samples p and q, i.e. to assert that one has perceived higher score than the other: \(score(X_p)>score(X_q)\). Therefore, a set of multiple such ordered pairs can be used to find a projection in the space of measured features, that will preserve the orders in the best possible way, and that might be used as a scoring function.

Moreover, measurements collected on multiple occasions over time might belong to the same subject; In this case, the measurements at each time step will be considered as a sample. We assume that the outcome variable changes gradually (smoothly) over time for the same subject, e.g. the disease severity score changes smoothly over consecutive time points for the same patient. This assumption will lead to improving the quality of the scoring function. We assume that \(X_p\) represents the feature vector for the sample p (which could be one particular subject at one particular time point).

In this work, we constrain such functional mapping score to the linear case, where the score estimate is computed as a weighted sum of the measured characteristics: \(score(X)=w^TX\). Therefore, the problem of learning the scoring function becomes finding the appropriate weight (or parameter) vector \(w\in \mathbb {R}^d\).

3.1 Single Task Model Formulation

Maximizing the number of correctly ordered training pairs can be performed using the soft max-margin framework expressed in a Hinge loss form (1), as suggested in [18].

$$\begin{aligned} max(0,1-(X_p-X_q)w) \end{aligned}$$
(1)

If sample p should have higher score compared to sample q, the formulation (1) will favor the weighted difference \((X_p-X_q)w\) that is positive and greater than 1, thus even achieving some margin in the score difference.

The \(L_2\) norm on the weight vector \(|| w ||^2\), is introduced to regularize the magnitude of the weights, and to turn the problem into simultaneous maximization of correct ordering and maximization of normalized margin.

Gradual (smooth) change of the scoring function over time can be obtained by penalizing high changes of the score (e.g. for two samples \(X_{i+1}^s, X_i^s\) of the same subject s), over short time intervals. In [12] such effect is achieved by using the temporal smoothness term:

$$\begin{aligned} \Bigg (\frac{(X_{i+1}^s-X_i^s)w}{(t^s_{i+1}-t^s_i)}\Bigg )^2, \end{aligned}$$
(2)

which essentially ensures that squared magnitude in difference, normalized with the time interval length, is kept low.

Therefore, for single task formulation of ranking-based scoring function learning, we adopted the Linear Disease Severity Score Learning formulation [11] which combines attractive properties of ranking SVM [18], with temporal smoothness term (2) that enforces the gradual change of the scoring function over time:

$$\begin{aligned} \begin{aligned} \hat{w}=\underset{w}{\mathrm {argmin}} \frac{1}{2} \Vert w\Vert _2^2 + c \sum _{\{p,q\} \in O} max(0,1-(X_p-X_q)w) \\ +\ b \sum _{\{i,i+1\}_s \in S} \Bigg (\frac{(X_{i+1}^s-X_i^s)w}{(t^s_{i+1}-t^s_i)}\Bigg )^2 \end{aligned} \end{aligned}$$
(3)

Every measurement (row) vector \(X_i\), \(i=\{1,2,\ldots ,N\}\) has associated time-stamp t, while \(\hat{w} \in \mathbb {R}^d\) denotes the solution of the objective 3.

Set O is composed of ordered pairs \(\{p,q\}\), where p has a higher rank than q (p is perceived to have a higher score than q), and which corresponds to the measurement vectors \(X_p\) and \(X_q\), respectively. Sum of the Hinge loss terms over all pairs from the O set, serves to reduce the extent of incorrectly ordered pairs.

Set of all consecutive pairs in all subjects is denoted S and the sum of the Temporal smoothness terms in Eq. (3) penalizes high rates of change in score values in consecutive time steps \(t_i\) and \(t_{i+1}\) for all subjects \(s \in S\). Scalar constants c and b are hyperparameters that determine the cost of the respective loss terms, the Hinge loss and the Temporal loss.

We aggregate the differences of measurements in the Hinge loss term into a single data matrix \(D_{k \times d}\), where k is the number of pairs in the comparison set O. Similarly, measurement and temporal difference ratios in the Temporal loss term we write as matrix \(R_{l \times d}\), where l is a number of pairs in the consecutive measurements set S. We aggregate the \(L_2\) norm and temporal smoothness terms (they are essentially weighting the square of optimization parameters) into a single weighted quadratic term \(\frac{1}{2}w^T Qw\), where Q is constant square matrix defined in Eq. (4):

$$\begin{aligned} Q=I+2bR^TR, \end{aligned}$$
(4)

I being the d-dimensional identity matrix.

The formulation (3) can now be rewritten more concisely as (5):

$$\begin{aligned} \hat{w}=\underset{w}{\mathrm {argmin}} \frac{1}{2} w^T Q w + c \sum _{i} max(0,1-D^iw) \end{aligned}$$
(5)

3.2 Multitask Formulation

As mentioned before, in case of a limited amount of data for training the scoring function for a single task (5), it is beneficial to exploit the relatedness among the multiple similar tasks, by learning them together, as illustrated in Fig. 1.

Fig. 1.
figure 1

Illustration of joint training of multiple ranking based score learning tasks. Three distinct task are depicted, where measured data in combination with supervision in form of ordered pairs, are jointly optimized to obtain the scoring function parameters, represented as parameter matrix. Parameter matrix is typically regularized to encode the structural assumptions regarding the task relatedness.

For m different tasks, individual parameter vectors \(w_i\) are aligned into a matrix \(W_{d \times m}\), and a joint objective is obtained as a superposition of individual losses (Eq. (5)) over the multiple tasks \(i \in \{1,2,...,m\}\):

$$\begin{aligned} \underset{W}{\mathrm {argmin}}\sum _{i=1}^{m} \left( \frac{1}{2} W_i^T Q_i W_i + c \sum _{j} max(0,1-D^j_i W_i)\right) \end{aligned}$$
(6)

Instead of the non-smooth Hinge loss \(L(a) = max(0,a)\) in Eq. (6), we work with the twice differentiable approximation in the form of Huber loss [11]:

$$\begin{aligned} L_{h}(a)={\left\{ \begin{array}{ll} 0 &{} \text {, if}\,\,a\, <\, -h\\ \frac{(a+h)^2}{4h} &{} \text {, if}\,\, |a|\, \le \,\, h\\ a &{} \text {, if}\,\, a\,\, >\,\, h. \end{array}\right. } \end{aligned}$$
(7)

where the approximation threshold h can be chosen arbitrarily small.

Further, we regularize the objective in Eq. (6) with a joint norm on parameter matrix \(\Vert W\Vert _{p,q} = (\sum _{i}((\sum _{j}(W_{ij}^q)^\frac{1}{q})^p)^{\frac{1}{p}}\). For \(p=2\) and \(q=1\), this approach is known as a group Lasso penalty on the row groups (of W), which forces sparsity in the parameter weights corresponding to certain features [3]. Additionally, we introduce the trace norm \(L_*\) in order to get the low rank component, or in other words, the parameter weight pattern common among all the tasks. To accommodate such a setup, which will be further clarified in the Optimization section, the parameter matrix W was split into two distinct matrices A and B, where \(W=A+B\).

Multitask Ranking Based Scoring Function Learning (MultiRBSFL) objective is now given in Eq. (8), and it takes as an input two matrices (per task i) obtained from the data: \(Q^i_{d \times d}\) and \(D^i_{k \times d}\); hyperparameters b, c, \(\lambda _1\) and \(\lambda _2\) weighting the influence of Temporal loss, Huber loss, trace norm and sparse group norm, respectively.

$$\begin{aligned} \underset{W=A+B}{\text {argmin}}&\; \mathcal {L}_1+\lambda _1 \Vert A\Vert _* + \lambda _2 \Vert B\Vert _{2,1} \end{aligned}$$
(8)

where

$$\begin{aligned} \mathcal {L}_1&= \frac{1}{m}\sum _{i=1}^{m}\left( \frac{1}{2} (A^i+B^i)^T Q^i (A^i+B^i) + c \sum _{j=1}^{k} L_h(1-D^i_j(A^i+B^i)) \right) \end{aligned}$$
(9)

\(A^i\) and \(B^i\) are column vectors \(\mathbb {R}^{d \times 1}\), and \(D_j^i\) is \(\mathbb {R}^{1 \times k}\) row-vector.

4 Optimization

The optimization (8) is composed of smooth and non-smooth terms. However, although the regularaization terms are separable in A and B, the loss term \(\mathcal {L}_1\) is not separable. Therefore, we solve the problem by using the alternative minimization scheme, where, in each iteration, we fix A and minimize (8) with respect to B, and then fix B and minimize (8) w.r.t A. In this case, each subproblem can be decomposed into two different optimizations. This will be explained in the next section.

Fix A

$$\begin{aligned} \underset{B}{\text {argmin}}&\; \mathcal {L}_1 + \lambda _2 \left\| B\right\| _{2,1} \end{aligned}$$
(10)

Fix B

$$\begin{aligned} \underset{A}{\text {argmin}}&\; \mathcal {L}_1 +\lambda _1 \left\| A\right\| _* \end{aligned}$$
(11)

In general, problem (10) and (11) can be written as:

$$\begin{aligned} \underset{\varvec{\varTheta }}{\text {argmin}}&\; \mathcal {L}_1 + \gamma \left\| \varvec{\varTheta }\right\| _p, \end{aligned}$$
(12)

where \(\varvec{\varTheta }= \{A,B\}\) and \(p=\{*,\{2,1\}\}\).

The optimization (12) is convex. The expression \(\mathcal {L}_1\) is smooth and the regulariation term (either group lasso or trace norm) is non-smooth. Therefore, we solve (12) using the proximal methods.

4.1 Proximal Algorithm

We solve (12) using the proximal gradient method [20].

$$\begin{aligned} \varvec{\varTheta }^{k+1}&:= \mathbf{prox }_{\lambda \left\| \varvec{\varTheta }\right\| _p}(\varvec{\varTheta }^k - \lambda \nabla \mathcal {L}_1(\varvec{\varTheta }^k)) \nonumber \\&= \underset{\varvec{\varTheta }}{\text {argmin}} \left( \left\| \varvec{\varTheta }\right\| _p + \frac{1}{2\lambda } \left\| \varvec{\varTheta }- (\varvec{\varTheta }^k - \lambda \nabla \mathcal {L}_1(\varvec{\varTheta }^k))\right\| _2^2 \right) , \end{aligned}$$
(13)

where \(\mathbf{prox }_{\lambda \left\| \varvec{\varTheta }\right\| _p}\) is the proximal operator of the scaled function \(\left\| \varvec{\varTheta }\right\| _p\), and \(\lambda \in (0,1/L] \) is a constant step size, and L is a Lipschitz constant of \(\nabla \mathcal {L}_1\). Problem (12) can be solved analytically, where the proximal operator associated with the norm can be obtained as in [5].

Trace norm. Let us assume that \(M=U \varSigma V \) is the singular value decompoistion of M, where \(\varSigma \) is a diagonal matrix and its entries \(\sigma _i\) are the singluar values of the matrix M. The proximal operator of the trace norm is defined as [8]:

$$\begin{aligned} \mathbf{prox }_{\lambda \left\| .\right\| _* } (M) = U \mathbf{diag }(\mathbf{prox }_{\lambda \left\| .\right\| _1}(\sigma (M))) V \end{aligned}$$

i.e., the proximal operator of \(\left\| .\right\| _*\) can be calculated by carrying out a singular value decomposition of Z and evaluating the proximal operator of the corresponding absolutely symmetric function at the singular values \(\sigma (M)\). Therefore,

$$\begin{aligned} \mathbf{prox }_{\lambda \left\| .\right\| _* } (M)&= U \mathbf{diag }(\overline{\sigma }_1,\overline{\sigma }_2,\ldots ,\overline{\sigma }_n) V, \end{aligned}$$
(14)

where:

$$\begin{aligned} \overline{\sigma }_i = {\left\{ \begin{array}{ll} \sigma _i - \lambda &{} \sigma _i \ge \lambda \\ 0 &{} -\lambda \le \sigma _i \le \lambda \\ \sigma _i + \lambda &{} \sigma _i \le -\lambda \end{array}\right. } \end{aligned}$$

Equation (14) is sometimes called the singular value thresholding operator.

Group lasso norm. The proximal operator associated with the group lasso norm is defined as:

$$\begin{aligned} \left[ \mathbf{prox }_{\lambda \left\| .\right\| _{1,2} } (u)\right] _g = {\left\{ \begin{array}{ll} (1 - \frac{\lambda }{\left\| u_g\right\| _2}) u_g &{} \left\| u_g\right\| _2 > \lambda \\ 0 &{} \text {otherwise} \end{array}\right. } \end{aligned}$$

4.2 Step Size

In order to find an adaptive step size \(\lambda ^k\) in each iteration k, we employ the backtracking line search algorithm [7], which requires computing an upper bound for \(\mathcal {L}_1\). Since \(\mathcal {L}_1\) is convex and smooth, and \(\nabla \mathcal {L}_1\) is L-Lipschitz continuous, it follows that:

$$\begin{aligned} \mathcal {L}_1(\varvec{\varTheta }) \le \underbrace{\mathcal {L}_1(\varvec{\varTheta }^k) + \nabla \mathcal {L}_1(\varvec{\varTheta }^k)^T(\varvec{\varTheta }-\varvec{\varTheta }^k) + \frac{L}{2}\left\| \varvec{\varTheta }-\varvec{\varTheta }^k\right\| _2^2}_{\widehat{\mathcal {L}_1}_{\frac{1}{L}}(\varvec{\varTheta }, \varvec{\varTheta }^k)} \end{aligned}$$
(15)

By utilizing (15), it can be shown that the optimization (13) is equivalent to [20]:

$$\begin{aligned} \varvec{\varTheta }^{k+1}&:= \underset{\varvec{\varTheta }}{\text {argmin}}\; \widehat{\mathcal {L}_1}_{\lambda ^k}(\varvec{\varTheta }, \varvec{\varTheta }^k) + \left\| \varvec{\varTheta }\right\| _p \end{aligned}$$
(16)

where \(\lambda ^k=\frac{1}{L}\). So at each iteration, the function \(\mathcal {L}_1\) is linearized around the current point and the problem (16) is solved. The final fast proximal gradient method with backtracking is shown in Algorithm 1.

figure a

The final alternative minimization algorithm is shown in Algorithm 2.

figure b

5 Empirical Evaluation

The proposed approach for multitask learning of ranking-based scoring functions is tested on one synthetic and two real-world datasets. We compared our MultiRBSFL approach against the following baseline approaches:

  1. 1.

    \(L_2\) - independently learning scoring functions for each task (objective (3));

  2. 2.

    \(L_1\) - independently learning sparse (\(L_1\) regularized) scoring functions for each task;

  3. 3.

    \(L_*\) - learning multiple scoring functions by imposing low rank regularization on their joint parameter matrix (\(L_*\) regularized objective (6));

  4. 4.

    \(L_{2,1}\) - joint objective (6), regularized by mixed \(\Vert .\Vert _{2,1}\) norm.

Our MultiRBSFL approach, which uses composite low rank and mixed norm regularized joint objective (8), we will denote as \(L_*+L_{2,1}\) for consistency in naming the alternative approaches.

We measured the predictive performance in terms of accuracy, which is the number of correctly ordered test pairs. As the pairwise ranking relation is antisymmetric, it is sufficient to use only the positive training instances (i.e. where the first sample in a pair has the larger score). Test pairs are exclusively generated from examples not contained in the training set. Accuracy values that we report in this study are obtained by doing 5-fold cross-validation experiments.

5.1 Experiments on Synthetic Data

In this settings, a Gaussian processes model with an exponential kernel was used to generate the temporal data. We compiled 250 processes to mimic \(d=250\) measured variables (features) per subject. Each single process was used to generate a time series with 10 time points (10 samples). We followed the same principle to generate 10 different multivariate time series (subjects) for training and 10 subjects for test, resulting in 100 samples \(X_{100 \times 250}^{train}\) for training, and 100 samples \(X_{100 \times 250}^{test}\) for test.

Four different tasks were created by randomly generating the weight matrix \(W_{250 \times 4}\), with only 5 nonzero rows, which corresponds to the \(L_{2,1}\) assumption (row-sparsity). This row-wise sparse matrix was then superimposed with a dense rank-1 matrix, generated by multiplication of two random vectors, which suits the \(L_*\) trace norm part of the objective. True underlying scores on four tasks, for each of the 250-dimensional samples (one time point of one patient), are calculated as the weighted sum of the feature values \(X*W\). Zero mean random vector was subsequently superimposed to input X data to model the measurement noise.

A training set is then obtained by making pairs out of samples whose scores are sufficiently different (in our case we set the threshold to 1). Pairs of examples were generated independently for each task based on their scores, totaling 14,187 pairs for all four tasks jointly. Test set pairs were generated in the same fashion, but with a smaller threshold and consisted out of 19,390 pairs. Training pairs were used to learn the weight matrix \(\hat{W}\), which was used to estimate the testing scores from the test samples. The obtained estimates were used to infer the relative order of the testing pairs. The accuracy (percentage of correct guesses) is reported in the Table 1. It is no surprise that the proposed \(L_*+L_{1,2}\) approach achieves the highest accuracy on all four tasks, as the underlying assumptions were explicitly built into the synthetic example.

Table 1. Comparison of accuracy indicators (fraction of correctly ordered pairs) for alternative score learning methods on the synthetic data of four related tasks.

5.2 School Exam Score

Intelligence as well as the capacity for understanding and using mathematics or languages are all examples of properties that are latent - yet important and often evaluated (estimated). We have tested the multitask score learning framework on data from an elementary school study [19], which contains longitudinal data on performance in Math and English language for pupils in 50 inner London schoolsFootnote 1. In total there are scores for 3,236 exams (Math and English each), taken by 1,402 students over three consecutive school years. The goal is to rank the students’ performances on Math and English test based on known score from Ravens ability test and additional information like demographics, social status, gender, class and school type. Distributions of scores for two tasks are given in the Fig. 2.

Fig. 2.
figure 2

Distributions of test scores for Math and English tasks, respectively.

According to results depicted in Table 2, our \(L_*+L_{1,2}\) approach achieved the best predictive performance in both tasks.

Table 2. Comparison of accuracy indicators (fraction of correctly ordered pairs) for alternative score learning methods on the task of learning the performance on Math and English tests.

5.3 Tolerance to Infections Score

Tolerance is the host’s behavior that arises from interactions with a pathogen, which describes the ability of the host to preserve fitness despite the presence of a large amount of pathogen. Therefore, it is defined as changes in host fitness (health) with respect to changes in pathogen load [22]. However, tolerance is a very understudied topic, where there is no established scoring function, despite the necessity.

We analyzed three publicly available datasetsFootnote 2 that allows characterization of the tolerance behavior in humans. The data comes from the human viral challenge studies [29] where human volunteers were infected with H3N2 influenza, rhinovirus (HRV) and respiratory syncytial virus (RSV), respectively. For all subjects in each dataset, symptoms were recorded twice a day and quantified by the modified Jackson Score [15]. Thereafter, subjects were classified based on the modified Jackson Score values into “symptomatic” and “asymptomatic” groups. In addition, viral load temporal measurements are available for 28 “symptomatic” subjects, given in Table 3. Gene expression measurements (for 12,023 genes) were collected temporally, starting at a baseline (24 hours prior to inoculation with virus) and measured at certain time points following the experimental procedure described in detail in [29], making a total of 16, 14 and 21 time-point measurements for H3N2, HRV and RSV datasets, respectively. Table 3 shows the viral shedding and symptom scores for subjects who developed clinically relevant symptoms from H3N2, HRV and RSV datasets.

Temporal measurements about symptoms (proxy for fitness) and viral (pathogen) load for each subject were used to derive tolerance scores according to the definition given in [22]. In particular, the tolerance score for each subject was calculated by dividing the maximum viral load with the maximum severity of symptoms observed for that subject (Table 3). Gene expression measurements were used as an explanatory variables in our ranking task.

Table 3. Tolerance scores (Ratio) derived by dividing maximum viral load (Max V) with maximum severity score (Max S).

Biological rationale behind the task relatedness is that the three infections are viruses that cause similar respiratory symptoms (runny nose, fever, cough) and are quantified by the same Jackson score, suggesting that some shared genetic mechanisms might be responsible for the disease manifestations. Consequently, we sought to learn the tolerance scoring functions jointly.

The tolerance scores were used to compile a set of ranked pairs, and the objective was to learn the scoring functions for tolerance to H3N2, HRV and RSV viruses (3 tasks), from high-dimensional gene expression data. Since 12,023 dimensions is very computationally expensive to optimize, we reduced the dimensionality of the data to the 100 most informative genes according to the correlation with the target. The results of learning the scoring functions with different approaches are summarized in the Table 4.

Table 4. Comparison of accuracy indicators (fraction of correctly ordered pairs) for alternative score learning methods on the tolerance to three viruses learning task.

The results from the Table 4 show that the HRV task is the most difficult one in the described formulation. Although some alternative approaches achieved better accuracy in two of the tasks, the proposed approach achieved the best generalization trade-off as can be concluded from the highest average (overall) accuracy.

6 Discussion and Conclusions

We proposed the method that jointly learns multiple scoring functions from a set of ranked examples. The approach utilizes composite regularization consisting of the trace norm and row-wise grouped Lasso penalty, to impose the structural regularity among the model parameters of different tasks. We also provide optimization algorithm, based on the alternate minimization and proximal gradient techniques, for solving the proposed convex MultiRBSFL objective.

Presented empirical evaluations in one synthetic and two real world datasets suggest the benefits of utilizing the multitask approach for learning related ranking based scoring functions. According to the results, the model with only \(L_*\) performs worse than \(L_{1,2}\), probably because sparsity in features seems to be the more dominant pattern in the data than the low-rank component. However, utilizing both \(L_*\) and \(L_{1,2}\) in the same model turned out to be most beneficial for studied applications.

The proposed proximal gradient algorithm with alternating minimization for optimization of the multitask objective proved valuable for applications with low to moderate dimensionality of the feature space. However, as the contemporary applications have ever increasing number of measured variables, more efficient optimization approaches and with better scalability would be required. One potential way to accelerate the proximal gradient algorithm is to adopt the approach proposed in [26].