Ranking Based Multitask Learning of Scoring Functions

Stojkovic, Ivan; Ghalwash, Mohamed; Obradovic, Zoran

doi:10.1007/978-3-319-71246-8_44

Ivan Stojkovic^18,19,
Mohamed Ghalwash^18,20,21 &
Zoran Obradovic¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10535))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

3273 Accesses
2 Citations

Abstract

Scoring functions are an important tool for quantifying properties of interest in many domains; for example, in healthcare, a disease severity scores are used to diagnose the patient’s condition and to decide its further treatment. Scoring functions might be obtained based on the domain knowledge or learned from data by using classification, regression or ranking techniques - depending on the type of supervised information. Although learning scoring functions from collected data is beneficial, it can be challenging when limited data are available. Therefore, learning multiple distinct, but related, scoring functions together can increase their quality as shared regularities may be easier to identify. We propose a multitask formulation for ranking-based learning of scoring functions, where the model is trained from pairwise comparisons. The approach uses mixed-norm regularization to impose structural regularities among the tasks. The proposed regularized objective function is convex; therefore, we developed an optimization approach based on alternating minimization and proximal gradient algorithms to solve the problem. The increased predictive accuracy of the presented approach, in comparison to several baselines, is demonstrated on synthetic data and two different real-world applications; predicting exam scores and predicting tolerance to infections score.

You have full access to this open access chapter, Download conference paper PDF

Design, implementation, and evaluation of the computer-aided clinical decision support system based on learning-to-rank: collaboration between physicians and machine learning in the differential diagnosis process

Article Open access 02 February 2023

Learning to Rank

Collaborative online ranking algorithms for multitask learning

Article 15 October 2019

Keywords

1 Introduction

Quantifying the properties of interest is an integral part in many domains, e.g., assessing the condition of a patient [27], estimating the risk of an investment [1], or predicting binding affinity of a ligand [4] when developing new drugs. Various measuring technologies and sensors are devised to quantify such properties of interest, which would in turn be utilized for informing decisions and making appropriate actions. However, the properties of interest are often not easy to obtain, whether they are difficult to measure directly or completely unobservable. This is usually the case when the properties are conceptual, i.e. they are latent constructs, such as health, satisfaction, and even intelligence. Under these circumstances, other measurable characteristics, considered related and informative of the true target, are observed and used as surrogate variables. In clinical settings, variables like temperature, blood pressure and various biomarkers measured from tissues are commonly tracked and considered when determining the health of the patient.

Typically, some heuristic rules are decided to map these surrogate variables into the desired score. The process of deciding these heuristic rules (or scoring functions) is usually long and tedious. For example, disease severity scores that are needed in clinical practices for patient diagnostics require years of effort and consensus of the medical community before the scoring functions can become part of the protocols. Fortunately, developments in machine learning and increasing amounts of collected data allows an alternative and complementary way for engineering the scoring functions by extracting rules automatically from the data, which facilitates and complements traditional approaches.

Algorithms for learning scoring functions from data were previously proposed, mainly in the medical domain, with the objective to learn disease severity scores [11, 12, 21, 28, 31]. Initial approaches posed the problem as traditional supervised learning tasks of classification [21, 28] and regression [31]. However, classification and regression approaches require scores to be already accessible up front, which limits their applicability to problems with a good surrogate. The approach in [11, 12] suggests the very appealing idea that there is a more convenient alternative form of supervised information to learn the scoring function from. Namely, ranked pairs are much easier to obtain than direct score estimates, and moreover, learning from pairs of ranked examples may result in more reliable and robust scoring functions.

In this work, we extend the suggested ranking-based approach [11] for score learning in multitask settings. The efforts are motivated by the applications in which there are multiple related tasks, with a limited amount of data for each task. Related tasks commonly share underlying regularities which could be learned more accurately by modeling all tasks together. For example, in education, scores on different subjects (e.g. Math and English) are dependent on the same characteristics of a particular student and a particular school. In the medical domain, disease severity scores for related illnesses (e.g. various respiratory viral infections) are expected to share common underlying biological mechanisms.

Consequently, we propose a novel multitask formulation for learning scoring functions from pairwise comparisons, by enforcing structural regularities on joint parameter space, using a matrix norm regularizations. In addition, we provide another contribution by developing an optimization algorithm in the form of an alternate minimization scheme based on a proximal gradient method. We evaluated the proposed approach on a synthetic data and two real-world applications. The objective of the first application is learning exam scores of elementary school pupils, while the objective of the second application is learning the tolerance to respiratory viral infections in humans. The results showed increased prediction accuracy of the proposed approach over individual tasks.

2 Related Work

Early efforts to learn scoring functions were dependent on complete supervised information (e.g. classification and regression tasks). In the classification settings, where the discrete class labels are provided, the classification methods were used to estimate the probability of a sample belonging to a certain class; these probabilities were used as a scoring function. For example, the method in [28] uses sparsity inducing $L_1$ norm in combination with a classical logistic loss function to learn the disease severity scoring function for assessing the abnormality of the skull in craniosynostosis cases.

Another similar approach is to learn the scoring function in a regression manner from the continuous outcome. In [31], Alzheimer’s disease severity, as measured by cognitive scores, is modeled as a (temporal) multi-task regression problem using the fused sparse group lasso approach. The approach was more concerned with the progression of the disease; hence, the multi-task problem was formulated considering each time-step as a separate task. In contrast, we are interested in multiple score mapping from a single time-point set of measurements. There is also work on multitask learning to rank in the context of web search results ranking [6], where the ranking function is learned using the gradient boosted trees from the ranking scores provided by the human experts.

The problem with such completely-supervised methods is the necessity of providing direct values of scores for training purposes, which render the approaches as less powerful in settings where characteristics of interest are latent and not directly accessible. However, rather than giving direct estimates of the score, the easier task seems to be comparing two samples and asserting whether one has a higher score than the other. Ranking SVM [18] was the first approach that recognized the benefits of learning from ordered pairs of samples. This method was applied to learn an improved relevance function for documents retrieval from click-through data. Main insight was that clicked links are definitely more relevant for the search, as compared to non-clicked ones. And such kind of data is much more abundant than the user provided rankings. Recently, the ranking SVM-based method was adopted for Sepsis severity score learning [11] and extended for temporal applications by introducing a term that ensures gradual score change over consecutive time points.

Multitask learning is based on the idea that generalization (predictive performance) can be increased by accounting for the intrinsic relationships among multiple tasks. Multitask approach is found particularly effective when the number of samples per task is small. To the best of our knowledge, there are no published multitask formulations for ranking-based scoring functions, that is, for methods that learn from pairwise comparisons. The closest approaches are the previously mentioned multitask regression-based models for Alzheimer’s disease progression [31] and search results ranking [6]. Other multitask regression methods exist that learn the structure among the tasks using norm regularization [30], or methods that utilize fixed relatedness structure [23] obtained from domain knowledge [25] or learned from a statistical correlation [24]. However, since they are not directly proposed for ranking-based learning of the scoring functions, we will not consider them, nor will compare with them in this work.

The main problem in multi-task learning is finding the most appropriate assumption on how the tasks are related and incorporating such assumption into the model. Typically, in linear models, such structural assumptions are imposed on the joint parameter matrix, where rows correspond to features and columns to different tasks. Kernel methods assume that all tasks are related and similar [13], but some methods enforce tasks to be grouped into clusters [16]. For example, “Dirty method” [17] encourages block-structured row-sparsity in the joint parameter matrix by $\Vert . \Vert _{1,1}$ norm, and element-wise sparsity with $\Vert .\Vert _{1,\infty }$. The robust approach [14] selects sparse rows of features for related tasks with $\Vert .\Vert _{2,1}$ and dense columns for outlier tasks with $\Vert .\Vert _{1,2}$, in order to discern between related and unrelated tasks. Other approaches assume some shared common set of features [3] or shared common subspace [2, 9]. The approach proposed in [10] attempts to learn such relatedness subspace with trace (nuclear) norm $\Vert .\Vert _*$ by encouraging the parameter matrix to have low rank, and finding outlier tasks with additional sparse group norm $\Vert .\Vert _{1,2}$.

In this work we use regularization composed of trace norm [10], and grouped Lasso penalty [3] to jointly learn multiple ranking based scoring tasks, from temporal data.

3 Model

Let us assume that we have N samples (examples), where each sample i is represented as $X_i\in \mathbb {R}^d$, and where $X_{ij}$ is the value (measurement) of the feature $j=\{1,2,\ldots ,d\}$ for the sample $i=\{1,2,\ldots ,N\}$. Let us assume that $y_i\in \mathbb {R}$ represents the property of interest (outcome variable) for the sample i. Scoring function $score:\mathbb {R}^d \rightarrow \mathbb {R}$ is then a mapping $X_i \mapsto y_i^\prime $ that provides a close estimate $y_i^\prime $ of the true score $y_i$.

However, in many cases the values of the true scoring function are difficult to obtain. In such situations, it is easier to assess the ranking between the scores of two samples p and q, i.e. to assert that one has perceived higher score than the other: $score(X_p)>score(X_q)$. Therefore, a set of multiple such ordered pairs can be used to find a projection in the space of measured features, that will preserve the orders in the best possible way, and that might be used as a scoring function.

Moreover, measurements collected on multiple occasions over time might belong to the same subject; In this case, the measurements at each time step will be considered as a sample. We assume that the outcome variable changes gradually (smoothly) over time for the same subject, e.g. the disease severity score changes smoothly over consecutive time points for the same patient. This assumption will lead to improving the quality of the scoring function. We assume that $X_p$ represents the feature vector for the sample p (which could be one particular subject at one particular time point).

In this work, we constrain such functional mapping score to the linear case, where the score estimate is computed as a weighted sum of the measured characteristics: $score(X)=w^TX$. Therefore, the problem of learning the scoring function becomes finding the appropriate weight (or parameter) vector $w\in \mathbb {R}^d$.

3.1 Single Task Model Formulation

Maximizing the number of correctly ordered training pairs can be performed using the soft max-margin framework expressed in a Hinge loss form (1), as suggested in [18].

$$\begin{aligned} max(0,1-(X_p-X_q)w) \end{aligned}$$

(1)

If sample p should have higher score compared to sample q, the formulation (1) will favor the weighted difference $(X_p-X_q)w$ that is positive and greater than 1, thus even achieving some margin in the score difference.

The $L_2$ norm on the weight vector $|| w ||^2$, is introduced to regularize the magnitude of the weights, and to turn the problem into simultaneous maximization of correct ordering and maximization of normalized margin.

Gradual (smooth) change of the scoring function over time can be obtained by penalizing high changes of the score (e.g. for two samples $X_{i+1}^s, X_i^s$ of the same subject s), over short time intervals. In [12] such effect is achieved by using the temporal smoothness term:

$$\begin{aligned} \Bigg (\frac{(X_{i+1}^s-X_i^s)w}{(t^s_{i+1}-t^s_i)}\Bigg )^2, \end{aligned}$$

(2)

which essentially ensures that squared magnitude in difference, normalized with the time interval length, is kept low.

Therefore, for single task formulation of ranking-based scoring function learning, we adopted the Linear Disease Severity Score Learning formulation [11] which combines attractive properties of ranking SVM [18], with temporal smoothness term (2) that enforces the gradual change of the scoring function over time:

$$\begin{aligned} \begin{aligned} \hat{w}=\underset{w}{\mathrm {argmin}} \frac{1}{2} \Vert w\Vert _2^2 + c \sum _{\{p,q\} \in O} max(0,1-(X_p-X_q)w) \\ +\ b \sum _{\{i,i+1\}_s \in S} \Bigg (\frac{(X_{i+1}^s-X_i^s)w}{(t^s_{i+1}-t^s_i)}\Bigg )^2 \end{aligned} \end{aligned}$$

(3)

Every measurement (row) vector $X_i$, $i=\{1,2,\ldots ,N\}$ has associated time-stamp t, while $\hat{w} \in \mathbb {R}^d$ denotes the solution of the objective 3.

Set O is composed of ordered pairs $\{p,q\}$, where p has a higher rank than q (p is perceived to have a higher score than q), and which corresponds to the measurement vectors $X_p$ and $X_q$, respectively. Sum of the Hinge loss terms over all pairs from the O set, serves to reduce the extent of incorrectly ordered pairs.

Set of all consecutive pairs in all subjects is denoted S and the sum of the Temporal smoothness terms in Eq. (3) penalizes high rates of change in score values in consecutive time steps $t_i$ and $t_{i+1}$ for all subjects $s \in S$. Scalar constants c and b are hyperparameters that determine the cost of the respective loss terms, the Hinge loss and the Temporal loss.

We aggregate the differences of measurements in the Hinge loss term into a single data matrix $D_{k \times d}$, where k is the number of pairs in the comparison set O. Similarly, measurement and temporal difference ratios in the Temporal loss term we write as matrix $R_{l \times d}$, where l is a number of pairs in the consecutive measurements set S. We aggregate the $L_2$ norm and temporal smoothness terms (they are essentially weighting the square of optimization parameters) into a single weighted quadratic term $\frac{1}{2}w^T Qw$, where Q is constant square matrix defined in Eq. (4):

$$\begin{aligned} Q=I+2bR^TR, \end{aligned}$$

(4)

I being the d-dimensional identity matrix.

The formulation (3) can now be rewritten more concisely as (5):

$$\begin{aligned} \hat{w}=\underset{w}{\mathrm {argmin}} \frac{1}{2} w^T Q w + c \sum _{i} max(0,1-D^iw) \end{aligned}$$

(5)

3.2 Multitask Formulation

As mentioned before, in case of a limited amount of data for training the scoring function for a single task (5), it is beneficial to exploit the relatedness among the multiple similar tasks, by learning them together, as illustrated in Fig. 1.

For m different tasks, individual parameter vectors $w_i$ are aligned into a matrix $W_{d \times m}$, and a joint objective is obtained as a superposition of individual losses (Eq. (5)) over the multiple tasks $i \in \{1,2,...,m\}$:

$$\begin{aligned} \underset{W}{\mathrm {argmin}}\sum _{i=1}^{m} \left( \frac{1}{2} W_i^T Q_i W_i + c \sum _{j} max(0,1-D^j_i W_i)\right) \end{aligned}$$

(6)

Instead of the non-smooth Hinge loss $L(a) = max(0,a)$ in Eq. (6), we work with the twice differentiable approximation in the form of Huber loss [11]:

$$\begin{aligned} L_{h}(a)={\left\{ \begin{array}{ll} 0 &{} \text {, if}\,\,a\, <\, -h\\ \frac{(a+h)^2}{4h} &{} \text {, if}\,\, |a|\, \le \,\, h\\ a &{} \text {, if}\,\, a\,\, >\,\, h. \end{array}\right. } \end{aligned}$$

(7)

where the approximation threshold h can be chosen arbitrarily small.

Further, we regularize the objective in Eq. (6) with a joint norm on parameter matrix $\Vert W\Vert _{p,q} = (\sum _{i}((\sum _{j}(W_{ij}^q)^\frac{1}{q})^p)^{\frac{1}{p}}$. For $p=2$ and $q=1$, this approach is known as a group Lasso penalty on the row groups (of W), which forces sparsity in the parameter weights corresponding to certain features [3]. Additionally, we introduce the trace norm $L_*$ in order to get the low rank component, or in other words, the parameter weight pattern common among all the tasks. To accommodate such a setup, which will be further clarified in the Optimization section, the parameter matrix W was split into two distinct matrices A and B, where $W=A+B$.

Multitask Ranking Based Scoring Function Learning (MultiRBSFL) objective is now given in Eq. (8), and it takes as an input two matrices (per task i) obtained from the data: $Q^i_{d \times d}$ and $D^i_{k \times d}$; hyperparameters b, c, $\lambda _1$ and $\lambda _2$ weighting the influence of Temporal loss, Huber loss, trace norm and sparse group norm, respectively.

$$\begin{aligned} \underset{W=A+B}{\text {argmin}}&\; \mathcal {L}_1+\lambda _1 \Vert A\Vert _* + \lambda _2 \Vert B\Vert _{2,1} \end{aligned}$$

(8)

where

$$\begin{aligned} \mathcal {L}_1&= \frac{1}{m}\sum _{i=1}^{m}\left( \frac{1}{2} (A^i+B^i)^T Q^i (A^i+B^i) + c \sum _{j=1}^{k} L_h(1-D^i_j(A^i+B^i)) \right) \end{aligned}$$

(9)

$A^i$ and $B^i$ are column vectors $\mathbb {R}^{d \times 1}$, and $D_j^i$ is $\mathbb {R}^{1 \times k}$ row-vector.

4 Optimization

The optimization (8) is composed of smooth and non-smooth terms. However, although the regularaization terms are separable in A and B, the loss term $\mathcal {L}_1$ is not separable. Therefore, we solve the problem by using the alternative minimization scheme, where, in each iteration, we fix A and minimize (8) with respect to B, and then fix B and minimize (8) w.r.t A. In this case, each subproblem can be decomposed into two different optimizations. This will be explained in the next section.

Fix A

$$\begin{aligned} \underset{B}{\text {argmin}}&\; \mathcal {L}_1 + \lambda _2 \left\| B\right\| _{2,1} \end{aligned}$$

(10)

Fix B

$$\begin{aligned} \underset{A}{\text {argmin}}&\; \mathcal {L}_1 +\lambda _1 \left\| A\right\| _* \end{aligned}$$

(11)

In general, problem (10) and (11) can be written as:

$$\begin{aligned} \underset{\varvec{\varTheta }}{\text {argmin}}&\; \mathcal {L}_1 + \gamma \left\| \varvec{\varTheta }\right\| _p, \end{aligned}$$

(12)

where $\varvec{\varTheta }= \{A,B\}$ and $p=\{*,\{2,1\}\}$.

The optimization (12) is convex. The expression $\mathcal {L}_1$ is smooth and the regulariation term (either group lasso or trace norm) is non-smooth. Therefore, we solve (12) using the proximal methods.

4.1 Proximal Algorithm

We solve (12) using the proximal gradient method [20].

$$\begin{aligned} \varvec{\varTheta }^{k+1}&:= \mathbf{prox }_{\lambda \left\| \varvec{\varTheta }\right\| _p}(\varvec{\varTheta }^k - \lambda \nabla \mathcal {L}_1(\varvec{\varTheta }^k)) \nonumber \\&= \underset{\varvec{\varTheta }}{\text {argmin}} \left( \left\| \varvec{\varTheta }\right\| _p + \frac{1}{2\lambda } \left\| \varvec{\varTheta }- (\varvec{\varTheta }^k - \lambda \nabla \mathcal {L}_1(\varvec{\varTheta }^k))\right\| _2^2 \right) , \end{aligned}$$

(13)

where $\mathbf{prox }_{\lambda \left\| \varvec{\varTheta }\right\| _p}$ is the proximal operator of the scaled function $\left\| \varvec{\varTheta }\right\| _p$, and $\lambda \in (0,1/L] $ is a constant step size, and L is a Lipschitz constant of $\nabla \mathcal {L}_1$. Problem (12) can be solved analytically, where the proximal operator associated with the norm can be obtained as in [5].

Trace norm. Let us assume that $M=U \varSigma V $ is the singular value decompoistion of M, where $\varSigma $ is a diagonal matrix and its entries $\sigma _i$ are the singluar values of the matrix M. The proximal operator of the trace norm is defined as [8]:

$$\begin{aligned} \mathbf{prox }_{\lambda \left\| .\right\| _* } (M) = U \mathbf{diag }(\mathbf{prox }_{\lambda \left\| .\right\| _1}(\sigma (M))) V \end{aligned}$$

i.e., the proximal operator of $\left\| .\right\| _*$ can be calculated by carrying out a singular value decomposition of Z and evaluating the proximal operator of the corresponding absolutely symmetric function at the singular values $\sigma (M)$. Therefore,

$$\begin{aligned} \mathbf{prox }_{\lambda \left\| .\right\| _* } (M)&= U \mathbf{diag }(\overline{\sigma }_1,\overline{\sigma }_2,\ldots ,\overline{\sigma }_n) V, \end{aligned}$$

(14)

where:

$$\begin{aligned} \overline{\sigma }_i = {\left\{ \begin{array}{ll} \sigma _i - \lambda &{} \sigma _i \ge \lambda \\ 0 &{} -\lambda \le \sigma _i \le \lambda \\ \sigma _i + \lambda &{} \sigma _i \le -\lambda \end{array}\right. } \end{aligned}$$

Equation (14) is sometimes called the singular value thresholding operator.

Group lasso norm. The proximal operator associated with the group lasso norm is defined as:

$$\begin{aligned} \left[ \mathbf{prox }_{\lambda \left\| .\right\| _{1,2} } (u)\right] _g = {\left\{ \begin{array}{ll} (1 - \frac{\lambda }{\left\| u_g\right\| _2}) u_g &{} \left\| u_g\right\| _2 > \lambda \\ 0 &{} \text {otherwise} \end{array}\right. } \end{aligned}$$

4.2 Step Size

In order to find an adaptive step size $\lambda ^k$ in each iteration k, we employ the backtracking line search algorithm [7], which requires computing an upper bound for $\mathcal {L}_1$. Since $\mathcal {L}_1$ is convex and smooth, and $\nabla \mathcal {L}_1$ is L-Lipschitz continuous, it follows that:

$$\begin{aligned} \mathcal {L}_1(\varvec{\varTheta }) \le \underbrace{\mathcal {L}_1(\varvec{\varTheta }^k) + \nabla \mathcal {L}_1(\varvec{\varTheta }^k)^T(\varvec{\varTheta }-\varvec{\varTheta }^k) + \frac{L}{2}\left\| \varvec{\varTheta }-\varvec{\varTheta }^k\right\| _2^2}_{\widehat{\mathcal {L}_1}_{\frac{1}{L}}(\varvec{\varTheta }, \varvec{\varTheta }^k)} \end{aligned}$$

(15)

By utilizing (15), it can be shown that the optimization (13) is equivalent to [20]:

$$\begin{aligned} \varvec{\varTheta }^{k+1}&:= \underset{\varvec{\varTheta }}{\text {argmin}}\; \widehat{\mathcal {L}_1}_{\lambda ^k}(\varvec{\varTheta }, \varvec{\varTheta }^k) + \left\| \varvec{\varTheta }\right\| _p \end{aligned}$$

(16)

where $\lambda ^k=\frac{1}{L}$. So at each iteration, the function $\mathcal {L}_1$ is linearized around the current point and the problem (16) is solved. The final fast proximal gradient method with backtracking is shown in Algorithm 1.

The final alternative minimization algorithm is shown in Algorithm 2.

5 Empirical Evaluation

The proposed approach for multitask learning of ranking-based scoring functions is tested on one synthetic and two real-world datasets. We compared our MultiRBSFL approach against the following baseline approaches:

1.
$L_2$ - independently learning scoring functions for each task (objective (3));
2.
$L_1$ - independently learning sparse ($L_1$ regularized) scoring functions for each task;
3.
$L_*$ - learning multiple scoring functions by imposing low rank regularization on their joint parameter matrix ($L_*$ regularized objective (6));
4.
$L_{2,1}$ - joint objective (6), regularized by mixed $\Vert .\Vert _{2,1}$ norm.

Our MultiRBSFL approach, which uses composite low rank and mixed norm regularized joint objective (8), we will denote as $L_*+L_{2,1}$ for consistency in naming the alternative approaches.

We measured the predictive performance in terms of accuracy, which is the number of correctly ordered test pairs. As the pairwise ranking relation is antisymmetric, it is sufficient to use only the positive training instances (i.e. where the first sample in a pair has the larger score). Test pairs are exclusively generated from examples not contained in the training set. Accuracy values that we report in this study are obtained by doing 5-fold cross-validation experiments.

5.1 Experiments on Synthetic Data

In this settings, a Gaussian processes model with an exponential kernel was used to generate the temporal data. We compiled 250 processes to mimic $d=250$ measured variables (features) per subject. Each single process was used to generate a time series with 10 time points (10 samples). We followed the same principle to generate 10 different multivariate time series (subjects) for training and 10 subjects for test, resulting in 100 samples $X_{100 \times 250}^{train}$ for training, and 100 samples $X_{100 \times 250}^{test}$ for test.

Four different tasks were created by randomly generating the weight matrix $W_{250 \times 4}$, with only 5 nonzero rows, which corresponds to the $L_{2,1}$ assumption (row-sparsity). This row-wise sparse matrix was then superimposed with a dense rank-1 matrix, generated by multiplication of two random vectors, which suits the $L_*$ trace norm part of the objective. True underlying scores on four tasks, for each of the 250-dimensional samples (one time point of one patient), are calculated as the weighted sum of the feature values $X*W$. Zero mean random vector was subsequently superimposed to input X data to model the measurement noise.

A training set is then obtained by making pairs out of samples whose scores are sufficiently different (in our case we set the threshold to 1). Pairs of examples were generated independently for each task based on their scores, totaling 14,187 pairs for all four tasks jointly. Test set pairs were generated in the same fashion, but with a smaller threshold and consisted out of 19,390 pairs. Training pairs were used to learn the weight matrix $\hat{W}$, which was used to estimate the testing scores from the test samples. The obtained estimates were used to infer the relative order of the testing pairs. The accuracy (percentage of correct guesses) is reported in the Table 1. It is no surprise that the proposed $L_*+L_{1,2}$ approach achieves the highest accuracy on all four tasks, as the underlying assumptions were explicitly built into the synthetic example.

Table 1. Comparison of accuracy indicators (fraction of correctly ordered pairs) for alternative score learning methods on the synthetic data of four related tasks.

Full size table

5.2 School Exam Score

Intelligence as well as the capacity for understanding and using mathematics or languages are all examples of properties that are latent - yet important and often evaluated (estimated). We have tested the multitask score learning framework on data from an elementary school study [19], which contains longitudinal data on performance in Math and English language for pupils in 50 inner London schools^{Footnote 1}. In total there are scores for 3,236 exams (Math and English each), taken by 1,402 students over three consecutive school years. The goal is to rank the students’ performances on Math and English test based on known score from Ravens ability test and additional information like demographics, social status, gender, class and school type. Distributions of scores for two tasks are given in the Fig. 2.

According to results depicted in Table 2, our $L_*+L_{1,2}$ approach achieved the best predictive performance in both tasks.

Table 2. Comparison of accuracy indicators (fraction of correctly ordered pairs) for alternative score learning methods on the task of learning the performance on Math and English tests.

Full size table

5.3 Tolerance to Infections Score

Tolerance is the host’s behavior that arises from interactions with a pathogen, which describes the ability of the host to preserve fitness despite the presence of a large amount of pathogen. Therefore, it is defined as changes in host fitness (health) with respect to changes in pathogen load [22]. However, tolerance is a very understudied topic, where there is no established scoring function, despite the necessity.

We analyzed three publicly available datasets^{Footnote 2} that allows characterization of the tolerance behavior in humans. The data comes from the human viral challenge studies [29] where human volunteers were infected with H3N2 influenza, rhinovirus (HRV) and respiratory syncytial virus (RSV), respectively. For all subjects in each dataset, symptoms were recorded twice a day and quantified by the modified Jackson Score [15]. Thereafter, subjects were classified based on the modified Jackson Score values into “symptomatic” and “asymptomatic” groups. In addition, viral load temporal measurements are available for 28 “symptomatic” subjects, given in Table 3. Gene expression measurements (for 12,023 genes) were collected temporally, starting at a baseline (24 hours prior to inoculation with virus) and measured at certain time points following the experimental procedure described in detail in [29], making a total of 16, 14 and 21 time-point measurements for H3N2, HRV and RSV datasets, respectively. Table 3 shows the viral shedding and symptom scores for subjects who developed clinically relevant symptoms from H3N2, HRV and RSV datasets.

Temporal measurements about symptoms (proxy for fitness) and viral (pathogen) load for each subject were used to derive tolerance scores according to the definition given in [22]. In particular, the tolerance score for each subject was calculated by dividing the maximum viral load with the maximum severity of symptoms observed for that subject (Table 3). Gene expression measurements were used as an explanatory variables in our ranking task.

Table 3. Tolerance scores (Ratio) derived by dividing maximum viral load (Max V) with maximum severity score (Max S).

Full size table

Biological rationale behind the task relatedness is that the three infections are viruses that cause similar respiratory symptoms (runny nose, fever, cough) and are quantified by the same Jackson score, suggesting that some shared genetic mechanisms might be responsible for the disease manifestations. Consequently, we sought to learn the tolerance scoring functions jointly.

The tolerance scores were used to compile a set of ranked pairs, and the objective was to learn the scoring functions for tolerance to H3N2, HRV and RSV viruses (3 tasks), from high-dimensional gene expression data. Since 12,023 dimensions is very computationally expensive to optimize, we reduced the dimensionality of the data to the 100 most informative genes according to the correlation with the target. The results of learning the scoring functions with different approaches are summarized in the Table 4.

Table 4. Comparison of accuracy indicators (fraction of correctly ordered pairs) for alternative score learning methods on the tolerance to three viruses learning task.

Full size table

The results from the Table 4 show that the HRV task is the most difficult one in the described formulation. Although some alternative approaches achieved better accuracy in two of the tasks, the proposed approach achieved the best generalization trade-off as can be concluded from the highest average (overall) accuracy.

6 Discussion and Conclusions

We proposed the method that jointly learns multiple scoring functions from a set of ranked examples. The approach utilizes composite regularization consisting of the trace norm and row-wise grouped Lasso penalty, to impose the structural regularity among the model parameters of different tasks. We also provide optimization algorithm, based on the alternate minimization and proximal gradient techniques, for solving the proposed convex MultiRBSFL objective.

Presented empirical evaluations in one synthetic and two real world datasets suggest the benefits of utilizing the multitask approach for learning related ranking based scoring functions. According to the results, the model with only $L_*$ performs worse than $L_{1,2}$, probably because sparsity in features seems to be the more dominant pattern in the data than the low-rank component. However, utilizing both $L_*$ and $L_{1,2}$ in the same model turned out to be most beneficial for studied applications.

The proposed proximal gradient algorithm with alternating minimization for optimization of the multitask objective proved valuable for applications with low to moderate dimensionality of the feature space. However, as the contemporary applications have ever increasing number of measured variables, more efficient optimization approaches and with better scalability would be required. One potential way to accelerate the proximal gradient algorithm is to adopt the approach proposed in [26].

Notes

References

Anderson, R.: The Credit Scoring Toolkit: Theory and Practice for Retail Credit Risk Management and Decision Automation. Oxford University Press, Oxford (2007)
Google Scholar
Ando, R.K., Zhang, T.: A framework for learning predictive structures from multiple tasks and unlabeled data. J. Mach. Learn. Res. 6(Nov), 1817–1853 (2005)
MathSciNet MATH Google Scholar
Argyriou, A., Evgeniou, T., Pontil, M.: Convex multi-task feature learning. Mach. Learn. 73(3), 243–272 (2008)
Article Google Scholar
Ashtawy, H.M., Mahapatra, N.R.: Machine-learning scoring functions for identifying native poses of ligands docked to known and novel proteins. BMC Bioinform. 16(6), S3 (2015)
Article Google Scholar
Bach, F., Jenatton, R., Mairal, J., Obozinski, G.: Optimization with sparsity-inducing penalties. Found. Trends Mach. Learn. 4, 1–106 (2012)
Article MATH Google Scholar
Bai, J., Zhou, K., Xue, G., Zha, H., Sun, G., Tseng, B., Zheng, Z., Chang, Y.: Multi-task learning for learning to rank in web search. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 1549–1552. ACM (2009)
Google Scholar
Beck, A., Teboulle, M.: Gradient-based algorithms with applications to signal recovery. In: Convex Optimization in Signal Processing and Communications, pp. 42–88 (2009)
Google Scholar
Cai, J.F., Candès, E.J., Shen, Z.: A singular value thresholding algorithm for matrix completion. SIAM J. Optim. 20(4), 1956–1982 (2010)
Article MathSciNet MATH Google Scholar
Chen, J., Tang, L., Liu, J., Ye, J.: A convex formulation for learning shared structures from multiple tasks. In: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 137–144. ACM (2009)
Google Scholar
Chen, J., Zhou, J., Ye, J.: Integrating low-rank and group-sparse structures for robust multi-task learning. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 42–50. ACM (2011)
Google Scholar
Dyagilev, K., Saria, S.: Learning (predictive) risk scores in the presence of censoring due to interventions. Mach. Learn. 102(3), 1–26 (2015)
MathSciNet MATH Google Scholar
Dyagilev, K., Saria, S.: Learning severity score for sepsis: a novel approach based on clinical comparisons. In: AMIA Annual Symposium Proceedings, pp. 1890–1898 (2015)
Google Scholar
Evgeniou, T., Micchelli, C.A., Pontil, M.: Learning multiple tasks with kernel methods. J. Mach. Learn. Res. 6(Apr), 615–637 (2005)
MathSciNet MATH Google Scholar
Gong, P., Ye, J., Zhang, C.: Robust multi-task feature learning. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 895–903. ACM (2012)
Google Scholar
Jackson, G.G., Dowling, H.F., Spiesman, I.G., Boand, A.V.: Transmission of the common cold to volunteers under controlled conditions: I. the common cold as a clinical entity. AMA Arch. Intern. Med. 101(2), 267–278 (1958)
Article Google Scholar
Jacob, L., Vert, J.P., Bach, F.R.: Clustered multi-task learning: a convex formulation. In: Advances in Neural Information Processing Systems, pp. 745–752 (2009)
Google Scholar
Jalali, A., Sanghavi, S., Ruan, C., Ravikumar, P.K.: A dirty model for multi-task learning. In: Advances in Neural Information Processing Systems, pp. 964–972 (2010)
Google Scholar
Joachims, T.: Optimizing search engines using clickthrough data. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 133–142. ACM (2002)
Google Scholar
Mortimore, P., Sammons, P., Stoll, L., Lewis, D., Ecob, R.: School Matters: The Junior Years. Open Books (1988)
Google Scholar
Parikh, N., Boyd, S.: Proximal algorithms. Found. Trends Optim. 1(3), 127–239 (2014)
Article Google Scholar
Santolino, M., Boucher, J.P.: Modelling the disability severity score in motor insurance claims: an application to the spanish case. IREA-Working Papers, 2009, IR09/002 (2009)
Google Scholar
Simms, E.L.: Defining tolerance as a norm of reaction. Evol. Ecol. 14(4–6), 563–570 (2000)
Article Google Scholar
Stojkovic, I., Jelisavcic, V., Milutinovic, V., Obradovic, Z.: Distance based modeling of interactions in structured regression. In: Procedeengs of the 25th International Joint Conference on Artificial Intelligence IJCAI 2016, pp. 2032–2038 (2016)
Google Scholar
Stojkovic, I., Jelisavcic, V., Milutinovic, V., Obradovic, Z.: Fast sparse Gaussian Markov random fields learning based on cholesky factorization. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence, IJCAI 2017, pp. 2758–2764 (2017)
Google Scholar
Stojkovic, I., Obradovic, Z.: Predicting sepsis biomarker progression under therapy. In: Proceedings of the 30th IEEE International Symposium on Computer-Based Medical Systems, CBMS 2017, pp. 19–24. IEEE (2017)
Google Scholar
Toh, K.C., Yun, S.: An accelerated proximal gradient algorithm for nuclear norm regularized linear least squares problems. Pac. J. Optim. 6(615–640), 15 (2010)
MathSciNet MATH Google Scholar
Vincent, J.L., Moreno, R., Takala, J., Willatts, S., De Mendonça, A., Bruining, H., Reinhart, C., Suter, P., Thijs, L.: The sofa (sepsis-related organ failure assessment) score to describe organ dysfunction/failure. Intensive Care Med. 22(7), 707–710 (1996)
Article Google Scholar
Yang, S., Shapiro, L., Cunningham, M., Speltz, M., Birgfeld, C., Atmosukarto, I., Lee, S.-I.: Skull retrieval for craniosynostosis using sparse logistic regression models. In: Greenspan, H., Müller, H., Syeda-Mahmood, T. (eds.) MCBR-CDS 2012. LNCS, vol. 7723, pp. 33–44. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-36678-9_4
Chapter Google Scholar
Zaas, A.K., Chen, M., Varkey, J., Veldman, T., Hero, A.O., Lucas, J., Huang, Y., Turner, R., Gilbert, A., Lambkin-Williams, R., et al.: Gene expression signatures diagnose influenza and other symptomatic respiratory viral infections in humans. Cell Host Microbe 6(3), 207–217 (2009)
Article Google Scholar
Zhou, J., Chen, J., Ye, J.: Malsar: Multi-task learning via structural regularization. Arizona State University 21 (2011)
Google Scholar
Zhou, J., Liu, J., Narayan, V.A., Ye, J.: Modeling disease progression via fused sparse group lasso. In: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 1095–1103. ACM (2012)
Google Scholar

Download references

Acknowledgments

This research was supported in part by DARPA grant W911NF-16-C-0050 and in part by DARPA grant No. FA9550-12-1-0406 negotiated by AFOSR. Computations were performed on the OwlsNest HPC cluster at Temple University, which is supported in part by the National Science Foundation through NSF grant NSF-CNS-1625061 and Pennsylvania Department of Health CURE grant.

Author information

Authors and Affiliations

Center for Data Analytics and Biomedical Informatics, Temple University, Philadelphia, PA, 19122, USA
Ivan Stojkovic, Mohamed Ghalwash & Zoran Obradovic
School of Electrical Engineering, University of Belgrade, 11120, Belgrade, Serbia
Ivan Stojkovic
IBM T. J. Watson Research Center, Cambridge, MA, USA
Mohamed Ghalwash
Faculty of Science, Ain Shams University, 11566, Cairo, Egypt
Mohamed Ghalwash

Authors

Ivan Stojkovic
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed Ghalwash
View author publications
You can also search for this author in PubMed Google Scholar
Zoran Obradovic
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zoran Obradovic .

Editor information

Editors and Affiliations

Università degli Studi di Bari Aldo Moro, Bari, Italy
Michelangelo Ceci
Aalto University School of Science, Espoo, Finland
Jaakko Hollmén
University of Ljubljana, Ljubljana, Slovenia
Ljupčo Todorovski
KU Leuven Kulak, Kortrijk, Belgium
Celine Vens
Jožef Stefan Institute, Ljubljana, Slovenia
Sašo Džeroski

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Stojkovic, I., Ghalwash, M., Obradovic, Z. (2017). Ranking Based Multitask Learning of Scoring Functions. In: Ceci, M., Hollmén, J., Todorovski, L., Vens, C., Džeroski, S. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2017. Lecture Notes in Computer Science(), vol 10535. Springer, Cham. https://doi.org/10.1007/978-3-319-71246-8_44

Download citation

DOI: https://doi.org/10.1007/978-3-319-71246-8_44
Published: 30 December 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-71245-1
Online ISBN: 978-3-319-71246-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Ranking Based Multitask Learning of Scoring Functions

Abstract

Similar content being viewed by others