1 Introduction

In contrast to traditional supervised learning, multi-label classification purports to build classification models for objects assigned with multiple labels simultaneously, which is a common learning paradigm in real-world tasks. In the past decades, it has attracted much attention (Zhang and Zhou 2014a). To name a few, in image classification, a scene image is usually annotated with several tags (Boutell et al. 2004); in text categorization, a document may present multiple topics (McCallum 1999; Schapire and Singer 2000); in music information retrieval, a piece of music can convey various messages (Turnbull et al. 2008).

To solve the multi-label tasks, a variety of methods have been proposed (Zhang and Zhou 2014a, Zhang et al. 2018), among which Rank-SVM (Elisseeff and Weston 2002) is one of the most eminent methods. It extended the idea of maximizing minimum margin in support vector machine (SVM) (Cortes and Vapnik 1995) to multi-label classification and achieved impressive performance. Specifically, the central idea of SVM is to search a large margin separator, i.e., maximizing the smallest distance from the instances to the classification boundary in a RKHS (reproducing kernel Hilbert space). Rank-SVM modified the definition of margin for label pairs and adapted maximizing margin strategy to deal with multi-label data, where a set of classifiers are optimized simultaneously. Benefiting from kernel tricks and considering pairwise relations between labels, Rank-SVM could handle non-linear classification problems and achieve good generalization performance.

For maximizing minimum margin strategy of SVMs, the margin theory (Vapnik 1995) provided good support to the generalization performance. It is noteworthy that there is also a long history of utilizing margin theory to explain the good generalization of AdaBoost (Freund and Schapire 1997), due to its tending to be empirically resistant to over-fitting. Specifically, Schapire et al. (1998) first suggested margin theory to interpret the phenomenon that AdaBoost seems resistant to over-fitting; soon after, Breiman (1999) developed a boosting-style algorithm, named Arc-gv, which is able to maximize the minimum margin but with a poor generalization performance. Later, Reyzin and Schapire (2006) observed that although Arc-gv produced a larger minimum margin, its margin distribution is quite poor.

Recently, the margin theory for Boosting has finally been defended (Gao and Zhou 2013), and has disclosed that the margin distribution rather than a single margin is more crucial to the generalization performance. It suggests that there may still exist large space to further ameliorate for SVMs. Inspired by this finding, Zhang and Zhou (2014b, 2019) proposed a binary classification method to optimize margin distribution by characterizing it through the first- and second-order statistics, which achieves better experimental results than SVMs. Later, Zhang and Zhou (2017, 2019) extended the definition of margin for multi-class classification and proposed multi-class optimal margin distribution machine (mcODM), which always outperforms multi-class SVMs empirically. In addition to classic supervised learning tasks, there is also a series of work in various tasks verifying the better generalization performance of optimizing margin distribution. For example, Zhou and Zhou (2016) extended the idea to exploit unlabeled data and handle unequal misclassification cost; Zhang and Zhou (2018) proposed the margin distribution machine for clustering. Tan et al. (2019) accelerated the kernel methods and applied the idea to large-scale datasets.

Existing work has depicted that optimizing the margin distribution can obtain superior generalization performance in most cases, but it still remains open for multi-label classification because the margin distribution for multi-label classification is much more complicated and the tremendous number of variables makes the optimization more difficult. In this paper, we propose a method to first introduce the margin distribution to multi-label classification, named multi-label optimal margin distribution machine (mlODM). Specifically, we formulate the idea of optimizing the margin distribution in multi-label learning and solve it efficiently by dual block coordinate descent. Extensive experiments in multiple multi-label evaluation metrics illustrate that our method mlODM outperforms SVM-style multi-label methods. Moreover, empirical studies present the best margin distribution and verifies the fast convergence of our method.

The rest of paper is organized as follows. Some preliminaries are introduced in Sect. 2. In Sect. 3, we review Rank-SVM and reformulate it with our definition to display the key idea of maximizing the minimum margin more clearly. Section 4 presents the formulation of our proposed method mlODM. In Sect. 5, we use Block Coordinate Descent Algorithm to solve the dual of objective. Section 6 reports our experimental studies and empirical observations. The related work is introduced in Sect. 7. Finally, Sect. 8 concludes with future work.

2 Preliminaries

Suppose \({\mathcal {X}} = {\mathbb {R}}^d\) denotes the d-dimensional instance space, and \({\mathcal {Y}} = \left\{ y_1, y_2, \ldots , y_q \right\} \) denotes the label space with q possible class labels. The task of multi-label learning is to learn a classifier \(h: {\mathcal {X}}\rightarrow 2^{{\mathcal {Y}}}\) from the multi-label training set \({\mathcal {S}} = \left\{ \left( {\varvec{x}}_i, Y_i \right) | 1 \le i \le m \right\} \). In most cases, instead of outputting a multi-label classifier, the learning system will produce a real-valued function of the form \(f:{\mathcal {X}}\times {\mathcal {Y}}\rightarrow {\mathbb {R}}\). For each multi-label example \(\left( {\varvec{x}}_i, Y_i \right) , {\varvec{x}}_i \in {\mathcal {X}} \) is a d-dimensional feature vector \(\left( x_{i1}, x_{i2}, \ldots , x_{id} \right) ^{\top } \) and \(Y_i \subset {\mathcal {Y}}\) is the set of labels associated with \({\varvec{x}}_i\). Besides, the complement of \(Y_i\), i.e., \({\bar{Y}}_i = {\mathcal {Y}}\backslash Y_i\), is referred to as a set of irrelevant labels of \({\varvec{x}}_i\).

Let \(\phi : {\mathcal {X}} \mapsto {\mathbb {H}}\) be a feature mapping associated to some positive definite kernel \(\kappa \). For multi-label classification setting, the hypothesis \({\mathcal {W}} = \left\{ {\varvec{w}}_j\ |\ 1 \le j \le q \right\} \) is defined based on q weight vectors \({\varvec{w}}_1, \ldots , {\varvec{w}}_q \in {\mathbb {H}}\), where each vector \({\varvec{w}}_y, y \in {\mathcal {Y}}\) define a scoring function \(x \mapsto {\varvec{w}}_y^{\top }\phi ({\varvec{x}})\) and the label of instance \({\varvec{x}}\) is the ones resulting in large score. For systems that rank the value of \({\varvec{w}}_y^{\top }\phi ({\varvec{x}})\), the decision boundaries of \({\varvec{x}}\) are defined by the hyperplanes \({\varvec{w}}_k^{\top }\phi ({\varvec{x}})-{\varvec{w}}_l^{\top }\phi ({\varvec{x}})=0\) for each relevant–irrelevant label pair \((k,l)\in Y \times {\bar{Y}}\). Therefore, the margin of a labeled instance \(({\varvec{x}}_i,Y_i)\) can be defined as:

$$\begin{aligned} \gamma _h ({\varvec{x}}_i,y_k,y_l) = {\varvec{w}}_k^{\top }\phi ({\varvec{x}}_i) - {\varvec{w}}_l^{\top }\phi ({\varvec{x}}_i),\quad \forall (k,l)\in Y_i \times {\bar{Y}}_i, \end{aligned}$$
(1)

which is the difference in the score of \({\varvec{x}}_i\) on a label pair. In addition,we define the ranking margin as:

$$\begin{aligned} \min _{(k,l)\in Y_i \times {\bar{Y}}_i} \ \frac{1}{\left\| {\varvec{w}}_k - {\varvec{w}}_l \right\| _{{\mathbb {H}}} }\gamma _h ({\varvec{x}}_i,y_k,y_l), \end{aligned}$$
(2)

which is the normalized margin, also the minimum signed distance of \({\varvec{x}}_i\) to the decision boundary using norm \(\left\| \cdot \right\| _{{\mathbb {H}}}\). Thus the classifier h misclassifies \(({\varvec{x}}_i,Y_i)\) if and only if it produces a negative margin for this instance, i.e., there exists at least a label pair \((k,l)\in Y_i \times {\bar{Y}}_i\) in the output such that \(\gamma _{k,l} ({\varvec{x}},y_k,y_l) < 0\).

Based on the above definition of margin, the task of multi-label learning is tackled by considering pairwise relations between labels, which corresponds to the ranking between relevant label and irrelevant label. Therefore, the methods based on the ranking margin belong to second-order strategies (Zhang and Zhou 2014a), which could achieve better generalization performance than first-order approaches.

3 Review of Rank-SVM

Using the ranking margin Eq. (2), Elisseeff and Weston (2002) first extended the key idea of maximizing margin to multi-label classification and proposed Rank-SVM, which learns q base models to minimize the Ranking Loss while maximizing the ranking margin. The brief derivation process is reformulated as follows.

When the learning system is capable of properly ranking every relevant–irrelevant label pair for each training example, the learning system’s margin on the whole training set S naturally follows

$$\begin{aligned} \min _{\left( {\varvec{x}}_i, Y_i \right) \in S}\ \min _{(k,l)\in Y_i \times {\bar{Y}}_i} \ \frac{1}{\left\| {\varvec{w}}_k - {\varvec{w}}_l \right\| _{{\mathbb {H}}} }\gamma _h ({\varvec{x}}_i,y_k,y_l) \end{aligned}$$
(3)

In this ideal case, we can normalize the parameters to ensure that for \(\forall \left( {\varvec{x}}_i, Y_i \right) \in S\),

$$\begin{aligned} \gamma _h ({\varvec{x}}_i,y_k,y_l) = {\varvec{w}}_k^{\top }\phi ({\varvec{x}}_i) - {\varvec{w}}_l^{\top }\phi ({\varvec{x}}_i) \ge 1 \end{aligned}$$
(4)

and there exist instances satisfying the equation. Thereafter, the problem of maximizing the ranking margin in Eq. (3) can be expressed as:

$$\begin{aligned}&\max _{{\mathcal {W}}}\ \min _{\left( {\varvec{x}}_i, Y_i \right) \in S}\ \min _{(k,l)\in Y_i \times {\bar{Y}}_i} \ \frac{1}{\left\| {\varvec{w}}_k - {\varvec{w}}_l \right\| ^2_{{\mathbb {H}}} }\nonumber \\&\text {s.t.}\ \gamma _h ({\varvec{x}}_i,y_k,y_l) \ge 1, \quad \forall (k,l)\in Y_i \times {\bar{Y}}_i, \ i=1,\ldots ,m. \end{aligned}$$
(5)

Suppose we have sufficient training examples such that two labels are always co-occurring, the objective in Eq. (5) becomes equivalent to \(\max _{{\mathcal {W}}}\ \min _{k,l} \ \frac{1}{\left\| {\varvec{w}}_k - {\varvec{w}}_l \right\| ^2_{{\mathbb {H}}} }\), and the optimization problem can be reformulated as:

$$\begin{aligned}&\min _{{\mathcal {W}}}\ \max _{k,l}\ \left\| {\varvec{w}}_k - {\varvec{w}}_l \right\| ^2_{{\mathbb {H}}}\nonumber \\&\text {s.t.}\ \gamma _h ({\varvec{x}}_i,y_k,y_l) \ge 1, \quad \forall (k,l)\in Y_i \times {\bar{Y}}_i, \ i=1,\ldots ,m. \end{aligned}$$
(6)

To avoid the difficulty brought by the max operator, Rank-SVM chooses to approximate the maximum with the sum operator and obtains \(\min _{{\mathcal {W}}}\ \sum _{k,l=1}^{q}\ \left\| {\varvec{w}}_k - {\varvec{w}}_l \right\| ^2_{{\mathbb {H}}}\). Note that a shift in the optimization variables does not change the ranking, the constraint \(\sum _{j=1}^{q} {\varvec{w}}_j =0\) is added. The previous problem Eq. (6) is equivalent to:

$$\begin{aligned}&\min _{{\mathcal {W}}}\ \sum _{k=1}^{q}\ \left\| {\varvec{w}}_k\right\| ^2_{{\mathbb {H}}}\nonumber \\&\text {s.t.}\ \gamma _h ({\varvec{x}}_i,y_k,y_l) \ge 1, \quad \forall (k,l)\in Y_i \times {\bar{Y}}_i, \ i=1,\ldots ,m. \end{aligned}$$
(7)

To generalize the method to real-world scenarios where constraints in Eq. (7) can not be fully satisfied, Rank-SVM introduces slack variables like binary SVM, and obtain the final optimization problem:

$$\begin{aligned}&\min _{{\mathcal {W}};\varXi }\ \sum _{k=1}^{q}\ \left\| {\varvec{w}}_k\right\| ^2_{{\mathbb {H}}} + C\sum _{i=1}^{m}\frac{1}{|Y_i||{\bar{Y}}_i|}\sum _{(k,l)\in Y_i \times {\bar{Y}}_i}\xi _{ikl} \nonumber \\&\text {s.t.}\ \gamma _h ({\varvec{x}}_i,y_k,y_l) \ge 1 - \xi _{ikl}, \nonumber \\&\quad \xi _{ikl} \ge 0,\quad \forall (k,l)\in Y_i \times {\bar{Y}}_i, \ i=1,\ldots ,m \end{aligned}$$
(8)

where \(\varXi = \left\{ \xi _{ikl}\ |\ 1 \le i \le m, (k,l)\in Y_i \times {\bar{Y}}_i \right\} \) is the set of slack variables. In this way, Rank-SVM aims to minimize the margin while minimizing the Ranking Loss. Specifically, the first part in Eq. (8) corresponds to the ranking margin while the second part corresponds to the surrogate Ranking Loss in hinge form. These two parts are balanced by the trade-off parameter C.

4 Formulation of proposed mlODM

Gao and Zhou (2013) proved that, to characterize the margin distribution, it is important to consider both the margin mean and the margin variance. Inspired by this idea, Zhang and Zhou (2019) proposed optimal margin distribution machine (ODM) for binary classification, which maximizes the margin mean while minimizing the margin variance. In this section, we introduce optimizing the margin distribution into multi-label setting and propose the formulation of multi-label optimal margin distribution machine (mlODM), the key idea of which is to maximize the ranking margin mean and minimize the margin variance.

Like binary ODM, considering that all the data in the training set S can be well ranked, we can normalize the weight vectors \({\varvec{w}}_j, j=1,\ldots ,q\) such that for every label pair \((k,l)\in Y_i \times {\bar{Y}}_i\), the mean of \(\gamma _h ({\varvec{x}}_i,y_k,y_l)\) is 1, i.e., \(\bar{\gamma _h}({\varvec{x}},y_k,y_l)=1\). Therefore, the distance of the mean point for label pair \((k,l)\in Y_i \times {\bar{Y}}_i\) to the decision boundary using norm \(\left\| \cdot \right\| _{{\mathbb {H}}}\) can be represented as \(\frac{1}{\left\| {\varvec{w}}_k - {\varvec{w}}_l \right\| _{{\mathbb {H}}} }\). Thereafter, the minimum distance between mean points and decision boundaries in this case can be represented as:

$$\begin{aligned}&\min _{(k,l)\in Y_i \times {\bar{Y}}_i} \, \frac{1}{\left\| {\varvec{w}}_k - {\varvec{w}}_l \right\| _{{\mathbb {H}}}} \nonumber \\&\text {s.t.}\, \bar{\gamma _h}({\varvec{x}},y_k,y_l)=1, \quad \forall (k,l)\in Y_i \times {\bar{Y}}_i, \end{aligned}$$
(9)

which is the minimum margin mean. Corresponding to maximizing the minimum margin in Rank-SVM, we maximize the margin mean on the whole dataset and obtain

$$\begin{aligned}&\max _{{\mathcal {W}}}\ \min _{\left( {\varvec{x}}_i, Y_i \right) \in S}\ \min _{(k,l)\in Y_i \times {\bar{Y}}_i} \ \frac{1}{\left\| {\varvec{w}}_k - {\varvec{w}}_l \right\| ^2_{{\mathbb {H}}} } \nonumber \\&\text {s.t.}\ \bar{\gamma _h} ({\varvec{x}},y_k,y_l) = 1, \quad \forall (k,l)\in Y_i \times {\bar{Y}}_i. \end{aligned}$$
(10)

Then we use the same technique to simplify the objective. Specifically, we suppose the problem is not ill-conditioned, approximate the maximum operator with the sum operator and add the constraint \(\sum _{j=1}^{q} {\varvec{w}}_j = 0\). Thereafter, the objective of maximizing the margin mean can be reformulated as:

$$\begin{aligned}&\min _{{\mathcal {W}}}\ \sum _{k=1}^{q}\ \left\| {\varvec{w}}_k\right\| ^2_{{\mathbb {H}}} \nonumber \\&\text {s.t.}\ \bar{\gamma _h} ({\varvec{x}},y_k,y_l) = 1, \quad \forall (k,l)\in Y_i \times {\bar{Y}}_i. \end{aligned}$$
(11)

After considering the margin mean, in order to optimize the margin distribution, we still need to minimize the margin variance. Like binary ODM, the variance can be formulated as slack variables. Considering the margin variance is calculated on every label pair, we use the framework of Ranking Loss to weighted average the variance. Then the objective can be represented as:

$$\begin{aligned}&\min _{{\mathcal {W}},\varXi ,\varLambda }\, \sum _{k=1}^{q}\ \left\| {\varvec{w}}_k\right\| ^2_{{\mathbb {H}}} + C\sum _{i=1}^{m}\frac{1}{|Y_i||{\bar{Y}}_i|}\sum _{(k,l)\in Y_i \times {\bar{Y}}_i}\left( \xi _{ikl}^2 + \epsilon _{ikl}^2 \right) \nonumber \\&\text {s.t.}\, \bar{\gamma _h} ({\varvec{x}},y_k,y_l) = 1, \nonumber \\&\quad \gamma _h({\varvec{x}}_i,y_k,y_l) \ge 1 - \xi _{ikl}, \nonumber \\&\quad \gamma _h({\varvec{x}}_i,y_k,y_l) \le 1 + \epsilon _{ikl}, \quad \forall (k,l)\in Y_i \times {\bar{Y}}_i,\ i=1,\ldots ,m \end{aligned}$$
(12)

where C is the trade-off parameter to balance the margin mean and variance; \(\varXi = \left\{ \xi _{ikl}\ |\ 1 \le i \le m, (k,l)\in Y_i \times {\bar{Y}}_i \right\} \) and \(\varLambda = \left\{ \epsilon _{ikl}\ |\ 1 \le i \le m, (k,l)\in Y_i \times {\bar{Y}}_i \right\} \) are the set of slack variables. Because of setting margin mean as 1, the right part of the objective is the weighted average of margin variance. However, the above optimization problem is very difficult to solve due to the existence of the constraint of margin mean. Draw on the idea of insensitive margin loss in Support Vector Regression (Vapnik 1995) and in order to simplify the objective, we approximate the margin mean and variance by a \(\theta \)-insensitive margin loss. The previous problem can be recast as:

$$\begin{aligned}&\min _{{\mathcal {W}},\varXi ,\varLambda }\, \sum _{k=1}^{q}\ \left\| {\varvec{w}}_k\right\| ^2_{{\mathbb {H}}} + C\sum _{i=1}^{m}\frac{1}{|Y_i||{\bar{Y}}_i|}\sum _{(k,l)\in Y_i \times {\bar{Y}}_i}\left( \xi _{ikl}^2 + \epsilon _{ikl}^2 \right) \nonumber \\&\text {s.t.}\, \gamma _h({\varvec{x}}_i,y_k,y_l) \ge 1 - \theta - \xi _{ikl},\nonumber \\&\quad \gamma _h({\varvec{x}}_i,y_k,y_l) \le 1 + \theta + \epsilon _{ikl}, \quad \forall (k,l)\in Y_i \times {\bar{Y}}_i, \ i=1,\ldots ,m \end{aligned}$$
(13)

where \(\theta \in [0,1]\) is a hyperparameter to control the degree of approximation. By the \(\theta \)-insensitive margin loss, the margin mean is limited to the interval while the variance is only calculated by the outliers outside the interval. From another point of view, the first part of objective is the regularization term to limit the model complexity and minimize the structural risk; the second part is approximated weighted variance loss. Moreover, the parameter \(\theta \) also control the number of support vector, i.e., the sparsity of solutions.

For each instance outside the interval, it is obvious that the instances corresponding to \(\gamma _h({\varvec{x}}_i,y_k,y_l) < 1 - \theta \) are much easier to be misclassified than those falling on the other side. Thus like binary ODM, we set different weights for the loss of instances in different sides. This leads us to the final formulation of mlODM:

$$\begin{aligned}&\min _{{\mathcal {W}},\varXi ,\varLambda }\ \sum _{k=1}^{q} \left\| {\varvec{w}}_k\right\| _{{\mathbb {H}}}^2 + \sum _{i=1}^{m}\frac{1}{|Y_i||{\bar{Y}}_i|}\sum _{(k,l)\in Y_i \times {\bar{Y}}_i} \left( \xi ^2_{ikl}+\mu \epsilon ^2_{ikl}\right) \nonumber \\&\text {s.t.} \ \gamma _h({\varvec{x}}_i,y_k,y_l) \ge 1 - \theta - \xi _{ikl}, \nonumber \\&\quad \gamma _h({\varvec{x}}_i,y_k,y_l) \le 1 + \theta + \epsilon _{ikl},\quad \forall (k,l)\in Y_i \times {\bar{Y}}_i,\ i=1,\ldots ,m \end{aligned}$$
(14)

where \(\mu \in (0,1]\) is the weight parameter. The optimization problem of mlODM is more difficult than Rank-SVM because considering margin distribution is more complex than minimizing the hinge-form Ranking Loss.

5 Optimization

The mlODM problem Eq. (14) is a non-differentiable quadratic programming problem, we solve its dual form by Block Coordinate Descent (BCD) algorithm (Richtárik and Takáč 2014) in this paper. For the convenience of calculation, the origin problem can be represented as follows:

$$\begin{aligned}&\min _{{\mathcal {W}},\varXi ,\varLambda }\ \frac{1}{2}\sum _{k=1}^{q} \left\| {\varvec{w}}_k\right\| _{{\mathbb {H}}}^2 + \frac{C}{2}\sum _{i=1}^{m}\frac{1}{|Y_i||{\bar{Y}}_i|}\sum _{(k,l)\in Y_i \times {\bar{Y}}_i} \left( \xi ^2_{ikl}+\mu \epsilon ^2_{ikl}\right) \nonumber \\&\text {s.t.} \ {\varvec{w}}_k^{\top }\phi ({\varvec{x}}_i) - {\varvec{w}}_l^{\top }\phi ({\varvec{x}}_i) \ge 1 - \theta - \xi _{ikl}, \nonumber \\&\quad {\varvec{w}}_k^{\top }\phi ({\varvec{x}}_i) - {\varvec{w}}_l^{\top }\phi ({\varvec{x}}_i) \le 1 + \theta + \epsilon _{ikl},\quad \forall (k,l)\in Y_i \times {\bar{Y}}_i,\ i=1,\ldots ,m. \end{aligned}$$
(15)

5.1 Lagrangian dual problem

First we introduce the dual variables \(\alpha _{ikl} \ge 0,(k,l)\in Y_i \times {\bar{Y}}_i\) related to the first \(\sum _{i=1}^{m}|Y_i||{\bar{Y}}_i|\) constraints, and the variables \(\beta _{ikl} \ge 0,(k,l)\in Y_i \times {\bar{Y}}_i\) related to the last \(\sum _{i=1}^{m}|Y_i||{\bar{Y}}_i|\) constraints respectively. The Lagrangian function of Eq. (15) can be computed:

$$\begin{aligned}&L({\varvec{w}}_k,\xi _{ikl},\epsilon _{ikl},\alpha _{ikl},\beta _{ikl}) \\&\quad =\ \frac{1}{2}\sum _{k=1}^{q} \left\| {\varvec{w}}_k\right\| _{{\mathbb {H}}}^2 + \frac{C}{2}\sum _{i=1}^{m}\frac{1}{|Y_i||{\bar{Y}}_i|}\sum _{(k,l)\in Y_i \times {\bar{Y}}_i} \left( \xi ^2_{ikl}+\mu \epsilon ^2_{ikl}\right) \\&\qquad -\sum _{i=1}^{m}\sum _{(k,l)\in Y_i \times {\bar{Y}}_i} \alpha _{ikl} \left( {\varvec{w}}_k^{\top }\phi ({\varvec{x}}_i) - {\varvec{w}}_l^{\top }\phi ({\varvec{x}}_i) - 1 + \theta + \xi _{ikl} \right) \\&\qquad +\sum _{i=1}^{m}\sum _{(k,l)\in Y_i \times {\bar{Y}}_i} \beta _{ikl} \left( {\varvec{w}}_k^{\top }\phi ({\varvec{x}}_i) - {\varvec{w}}_l^{\top }\phi ({\varvec{x}}_i) - 1 - \theta - \epsilon _{ikl} \right) \end{aligned}$$

By setting the partial derivations of variables \({\varvec{w}}_k\) to zero, we can obtain:

$$\begin{aligned} {\varvec{w}}_k&= \sum _{i=1}^{m} \left( \sum _{(j,l)\in Y_i \times {\bar{Y}}_i} \left( \alpha _{ijl}-\beta _{ijl} \right) \cdot c^k_{ijl} \right) \phi ({\varvec{x}}_i) \end{aligned}$$
(16)

where \(c_{ijl}^k\) is defined as follows:

$$\begin{aligned} c^{k}_{ijl}=\left\{ \begin{array}{ll} 0 &{} \quad j \ne k\ \text {and}\ l \ne k \\ 1 &{} \quad j=k \\ -1 &{} \quad l=k. \end{array} \right. \end{aligned}$$

Note that \(c_{ijl}^k\) depends on k. Then setting \(\partial _{\xi _{ikl}}L = 0\) and \(\partial _{\epsilon _{ikl}}L = 0\) at the optimum yields:

$$\begin{aligned} \xi _{ikl}&= \frac{|Y_i||{\bar{Y}}_i|}{C} \alpha _{ikl},\qquad \epsilon _{ikl} = \frac{|Y_i||{\bar{Y}}_i|}{\mu C} \beta _{ikl} \end{aligned}$$
(17)

Substituting Eqs. (16) and (17) into Lagrange function, simplifying the problem by double counting and transforming the minimization to maximization, a the dual of Eq. (15) can then be expressed:

$$\begin{aligned}&\min _{\alpha _{ikl}, \beta _{ikl}} \ \ \frac{1}{2} \sum _{k=1}^{q} \left\| {\varvec{w}}_k \right\| _{{\mathbb {H}}}^2 + \sum _{i=1}^{m} \sum _{(k,l)\in Y_i \times {\bar{Y}}_i} \left[ \alpha _{ikl}(\theta - 1) + \beta _{ikl} (\theta + 1) \right] \nonumber \\&\quad + \sum _{i=1}^{m} \frac{|Y_i||{\bar{Y}}_i|}{2C}\sum _{(k,l)\in Y_i \times {\bar{Y}}_i}\left( \frac{1}{\mu }\beta _{ikl}^2 + \alpha ^2_{ikl} \right) \nonumber \\&\text {s.t.} \ \, \alpha _{ikl} \ge 0, \ \beta _{ikl} \ge 0, \quad \forall (k,l)\in Y_i \times {\bar{Y}}_i, \ i=1,\ldots ,m. \end{aligned}$$
(18)

In order to use as simple notation as possible to make the objective concise, we express it with both dual variables and weight vectors \({\varvec{w}}_k\). Now we transform the objective into block vector representation of dual variables for each instance. For the ith instance, we construct four column vectors \(\varvec{\alpha }_i\), \(\varvec{\beta }_i\), \({\varvec{c}}_i^k\) and \({\varvec{e}}_i\), all of \(|Y_i||{\bar{Y}}_i|\)-dimension, which is the number of label pairs contained in \(Y_i\). Specifically, for \(1\le i \le m\), the vectors are defined as follows:

$$\begin{aligned} \varvec{\alpha }_i&= \left[ \alpha _{ikl}|(k,l)\in Y_i \times {\bar{Y}}_i \right] ^\top , \nonumber \\ \varvec{\beta }_i&= \left[ \beta _{ikl}|(k,l)\in Y_i \times {\bar{Y}}_i \right] ^\top , \nonumber \\ {\varvec{c}}_i&= \left[ c_{ijl}^k|(j,l)\in Y_i \times {\bar{Y}}_i\right] ^\top , \nonumber \\ {\varvec{e}}_i&= \left[ 1,\ldots ,1 \right] ^\top \end{aligned}$$
(19)

where \({\varvec{c}}_i\) can only take the value \(\left\{ 0,+1,-1 \right\} \). Based on the above definition, the optimal solution of weight vectors can be rewritten as:

$$\begin{aligned} {\varvec{w}}_k = \sum _{i=1}^{m}\left( \left( \varvec{\alpha }_i - \varvec{\beta }_i \right) ^\top {\varvec{c}}_i^k \right) \phi ({\varvec{x}}_i) \end{aligned}$$
(20)

Using Eqs. (19) and (20), the dual problem Eq. (18) can be finally represented as:

$$\begin{aligned}&\min _{\varvec{\alpha },\varvec{\beta }}\ \ \frac{1}{2} \sum _{k=1}^{q} \sum _{i=1}^{m} \sum _{h=1}^{m} \left[ \left( \varvec{\alpha }_i - \varvec{\beta }_i \right) ^{\top }{\varvec{c}}_i^k \right] \left[ \left( \varvec{\alpha }_h - \varvec{\beta }_h \right) ^{\top }{\varvec{c}}_h^k \right] \phi ({\varvec{x}}_i)^{\top }\phi ({\varvec{x}}_h) \nonumber \\&\quad + \sum _{i=1}^{m}\frac{|Y_i||{\bar{Y}}_i|}{2C}\left( \frac{1}{\mu }\varvec{\beta }_i^{\top }\varvec{\beta }_i + \varvec{\alpha }_i^{\top }\varvec{\alpha }_i \right) +\sum _{i=1}^{m} \left[ (\theta -1)\varvec{\alpha }_i^{\top }{\varvec{e}}_i + (\theta + 1)\varvec{\beta }_i^{\top }{\varvec{e}}_i \right] \nonumber \\&\text {s.t.}\ \, \varvec{\alpha }_i \ge 0, \ \varvec{\beta }_i \ge 0, \ i = 1,\ldots ,m. \end{aligned}$$
(21)

The optimization problem includes \(2\sum _{i=1}^{m}|Y_i||{\bar{Y}}_i|\) variables, the order of which is \(O\left( mq^2 \right) \) in the worst case, so we need an efficient optimization method. Considering the variables can be partitioned into m disjoint sets, and the i-th set only involves \(\varvec{\alpha }_i\) and \(\varvec{\beta }_i\), so it’s natural to use Block Coordinate Descent (BCD) method (Richtárik and Takáč 2014) to decompose the problem into m sub-problems.

We note that column vector \(\varvec{\zeta }=\left[ \varvec{\alpha }_1;\ldots ;\varvec{\alpha }_m;\varvec{\beta }_1;\ldots ;\varvec{\beta }_m \right] \), and diagonal matrix \(I_i\) satisfies \(I_i\varvec{\zeta } = \varvec{\alpha }_i,\ 1\le i \le m\), and \(I_i\varvec{\zeta } = \varvec{\beta }_i,\ m+1 \le i \le 2m\). Then the objective can be reformulated as

$$\begin{aligned} \min _{\varvec{\zeta }}\ \,&\varvec{\zeta }^{\top } {\varvec{Q}} \varvec{\zeta } + \varvec{\zeta }^{\top } {\varvec{u}} + \varPsi (\varvec{\zeta }) \end{aligned}$$
(22)

where \({\varvec{Q}}=\sum _{k=1}^{q}\sum _{i,h=1}^{m}\left[ \left( I_i -I_{i+m} \right) {\varvec{c}}_i^{k} \right] \left[ \left( I_h -I_{h+m} \right) {\varvec{c}}_h^{k}\right] + \sum _{i=1}^{m} \frac{|Y_i||{\bar{Y}}_i|}{2C} \left( \frac{1}{\mu }I_{i+m} I_{i+m} + I_{i}I_{i} \right) \), and \(\varPsi (\varvec{\zeta })\) equals to 0 when \(\varvec{\zeta } \ge 0\) and \(+\infty \) otherwise. Notice that the first term of matrix \({\varvec{Q}}\) is positive semi-definite and the second term is positive definite, it’s easy to verify that the first term of optimization problem Eq. (22) is strongly convex and the problem satisfies the assumptions in Richtárik and Takáč (2014). Therefore, we can use Block Coordinate Descent algorithm to solve mlODM efficiently with linear convergence rate.

Algorithm 1 below shows the details of the optimization procedure of mlODM by BCD.

figure a
figure b

5.2 Solving the sub-problem

For each sub-problem, we select \(2|Y_i||{\bar{Y}}_i|\) variables \(\varvec{\alpha }_i\) and \(\varvec{\beta }_i\) corresponding to a instance to minimize while keeping other variables constants, and repeat this procedure until convergence.

Note that all variables are fixed except \(\varvec{\alpha }_i\) and \(\varvec{\beta }_i\). After removing the constants, we obtain the sub-problem as follows:

$$\begin{aligned}&\min _{\varvec{\alpha }_i, \varvec{\beta }_i}\ \, \frac{1}{2} \left( \varvec{\alpha }_i - \varvec{\beta }_i \right) ^{\top } {\varvec{A}}_i \left( \varvec{\alpha }_i - \varvec{\beta }_i \right) + M_i\left( \frac{1}{\mu }\varvec{\beta }_i^{\top }\varvec{\beta }_i + \varvec{\alpha }_i^{\top }\varvec{\alpha }_i \right) \nonumber \\&\qquad + {\varvec{b}}_i \left( \varvec{\alpha }_i - \varvec{\beta }_i \right) + \theta \left( \varvec{\alpha }_i + \varvec{\beta }_i \right) ^{\top }{\varvec{e}}_i - \left( \varvec{\alpha }_i - \varvec{\beta }_i \right) ^{\top }{\varvec{e}}_i \nonumber \\&\text {s.t.}\ \ \varvec{\alpha }_i \ge {\varvec{0}},\ \varvec{\beta }_i \ge {\varvec{0}} \end{aligned}$$
(23)

where \({\varvec{A}}_i = \left( \sum _{k=1}^{q} {\varvec{c}}_i^k {{\varvec{c}}_i^k}^{\top } \right) \kappa ({\varvec{x}}_i, {\varvec{x}}_i)\) is a matrix, \({\varvec{b}}_i = \sum _{k=1}^{q}\sum _{j \ne i}\left( \varvec{\alpha }_j - \varvec{\beta }_j \right) ^{\top }{\varvec{c}}_j^k\)\({{\varvec{c}}_i^k}^{\top } \kappa ({\varvec{x}}_j, {\varvec{x}}_i)\) is a row vector and \(M_i = \frac{|Y_i||{\bar{Y}}_i|}{2C}\) is a constant. \(\varvec{\alpha }_i \ge {\varvec{0}}\) represents that each element of \(\varvec{\alpha }_i\) is nonnegative, so as \(\varvec{\beta }_i\).

Let column vector \(\varvec{\eta }_i = \left[ \varvec{\alpha }_i; \varvec{\beta }_i \right] \) for \(i=1,\ldots ,m\) and \({\varvec{I}}\) be an identity matrix with \(|Y_i||{\bar{Y}}_i|\) dimension, let \({\varvec{I}}_{\mu }\) be a diagonal matrix with \(2|Y_i||{\bar{Y}}_i|\) dimension with the elements of the second half being \(\frac{1}{\mu }\). the objective can be further represented as

$$\begin{aligned}&\min _{\varvec{\eta }_i} \ \ F(\varvec{\eta }_i)\triangleq \frac{1}{2}\varvec{\eta }_i^\top {\varvec{H}}_i \varvec{\eta }_i + {\varvec{v}}_i \varvec{\eta }_i \nonumber \\&\text {s.t.}\ \ \varvec{\eta }_i \ge {\varvec{0}} \end{aligned}$$
(24)

where \({\varvec{H}}_i = [{\varvec{I}},-{\varvec{I}}]^{\top } {\varvec{A}}_i [{\varvec{I}},-{\varvec{I}}] + 2 M_i {\varvec{I}}_{\mu }\) is a matrix and \({\varvec{v}}_i = {\varvec{b}}_i[{\varvec{I}},-{\varvec{I}}] + \theta {\varvec{e}}_i^\top [{\varvec{I}},{\varvec{I}}] - {\varvec{e}}_i^\top [{\varvec{I}},-{\varvec{I}}]\) is a row vector. It is easy to prove that the first part of \({\varvec{H}}_i\) is positive semi-definite and the second part \(2 M_i {\varvec{I}}_{\mu }\) is positive definite. Thus \({\varvec{H}}_i\) is positive definite, and the problem is strictly convex.

Through the above derivation, the sub-problem is finally reformulated as a convex nonnegative quadratic programming problem, which can be solved by QP solver efficiently. In order to avoid the drawback of having to choose a learning rate and control the precision, we choose Multiplicative Margin Maximization (\(\text {M}^3\)) method (Sha et al. 2002) to solve Eq. (24). Detailed algorithm is showed in Algorithm 2. It is worth mentioning that the \(\text {M}^3\) algorithm achieve minimum value when the fixed point occurs, i.e., when one of two conditions holds for each element of optimization variables \(\eta _j\): \((\text {1})\ \eta _j^{*}>0\) and \( \lambda _j = 1 \), or \((\text {2})\ \eta _j^{*}=0\). In experiments, each variable \(\eta _j\) should be initialized to 1, and the criterion of fixed points can be relaxed. In addition, we utilize a simple and effective heuristic shrinking strategy for further acceleration. Considering that \(\nabla F(\varvec{\eta }_i)={\varvec{0}}\) indicates that the corresponding block \(\varvec{\eta }_i\) has achieve optimum, we can move to next iteration without update \(\varvec{\eta }_i\) if this condition holds.

6 Empirical study

In this section, we empirically evaluate the effectiveness of our method on seven datasets. We first introduce the experimental settings in Sect. 6.1, which includes the information of datasets, the compared methods, the evaluation metrics used in experiments, the threshold calibration and the hyperparameters setting. In Sect. 6.2, we compare the performance in four metrics and verify the superiority of mlODM. We analyze the convergence of our method mlODM on six datasets in Sect. 6.3 and compare the margin distribution of each method by visualization in Sect. 6.4.

In general, we compare our method with three multi-label classification methods on seven classic datasets, and use four metrics to evaluate the performance of each method. Then we analyze the characteristics of our method empirically. The information of experiments is introduced below.

6.1 Experimental setup

These seven datasets include Emotions, Scene, Yeast, Birds, Genbase, Medical and Enron from MULAN (Tsoumakas et al. 2011b) multi-label learning library. These datasets cover five diverse domains: audio, music, image, biology and text. The information of all the datasets is detailed in Table 1. \(\#\)Train, \(\#\)Test and \(\#\)Feature represent the number of training and test examples and number of features respectively. The number of labels is denoted by \(\#\)Labels, and the label cardinality and density (%) by LCard and LDen respectively. All features are normalized into the interval [0, 1].

Table 1 Characteristics of datasets in our experiments

6.1.1 Compared methods

In experiments, we compare our proposed method mlODM with six multi-label algorithms as follows.

  • BP-MLL (Zhang and Zhou 2006). BP-MLL is a neural network algorithm for multi-label classification, which employs a multi-label error function in Backpropagation algorithm.

  • Rank-SVM (Elisseeff and Weston 2002). Rank-SVM is a famous and classic margin-based multi-label classification method, which aims to maximize the minimum margin of each label pair. The objective is optimized by Frank–Wolfe Algorithm (Frank and Wolfe 1956) with sub-problem being a Linear Programming problem.

  • ML-KNN (Zhang and Zhou 2007). The basic idea of this method is to adapt k-nearest neighbor techniques to deal with multi-label data, where maximum a posteriori (MAP) rule is utilized to make prediction by reasoning with the labeling information embodied in the neighbors.

  • Rank-SVMz (Xu 2012). By adding a zero label into Rank-SVM, Rank-SVMz has a special QP problem in which each class has an independent equality constraint, and does not need to learn the linear threshold function by regression.

  • Rank-CVM (Xu 2013a). The key idea of Rank-CVM is to combine Rank-SVM with the binary core vector machine (CVM). The optimization is formulated as a QP problem with a unit simplex constraint like CVM.

  • Rank-LSVM (Xu 2016). This method is proposed recently and generalizes binary Lagrangian support vector machine (LSVM) to multi-label classification, resulting into a strictly convex Quadratic Programming problem with non-negative constraints only.

The compared methods include three classic multi-label classification methods: BP-MLL, Rank-SVM and ML-KNN, which are coded in MATLAB from;Footnote 1 and three methods modified from Rank-SVM: Rank-SVMz, Rank-CVM and Rank-LSVM, all coded in C/C++ from package MLC-SVM.Footnote 2

6.1.2 Evaluation metrics

In contrast to single-label classification, performance evaluation in multi-label classification is more complicated. A number of performance measures focusing on different aspects have been proposed (Schapire and Singer 2000; Tsoumakas et al. 2011a). Recently, Wu and Zhou (2017) provides a unified margin view of these measures, which suggests that it is more informative to evaluate using both measures optimized by label-wise effective predictors and measures optimized by instance-wise effective predictors. Inspired by this theoretical results, we select ranking loss, one-error and average precision for the first kind of measures and Hamming Loss for the second. We recall the definition of metrics as follows. The \(\uparrow (\downarrow )\) indicates that the larger (smaller) the value, the better the performance.

The ranking loss evaluates the fraction of reversely ordered label pairs, i.e., an irrelevant label is ranked higher than a relevant label.

$$\begin{aligned} rloss(\downarrow ) =\frac{1}{m}\sum _{i=1}^{m}\frac{1}{|Y_i||{\bar{Y}}_i|} \left| \left\{ (y_k,y_l)\ |\ f({\varvec{x}}_i,y_k)\ge f({\varvec{x}}_i,y_l),\ (y_k,y_l)\in Y_i\times {\bar{Y}}_i \right\} \right| \end{aligned}$$

The one-error evaluates the fraction of examples whose top-ranked label is not in the relevant label set.

$$\begin{aligned} one\text {-}error(\downarrow )=\frac{1}{m} \sum _{i=1}^{m} \llbracket [\text {argmax}_{y\in Y_i} f({\varvec{x}}_i,y) ]\notin Y_i \rrbracket \end{aligned}$$

where \(\llbracket \cdot \rrbracket \) equals 1 if \(\cdot \) is true and 0 otherwise.

The average precision evaluates the average fraction of relevant labels ranked higher than a particular label \(y\in Y_i\).

$$\begin{aligned} averprec(\uparrow )=\frac{1}{m}\sum _{i=1}^{m} \frac{1}{|Y_i|}\sum _{y\in Y_i} \frac{\left| \left\{ y'\ |\ rank_{f}({\varvec{x}}_i,y')\le rank_{f}({\varvec{x}}_i,y),\ y'\in Y_i \right\} \right| }{rank_{f}({\varvec{x}}_i,y)} \end{aligned}$$

The Hamming loss evaluates how many times an instance-label pair is misclassified.

$$\begin{aligned} Hamming\ loss(\downarrow )=\frac{1}{m} \sum _{i=1}^{m} \frac{1}{q}\left| h({\varvec{x}}_i)\varDelta Y_i \right| \end{aligned}$$

where \(\varDelta \) stands for the symmetric difference between two sets. All methods will be evaluated in these four measures.

6.1.3 Settings of each method

For Rank-SVM, Rank-CVM, Rank-LSVM and our method mlODM, the threshold function is determined using linear regression technique in Elisseeff and Weston (2002). Specifically, we train a linear model to predict the set size. For this linear model, the learning system produces a q-dimensional vector \(\left( f_1({\varvec{x}}_i),\ldots ,f_q({\varvec{x}}_i) \right) \) as the training data, the target values are the optimal threshold values via minimizing the Hamming loss. Then a linear regression threshold function is trained as the label size predictor.

For Rank-SVM, Rank-CVM, Rank-SVMz, Rank-LSVM and mlODM, the RBF kernel will be used in all experiments. For the first four methods, the hyperparameters, i.e., the RBF kernel scale factor \(\gamma \) and the regularization parameter C, are optimally set as recommended in Xu (2012, 2016), which is tuned from \(\left\{ 2^{-10},2^{-9},\ldots ,2^2 \right\} \) and \(\left\{ 2^{-2},2^{-1},\right. \)\(\left. \ldots ,2^{10}\right\} \) respectively. For our method mlODM, the C and \(\gamma \) are selected by 5-fold cross validation from the same range as Rank-SVM. In addition, the trade-off parameter \(\mu \) and approximation parameter \(\theta \) are selected from \(\left\{ 0.1,0.2,\ldots ,0.9 \right\} \). For ML-KNN, the number of nearest neighbors is 10. For BP-MLL, as recommended, the learning rate is fixed at 0.05, the number of hidden neurons is set to be \(20\%\) of the number of input units, the number of training epochs is fixed to be 100 and the regularization constant is set to be 0.1. All randomized algorithms are repeated five times.

6.2 Results and discussion

Table 2 shows the results of ranking loss, Hamming loss, one-error and average precision respectively, where the best accuracy on each dataset in each metric is bolded. From the experimental results, our method mlODM outperforms other methods in all evaluation metrics on more than half of datasets, and obtains very competitive results on other datasets. Specifically, mlODM performs better than BP-MLL/ML-KNN/Rank-SVM/Rank-CVM/Rank-SVMz/Rank-LSVM on 24/28/19/20/25/18 over seven datasets in four metrics. On the other hand, mlODM exceeds the performance that the best-tuned Rank-SVM and Rank-LSVM can achieve over many classic datasets, such as emotions, scene and birds, which verifies the better generalization performance of optimizing the margin distribution.

Table 2 Experimental results of seven methods on seven datasets in four measures
Fig. 1
figure 1

Training process of mlODM on six datasets

For the improved generalization performance, we can give an intuitive discussion of mlODM. Unlike taking only the points nearest to hyperplane into account in Rank-SVM, mlODM utilizes the information of data distribution by optimizing the margin distribution. At the same time, the approximation strategy in Sect. 4 makes efficiently solving possible. By introducing the information of data distribution, the method will be more robust and possess better generalization performance. To see this, we can assume the data is unevenly distributed, which is common in the real world, then SVM-style methods considers only the points near decision boundary, which could be unrepresentative. However, ODM-style methods wish to separate the representative parts on both sides of the decision boundary. Thus it is reasonable that ODM-style methods have better generalization performance in most cases. But when the points nearest to the decision boundary characterize a good classifier, SVM-style methods can achieve similar generalization performance to mlODM.

Fig. 2
figure 2

Margin distribution of mlODM, Rank-SVM, Rank-CVM, Rank-SVMz and Rank-LSVM

6.3 Training process

We visualize the training process of mlODM on six datasets in this subsection, to verify the fast convergence rate as mentioned in Sect. 5. Figure 1 shows the training process on ranking loss over the training and testing data on six datasets. All coordinates are updated during an iteration. The figure illustrates that mlODM converges very fast in most datasets. Specifically, the testing loss of all training process converges within one iteration. According to the analysis in Sect. 6.2, this experiment indicates that although mlODM utilizes the information of data distribution, which seems more complicated than SVM-style methods, it still can be solved efficiently enough. The reason is that the dual problem of mlODM is still strictly convex and satisfies the assumptions that Richtárik and Takáč (2014) proposed, which results in the linear convergence rate of optimization.

Fig. 3
figure 3

Margin distribution of mlODM, Rank-SVM, Rank-CVM, Rank-SVMz and Rank-LSVM

6.4 Comparison of margin distribution

In this subsection, we empirically analyze the margin distribution of margin-based methods, i.e., mlODM, Rank-SVM, Rank-CVM, Rank-SVMz and Rank-LSVM, as shown in Figs. 2 and 3. It is obvious that mlODM obtains better margin distribution than other four methods, which means the distribution of margin is more concentrated. The figure also illustrates that SVM-style methods often have bad margin distribution such as medical, birds and genbase, the reason of which is the points nearest to the classification hyperplane can not always be representative. In general, without considering the distribution of data, the generalization performance of SVM-style methods is not always promising. In the experiments, the choice of hyperparameters also has an effect on the margin distribution, so we utilize the uniform parameter settings, which is \(C=1\) and \(\gamma =2^{-1}\).

6.5 Effect of hyperparameters

In our proposed mlODM method, two hyperparameters, \(\theta \) and \(\mu \), are introduced to improve sparsity and trade off the penalty on different sides respectively. Figure 4 presents the effect of hyperparameters of mlODM on four metrics over Emotions. The figure shows that both hyperparameters result in a smooth change of loss value, which makes it convenient to adjust hyperparameters and the method credible. Specifically, small \(\mu \) and big \(\theta \) is a good choice for Emotions.

Fig. 4
figure 4

Effect of hyperparameters of mlODM on four metrics over Emotions

7 Related work

This work is related to two branches of studies. The first one is SVM-style multi-label learning approaches. Support vector machine (SVM) (Cortes and Vapnik 1995) has been one of the most successful machine learning techniques in the past few decades, with kernel methods providing a powerful and unified framework for nonlinear problems (Schölkopf and Smola 2001). Elisseeff and Weston (2002) first applied this framework to multi-label learning and proposed Rank-SVM, which has been one of the most famous multi-label learning methods. Like binary SVM, the Rank-SVM can also be represented as minimizing the empirical ranking loss with a regularization term controlling the model complexity. Accordingly Tsochantaridis et al. (2005) extended the framework to a general form of structured output classification. In this general formulation, the ranking loss function can be replaced. For example, Guo and Schuurmans (2011) proposed calibrated separation ranking loss by using simpler dependence structure and obtain better generalization performance.

There are numerous work to improve Rank-SVM in efficiency or performance. Specifically, considering that the threshold selected may be not the optimal due to its separation from the training process, Jiang et al. (2008) proposed Calibrated-RankSVM. To accelerate the time-consuming training process, Xu (2012) proposed SVM-ML which consisted of adding a zero label to detect relevant labels and simplified the original form of Rank-SVM; Xu (2013b) use Random Block Coordinate Descent method to solve the dual problem instead of Frank–Wolfe algorithm. Both methods significantly reduced the computational cost and obtained competitive performance. In addition, there are also a number of variants and applications of Rank-SVM. For example, Xu (2016) generalized Lagrangian support vector machine (LSVM) to multi-label learning and proposed Rank-LSVM; Liu et al. (2015) proposed rank-wavelet SVM (Rank-WSVM) for the classification of power quality complex disturbances.

The second branch of studies is utilizing margin distribution in classification tasks. Although the above framework has been successful and the performance is promised by margin theory (Vapnik 1995), all of the above methods are based on large margin formulation. However, the studies in margin theory for Boosting (Schapire et al. 1998; Reyzin and Schapire 2006; Gao and Zhou 2013) have finally disclosed that maximizing the minimum margin does not necessarily lead to better generalization performance, and instead, the margin distribution has been proven to be more crucial. Later, inspired by this idea, Zhang and Zhou (2014b, 2019) proposed Large margin Distribution Machine (LDM) and its simplified version optimal margin distribution machine (ODM) for binary classification. Thereafter, varieties of methods based on margin distribution have been proposed. Zhou and Zhou (2016) and Zhang and Zhou (2017, 2018) generalized ODM to class imbalance learning, multi-class learning and unsupervised learning respectively. In weakly supervised learning (Zhou 2018), Zhang and Zhou (2018a) proposed the semi-supervised ODM(ssODM), which achieved significant improvement in performance compared to SVM-based methods. Lv et al. (2018) introduced margin distribution into neural networks and proposed the Optimal margin Distribution Network (mdNet), which outperforms the cross-entropy loss model.

However, for the more general learning paradigm in real-world tasks, i.e, the multi-label learning, whether optimizing the margin distribution is still effective is still unknown. By first introducing this idea into multi-label classification, this paper proposes multi-label optimal margin distribution machine (mlODM) and shows its superiority with extensive experiments.

8 Conclusion

In this paper, we propose a multi-label classification method named mlODM, which first extends the idea of optimizing the margin distribution to multi-label learning. Based on the approximation of margin mean and margin variance like binary ODM, and the simplification technique in Rank-SVM, we propose the formulation of mlODM in Sect. 4. Subsequently we use block coordinate descent method to solve the problem efficiently considering the structure of the optimization problem in Sect. 5. Empirically, extensive experiments compared to classic methods in different measures verify the superiority of our method. Finally, the visualization of margin distribution and convergence analyzes the characteristic of our method. In the future it will be interesting to solve the sub-problem in a more efficient way to accelerate the method and make theoretical analysis for the good performance of mlODM. Another interesting future issue is to incorporate the proposed method into the recently proposed abductive learning (Zhou, 2019), a new paradigm which leverages both machine learning and logical reasoning, to enable it handle multi-label concepts.