A Novel Semi-supervised Multi-label Twin Support Vector Machine

Multi-label learning is a meaningful supervised learning task in which each sample may belong to multiple labels simultaneously. Due to this characteristic, multi-label learning is more complicated and more difficult than multi-class classification learning. The multi-label twin support vector machine (MLTSVM) [1], which is an effective multi-label learning algorithm based on the twin support vector machine (TSVM), has been widely studied because of its good classification performance. To obtain good generalization performance, the MLTSVM often needs a large number of labelled samples. In practical engineering problems, it is very time consuming and difficult to obtain all labels of all samples for multilabel learning problems, so we can only obtain a large number of partially labelled and unlabelled samples and a small number of labelled samples. However, the MLTSVM can use only expensive labelled samples and ignores inexpensive partially labelled and unlabelled samples. Because of the MLTSVM’s disadvantages, we propose an alternative novel semi-supervised multi-label twin support vector machine, named SS-MLTSVM, which can take full advantage of the geometric information of the edge distribution embedded in partially labelled and unlabelled samples by introducing a manifold regularization term into each sub-classifier and use the successive overrelaxation (SOR) method to speed up the solving process. Experimental results on several publicly available benchmark multi-label datasets show that, compared with the classical MLTSVM, our proposed SS-MLTSVM has better classification performance.


Introduction
Multi-label learning is a meaningful supervised learning task, wherein each sample may belong to multiple different labels simultaneously. In real life, many applications employ multi-label learning, including text classification [2,3], image annotation [4], bioinformatics [5], and so on [6]. Because the samples can have multiple labels simultaneously, multi-label learning is more complicated and more difficult than multi-class classification learning. At present, there are two kinds of methods to solve multilabel learning problems: problem transformation and algorithm adaptation. The problem transformation solves the multi-label learning problem by transforming it into one or more single-label problems, such as

MLTSVM
For the multi-label problem, we denote the training set as: where x i 2 R n is the training sample, y i ¼ y i1 ; …; y ip È É is the label set of the sample x i , y iq 2 f1; . . . ; Kg(q ¼ 1; . . . ; p), m is the total number of training samples and K is the total number of labels.
The MLTSVM seeks K hyperplanes: We denote the samples belonging to the kth class by A k and the other samples by B k . The original problem for label k is: s:t: À ðB k w k þ e B k b k Þ þ n B k ! e B k ; n B k ! 0 ; where c k and k are the penalty parameters, e A k and e B k are all 1 vector of the proper dimension, and n B k is the slack variable.
By introducing the Lagrange multiplier, the dual problems of (10) can be obtained as follows: max a B k e T B k a B k À 1 2 a T B k GðH T H þ k I k Þ À1 G T a B k s:t: 0 a B k c k ; (11) where H ¼ A k e A k ½ , G ¼ B k e B k ½ , I k are the diagonal matrices of the proper dimension, and a B k is the Lagrange multiplier.
By solving the dual problem (11), we can obtain:

ML-STSVM
Similar to the MLTSVM, the ML-STSVM also seeks K hyperplanes: The original problem for label k is: where c ki ði ¼ 1; 2; 3Þ are the penalty parameters, n B k is the slack factor, e A k and e B k are all 1 vector of the proper dimension, and The dual problem of (14) is: where . I k are the diagonal matrices of the proper dimension, and a B k is the Lagrange multiplier.
By solving the dual problem (15), we can obtain:

SS-MLTSVM
For the semi-supervised multi-label problem, we define the training set as follows: where x i 2 R n is the training sample and y i ¼ y i1 ; …; y iK f gis the label matrix of the sample x i .
; if x i belongs to the kth class; À1 if x i doesn 0 t belong to the kth class; 1 p K: 0; uncertain;

Manifold Regularization Framework
The manifold regularization framework [32], proposed by Belkin et al., can effectively solve semisupervised learning problems. The objective optimization function of the manifold regularization framework can be expressed as follows: where f is the decision function to be solved, V is the loss function on the labelled samples, the regularization term f k k 2 H is used to control the complexity of the classifier, and the manifold regularization term f k k 2 M reflects the internal manifold structure of the data distribution.

Linear SS-MLTSVM
Similar to the MLTSVM, for each label, the SS-MLTSVM seeks a hyperplane: For the kth label, we denote the samples that definitely belong to the kth class by A k , i.e., A k ¼ x i jy ik ¼ þ1 f g ; the samples that definitely do not belong to the kth class by, i.e., B k ¼ x i jy ik ¼ À1 f g ; and the samples that are uncertain of belonging to the kth class, by U k , i.e., To make full use of U k , according to the manifold regularization framework, in our SS-MLTSVM, the loss function V is replaced by a square loss function and a Hinge loss function, namely: The regularization term f k k 2 H can be replaced by: The manifold regularization term f k k 2 M can be expressed as: where where W is defined as follows: ; if x i and x j are k nearest neighbor; and D is defined as follows: For the kth label, the original problem in the linear SS-MLTSVM is: where c ki ði ¼ 1; 2; 3Þ are the penalty parameters, n B k is the slack factor, e A k , e B k and e are all 1 vector of the proper dimension, and L is the Laplace matrix.
The Lagrange function of (26) is as follows: where . a B k ! 0 and b B k ! 0 are Lagrange multipliers. Using KKT theory, we can obtain: According to (28)-(30), we can obtain the dual problem of (26) as follows: where For the kth label, the hyperplane can be obtained by solving the dual problem as follows:

Nonlinear SS-MLTSVM
In this section, using the kernel-generated surfaces, we extend the linear SS-MLTSVM to the nonlinear case. For each label, the nonlinear SS-MLTSVM seeks the following hyperplanes: where KðÁ; ÁÞ is a kernel function. Similar to the linear case, the regularization term f k k 2 H and the manifold regularization term f k k 2 M in (19) can be, respectively, expressed as: The original problem of the nonlinear SS-MLTSVM is as follows: The Lagrange function of (36) is as follows: According to KKT theory, we can obtain: According to (38)-(40), we can obtain the dual problem of (32) as follows: where By solving the dual problem, the hyperplane of the kth label can be obtained as follows:

Decision Function
In this subsection, we present the decision function of our SS-MLTSVM. For a new sample x, as mentioned above, if the sample x is proximal enough to a hyperplane, it can be assigned to the corresponding class. In other words, if the distance d k x ð Þ between x and the kth hyperplane is less than or equal to the given value D k , k ¼ 1; …; K, then the sample x is assigned to the kth class. To choose the proper D k , we apply the strategy in the MLTSVM, which is a simple and effective method, i.e.,

Fast Solvers
In this subsection, we use SOR to solve the dual problems (31) and (41) efficiently [33]. For convenience, we set The dual problems (31) and (41) can be uniformly rewritten as: Algorithm 1 only updates one variable a iþ1 in each iteration, which can effectively reduce the complexity of the algorithm and speed up the learning process.

Experiments
In this section, we present the classification results of backpropagation for multi-label learning (BPMLL) [34], ML-kNN, Rank-SVM, MLTSVM and our SS-MLTSVM on the benchmark datasets. All the algorithms are implemented in MATLAB (R2017b), and the experimental environment is an Intel Core i3 processor and 4G RAM. In the experiments, we use five common datasets for multi-label learning, including flags, birds, emotions, yeast and scene (see Tab. 1). To verify the classification performance of our SS-MLTSVM, we choose 50% of the dataset as labelled samples and the remaining samples as unlabelled samples.
The parameters of the algorithm have an important impact on the classification performance. We use 10fold cross-validation to select the appropriate parameters for each algorithm. For BPMLL, the number of hidden neurons is set to 20% of the input dimension, and the number of training epochs is 100. For the ML-kNN, the number of nearest neighbours is set to 5. For the Rank-SVM, the penalty parameter c is Algorithm 1: The SOR for optimization problem (44) INPUT: penalty parameter c k1 , relaxation factor t 2 0; 2 ð Þ, and matrix Q.

OUTPUT:
The optimal solution a k in (44).
Step 1: Initialize the iteration variable i ¼ 0 and start with any a 0 .
Step 2: where M is a diagonal matrix and S is a strict lower triangular matrix.

Evaluation Criteria
In the experiments, we use five popular metrics to evaluate the multi-label classifiers, which are Hamming loss, average precision, coverage, one_error and ranking loss. Next, we introduce these five evaluation metrics in detail.
We denote the total number of samples by l and the total number of labels by K. Y i and Y i represent the relevant label set and irrelevant label set of sample x i , respectively. The function f ðx; yÞ returns a confidence of y being the right label of sample x, and the function rank x; y; f ð Þreturns a descending rank of f x; y ð Þ for any y 2 fy 1 ; . . . ;y K g.

Hamming Loss
The evaluation criteria are used to measure the proportion of labels that are wrongly classified.

Hamm ing lossðH LÞ
where h x i ð Þ is the predicted label set of sample x i .

Coverage
The evaluation criteria are used to measure how many steps we need to go down the ranked label list to contain all true labels of a sample.

One_Error
The evaluation criteria are used to measure the proportion of samples whose label with the highest prediction probability is not in the true label set. where

Ranking Loss
The evaluation criteria are used to measure the proportion of label pairs that are ordered reversely.

Average Precision
The evaluation criteria are used to measure the proportion of labels ranked above a particular label y 2 Y i .

Results
We show the average precision, coverage, Hamming loss, one_error and ranking loss of each algorithm on the benchmark datasets in Tabs. 2-6. From Tabs. 2 and 3, we can observe that, for average precision and coverage, our SS-MLTSVM is superior to the other algorithms for each dataset, while for Hamming loss, one_error and ranking loss, no algorithm is superior to any other algorithm on all datasets. Therefore, for Hamming loss, one_error and ranking loss, we proceed to use the Friedman test to evaluate each algorithm statistically. The Friedman statistics is as follows: where R j ¼ 1 N X i r j i , r j i represents the rank of the jth algorithm on the ith dataset. Because v 2 F is undesirably conservative, we apply the better statistic where k is the number of algorithms and N is the number of datasets.
We We list the rank of the different algorithms in light of Hamming loss, one_error and ranking loss in Tabs. 7-9. From Tabs. 7-9, we can see that the average rank of our SS-MLTSVM is better than other algorithms; thus, the SS-MLTSVM has better classification performance.     From the above analysis, we can conclude that our SS-MLTSVM is superior to the other algorithms for all metrics.
We show the learning time of different algorithms on the benchmark datasets in Tab. 10. From Tab. 10, we can observe that our SS-MLTSVM has a lower learning speed than the MLTSVM. This is mainly because our SS-MLTSVM adds a manifold regularization term that needs to solve the Laplace matrix with whole samples. Even so, our SS-MLTSVM still has great advantages compared with the Rank-SVM and BPMLL.

Sensitivity Analysis
In this subsection, we investigate the effect of the size of unlabelled samples on the classification performance. In Figs

Conclusion
In this paper, a novel SS-MLTSVM is proposed to solve semi-supervised multi-label classification problems. By introducing the manifold regularization term into the MLTSVM, we construct a more reasonable classifier and use SOR to speed up learning. Theoretical analysis and experimental results show that, compared with the existing multi-label classifiers, the SS-MLTSVM can take full advantage of the geometric information embedded in partially labelled and unlabelled samples and effectively solve semisupervised multi-label classification problems. It should be pointed out that our SS-MLTSVM does not consider the correlation among labels; however, the correlation among labels is very valuable to improve the generalization performance. Therefore, more effective methods of obtaining correlation among labels should be addressed in the future. Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the present study.