Safe transductive support vector machine

Since semi-supervised learning can use fewer labelled samples to train a better model, semi-supervised methods are becoming popular in data mining. As an important algorithm of semi-supervised support vector machines (S VM), transductive support vector machine (TSVM) sometimes may get worse models trained on both labelled samples and unlabelled samples than those trained only on labelled samples. To solve this problem, in this paper, we propose a safe TSVM (STSVM) based on the infinitesimal annealing algorithm. In the training of TSVM, we adopt the infinitesimal annealing and path following technology to approximate the step size of simulated annealing to balance the contradiction between annealing step and calculation time. During the annealing process, we call CP-step to update TSVM model with pseudo-labelled samples. If the current sample is on the boundary of combinatorial optimisation problem, SJ-step is called and a safety condition is designed to determine whether the sample needs to change its label or not, so as to ensure the TSVM model trained after changing is better than the model got before. The experimental results show that our STSVM algorithm can improve the accuracy of TSVM with a shorter running time, and is safer than the existing safe algorithms.


Introduction
In the real world, there are a large number of unlabelled samples which are difficult to obtain labels. Therefore, they cannot be directly used for machine learning. For example, in the field of sports bioanalysis (Honglian et al., 2020;Huifeng et al., 2020), there are a large number of samples collected from the real world without any labels. Before these samples are used for learning, we have to ask sports biologists to manually label these samples, which is a costly task. The same problem also exists in other fields, such as cross-domain sentiment classification (Cao et al., 2021), human cognition (Thibodeau et al., 2020) and social analysis (Dai & Wang, 2021;Qiu et al., 2021). Therefore, semi-supervised learning (Chapelle et al., 2009;Zhu & Goldberg, 2009) which can use a large number of unlabelled samples and a small number of labelled samples to train a better model has become a key approach in pattern recognition and machine learning.  Collobert et al. (2006); Wang et al. (2007) Non-convex Unsafe O((l + 2u) 3 ) S 3 VM light Chi and Bruzzone (2007); Joachims (1999) Combinatorial Unsafe O((l + u) 3 ) DA-S 3 VM Sindhwani et al. (2006) Combinatorial Unsafe O((l + u) 3 ) S 3 VM path Ogawa et al. (2013) Combinatorial Unsafe O(|M|(l + u) 2 + |M| 2 (l + u)) S 4 VM Li and Zhou (2019) Non-convex Safe O(|M|(l + 2u) 2 + |M| 2 (l + 2u)) STSVM Our algorithm Combinatorial Safe O(|M|(l + u) 2 + |M| 2 (l + u)) Supervised support vector machine (SVM) tries to find a dividing hyperplane to maximise the separation between two classes. S 3 VM considers that the hyperplane needs to pass through the low-density area of data. Transductive support vector machine (TSVM) (Collobert et al., 2006) is the most famous representative of S 3 VM (Chapelle et al., 2008). Commonly semi-supervised methods for SVM are shown in Table 1.
The core idea of TSVM is to find suitable labels y i = {+1, −1}, i ∈ [1, u] for unlabelled samples U = {x 1 , x 2 , . . . , x u }, and maximise the margin after finding hyperplane. TSVM uses a local search strategy to solve this problem iteratively, that is to train an SVM learner with labelled samples L = {(x 1 , y 1 ), (x 2 , y 2 ), . . . , (x l , y l )} at first, and then uses the learner to mark the unlabelled samples. In this way, all samples are labelled. During the retraining process of SVM model on all these labelled samples, TSVM searches for error-prone samples and continuously adjusts their labels to ensure that they get the correct labels. Therefore, TSVM mainly solves the following optimisation problems, where [1 − z] + denotes hinge loss function, C and C * correspond to the penalty coefficients of labelled samples and unlabelled samples. Since labelled data is more reliable than unlabelled data, we set C * ≤ C. Especially when C * = 0, Equation (1) becomes the objective function of a standard SVM. By solving this optimisation problem, we can get the decision function of SVM, Since TSVM iteratively searches for the local optimal solution, there is a contradiction between accuracy and calculation cost. That is, if we want to train a better model, we need to do more annealing steps to get better accuracy, which will increase the calculation cost. If we want to save training time, we should reduce annealing steps. As a result, we will lose some precision. Therefore, Ogawa et al. (2013) proposed an infinitesimal annealing method to train S 3 VM (S 3 VM path ), which used the path following method (Hastie et al., 2004) to ensure that the step size of each annealing was approximately infinitesimal to solve this contradiction.
S 3 VM path has good performance in many cases. However, in some cases, since the labels of unlabelled samples are arbitrarily assigned, the performance of S 3 VM path models trained with labelled and unlabelled samples is inferior to SVM models trained with only labelled samples. That's an unsafe situation. Safe means that the performance of semi-supervised learning method is not significantly worse than the performance of the induction method that only uses a small amount of labelled data. Regarding the safety of semi-supervised support vector machines, Li and Zhou (2019) proposed a safe semi-supervised support vector machine (S 4 VM). Specifically, for a given semi-supervised dataset, S 4 VM first finds candidate low-density partitions and then optimises the labels of unlabelled data to maximise the performance. However, S 4 VM spends a lot of time in solving low-density partitions, so it is not a very efficient method.
In order to make the unsafe situation in training TSVM more efficiently, in this paper, we propose a Safe TSVM (STSVM) algorithm based on an infinitesimal annealing framework. For the training on unlabelled samples, we design two special steps. In the training process, we first use labelled samples to train an SVM model to label unlabelled samples. As we continue to increase the impact of unlabelled samples, we use path following technology to continuously adjust the TSVM model until we find a pseudo-labelled sample that needs to change its label. At the time, we decide how to change the pseudo-label of this sample according to the safety judgment conditions, so as to ensure that the TSVM model trained after changing the label is better than the supervised SVM and the model before the label is changed. We also analyse the convergence and complexity of STSVM algorithm theoretically. Experimental results show that our training method of the TSVM model is safer and uses less time than the existing safety training methods.
Contributions. The main contributions of this paper are summarised as follows.
(1) We propose a safe TSVM (STSVM) algorithm to train a safer and more efficient TSVM model. As far as we know, our STSVM algorithm is the first algorithm to propose safe judgments for the TSVM model. (2) We prove the finite convergence and analyse the time complexity of STSVM. The experimental results show that our algorithm has higher efficiency in training TSVM.
Organisation. The rest of the paper is organised as follows. In Section 2, we review the infinitesimal annealing algorithm and S 4 VM algorithm. We propose our STSVM algorithm in Section 3. In Section 4, we carry out the finite convergence and the time complexity analysis for STSVM. Section 5 shows the experimental results. Finally, we conclude the paper in Section 6.

Reviews of S 3 VM path and S 4 VM algorithm
In this chapter, we first briefly review the infinitesimal annealing algorithm, S 3 VM path , and then introduce the S 4 VM algorithm.

S 3 VM path algorithm
As a combinatorial optimisation problem, the basic idea of TSVM is to use the annealing method to continuously increase the influence of unlabelled data on the model. However, in the annealing process, there is a contradiction between running time and calculation accuracy. Increasing the computational accuracy means increasing the running time. In order to balance this contradiction, the S 3 VM path algorithm was proposed. The objective function Equation (1) is rewritten into Equation (2) as follows, The purpose of S 3 VM path is to solve an optimal solution path as follows, where pol( y) is a convex polyhedron, which means that under the current pseudo-labels, all solutions satisfying f * (x i ) · y i ≥ 0, i ∈ U, where the point on the boundary indicates that there is a sample x i that satisfies f (x i ) = 0. The specific expression is shown as Equation (4).
Initially let C * = 0, S 3 VM path iterates CP-Step and DJ-Step until C * = C. In CP-Step, the solution path of the local optimal solution f * y under the condition Equation (4) is solved. In DJ-Step, once a path reaches the boundary of the convex polyhedron, the convex polyhedron is adjusted to find a new local optimal solution.

S 4 VM
S 3 VM algorithm uses a small number of labelled samples and a large number of unlabelled samples for training. In semi-supervised learning, the labels of unlabelled samples are uncertain, so the performance of S 3 VM training results may be inferior to SVM. It causes the unsafety of semi-supervised learning. During the training process of S 3 VM, there may be more than one low-density area with large intervals. If only one is considered, a great loss may occur. To solve this problem, as an integrated learning method, S 4 VM focuses on multiple possible low-density regions and uses multiple hyperplanes for training. Therefore, the objective function of S 4 VM is as follows: where T is the number of candidate low-density partitions, is a penalty term to describe lowdensity partitions, and is used to ensure a certain difference between different hyperplanes. M is the penalty coefficient.
The discriminant formula for the best divided hyperplane can be represented as follows: where λ is the tolerance parameter. The earn function indicates that y assigned by S 3 VM is better than y assigned by SVM, and the loss function indicates that y assigned by S 3 VM are worse than y assigned by SVM. The specific expression of the two functions are as follows: where I(a) = 1 when a is true. The idea of S 4 VM is to first use Equation (5) to obtain T partitioned hyperplanes, and then find the optimal one according to Equation (6). Since Equation (5) is non-convex, there may be multiple local optimal solutions. There are two ways to achieve the global optimal solution.
(1) Global simulated annealing algorithm: In each iteration of simulated annealing, a solution is randomly selected to replace the current solution with a certain probability in the neighbourhood of the current solution. The magnitude of the probability is related to the degree of decline of the solution and the temperature parameter P: when P is large, the solution will be arbitrarily replaced, while when P is small, the solution tends to be stable. In order to further accelerate the convergence speed of simulated annealing, S 4 VM proposes a deterministic local search process, that is, fix y to obtain w, b through standard SVM and fix w, b to determine the value of y through local binary values. This process will iterate until the algorithm converges. Finally, T partitioned hyperplanes are obtained.
(2) Sampling method: This method randomly generates N hyperplanes of S 3 VM (N > T), solved by S 3 VM. Each hyperplane gives a predicted label to the unlabelled samples, and all the samples are clustered into T clusters. In each cluster, the hyperplane that can minimise the objective function is selected as the best hyperplane. Thus, T best hyperplanes are selected, from which we can find the global optimal hyperplane of S 3 VM according to Equation (6).

STSVM
In this chapter, we first introduce the principle of STSVM and then introduce this algorithm in detail from three aspects of STSVM algorithm as a whole and its two main processes, CP-Step and SJ-Step.

Principle of STSVM
As discussed above, S 4 VM assumes that there are multiple low-density partitions in the semi-supervised learning data, and each partition has a large interval. Therefore, S 4 VM first finds candidate low-density partitions, and then optimises the assignment of labels of the unlabelled data to maximise performance.
Therefore, in the training process of STSVM, in order to assign the labels of unlabelled samples more accurately, we set the objective function of STSVM as shown in Equation (9): J(y, y * , y, y svm ) = max y {earn(y, y * , y, y svm ) − λ loss(y, y * , y, y svm )}.
For the selected unlabelled samples, let y * be the true label vector of the unlabelled samples, y svm is the predicted label vector of SVM, and y is the current label vector. We can get earn and loss according to these parameters, as follows.
earn is the size of unlabelled samples whose labels y are better than y svm and y, loss is the size of unlabelled samples whose labels y are inferior to y svm and y. However, the real labels of unlabelled samples are unknown. We assume the current STSVM prediction results as the real labels. According to the S 3 VM path algorithm, we can know that when a breakpoint is found, we need to consider the change of the samples. Here, we define set S as follows: Definition 3.1: The misclassified samples set: Set S is defined as the set where all samples that are misclassified, that is, When the set S is not an empty set, it indicates that there is a breakpoint. In the process of finding breakpoints, our objective function is Equation (1). In order to facilitate the calculation, we convert Equation (1) into a dual form, as shown in Equation (12): where the definitions of C and C are as follows: is for all samples. According to the Lagrange multiplier method, Equation (12) can be transformed into the following form: where b is the Lagrangian multiplier, and the following KKT conditions can be derived: According to Equation (15), we can also divide the dataset as follows: Figure 1 shows the original data set. After using SVM to label the unlabelled samples, Figure 2 shows the partition of the data set. Set E is the set of support vector, set O is the set of samples on the left side of the support vector, and set E is the set of samples on the right side of the support vector.

STSVM
We summarise the STSVM algorithm as shown in Algorithm 1. When a breakpoint is found, S 3 VM path directly flips all the unlabelled samples' labels on the boundary without considering safety issues. In response to this situation, STSVM adds safety guarantees. First, for all samples, including positive and negative samples in labelled samples, and unlabelled samples (as shown in Figure 1), STSVM uses the labelled samples for supervised training and gets an SVM model (as shown in Figure 2). Then this model is used to pseudo-label unlabelled samples. In the process of CP-Step, the penalty coefficient C * of unlabelled samples is call SJ-Step to make all unlabeled samples correctly classified; 8: end while continuously increased from 0 to C. If a breakpoint is found, STSVM determines whether the pseudo labels of the unlabelled samples should be changed or not, (as shown in Figure 3). Then it exits CP-Step and enters SJ-Step. In SJ-Step, we first judge whether we need to change the labels of unlabelled samples based on the principle of safety according to earn and loss, and then flip the labels that need to change (as shown in Figure 4).

CP-Step
The process of CP-Step is shown in Algorithm 2. In CP-Step, if there is a breakpoint, S = {i ∈ U | y i f (x i ) = 0}. It means that the sample is likely to be classified incorrectly during the increase of C * , 0 ≤ C * ≤ C. Then we need to flip its pseudo-label. However, for safety Step continuously increasing C * , the decision plane is constantly adjusted, and samples may fall on the decision plane again. We call these samples breakpoints. reasons, we need to determine whether this sample is really mislabelled and needs to be flipped.
First, we define the set N as Definition 3.2. In CP-Step, to update α and b, we need to solve two problems: (1) calculating the update direction α of each α for η; (2) finding the maximum value of η, η max , and ensuring that there is only one sample moving in sets M and E/O.

Computation of the direction of α
Let C denotes the change value of C in set L ∪ U, and set the direction to C as where η is the parameter of [0, 1] to control the adjustment of C . Then we can get α N = y N C . At the same time, we set β b and β M to represent the directions of b and α M , thus we have β b = b η and β M = α M η , where β b and β M can be obtained by solving the following linear system according to (14) and (16),

Computation of the maximal quantity η max
After calculating the direction of α, we can get the variation of g i as follows, During each iteration of the update, we should ensure that except for the samples that need to be adjusted, other samples still meet the conditions of their current sets. Thus the maximum adjustment η max can be obtained by solving the following system of linear inequalities.
We can compute η M by the restriction condition (21) before a certain sample is moved from set M to E or O, compute η O by the restriction condition (22) before a certain sample is moved from set O to M, and compute η E by the restriction condition (23) before a certain sample is moved from set E to M. Finally, we can get At the end of the update of η, we can update C , α and b according to the above parameters and obtain a new partition, C = C + C , α = α + α and b = b + b .

Algorithm 3 SJ-Step
Input: The set of samples that need to be flipped S, penalty coefficient for unlabelled samples C * and safety parameter λ Output:α, b and y

SJ-Step
The process of SJ-Step is shown as Algorithm 3. In SJ step, we first calculate the value of earn and loss, and determine whether the samples need to be flipped or not according to the objective function and the value of loss and earn. Then we flip the samples that need to be flipped. After the adjustment, we recalculate the decision function based on current labelled until there are no samples that need to be flipped.

Convergence and time complexity analysis
In this chapter, we give the convergence and time complexity analysis of STSVM.

Convergence analysis
In order to prove that our algorithm can converge to the optimal solution within a limited number of iterations, we only need to prove that both CP-Step and SJ-Step can effectively converge.

Finite convergence of CP-Step
In order to prove that CP-Step can effectively converge, we first prove Theorem 4.1 as follows. Proof: Similar to the proof of Theorem 2 in Gu et al. (2018), for any sample (x t ,y t ) where t ≥ 0, it is easy to get the following four conclusions: (1) if a sample (x t ,y t ) is added into set M, then (x t ,y t ) will not be removed from M in the immediate next adjustment and (2) if a sample (x t ,y t ) is removed from set M, then (x t ,y t ) will not be added into M in the immediate next adjustment. Through Theorem 4.1, we can get Corollary 4.1.

Corollary 4.1: For each adjustment of CP-
Step, the maximum adjustment η max is greater than zero.
Then, according to Theorem 4.1 and Corollary 4.1, we can get Theorem 4.2.

Theorem 4.2: CP-
Step can converge to a local minimal respect to C in a finite number of iterations.

Finite convergence of SJ-Step
In order to prove that SJ-Step can converge to the optimal solution, we need to prove Theorem 4.3.

Theorem 4.3:
After flipping the labels of the samples in set S, the optimal solution is strictly better than the previous optimal solution.
Proof: First, according to the Lagrangian multiplier method and the KKT condition, we can get that for all samples in set S, we have We assume that the solution after flipping will be as good as the solution of the current one, then we have where Thus, according to (24) and (26), we have It can be seen that (25) and (27) are contradictory, so the hypothesis doesn't hold. That means the optimal solution obtained after the flipping will be strictly better than the solution obtained before.
According to Theorem 4.3, we can get Theorem 4.4 as follows.

Theorem 4.4: During the process of DJ-Step, any sample from L ∪ U cannot migrate back and forth in successive adjustment steps in S.
Proof: From Theorem 4.3 we can know that after flipping, the updated solution is different from the previous solution. Let's assume that the solution before the update is f (x) and the solution after the update is f (x), so there is a sample x i such that f (x i ) = f (x i ), thus Theorem 4.4 can be proved.

Time complexity analysis
Our algorithm STSVM uses infinitesimal annealing technology, and its time complexity is related to CP-Step and SJ-Step. Experiments show that the number of iterations of the outer loop has a linear relationship with the size of the samples. Therefore, in the outer loop, as the penalty coefficient of unlabelled samples is constantly increased to find the pseudolabels that need to be changed, its time complexity is O|l + u|. In the process of updating α, the time complexity is related to the size of support vector, and it is O(|M| 2 + |M|(l + u)). Therefore the time complexity of CP-Step is O((l + u)|M| 2 ). Since the time complexity of SJ- Step is related to the size of set S. Compared with CP-Step, the time complexity of SJ-Step can be ignored. Therefore, the time complexity of STSVM is O((l + u)|M| 2 + (l + u) 2 |M|).

Experiments
In this section, we first present the experimental setup and then provide the experimental results and discussions.

Experimental setup
Design of Experiments: In order to verify the effectiveness of our algorithm, we explored the effective convergence of our algorithm. Specifically, we showed the number of breakpoints encountered during the training of STSVM. We conducted parameter selection experiments. Specifically, we showed the accuracy of our STSVM algorithm for different values of parameters. In order to show the advantages of our algorithm, we compared the safety and running time of different algorithms. Specifically, we compared the accuracy of different algorithms with that of SVM on different datasets and then compared their safety. At the same time, we gave a comparison of the performance of STSVM with different values of λ. The algorithms involved in the comparison were: (1) TSVM: a transductive learning algorithm combining semi-supervised SVM with combined learning.
(2) S 3 VM path : a TSVM algorithm based on an annealing algorithm.
(3) S 4 VM: a semi-supervised learning algorithm to find the safest S 3 VM.

Implement:
We have implemented all the algorithms on MATLAB. For the kernel function, we used linear kernel function K(x 1 , x 2 ) = x 1 · x 2 and Gaussian kernel function K(x 1 , x 2 ) = exp(−k||x 1 − x 2 || 2 ) in all experiments. For our algorithm STSVM, in order to explore the impact of different values of λ on the safety results, we took λ = 0.1, λ = 0.5 and λ = 1. Following the suggestion in Ogawa et al. (2013), the value of C was chosen from C = {1, 10, 100, 1000} and from the experiment on parameter selection, we found that the S 3 VM models can achieve the best performance when C = 10. Therefore, we set C = 10 in our comparison experiments. The value of C * was chosen from C * = {C, C 2 , C 4 , C 8 , C 16 } for TSVM and S 4 VM, in S 3 VM path and STSVM, C * was selected in the entire range of [0, C].
Experimental Data: Table 2 summarises the datasets used in the experiments. USPS dataset contains tags 0-9. We set tag 1 as the positive class, and the rest as the negative class. Mnist dataset contains tags 0-9. We set tag 4 as the positive class, and tag 9 as the negative class. In each dataset, we randomly selected half of the dataset as the training set. We set the number of labelled samples at |L| = min{30, 10% of the training set size}, and  the rest of the samples were used as unlabelled sample set. We used the training data to train the model and used the test data to compare the prediction accuracy. We used 5-fold cross validation to optimise the hyperparameters to obtain the model. Since the training data accounts for 50% of the original data, each fold occupies 10% of the original data. And we ran the algorithm from C = {1, 10, 100, 1000} through five-fold cross-validation to select the appropriate parameters. Table 3 shows the mean and standard deviation of the number of SJ-Step flips of STSVM in different datasets. The experimental results show that in the training of STSVM, SJ-numbers, the number of breakpoints, are limited, which means that the STSVM algorithm can terminate the iteration after a limited number of flips, indicating that STSVM can effectively converge, which verifies the effectiveness of our algorithm. Figure 5 shows the accuracy of STSVM with Gaussian kernel on training datasets of different sizes with different C and different λ. From Figure 5, we can conclude that as C increases, the accuracy of STSVM decreases, and on different datasets, the decline rate of the accuracy varies with the value of λ. For most cases, the STSVM algorithm performs best when C = 10. Figure 6 shows the computation cost of training TSVM with S 3 VM path , S 4 VM, TSVM and STSVM. The experimental results show that our proposed STSVM algorithm is much faster than the S 4 VM algorithm. When there are many annealing steps, STSVM is better than TSVM. This is because the annealing method we use is less time-complex than the non-convex method. Since STSVM needs to make safe judgments, it sometimes spends more time than S 3 VM path , but it can get a safer model with better generalisation performance. Figure 7 shows the classification accuracy of 6 algorithms on 5 datasets, using the Gaussian kernel function. It can be seen that the accuracy of STSVM is higher than that of SVM, S 3 VM path , S 4 VM, Star-SVM and TSVM, and the performance of our STSVM algorithm is very stable. This is because the STSVM algorithm adds safe judgments in the solution  process, which ensures have a higher accuracy rate than algorithms without safe condition judgments.

Conclusion
In order to solve the unsafe problem of TSVM in an efficient way, we propose a new safe TSVM algorithm in this paper. Our STSVM algorithm uses path following technology to add safety judgments to adjust pseudo-labels when training TSVM on unlabelled samples. STSVM algorithm can also be more safe in the fast learning process. Moreover, we provide a limited convergence analysis and time complexity analysis of STSVM. As far as we know, our algorithm is the first algorithm that guarantees the safety of TSVM. The experimental results not only prove that our algorithm can effectively converge, but also prove that our algorithm has higher accuracy and spend less time than other semi-supervised algorithms.

Disclosure statement
No potential conflict of interest was reported by the author(s).

Funding
This work was supported by the National Natural Science Foundation of China [grant number 61501229].