Dual-Training-Based Semi-Supervised Learning with Few Labels

: The continual expansion in the number of images poses a great challenge for the annotation of the data. Therefore, improving the model performance for image classification with limited labeled data has become an important problem to solve. To address the problem, we propose in this paper a simple and effective dual-training-based semi-supervised learning method for image classification. To enable the model to acquire more valuable information, we propose a dual training approach to enhance model training. Specifically, the model is trained with different augmented data at the same time with soft labels and hard labels, respectively. In addition, we propose a simple and effective weight generation method for generating the weight of samples during training to guide the model training. To further improve the model performance, we employ a projection layer at the end of the network to guide the self-learning of the model by minimizing the distance of features extracted from different layers. Finally, we evaluate the proposed approach on three benchmark image classification datasets. The experimental results demonstrate the effectiveness of our proposed approach.


Introduction
Deep learning has advanced swiftly in recent years.Great breakthroughs caused by deep learning methods have been made in various computer vision tasks, such as classification tasks [1,2], detection tasks [3,4], segmentation tasks [5,6] and so on.However, the success is mainly attributed to huge labeled datasets, such as ImageNet [7].In addition, annotating such a large dataset is time-consuming and labor-intensive.Hence, it has become a hot topic of research to find how to enhance the performance of the model with only a few labeled data and many unlabeled data.The labeled data are usually annotated manually, while unlabeled data are the data without such annotations.For a few labeled data, researchers have proposed few-shot learning to improve the model performance [8,9].However, this kind of method ignores the benefits of unlabeled data.Thus, semi-supervised learning methods are proposed to enhance the model performance with few labeled data and many unlabeled data [10][11][12].Recently, semi-supervised learning methods have developed rapidly and achieved surprising success for image classification with only a few labeled samples [13][14][15][16].
The use of large amounts of unlabeled data is the key to the success of the semisupervised learning method.In the early research on the semi-supervised learning method, Lee et al. [17] proposed a simple method to make effective use of unlabeled data.They trained a model with existing labeled data first and then generated the pseudo-labels for unlabeled data by the model predictions.Afterward, the model was trained by the unlabeled data with their pseudo-labels.However, due to the limited number of labeled data, the effectiveness of the trained model cannot be guaranteed, and thus the accuracy of generated pseudo-labels is poor, which significantly degrades the model performance.Subsequently, researchers have attempted to utilize consistency regularization to train the model [18,19].This kind of method assumes that the predictions of the model should be the same when given the same data under small perturbations.However, training the model by consistency regularization only helps the model to learn itself, but not to learn task-relevant information.Task-relevant information can only be learned from the small amount of available labeled data.As a result, the performance of the models trained by such methods decreases significantly when the amount of labeled data decreases.
To address the above issues, FixMatch [16] combines the advantages of pseudolabeling methods and consistency-based methods to propose a simple and effective semisupervised learning method.Specifically, it converts the model prediction of the samples augmented by weak data augmentation (e.g., random cropping and flipping) to one-hot labels.Then, the model is trained with the data augmented by strong data augmentation (e.g., RandAugment [20], CTAugment [21]) with their one-hot labels.Furthermore, the method only selects data with a high value of model prediction to train the model.The method achieved great success for semi-supervised image classification and thus aroused the enthusiasm of researchers for further study.After that, following the architecture of FixMatch, some methods were proposed to improve the performance of the model [22,23].CoMatch [22] proposes graph-based contrastive learning to impose constraints on features.SimMatch [23] further improves upon FixMatch by combining both semantic similarity and instance similarity.However, the performance of these methods degrades sharply when the amount of labeled data is further reduced [24].
Yang et al. [24] investigated the case when there are only a few labeled data.They proposed a semi-supervised learning method based on interpolated contrastive learning to improve the performance of the model when only two or three labels exist for each category.Hu et al. [25] proposed a patch-mixing contrastive regularization method to ensure that the feature representation is consistent with the task, thus improving the semisupervised classification performance with only few labeled data.However, these methods are based on FixMatch, which only selects unlabeled data with high confidence for classification training and does not fully utilize the available data.
Therefore, we propose a dual-training-based semi-supervised image classification method in this paper.We first add an extra projection layer at the end of the backbone, as shown in Figure 1.Although our method constructs an extra layer, the number of parameters of the extra layer can be almost negligible.Thus, it does not require more memory of GPU.Produced using WideResNet-28-2 as the backbone, the statistics of the number of parameters for different methods are shown in Table 1.

Method
FixMatch [16] CoMatch [22] SimMatch [23] Ours Parameters(M) 1.4676 1.4924 3.0013 1.4773 Furthermore, we propose a dual training approach that allows the model to learn various useful knowledge simultaneously.Specifically, for unlabeled data, we generate the pseudo-labels with the model prediction of weakly augmented data.After that, we train the model with the data augmented by two different strong data augmentations.The randomness of data augmentation helps the model learn various information simultaneously.In addition, to further increase the difference between the information learned by the model, we respectively utilize the hard labels and soft labels for model training.Hard labels are the one-hot form of model predictions.In addition, hard labels are generated by the same method in FixMatch, and we also utilize the same threshold to select data with high confidence for training, while soft labels generated by the model predictions of weakly augmented data can be seen as hard labels with label smooth.In addition, we use distribution alignment to further improve the soft labels.
To further improve the training of the model with soft-labeled data, we propose a simple and effective weight generation method for generating sample weights.Specifically, for each batch, we normalize the maximum probability of model prediction by the Softmax function to get the weight of each sample.Then we multiply the weight and the number of data so that the sum value of the weights is the same as before.
Inspired by the effectiveness of self-supervised learning [26], we utilize the features extracted from different layers with data augmented by different augmentations to guide the self-learning of the model.By minimizing the cosine distance between these features, the model can learn more useful information by itself.
Finally, we evaluate our method on three benchmark datasets for image classification and verify that the proposed approach can effectively improve the performance of the model for image classification with only few labeled data.
The rest of the paper is organized as follows.The related works of semi-supervised learning and self-supervised learning are described in Section 2. The details of our proposed method are shown in Section 3. The implementation details and the experimental results are described and discussed in Section 4. Finally, the conclusion is presented in Section 5.

Semi-Supervised Learning
Since it is a time-consuming and labor-intensive task to annotate a huge number of data, how to fully utilize a small number of labeled data to maintain model performance has become an urgent issue.A semi-supervised learning method has shown its potential to solve the issue.Recently, the most popular methods designed for semi-supervised learning include pseudo-labeling methods [17,[27][28][29] and consistency-based methods [18,19,30].The purpose of pseudo-labeling methods is to generate pseudo-labels for unlabeled data and then regard them as labeled data to train the model.Therefore, Lee et al. [17] first used the model predictions as pseudo-labels of unlabeled data.Rizve et al. [27] further selected the samples with high confidence to be labeled and introduced complementary labels for unlabeled data with low confidence.Then they further selected samples to train the model by the uncertainty estimation method.Iscen et al. [28] constructed the nearest neighbor graph based on features to annotate unlabeled data.The main idea of consistency-based methods is consistency regularization.Tarvainen et al. [18] built a teacher model by exponential moving average and trained the model to produce the same prediction as that of the teacher model.MixMatch [19] combines the labeled data and unlabeled data by MixUp [31] and then minimizes the difference between the model predictions from data with various augmentations.Miyato et al. [30] perturbed the input data by virtual adversarial loss.
Recently, some pseudo-labeling methods combined the advantages of consistencybased methods and achieved great success for semi-supervised image classification.FixMatch [16] utilizes weakly augmented data to generate pseudo-labels and trains the model by strongly augmented data.It also sets a threshold to select samples that need to be learned.Based on the architecture of FixMatch, CoMatch [22] learns the class probabilities and low-dimensional embeddings of the data.SimMatch [23] further takes semantic similarity and instance similarity into consideration and improves the model performance.Yang et al. [24] introduced interpolated contrastive learning.Hu et al. [25] proposed a patch-mixing contrastive-learning-based method for image classification and achieved impressive results with few labeled data.

Self-Supervised Learning
Self-supervised learning is one of the effective unsupervised learning methods, which is often seen as the pretraining of the model by exploring the relationships between data or features without the utilization of labels.By training with a self-supervised learning method, the models often have better initial parameters for subsequent specific tasks, such as classification tasks, detection tasks, segmentation tasks and so on.SimCLR [32] constructs positive and negative sample pairs by the data with different augmentation.Specifically, the same data augmented with different data augmentation are considered as a positive sample pair, while different data compose negative sample pairs.The purpose of SimCLR is to reduce the distance between features of positive sample pairs and expand the distance between features of negative sample pairs.BYOL [33] trains the model only with positive samples and abandons the negative sample pairs.SwAV [34] further introduces the clustering method and clusters the data by Sinkhorn-Knopp [35].It minimizes the difference between the clustering results of the same data with different data augmentation to train the model.SimSiam [26] proposes a simple self-supervised learning method based on a Siamese network, which removes the need for clustering and moving exponential averaging.SimSiam [26] further simplifies the self-supervised learning method while ensuring the effectiveness of the method.

Our Approach
In this section, we describe the details of our proposed method.We show the training procedure of our method in Figure 2. We first present the details of our proposed dual training strategy, and then describe the generation of sample weight.Finally, we introduce the cosine-distance-loss-based self-supervised learning method.Algorithm 1 shows the procedure of our method.

Dual Training Strategy
FixMatch [16] found that training the model with strongly augmented data whose labels are generated by the model prediction of weakly augmented data can achieve great success for image classification.However, it set a threshold to select samples with high confidence and ignore other samples.In addition, the model converges slowly because of the high randomness of data augmentation.A large number of iterations is required for FixMatch to achieve satisfactory results.Therefore, we proposed a novel dual training strategy to fully utilize all data and help the model learn more meaningful information.
Given a labeled dataset  = {( ,  )} and an unlabeled dataset  = { } , where  and  respectively represent the number of labeled and unlabeled data,  and  are the  labeled datum and its label. is the  unlabeled datum.For the unlabeled dataset, we first augment the data with weak data augmentation and fit them into the model to obtain the predictions, which are used as the soft labels.

𝑝𝑙
= ℎ((( , ))) Inspired by the effectiveness of distribution alignment [21,36], we also apply the distribution alignment to soft labels to improve their accuracy: where (•), (•) and ℎ(•), respectively, mean the encoder, the projection layer and the classifier. is computed by the moving average of the prediction of unlabeled data.Next, we can generate hard labels from soft labels: where the (•) function is used to yield one-hot labels from soft labels.Then, the threshold  is set to select the data with high confidence: where (•) is the mask function to choose the samples whose maximum prediction is higher than threshold .After computing the soft and hard labels, we can train the model with unlabeled data by standard cross-entropy loss: where  is the weight of each sample, which is described in Section 3.2.In summary, the total loss of dual training can be written as where  and  are the weights of  .With dual training strategy, we train the model with all unlabeled data to fully utilize the existing data.Furthermore, we train the model with samples augmented by different strong augmentation at the same time and the labels of used samples are different (i.e., soft labels and hard labels).Therefore, the model can learn more meaningful information at once, which significantly improves the performance of the model.

The Generation of Sample Weight
Each sample is weighted equally in the standard categorical cross-entropy loss.However, the semi-supervised learning method often labels unlabeled data by the model trained with the original labeled data.As a result, the accuracy of the model predictions decreases as the amount of labeled data decreases.Hence, training the model with samples with the same weights can make the model learn more incorrect information.Giving different weights to the samples is a simple and effective way to overcome this disadvantage.For instance, MentorNet [37] constructs an additional model to learn the sample weights and train the other model by weighted samples.However, this method needs an additional model.Focal loss [38] adds a scaling factor to the standard categorical cross-entropy loss to control the weight of different categories.However, the focal loss is originally designed for object detection and the weights are mainly for categories.In addition, the focal loss requires artificially set hyperparameters.To address these problems, we proposed a simple way to generate weights for different samples, which is displayed in Figure 3.We first apply the distribution alignment to the prediction of the model to obtain the soft labels.Then, we select the maximum probability of each label and concatenate them into a vector  ∈ ℝ × , where  is the batch size.Afterward, we use the Softmax function to normalize the vector: As the original sum of sample weights is  × 1 and the sum of the normalized vector is only 1, when we use the normalized vector as the sample weight, it significantly reduces the information learned by the model with each image batch.Therefore, we multiply the normalized vector by the batch size to keep the sum of the sample weights constant.Thus, the weight of samples can be given by With the sample weight , we can make the model focus more on the samples with high confidence but also learn from the other samples.Thus, it can make full use of all samples.

Self-Supervised Learning Based on Cosine Distance
SimSiam [26] designs a simple and effective self-supervised learning method based on the Siamese network.It can achieve competitive results without a large batch size compared with prior works [32][33][34].The main idea of it is that the features extracted by the same sample should be as similar as possible.Inspired by the effectiveness of feature learning, we minimize the feature distance to promote the model learning from itself, which can be seen in Figure 4. Specifically, we regard the features extracted by weakly augmented data from the encoder as the basic features.The features extracted by strongly augmented data from the projection layer are seen as learnable features.We train the model by minimizing the cosine distance between learnable features and basic features.Thus, the loss of self-supervised learning can be written as where

Experiments
This section describes the information of three classification datasets used in our experiments and presents the implementation details and experimental results to show the performance of the proposed method.

Datasets
We conducted the experiments on Cifar10, Cifar100 [39] and SVHN [40] datasets for semi-supervised image classification.The information of each dataset is listed in Table 2.The Cifar10 and Cifar100 datasets contain 10 and 100 categories, respectively, and both have 50,000 images for training and 10,000 images for testing.The image size has been standardized to 32 × 32.Following the setting in ICL-SSL [24], for the Cifar10 dataset, we trained the model with three different numbers of labeled data, including 20, 30 and 40.For the Cifar100 dataset, the numbers of labeled data are 200, 400 and 800, respectively.The SVHN dataset contains 99,289 images, each with a size of 32 × 32.The official training and test set include 73,257 and 26,032 images, respectively.We tested our method on the SVHN dataset with 250, 500 and 1000 labeled data, respectively.

Implementation Details
We used the WideResNet-28-2 [41] as our backbone for all datasets.Random flipping and cropping were used as weak augmentation and RandAugment [20] was used as strong augmentation.For a fair comparison, we followed the settings in ICL-SSL [24] and -Mix [25].For Cifar10, the batch size was set to 64 and the ratio of unlabeled data was 5.The SGD algorithm was employed for training with the initial learning rate of 0.03.The momentum was 0.9 and the weight decay was 0.0005.The hyperparameters ,  and  were set to 5, 5 and 0.1, respectively.The number of epochs was 300.For Cifar100, we set the batch size to 16 and the ratio of unlabeled data was also 5. We employed the SGD algorithm with the same setting as that for Cifar10 to train the model.The initial learning rate was 0.005 and the total epochs were 300.The hyperparameters were set the same as for Cifar10.For the SVHN dataset, the hyperparameters were set  = 0.5,  = 5 and  = 0.1, and other settings the same as for Cifar10.
We trained our method on five runs with different random seeds and reported the mean accuracy and variance as the score of model performance, as the previous work did [25].
We present the comparison results of our method and other compared methods on Cifar10 and Cifar100 with different numbers of labeled data in Table 3.It can be observed that our method outperformed all the other compared methods with different numbers of labeled data.For instance, our method achieved an accuracy of 92.67% with only 20 labeled data on Cifar10, which is 0.72% and 3.94% higher than -Mix [25] and ICL-SSL [24], respectively.For Cifar100, a more complex dataset with more categories, our method can also achieve the best results, as can be seen by the significant margin between our proposed method and other compared methods.The results demonstrate that our method has the ability to achieve better results with fewer labels, which helps reduce the cost of annotating labels.We further investigated the classification accuracy of our method with normal numbers of labeled data on the SVHN dataset.Table 4 illustrates the results of experiments conducted on the SVHN dataset with 250, 500 and 1000 labeled data.As shown in Table 4, our method can still achieve the best result.For instance, our method outperformed the -Mix [25] with an accuracy of 0.26% on the SVHN dataset with 1000 labeled data, which strongly demonstrates the effectiveness of our method with normal numbers of labeled data.

Ablation Study
This subsection investigates the impact of different components in our method.We first investigated the effectiveness of our proposed dual training strategy for model performance.The experiments were conducted on Cifar100 with 800 labeled data.Since our proposed dual training strategy includes training with soft labels and hard labels, to verify the effectiveness of the dual training strategy, we alternately evaluated the method with only soft labels and only hard labels on Cifar100 with 800 labeled data, which can be seen in Table 5.It should be noted that the dual training strategy trains the model with data from two different kinds of strong data augmentation, while training with single labels only employs data augmented once.In addition, in this paper, training with hard labels means that we only use high-confidence samples selected by threshold.The method achieved an accuracy of 50.96% with only hard labels.The accuracy rose to 51.98% when the model was trained with soft labels instead of hard labels.However, it also showed a significant margin between training with a single kind of label and the dual training strategy.Next, we compared the performance of the model with different dual training strategies.These results are also presented in Table 5, which shows that the model trained by a dual training strategy with both hard and soft labels can reach the highest accuracy of 54.34%, which is 0.55% and 1.97% higher than using only hard labels or only soft labels, respectively.As such, a dual training strategy with both hard and soft labels is beneficial for improving model performance.During the training process, we found that using soft labels to train the model can accelerate the convergence of the model, which can be seen in Figure 5.However, training only with soft labels increases the risk of the model learning incorrect knowledge, resulting in poor performance at the end of the training.Using hard labels makes the model learn less information at the beginning of training but promotes the model's focus on more confident samples, thus reducing the risk of learning incorrect knowledge.Hence, we combined the hard labels and soft labels to train our model and achieved the greatest accuracy.In this study, we employed the distribution alignment to further improve the accuracy of pseudo-labels of unlabeled data.Therefore, in order to investigate the influence of distribution alignment in our method, we tested the method with and without the distribution alignment on Cifar100 with 800 labeled data, the results of which are listed in Table 6.As shown in the table, by simply applying the distribution alignment to our method, the accuracy increased from 51.79% to 54.34%, which significantly demonstrates the importance of distribution alignment.In addition, we proposed a simple weight generation method to improve the training of the model.It is apparent from Table 7 that the model achieved an accuracy of 52.30% without sample weight, which is 2.04% lower than that of the method with sample weight.The experimental results show that distribution alignment and sample weight are beneficial in improving the performance of the model.To improve the classification ability of the model and guide the self-training of the model, we minimized the distance between features extracted from different layers by cosine distance loss.We also conducted experiments in Cifar100 with 800 labeled data to investigate the influence of the cosine distance loss.Table 8 illustrates the results of our method with and without cosine distance loss.By adding the cosine distance loss during training, the accuracy rose from 54.04% to 54.34%, which demonstrates the usefulness of the cosine distance loss.

Conclusions
In this paper, we proposed an effective semi-supervised learning method based on dual training for image classification.We improved the model performance with few labels without substantially increasing the number of model parameters.We proposed the dual training strategy, which combines the advantage of soft labels and hard labels, to help the model learn more useful information and fully utilize existing data.In order to prompt the model to focus on the samples with high confidence without ignoring the rest of the samples, we proposed a simple weight generation method to guide the model training.Furthermore, we employed the cosine distance loss based on features to improve the self-learning of the model and enhance the model performance.To evaluate the effectiveness of our proposed method, we conducted experiments on three image classification datasets and compared with other methods.Experimental results demonstrate that our method can work more effectively than other compared methods with few labels.In the future, we will further improve our method by replacing cosine distance and applying a stronger data process.

Figure 1 .
Figure 1.The proposed model architecture.(a) backbone; (b) our model.Different colors represent different features.

Figure 2 .
Figure 2. The training procedure of our method.Different colors represent different modules and features.

Figure 3 .
Figure 3.The generation of the sample weight.Different colors represent different dimensions of vectors.

Figure 4 .
Figure 4. Self-supervised learning based on cosine distance.Different colors represent different modules and features.

Figure 5 .
Figure 5.The test accuracy of different methods.

Table 1 .
The number of parameters for different methods with WideResNet-28-2 as the backbone.

Table 2 .
The information of three benchmark datasets.

Table 3 .
Comparison with state-of-the-art methods in test accuracy (%) on Cifar10 and Cifar100 datasets with different numbers of labeled data.The bold represents the best result.

Table 4 .
Comparison with state-of-the-art methods in test accuracy (%) on SVHN dataset with 250, 500 and 1000 labeled data.The bold represents the best result.

Table 5 .
The effectiveness of the dual training strategy on Cifar100 with 800 labeled data.The bold represents the best result.

Table 6 .
The effectiveness of distribution alignment on Cifar100 with 800 labeled data.DA: distribution alignment.The bold represents the best result.

Table 7 .
The effectiveness of sample weight on Cifar100 with 800 labeled data.SW: sample weight.The bold represents the best result.

Table 8 .
The results of our method with and without cosine distance loss on Cifar100 with 800 labeled data.CDL: cosine distance loss.The bold represents the best result.