A Unified Contrastive Loss for Self-Training

,


Introduction
Semi-supervised learning benefits significantly from advances in unsupervised representation learning, particularly through self-supervised approaches, which excel at efficiently extracting information from unlabeled data.Among these approaches, contrastive learning [9,18,19,27] has been particularly effective in the field of computer vision.Moreover, contrastive learning has been shown not limit its application to unsupervised settings.The standard practice for training deep neural networks in a supervised setting has traditionally involved using cross-entropy (CE) as the primary loss function.In recent works, [21] have developed a supervised contrastive loss function, dubbed SupCon, which achieves highly discriminative representations and comparable or even superior results in accuracy.It uses information from the labels to create positive pairs, instead of relying on data augmentation to generate two different views of the same unlabeled sample.Specifically, positive instances (instances from the same class within a batch) are pushed closer together while pushing them away from negative instances (instances from other classes) in the embedding space.Recent research suggests that SupCon loss may have the potential to increase robustness and be less sensitive to various hyperparameter choices for data augmentation or optimizers [17,20,21].
However, while SupCon hinges on the presence of labeled training data, its unsupervised counterpart cannot leverage any label information.The primary objective of this study is to adapt the principles and advantages of SupCon to semi-supervised scenarios.Through the integration of both supervised and unsupervised aspects of contrastive learning, we introduce a Semi-Supervised Contrastive (SSC) framework which uses a single loss L SSC .Our approach enables the integration of existing self-training techniques such as FixMatch [29], allowing for a seamless transition between unsupervised and supervised paradigms.
Unlike CE loss trained with softmax activation function, contrastive loss does not provide directly a probability distribution needed to pseudo-label examples during self-training.To address this challenge, we propose a solution by introducing class prototypes and we establish a theoretical equivalence between classical cross-entropy and supervised contrastive learning with these prototypes.The main contributions of this work are threefold: -We propose a new framework for semi-supervised learning based on a Semi-Supervised Contrastive loss L SSC that handles labeled, pseudo-labeled and unconfident pseudo-labels examples at the same time.-We show how to integrate class prototypes and establish a theoretical bridge between cross-entropy and supervised contrastive learning with prototypes.-We apply our loss to FixMatch, a simple existing framework, and show significant improvement on three datasets and investigate the properties of our loss function, highlighting its faster convergence rate, adaptability to transfer learning and its stability to hyperparameters.
In the following, we begin by presenting the notations and background in Section 2. Next, in Section 3, we introduce our proposed approach.Then, we discuss the experiments carried out on three benchmarks in Section 4. Lastly, Section 5 presents our conclusions.

Notations and Background
Notations.We will now introduce necessary notations and then show how they connect to previous related works.We use matrix notation rather than vectors, which provides a more convenient framework for presenting our approach.In the semi-supervised context, the batch is divided into a matrix X consisting of B labeled examples and their associated label vector y x , and another matrix U containing µB unlabeled examples, where the integer µ denotes the factor size between X and U .More specifically, we have : We denote by f : X ∪U → Z, an encoder that maps examples into the hidden space Z ⊆ R h .Then, a projection head p maps embeddings into a probability distribution over the K classes of our classification problem.In the context of self-training, pseudo-labels refers to labels automatically assigned to unlabeled data by a model's highest confidence prediction, defined as the argmax on the projection head p.The model is said to be confident in an unlabeled example if the maximum probability exceeds a threshold τ .
Following most of the recent semi-supervised learning approaches based on consistency regularization, we employ both a weak data augmentation, denoted as α(.), and a strong augmentation, denoted as A(.).During training, the encoder f is trained to compute three distinct embeddings: is the supervised embedding generated from the labeled training data.
∈ R 2µB×d denote the embeddings produced by applying two stochastic strong augmentation to unlabeled data.Using two augmentations ensures at least one positive pair for each example in the batch.
1 , . . ., z w µB ] ⊤ ∈ R µB×d is the unsupervised embedding created through the application of weak data augmentation.This embedding is employed for the estimation of a confidence score and the generation of pseudo-labels in self-training approaches.
Finally, we define two set of labels associated to unlabeled examples : y u↑ = q q where q = arg max p(Z w ) are the pseudo-labels computed with the weak-augmented examples.They are associated to unlabeled data with high confidence, above a given threshold τ .
y u↓ = i i where i = [1, 2, ..., µB] ⊤ are the labels associated to unlabeled examples τ .This definition will ensures that these unconfident labels will only have a unique positive example associated with them.
We will briefly present the contrastive and semi-supervised learning losses of related work with the previous notation, and then show how our approach is connected to these losses functions.Supervised Contrastive Learning.In [21], labeled data is utilized to ensure that embeddings of samples with identical labels are pulled closer together, while ensuring that embeddings from samples with different labels are pushed farther apart.This is achieved by employing supervised embeddings Z x along with their corresponding labels y x .
The objective involves calculating, for each embedding Z x i (referred to as the anchor), its cosine similarity with all other embeddings Z x p that share the same label (referred to as positive pairs).Subsequently, this similarity is normalized by the sum of similarities across all pairs, following the principles of the classical InfoNCE loss [27].More precisely, we have: Where I = {1, ..., B} is the set of anchor indices, P (i) = {p ∈ I \ {i} : y x p = y x i } is the set of positive examples associated with the example i and T is a temperature hyperparameter.Note that the labels y x are used in the equation only to define of the positive pairs P.
Unsupervised Contrastive Learning.As seen in methods SimCLR [9] or MoCo [19], unsupervised contrastive learning relies on two strong augmentations for each instance within the unsupervised dataset U , and employs the InfoNCE loss [27].
Unlike SupCon, which utilizes explicit labels to identify positive and negative samples, self-supervised losses operate under an unsupervised paradigm where labels are not provided.Consequently, in self-supervised learning, every augmented sample has only one positive pair, effectively constituting a specialized form of the SupCon loss.Based on the previous definition of y u↓ , the self-supervised InfoNCE loss can be expressed simply as: Interpreting the unsupervised contrastive loss as a specific instance of SupCon is central to the design of our unified loss.[2].Also commonly referred to as pseudo-labeling, self-training is a wrapper algorithm that is widely adopted in recent state-of-the-art semisupervised learning approaches [4,5,7,8,10,24,29,38].A classifier is first trained on the labeled training data, and then assigns iteratively pseudo-labels to unlabeled data and retrain the classifier with the augmented training set.Some approaches propose to use self-training in an online manner [4,5,29].

Self-training
More specifically, FixMatch [29] apply a CE loss L x to labeled examples X whereas an extra unsupervised CE loss L u is applied to unlabeled training examples with their associated pseudo-labels, only if the model confidence exceeds the threshold τ : This unsupervised loss is based on consistency regularization principle, which enforces the model to become invariant to perturbations of the input, like strong augmentations.It has become central to many popular recent semi-supervised approaches in computer vision [22,30,34].
Recently, adaptive thresholding strategies for generating pseudo-labels have been proposed in Dash [35], FlexMatch [37], Adamatch [6], and FreeMatch [33].SoftMatch [8] proposes to adjust pseudo-labels contributions based on their confidence levels by learning a parametric density function that adaptively assigns weights for each pseudo-labeled examples.
On the other hand, CoMatch [24] and SimMatch [38] introduce an additional contrastive loss that enforce similarity between representations having similar probability distribution.Other existing semi-supervised approaches [23,31] have already proposed to use the SupCon loss, as an extra regularization term applied to labeled or pseudo-labeled examples.Other than the self-training techniques we mentioned, very successful semi-supervised learning exist and often rely on self-supervised principles.This may involve using an additional regularization loss as in S4L [7], or using contrastive pre-training combined with distillation as in SimCLR V2 [10], or using clustering approaches like PAWS [3] or Suave and Daino [15].
In contrast to all the aforementioned approaches, our method uses a single contrastive loss that handles both the labeled training data and all the unlabeled training examples at the same time, including those on which the model is not confident.

Method
Our approach is a wrapper algorithm that can be easily adapted to various self-training algorithms.We will use FixMatch as an example to illustrate our approach because of its simplicity.However, our proposed approach is flexible enough to be applied to more complex self-training algorithms.

Overview
In our approach, we aim to enhance the classical SupCon loss by integrating labeled, pseudo-labeled, and unlabeled examples on which the model is unconfident, simultaneously within the loss formulation.The fundamental architecture of our method is illustrated in Figure 1.
Using the encoder f , we first compute Z x , the embeddings of the labeled training data.In a similar way, we generate Z u by applying two strong data augmentations to the unlabeled training data, and Z w by applying a weak data augmentation.
Unsupervised part Z u , y u .A pivotal innovation allowed in our framework is its way to handle all unlabeled examples in the loss, regardless of their confidence: Concerning high confidence examples, we adopt a strategy similar to online self-training methods, like FixMatch, by using pseudo-labels previously defined as y u↑ .However, rather than disregarding examples that have a posterior probability below the threshold τ , we assign unique labels to them using y u ↓ .Note that to make sure these labels are unique, values are shift by K to not interfere with existing classes.
This leads to the creation of singular positive pairs, mirroring the mechanics of unsupervised contrastive loss methods such as SimCLR as shown in equation 2. By incorporating both confident and unconfident examples within y u , our method is able to leverages all unlabeled training data.Note that, even if the loss does not directly depend on the weakly augmented embeddings, Z w is used to compute y u .
Centroid part Z c , y c .Computing labels for unlabeled examples y u requires a projection head p that maps Z w into a distribution probability.However, training a model with a supervised contrastive loss does not produce directly such a classifier, which is why an extra training phase with cross-entropy is employed when doing classification with SupCon loss [21].
In order to address this issue, and to maintain a fully contrastive framework, we propose the use of class prototypes [14,16,39].It consists in using K trainable parametric centers Z c ∈ R K×d that lie directly in the embeddings space.We define the label prototypes as y c = [1, 2, ..., K] ⊤ so that the k th row of Z c represents the prototype associated with class k.These parameters, initialized randomly, are then updated throughout the contrastive training process similarly to all other embeddings.A novel aspect of our method is to use these prototypes to define a probability distribution for a weakly augmented example z w i , by applying a softmax function with temperature T ′ to its cosine similarity with all prototypes: Training with these prototypes allows defining a classification head p, used to compute y u , without the addition of an extra cross-entropy loss.Further analysis and a connection with the cross entropy are discussed below.A unified loss L SSC .Finally, our loss, denoted as Semi-Supervised Contrastive (SSC) loss, can be easily expressed using SupCon and previously defined quantities: Table 1, provides an overview of different state-of-the-art approaches that utilize labeled, pseudo-labeled and unconfident unlabeled data in their learning process.Comparatively, our proposed approach is the only one which takes advantage of all labeled and unlabeled training data for learning in a fully contrastive framework.
The pseudo-code of the proposed approach is provided in Algorithm 1. First, the algorithm computes embeddings for labeled examples, Z x , unlabeled examples using a strong augmentation, Z u , and for prototypes, Z c .Labels associated with unlabeled examples y u are generated with our pseudo-labeling process using both Z c and the embeddings of weakly augmented examples Z w .Finally, it combines the labels of labeled examples, y x , pseudo-labeled examples, y u , and labels of the prototypes, y c , to train the model using the SSC loss function, defined in Equation 7.

Weigthed Semi-SupCon Loss
Similar to other mixed-loss frameworks that include parameters to balance between supervised, pseudo-labeled, or unsupervised parts, we extend the previously defined semi-supervised contrastive loss L SSC to feature weights, by using an additional parameter λ: These weights provide a mechanism to give higher importance to some anchors.In practise, we use a very simple strategy where we use a constant value that depends only on the nature of the anchor: More advanced approaches using adaptive weighting can be easily implemented, for instance with weights based on the confidence of the classifier p [8,24].From now, L SSC always refer to this weighted version of the loss.

Link with cross-entropy
We now establish a relationship between cross-entropy (CE) loss and our framework using contrastive learning loss using prototypes, in the classical supervised framework, under mild assumptions.As already observed in previous work [28], both loss functions have inherent similarities, particularly in treating negative embeddings similarly to the weights of a linear classification layer.Our prototype-based approach builds on this analogy.If we remove the bias of the last projection layer, the CE loss H can be expressed in terms of the weights of the final linear projection layer W ∈ R K×d as such : If we set the temperature of the SupCon loss to T = 1, and ensure the normalization of all embeddings, it is now easy to see that by replacing the weights W of the last layer with the prototypes Z c , we get : The CE loss is equivalent to applying separate SupCon losses to each example, each of which has only one positive pair that is its class prototype.Both losses aim fundamentally to learn prototypes Z c or equivalently weights W to be aligned with their corresponding feature vectors given by the labels, which supports the use of the prototypes to learn a similar distribution probability p on the embeddings space.
In the following section, we present our experimental setup, compare our method with established self-training approaches, and evaluate the impact of individual components through an ablation study.We also investigate the transfer performance of our approach and its synergy with self-supervised pre-training, focusing on convergence speed.Finally, we analyze the stability of the hyperparameters in our proposed loss function.

Experimental setup
Our framework is evaluated on three classical benchmark datasets: CIFAR-100 [1], STL-10 [12], and SVHN [26].For each dataset, we explore two splits, keeping a limited number of 4 and 25 labeled examples per class.We conducted each experiment using 3 random seeds and present both the mean and the standard deviation for each experiment.Following the setup in [29], baseline models are reported for 1024 epochs, where an epoch is arbitrary defined as 2 10 steps following the literature.However, to demonstrate the efficiency of our approach, we only train with L SSC on 256 epochs.We use a Wide ResNet WRN-28-2 [36] for all experiments on CIFAR-100 and SVHN, while a larger WRN-37-2 is used for STL-10.Additionally, on top of these architectures, we added a projection head as mentioned in SupCon, which consists of a 2-layer MLP with dimensions of 128 for WRN-28-2 and 256 for WRN-37-2 (following the dimension of the original projection used with CE).For FixMatch, the strong augmentation used is RandAugment [13].
It is important to note that although RandAugment is commonly used in semi-supervised settings, it is not specifically designed for contrastive learning.Nevertheless, we decided to keep the same augmentation parameters as those used in FixMatch.To be fair, we adopted the exact hyperparameters from the original work, including all optimizer settings such as learning rate, schedule, weight decay, batch size B and ratio µ.Concerning the extra hyperparameter introduced in our framework, we keep them the same for all the experiments.We take T = 0.01 which is a common temperature value used in SupCon loss, and we set T ′ = 0.04.
Tuning this last parameters is actually equivalent to tuning the pseudolabeling threshold τ , which is kept at τ = .95to be consistent with Fixmatch.Indeed, increasing T ′ will cause the posterior distribution p to approach the uniform distribution, which will have the same effect on pseudo-labeling as increasing τ .We chose to give the same importance to all embeddings by setting λ x = λ u↑ = λ c = 1 except for unconfident one where we set λ u↓ = 0.2

Experimental Results
Performance of L SSC We begin our evaluation by comparing FixMatch with and without the use of our proposed semi-supervised contrastive loss against other leading self-training approaches.We conduct this comparison on CIFAR-100 and SVHN datasets, employing both 4 and 25 labeled training samples.We report the results of the state-of-art approaches that have been previously found in the literature1 and in order to see the effect of the proposed approach, we ran FixMatch with and without SemiSupCon loss on our servers.Based on the results presented in Table 2, it comes that UDA [34] demonstrates comparable performance to FixMatch.However, when employing Fix-Match with the proposed approach, denoted as L SSC , the method notably enhances its competitiveness, particularly evident when training the model with only 4 labeled examples per class.These results underscore the effectiveness of our approach in leveraging all unlabeled data, particularly in scenarios where labeled data is scarce.
Transfer Performance Classical semi-supervised learning benchmarks typically require training models from scratch, a process that consumes considerable time.Due to these constraints, certain studies advocate for leveraging pre-trained models in semi-supervised approaches [15,32].
In this line, we explore the efficacy of integrating self-supervised pre-training using MoCo v2 [11], into our methodology.Specifically, in this section, we use a ResNet-50 architecture2 either trained from scratch or starting with MoCo v2 weights obtained after pretraining on ImageNet on 800 epochs3 .Figure 2 plots the Top-1 accuracy in percentage with respect to the number of epochs.We first observe that, in addition to having higher accuracy, using L SSC loss requires substantially fewer epochs to converge.With only 50 epochs, training with L SSC from scratch already outperforms the standard approach with 500 epochs.Only 25 epochs are needed when using pretrained weights, achieving a significantly higher validation accuracy of 69.3%.Using all unlabeled data, including instances where the model has lower confidence, facilitates efficient training.
As noted, we observe a significant gain from the self-supervised pre-training with L SSC , which is not the case when using the classical CE loss.This underscores that our proposed loss seem to facilitate a smoother transition from pre-training methods, particularly those with a contrastive nature like MoCo.
Ablation study In order to investigate the effect of different components of L SSC , we perform an extensive ablation study, as reported in table 3 on CIFAR-100 and STL-10 by training the models with 256 epochs.We observed that using two strong augmentations slightly enhances the FixMatch technique.However, adding the self-supervised SimCLR loss tends to degrade performance, as already observed in [23].Similarly, ignoring unconfident embeddings Z u↓ or applying them with a separate SimCLR loss also degrades the performance of L SSC .The use of L SSC consistently achieves the highest accuracy.These results justify our decision to incorporate them directly into our loss, thus facilitating global interaction with all other embeddings and prototypes.
Hyperparameter stability analysis We examine the sensitivity of our framework to classical self-training hyperparameters, such as the pseudo-labeling confidence threshold τ , the imbalance ratio between labeled and unlabeled examples in the batch µ, and the strength of strong augmentation A(•).Our contrastive approach, depicted in green, demonstrates significantly lower variance concerning the τ and µ parameters compared to alternative methods.However, both approaches appear equally sensitive to the augmentation strength.This outcome was anticipated since our contrastive framework relies on A(•) for both consistency regularization and unsupervised contrastive learning through the utilization of unconfident embeddings Z u↓ .

Conclusion
In this paper, we introduce a new semi-supervised contrastive framework that combines SupCon with an unsupervised contrastive loss, effectively operating within a self-training setting.The proposed framework allows taking advantage of labeled, pseudo-labeled, and unconfident examples simultaneously in the training process.
Moreover, we propose the incorporation of class prototypes into contrastive learning to derive class probabilities, enhancing the interpretability and performance of the model.By applying our approach to the FixMatch framework, we observe substantial performance gains across three datasets.Our method exhibits rapid convergence, benefits from pretraining, and showcases stability across various hyperparameters, underscoring its effectiveness and reliability in semi-supervised learning scenarios.
Future research avenues may explore further enhancements to the contrastive learning framework, such as incorporating domain-specific knowledge or adapting the framework to handle noisy or incomplete data.Additionally, investigating the interplay between contrastive learning and other semi-supervised learning techniques could lead to synergistic approaches with even greater performance gains.

Fig. 1 .
Fig. 1. SSC framework.Z x , Z u , and Z c are supervised, unsupervised, and prototype embeddings and y x , y u , and y c their corresponding labels, which aim to define positive pairs in the loss.Both the triplets of embeddings and their corresponding labels are concatenated and subsequently input into the loss function.The weakly augmented embeddings Z w are used only during pseudo-labeling phase to compute y u and does not propagate gradient back.Strongly augmented embeddings Z u used two augmentations to ensure the existence of at least one positive pair for unconfident examples.

Table 1 .
Comparison of loss types used in various online self-training algorithms.The table indicates which type of loss, either CE (Cross-Entropy), Contrastive Learning (CL), or none (∅) is applied to different parts of the input: Z x for embeddings of supervised examples, Z u↑ for high-confidence pseudo-labeled examples, and Z u↓ for unconfident examples (confidence less than threshold τ ).

Table 2 .
Top-1 validation accuracy (%) of various self-training methods compared to FixMatch, without and with the integration into our proposed wrapper approach (denoted as FixMatch w.LSSC ) obtained after convergence.