An Efficient Self-Supervised Cross-View Training For Sentence Embedding

Abstract Self-supervised sentence representation learning is the task of constructing an embedding space for sentences without relying on human annotation efforts. One straightforward approach is to finetune a pretrained language model (PLM) with a representation learning method such as contrastive learning. While this approach achieves impressive performance on larger PLMs, the performance rapidly degrades as the number of parameters decreases. In this paper, we propose a framework called Self-supervised Cross-View Training (SCT) to narrow the performance gap between large and small PLMs. To evaluate the effectiveness of SCT, we compare it to 5 baseline and state-of-the-art competitors on seven Semantic Textual Similarity (STS) benchmarks using 5 PLMs with the number of parameters ranging from 4M to 340M. The experimental results show that STC outperforms the competitors for PLMs with less than 100M parameters in 18 of 21 cases.1


Introduction
Self-supervised sentence representation learning is the task of constructing an embedding space for sentences without relying on human annotation efforts.Recent advancements in self-supervised sentence representation present promising results on various downstream tasks such as Semantic Textual Similarity (STS) and text classification.For example, Gao et al. (2021) found that self-supervised sentence embedding methods could be on par with supervised methods (Reimers and Gurevych, 2019) on various STS benchmarks.
A straightforward approach to self-supervised sentence representation is to finetune a pre-trained language model (PLM), i.e., BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019), with a representation learning technique.One popular method is contrastive learning.This learning method enables self-supervised representation learning by creating a self-referencing mechanism through data augmentation (Gao et al., 2021;Zhang et al., 2022b;Zhou et al., 2022;Klein and Nabi, 2022;Yan et al., 2021;Liu et al., 2021;Kim et al., 2021;Cao et al., 2022).These works have demonstrated improvements over existing self-supervised techniques in sentence embedding benchmark datasets (i.e., STS and text classification).
Figure 2 shows how three existing methods Sim-CSE (Gao et al., 2021), DiffCSE (Chuang et al., 2022), and DCLR (Zhou et al., 2022) perform on the BERT architecture as we varied the number of parameters from 4M to 340M.While these selfsupervised techniques achieve impressive performance on larger PLMs (i.e., those with more than 100M parameters), the performance rapidly degrades as the number of parameters decreases (Wu et al., 2021;Zhang et al., 2022b;Limkonchotiwat et al., 2022).The figure also shows how the data points organize themselves into two distinct groups: LL and HH.
• High Cost, High Performance (HH).As shown in Figure 2, all models in this group, i.e., BERT-Base and BERT-Large, score more than 75, with the inference times over 420.9 seconds regardless of the learning method.• Low Cost, Low Performance (LL).This group contains all methods on models with less than 100 parameters, i.e., BERT-Tiny, BERT-Mini, and BERT-Small.All models in this group score less than 70, with the inference times less than 84.7 seconds regardless of the learning method.Despite the apparent benefit of low computation costs, smaller models, i.e., BERT-Tiny, BERT-Mini, We calculate similarity score distributions between two networks (f θ and f ref ) from the cross-view pipeline and minimize them through similarity-score distribution.In addition, the two networks do not require identical architecture nor share weights.They can be large and small networks (distillation) or Siamese networks.and BERT-Small, are often neglected.Greater emphasis should be placed on exploring the potential to enhance the performance of smaller models through novel learning methods specifically tailored to their unique characteristics.In this paper, we propose a framework called Self-Supervised Cross-View Training (SCT) to narrow the performance gaps between large and small PLMs. Figure 1 displays the difference between the traditional contrastive learning approach and ours.As shown in Figure 1a, the two views are separated for contrastive learning, and the outputs from h(•) are directly compared to each other.Figure 1b highlights the key distinctions of SCT based on two concepts: cross-view comparison and similarityscore-distribution learning.
• Cross-view comparison: The ability to self-reference is crucial to self-supervised learning.
We derive a novel mechanism for two different augmented views to reference each other.• Similarity-score-distribution learning: The way we quantify loss is critical to any learning process.Our method calculates the loss by measuring the discrepancy between two similarity score distributions obtained from cross-comparing two different views.The combination of these two concepts provides additional guidance which improves the effectiveness of self-supervised sentence representation learning on small PLMs.
To evaluate the effectiveness of SCT, we compare it to state-of-the-art (SOTA) competitors on STS, re-ranking, and natural language inference (NLI) benchmarks.We also employ a distillation setting using BERT-Large-SimCSE as a teacher model.The experimental results on STS demonstrated that our framework could address the drastic performance degradation problems in small PLMs by outperforming competitors in every case when the number of parameters is less than 100M.For the smallest model (#parameter: 4 million), we improved the performance from 64.47 to 69.73 points compared to SimCSE.In the case of large PLMs (i.e., those with more than 100M parameters), our model's performance was on par with the current SOTA model when tested on BERT-Base and BERT-Large.For the distillation setting, we outperformed all distillation competitors on all PLMs.For the re-ranking and NLI tasks, we improved the downstream tasks' performance for nearly all settings.
The contributions of our work are as follows: • We formulate a cross-view comparison pipeline to provide a more robust self-referencing mechanism for self-supervised sentence representation learning on smaller PLMs (those with less than 100M parameters).• Based on the cross-view comparison, we propose a method to measure the discrepancies between the cross-view outputs by comparing their respective similarity score distributions rather than the direct outputs.• We evaluate the effectiveness of SCT against five competitors on three families of PLMs using STS and downstream benchmark datasets.In addition, we also provide an in-depth analysis of different components in the cross-view pipeline to assess their effectiveness individually.

Related Work
Self-supervised learning is becoming more popular as a method to learn sentence representation from pre-trained language models (PLMs) without annotated information from training corpora.We cover well-known self-supervised sentence representation learning techniques in the following subsections.

Contrastive Learning
Contrastive learning constructs an embedding space by treating augmentations of an anchor as positives and other samples as negatives.The anisotropic problem is addressed by pulling a positive sample and pushing a negative sample with respect to an anchor sample.Gao et al. (2021) showed that the way we obtain positive and negative samples is critical to the performance of the representation.Kim et al. (2021); Cao et al. (2022) utilized a different PLM to generate positive and negative samples for each anchor.Fang et al. (2020) derived a method using two back-translations to create two different augmented views.Another popular approach is to generate positive and negative pairs using feature dimension dropouts (Gao et al., 2021;Yan et al., 2021;Liu et al., 2021;Klein and Nabi, 2022).The experimental results from these works outperformed the traditional self-supervised sentence embedding methods.
A more advanced technique uses an additional function to help distinguish positive from negative samples.For example, Zhou et al. (2022) proposed an additional debias function by mapping negative samples to the Gaussian distribution while individu-ally assigning a weight to each contrastive negative sample.Zhang et al. (2022b) proposed a virtual augmentation scheme by approximating the nearest neighborhood from the neighboring samples to create the virtual negative samples.Chuang et al. (2022) introduced a discriminator network to contrastive learning by classifying whether each word in a sentence is edited.Although these works demonstrated good performance, contrastive learning requires a judicial consideration of negative sampling to prevent false negatives.

Learning Without Negative Samples
A popular method to avoid false negatives is to design a learning process that uses only positive samples.BSL (Zhang et al., 2021) adapted BYOL's learning algorithm (Grill et al., 2020), which maximizes the similarity between two augmented views of each sentence.In particular, BSL created two augmented views from a PLM.The method uses a weighted exponential moving average of embeddings as a self-referencing mechanism.Klein and Nabi (2022) adapted a redundancy representation learning algorithm from Zbontar et al. (2021) and added a cosine similarity to maximize the similarity between the two samples formulated from high and low intense feature-dropout rate models.While these methods allow us to perform self-supervised learning without negative samples, they are still outperformed by contrastive learning.

Sentence Representation Distillation
Distillation is a widely used technique for creating a small PLM (student) from an existing large PLM (teacher) (Turc et al., 2019;Wang et al., 2020).Several sentence representation works proposed selfsupervised distillation frameworks.For instance, Wu et al. (2021) proposed an self-supervised contrastive distillation.They formulated an anchor and other components (positives and negatives) of contrastive learning using a small and large PLMs, respectively.Limkonchotiwat et al. (2022) proposed a distillation framework based on the instance queue concept.A large PLM formulated representations for an instance queue, while the small PLM mimicked the relation between its representations and those in the instance queue.
These methods have been shown to reduce the performance gap between a small and large PLMs effectively.However, none of the sentence representation works present how to decrease the gap without utilizing a large PLM.This research ques-tion is an important problem that needs to be addressed, especially since utilizing a large PLM may not always be feasible in practice.Therefore, it is crucial to propose techniques that can decrease the gap with or without utilizing knowledge from a large PLM.

Learning From Distribution
A recent approach from computer vision to mitigating the false negative problem is replacing binary labels with a distribution of similarity scores.The main idea is to compare samples a and p using similarity scores computed from the same collection of instances D as soft labels.In particular, the discrepancy between a and p is expressed as the similarity score discrepancy.Fang et al. (2021) proposed a knowledge distillation method by training a student network to imitate the similarity score distribution formulated by a teacher network.Tejankar et al. (2021) introduced a distribution learning paradigm using a similarity distribution score inferred by a momentum encoder over a set of instances.Zheng et al. (2021) proposed a representation learning technique by modeling the relationship distribution between weak and contrastive augmentation schemes.These works' experimental results demonstrated higher performance than contrastive learning and avoided false negatives.

Summary
As discussed in Section 2.1, the main drawback of contrastive learning is the binary distinction between positive and negative samples.Near duplicates can be mistakenly used as negative samples.While there exist learning methods that use only positive samples, they are still outperformed by contrastive learning.
Based on various experimental studies, the paradigm of learning from distribution shows promising results compared to the other two approaches.However, we have found that the direct application of distribution learning to our problem does not yield consistent performance improvement (see Table 4 in Section 5.3.1).We developed a self-referencing mechanism through data augmentation, which is needed to improve the distribution learning strategy for self-supervised sentence representation learning.Moreover, our framework also allows sentence representation learning in a distillation manner.In particular, we employ a larger model as the teacher model to let a smaller model mimics the teacher's property.

Proposed Method
One of the challenges of using small models is the limited number of parameters.An empirical study has shown that larger models have enough parameters to solve complex problems with simple techniques, while smaller ones require more guidance to solve complex problems (Brutzkus and Globerson, 2019;Wang et al., 2020).Based on this observation, we design our proposed solution, Self-Supervised Cross-View Training (SCT), to enhance the learning guidance for smaller models (those with less than 100M parameters) by improving the self-referencing and discrepancy measurement mechanisms.
Figure 3 illustrates the SCT pipeline and highlights the two mechanisms we introduce to improve the learning guidance: cross-view comparison pipeline and similarity-score-distribution Learning.In what follows, we describe how the crossview pipeline improves the robustness of the selfreferencing mechanism in Section 3.1.Section 3.2 presents the mechanism we use to measure the discrepancies between cross-view outputs.We explain our proposed SCT loss function in Section 3.3.Finally, we introduce sentence representation distillation into our proposed framework in Section 3.4.

Cross-View Comparison Pipeline
As stated in the introduction, we devise a crossview comparison pipeline to improve the robustness of the self-referencing mechanism.Figure 3 illustrates how the two augmented views are fed to both online (updatable) f θ (•) and reference (unupdatable) f ref (•) networks and how their outputs are compared in a cross-view pattern.In this way, we use both views as references and do not compare outputs originating from the same view to each other.
Given a new sample x, two augmentations T and T ′ are created from two different back-translations to produce two views x 1 = T (x) and x 2 = T ′ (x).Our framework allows various data augmentation schemes, i.e., masked language model (MLM) or Synonym replacement.We found that backtranslation improves the performance of downstream tasks the most, and we used them to create cross-view inputs.(see Section 5.3.3 for design analysis).Online representations (z θ ).The views x 1 and x 2 are first encoded by an encoder f θ (•) into a sentence representation, which is then mapped by .These instance queues enable the dynamic construction of a large and consistent negative sample, facilitating distribution learning.(Fang et al., 2021;Tejankar et al., 2021;Zheng et al., 2021).At the beginning of each minibatch, we enqueue and dequeue instance queues in a "first-in-first-out" manner.

Similarity-Score Distribution
The next step is to calculate similarity score distributions for cross-view comparison.As shown in Figure 3, we enforce the online representations z θ 1 and z θ 2 to maintain the consistency of the reference representations z ref 2 and z ref 1 through instance queues D 2 and D 1 , respectively.When the online network can match the reference representation in a large number of negative samples, the online network gains robustness to unseen inputs, which is necessary for sentence embedding.
We formulate the cross-view and reference distributions as follows: • We formulate a cross-view distribution that compares two augmented views called • We calculate the self-references with respect to the previous online distributions as follows: We define the similarity score distribution function SR(•) as a dot product function between a sentence representation and an instance queue: where p j = e sim(z,d j )/τ d∼D e sim(z,d)/τ , (1) and τ is the temperature scaling hyper-parameter separately for the online and reference representations, and sim(•) is the dot product similarity function.

Self-Supervised Cross-View Training Loss
This step computes the self-supervised cross-view training L SCT loss function using cross-view and reference distributions.In particular, the loss is computed by minimizing the discrepancy between the c θ 1 and c ref 2 distributions.Moreover, we minimize the difference between the c θ 2 and c ref 1 distributions.L SCT is defined as follows: given that L KL is the KL-divergence loss function that minimizes the discrepancy between online and reference cross-view distributions.Using stopgradient SG(•) on the reference encoder is essential in avoiding the anisotropic problem (every input generates the same output).As demonstrated in previous sentence embedding works (Li et al., 2020;Yan et al., 2021), when directly adapting BERT to STS tasks, the model tends to produce high similarity scores for all sentences, as it maps all sentences into a small region of the embedding space, also known as a "collapse".Many works have offered explanations for why the stop-gradient can help prevent the collapse issue in self-supervised training.With the L SCT 's mechanism, cross-view comparison pipeline and similarity-score-distribution learning, we circumvent the anisotropic problem that occurred in regular contrastive learning methods.

Representation Distillation
As discussed in Section 2.3, distillation is a common technique for improving the performance of the small PLMs by minimizing the discrepancy between the teacher model (large PLM) and the student model (small PLM).In this work, we incorporate the distillation approach into our novel cross-view framework by replacing the reference network f ref (•) with a larger PLM f large (•) to enable the framework to perform the distillation.We then design the distillation training objective by combining a self-supervised loss (L SCT ) and a cross-view distillation loss (L CD ) as follows: given the L SCT loss is a self-supervised consistency training loss based on the self-referencing mechanism, the L CD loss is a minimization objective between the large f large (•) and small f θ (•) PLMs using the cross-view training pipeline.This loss aims to ensure that the small PLM can generate sentence representations similar to the large PLM.We define L CD as: We formulate c θ from the small PLM (Section 3.2).We define , D 2 , τ ref ).We produce z large from f large (•), where the input of f large (•) is the same x 1 and x 2 from Section 3.1.In addition, we apply the stop-gradient technique to prevent the large network from mimicking the online network.
4 Experimental Settings 4.1 Implementation Details Architecture.Our experiments cover five BERT PLMs (Turc et al., 2019;Devlin et al., 2019), while the number of parameters is ranged between 4M and 340M.To obtain sentence representation vectors, we follow the practice of average word pooling presented by Reimers and Gurevych (2019).The projection head h(•) contains three MLP layers.Each MLP layer has one feed-forward with a ReLU activation function, which is then fed into a linear feed-forward layer.The size of the first and second feed-forward layers are uw and u, respectively, where u is the output vector dimension and w is the first-second layer expansion factor.The default value of w is set to 10. Training setup.For the training data, we use unlabeled texts from two NLI datasets, such as SNLI (Bowman et al., 2015) and MultiNLI (Williams et al., 2018) datasets, following the prior works (Li et al., 2020;Zhang et al., 2020Zhang et al., , 2021)).For augmentation schemes, we use English-German-English T and English-French-English T ′ back-translations from Zhang et al. (2021).We use AdamW (Loshchilov and Hutter, 2019) as the optimizer, a linear learning rate warmup over 10% of the training data, and a batch size of 128 for ten epochs.We tune the learning rate, instance queue's size k, and the temperature scaling τ θ and τ ref on the STS-B development set.The best values of these parameters are shown in Table 1.Note that we evaluate the STS-B development set every 64 training steps, and the best checkpoint is used for the final model.We also initialize the queues by randomly generating vectors.

Competitive Methods
We compare our work with a comprehensive range of self-supervised sentence representation methods representing well-known approaches discussed in Section 2.
• SimCSE (Gao et al., 2021).A contrastive learning technique using different random dropout masks in the transformer architecture as the data augmentation.
• DCLR (Zhou et al., 2022).A contrastive learning method that weights negative samples according to the difficulty given by another model.• DiffCSE (Chuang et al., 2022).A contrastive learning technique that uses additional learning signals from a discriminator to make the model more sensitive to small changes.For the generator model used in this baseline, we employ DistilBERT (Sanh et al., 2019) as described in the original paper.• CKD (Wu et al., 2021).A self-supervised contrastive distillation method using a memory bank as large-negative samples.• ConGen (Limkonchotiwat et al., 2022).A selfsupervised distillation method using an instance queue for distilling sentence embedding from large to small PLMs.

Evaluation Setup
We utilize Gao et al. ( 2021)'s evaluation settings by evaluating the efficiency of our work on the following STS benchmark datasets: STS-B (Cer et al., 2017), SICK-R (Marelli et al., 2014), and STS 2012-2016(Agirre et al., 2012, 2013, 2014, 2015, 2016).These datasets contain pair-wise sentences, where the similarity of each pair is labeled with a number between 0 and 5, indicating the degree to which the two sentences express the same meaning.
We also evaluate our model on downstream tasks, such as re-ranking (AskUbuntu (Lei et al., 2016) and SciDocs (Cohan et al., 2020)) and NLI (SICK-E (Marelli et al., 2014) and SNLI (Bowman et al., 2015) datasets).For re-ranking, we use the experiment and evaluation settings from unsupervised sentence embedding benchmark (Wang et al., 2021).For NLI, we use all the datasets from Sen-tEval (Conneau and Kiela, 2018) and use the experiment setting from previous sentence embedding works (Conneau and Kiela, 2018;Limkonchotiwat et al., 2022).In addition, we report the average scores across three random seeds for each experiment where the SD value is approximately only ∼0.30 points for the STS benchmark, ∼1.02 points for NLI, and ∼0.78 points for NLI.

Experimental Results
This section presents results from five sets of studies.Section 5.1 presents results from the main experiments using 7 STS benchmark datasets described in the previous subsection.In Section 5.2, we demonstrate the effectiveness of our method on various downstream benchmark datasets.In Section 5.3, we study the design decisions of the key components, namely (i) the model architecture and loss function; (ii) instance queues; and (iii) data augmentation strategy.Section 5.4 demonstrates the design decision of the distillation loss.

Main Results: STS Benchmark Datasets
Table 2 illustrates the effectiveness of our method (SCT) in comparison to the five competitors: Sim-CSE, CCLR, DiffCSE, CKD, and ConGen.We separate the results into two groups: fine-tuning (without a large PLM in the framework) and distillation (using a large PLM in the framework).Fine-tuning results.For the average scores, the experimental results show that our method SCT outperforms all competitors for all PLMs with less than 100M parameters.Let us first look at the results from BERT-Tiny, the smallest one from the BERT family.SCT outperforms SimCSE and DCLR by 5.26 and 4.3 points regarding Spearman's rank correlation.As expected, SCT is outperformed by competitors for models with more than 100M parameters, i.e., BERT-Base and BERT-Large.For BERT-Base, SCT scores lower than the best performer, DiffCSE, by 2.94 points.For BERT-Large, SCT scores lower than DCLR, which is the best performer, by 0.74 points.These findings underscore the importance of incorporating SCT into PLM training, especially in scenarios where computational resources are limited.Distillation results.The results presented in Table 2 demonstrate that SCT outperforms competing methods across all PLMs.Notably, SCT shows superior performance compared to ConGen, with improvements from 75.89 to 76.43 and 78.72 to 79.58 on BERT-Tiny and BERT-Base, respectively.Furthermore, the SCT method outperforms the teacher model (BERT-Large-SimCSE) when the number of  For models with less than 100M parameters, SCT performs the best in 18 out of 21 trials, i.e., 85.7%.In contrast, for models with more than 100M parameters, SCT is the top performer in only 2 out of 14 cases, i.e., 14.3%.In the distillation setting, SCT outperforms its competitors in 25 out of 28 experiments, i.e., 89.3%, for all models.Moreover, when the number of parameters surpasses 29M, SCT is the best performer in all 14 cases.In addition, the performance of SCT-Distillation-BERT-Small (#param: 29M) is similar to the SOTA on i.e.,78.49 (DiffCSE) vs. 78.16 (SCT).These results conform with the proposed benefit of SCT that we aim to improve the performance of smaller models.

Downstream tasks
In this study, we demonstrate the effectiveness of our method compared to DCLR and DiffCSE (the top performers in Table 2) on re-ranking (AskUbuntu and SciDocs) and natural language inference (SICK-E and SNLI).We report the Mean Average Precision (MAP) for re-ranking and accuracy score for NLI.In addition, we also separate the results into two groups just like in the previous section.
Fine-tuning results.Table 3 demonstrates that while SCT's performance on STS is lower than that of its competitors when the parameter count is less than 100M, it outperforms all competitors in re-ranking and NLI for 26 out of 28 cases (92.8%).For example, on BERT-large, SCT surpasses DiffCSE and DCLR by 2.52 and 2.64 points in the NLI average case, respectively.The gap between our method and competitive methods is wider on NLI datasets compared to STS benchmark datasets.For re-ranking, we found that SCT consistently outperforms competitive methods except for AskUbuntu on BERT-Large.These results demonstrate that SCT improves the robustness of any PLMs on downstream tasks with cross-view and self-referencing mechanisms.
Distillation results.The results indicate that SCT outperforms all competing distillation methods.Furthermore, our distillation method performs better than the fine-tuning method in comparable setups.For instance, when applied to the smallest PLM (BERT-Tiny), our distillation method improved the performance of NLI datasets from 71.89 to 78.53, outperforming the fine-tuning method.Moreover, SCT-Distillation-BERT-Base surpasses SOTA BERT-Large-finetuning for the average case.These findings highlight the efficacy of SCT in improving PLM performance, whether the teacher model is available or not.Table 3: Re-ranking and NLI results.We report MAP scores for re-ranking and accuracy for NLI.

Design Analysis
In this subsection, we analyze the key components of SCT as follows.Section 5.3.1 provides an ablation study on the model and loss function.Section 5.3.2presents an analysis of the instance queue.
Section 5.3.3explores how different data augmentation schemes affect the performance of our method.In section 5.3.4,we provide the summary of results from the design analysis studies.Table 4: Ablation studies on model & loss, instance queue studies, and data augmentation studies.We evaluate the performance of these studies on the average score across seven STS datasets.

Model and loss Function
Table 4 presents the results from the proposed SCT (fine-tuning) setup compared to the following variants.For brevity, we focus on models with less than 100M parameters.
The results show that the default version of SCT is the best performer.We can see that changing from distribution learning to contrastive learning incurs performance penalties ranging from 3.24 to 10.52 points.Similarly, changing the view comparison setting from cross-view to identical-view also results in performance penalties ranging from 3.69 to 13.06 points.In contrast, the momentum encoder, cross-entropy, and removing MLPs modifications result in smaller impacts.The results suggest that all design components are crucial to our method's performance, and the penalties for removing the distribution learning and cross-view parts are the most drastic ones.

Instance Queue
We study the impact of the following instance queue modifications: (i) combining two instance queues into one, (ii) keeping the negative samples unchanged, (iii) replacing the queues with in-batch negatives.As shown in Table 4, any modification from the default SCT results in a performance drop for all models.We can also see that keeping the negative samples unchanged suffers the worst impact.For instance, the performance of using the same negative sample (no queue updates) decreases the performance from 69.73 to 65.88 on BERT-Tiny.These results imply that the coverage of negative samples is crucial to the performance.
Let us now consider the impact of instance queue size on BERT-Tiny and BERT-Small.In this study, we vary the number of negative samples in the queue from 128 to 262,144 samples (the largest that our hardware supports).As shown in Figure 4, the performance improves as the queue size grows from 128 to 16,384 samples for all cases.However, the optimal queue size varies according to the model architecture, i.e., 131,072 for BERT-Tiny and 65,536 for BERT-Small.These results suggest we should tune the queue size separately for each model architecture., 1024, 16384, 65536, 131072, and 262144.We average Spearman's rank correlation across the seven STS benchmarks and test on small PLMs, i.e., BERT-Tiny and BERT-Small.

Data augmentation choice
This experiment evaluates the effect of different augmentation schemes widely used in sentence representation learning: (i) two back-translations (default), (ii) mask language model 15%, (iii) synonym replacement (one-word replacement), (iii) dropout mask, and (iv) using the same backtranslation (T = T ′ ).We evaluate Spearman's rank correlation on seven STS benchmark datasets.
The experimental results are shown in Table 4 (Data augmentation studies).
As expected, changing back-translation to other augmentation schemes decreases the performance in all cases.For instance, the performance of BERT-Tiny drops from 69.73 to 64.07 when we change from two back-translations to only one back-translation.This is because the two backtranslation schemes generate high-quality synonym text pairs (different syntax but same meaning), which help sentence representation to distinguish positive and negative samples in the embedding space.In contrast, other augmentation techniques produce either incorrect or similar pair texts, which are not useful for sentence representation learning.Data augmentation analysis.To validate our data augmentation strategy, we assess the syntax and semantic scores on our augmented datasets.We utilized the edit distance metric to evaluate the syntax changes (dissimilarity) in the augmented datasets compared to the original dataset.Additionally, we employed cosine similarity to evaluate the semantic consistency between the original and augmented embeddings.Our base encoders in this analysis were BERT-Tiny-SCT and BERT-Tiny-DiffCSE.
The results, as presented in Table 5, revealed that although MLM produced the highest string dissimilarity, it failed to preserve semantic from the original texts, resulting in significant changes to syntax and semantic.In contrast, the synonym augment scheme exhibited higher embedding similarity than MLM, as it maintained the original texts to a greater extent, resulting in minimal changes to syntax and semantic.Interestingly, back-translation produced favorable results in both string and embedding similarity.While the syntax was altered, the semantic remained unchanged, indicating reasonable performance in maintaining the core semantic meaning.While the string dissimilarity of back-translation was slightly lower than that of MLM (with only one character difference on average), back-translation achieved higher similarities in the base encoders' embeddings.These findings corroborate the results of our data augmentation choices, as shown in Table 4, where we emphasize that data augmentation methods with desirable properties exhibit high string dissimilarity and embedding similarity.The efficacy of backtranslation, in particular, highlights its potential as a suitable data augmentation technique for preserving both syntax and semantic consistency, making it a promising technique for enhancing the performance of embedding space.

Summary of Design Analysis.
As shown in Table 4, we present the desired components in the SCT framework.We found that applying a technique from computer vision requires careful consideration of the architecture and data augmentation schemes.The experimental results from the model and loss studies demonstrate that using contrastive learning similar to SimCSE (Gao et al., 2021) or using a momentum encoder similar to MoCo (He et al., 2020;Chen et al., 2020) produce poorer performance than our setting (small PLMs).This is because of the fact that small PLMs necessitate more guidance, as discussed in Section 3. Thus, the similarity-score-distribution learning paradigm employed in our framework demonstrates promising results in enhancing the performance of small PLMs.However, it is worth noting that applying the similarity-score-distribution learning paradigm from Fang et al. (2021) without making any adjustments adversely affects the model's performance more than any other setting, i.e., the performance of BERT-Tiny decreased by 7.82 points when we changed from cross-view (our work) to identicalview (computer vision).
Regarding the data augmentation studies (Section 5.3.3),we found that using two-back translations produced the most effective augmented sentences compared to MLM or synonym replacement.With these findings, we require to adjust architectures, loss, and data augmentation from previous works, which achieved SOTA performance in small PLMs.These insightful findings necessitate the adaptation of architectures, loss functions, and data augmentation approaches from prior works.By carefully considering these adjustments, we can further enhance the capabilities and efficiency of small PLMs in various NLP tasks.

Distillation Studies
In this subsection, we study the components of our distillation method as follows.In Section 5.4.1, we provide an ablation study on the model and loss function.Section 5.4.2 presents an analysis of the distillation loss.

Distillation Design
This study illustrates the efficacy of SCT within distillation settings.An ablation study has been meticulously conducted to elucidate that all constituent elements of SCT contribute to the overall performance.In particular, we demonstrate the ablation study of the self-supervised loss L SCT in distillation settings using the setup from Table 4.
The findings in Table 6 highlight the importance of adhering to the default SCT configuration, as any departure from it incurs a notable performance decrement.The analysis distinctly reveals that the most substantial penalties arise from the alterations involving Distribution→Contrastive and Cross-view→Identical-view adjustments.These results emphasize all components of SCT contribute to performance improvement.Any deviation from the default SCT setting results in a performance penalty.Table 6: Ablation studies on model & loss of our distillation method.We evaluate the performance of these studies on the average score across seven STS datasets.

Distillation Loss
This experiment demonstrates the efficacy of our novel approach involving self-supervised and distillation losses.We investigate the impact of using a distillation loss alone and the benefits of integrating both distillation and self-supervised losses.In particular, we explore the utility of our SCT loss as a bootstrapping mechanism for existing distillation methods.We also demonstrate a common distillation loss by minimizing the discrepancy between z large and z θ with Mean Square Error (L MSE ).Table 7 presents the experimental results for two scenarios: (i) using only a distillation loss and (ii) incorporating both self-supervised and distillation losses.In addition, we highlight the improvement with the up arrow (↑).Our experimental findings consistently demonstrate that including the SCT loss significantly enhances the performance of existing distillation methods across the board.For example, the SCT loss yields substantial performance boosts of 3.28 and 4.50 for L CD and L MSE methods on BERT-Tiny, respectively.Moreover, we improve the performance of ConGen and CKD methods to a level comparable with L CD + L SCT .We do this using SCT as the bootstrapping loss.These results underscore the advantages of combining distillation and self-supervised losses to achieve enhanced performance in small PLMs.Furthermore, our SCT loss demonstrates its efficacy as a reliable bootstrapping loss for distillation methods, highlighting its potential as a valuable tool for improving the performance of distillation-based approaches.

Conclusion
We propose a self-supervised sentence representation learning method called Self-Supervised Cross-View Training (SCT).The observation inspires our work that smaller models, when constructed in a self-supervised setting, tend to perform poorly or collapse altogether.We hypothesize that this problem can be addressed by providing additional learning guidance to facilitate the self-referencing mechanism in the self-supervised learning pipeline.
Our work consists of three key contributions.First, at the framework level, we formulate a cross-view comparison pipeline to improve the self-referencing mechanism by enabling crosscomparison between two input views.In addition, our framework allows using two input views formulated from the same or different PLMs.Second, to facilitate the learning process, we also design a new technique to measure the discrepancy between two cross-view outputs.Instead of comparing them directly, we use similarity score distributions.Third, we conducted extensive sets of experimental studies to compare our method against existing competitors and to analyze our design decisions.
The experimental results on the STS tasks show that our method dominates all competitors in the cases of PLMs with less than 100M parameters.With the help of the distillation loss, our method improves the performance of small PLMs better than that of large PLMs.Moreover, our method outperforms competitive methods for all PLMs on the downstream tasks.Furthermore, the results also confirm that the cross-view comparison pipeline and similarity score distribution comparison are crucial to performance improvement.These findings imply that smaller PLMs benefit from our judiciously designed guidance in a self-supervised setting.

Figure 1 :
Figure 1: (a) The overview of self-supervised contrastive learning for sentence embedding.Contrastive learning is applied to directly compare the input x produced from the separate-view pipeline T .(b) The Self-Supervised Cross-View Training (SCT) pipeline.We calculate similarity score distributions between two networks (f θ and f ref ) from the cross-view pipeline and minimize them through similarity-score distribution.In addition, the two networks do not require identical architecture nor share weights.They can be large and small networks (distillation) or Siamese networks.

Figure 2 :
Figure 2: Comparison between sentence representation methods on different model sizes.We averaged Spearman's rank correlation across seven STS datasets.LL denotes the low-cost, lowperformance group, and HH denotes the high-cost, high-performance group.

Figure 3 :
Figure 3: The overview of Self-Supervised Cross-View Training (SCT).the MLPs projector h(•) onto the representations z θ 1 = h(f θ (x 1 )) and z θ 2 = h(f θ (x 2 )).Reference representations (z ref ).The views x 1 and x 2 are again encoded by the f ref (•) encoder to be used as references for the next step z ref 1 = f ref (x 1 ) and z ref 2 = f ref (x 2 ).Note that the architecture and weights of the target network f ref (•) and the online network f θ (•) are identical, and all encoder outputs are normalized.Instance queues (D).We denote two instance queues that are formulated from the cross-view reference representations, z ref 1 and z ref 2 , as D 1 = [d 1 1 , ..., d k 1 ] and D 2 = [d 1 2 , ..., d k 2 ] where k is the queue length and d is the sentence vector obtained from f ref with d k 1 = z ref 1 and d k 2 = z ref 2. These instance queues enable the dynamic construction of a large and consistent negative sample, facilitating distribution learning.(Fang et al., 2021;Tejankar et al., 2021;Zheng et al., 2021).At the beginning of each minibatch, we enqueue and dequeue instance queues in a "first-in-first-out" manner.

Table 1 :
Model parameters, including learning rate, instance queue size k, and temperature scaling for reference τ ref and online τ θ networks.

Table 2 :
Sentence embedding performance on STS tasks (Spearman's rank correlation).For the distillation setting, we used BERT-Large-SimCSE for all distillation techniques.

Table 5 :
We evaluate the string dissimilarity and embedding similarity on our training and augmentation datasets.For the string dissimilarity, we use edit distance to evaluate the changes in the augmentation dataset.For the embedding similarity, we use cosine similarity to evaluate the identical of the original and augmentation dataset.