AMPLIFY: attention-based mixup for performance improvement and label smoothing in transformer

Mixup is an effective data augmentation method that generates new augmented samples by aggregating linear combinations of different original samples. However, if there are noises or aberrant features in the original samples, mixup may propagate them to the augmented samples, leading to over-sensitivity of the model to these outliers. To solve this problem, this paper proposes a new mixup method called AMPLIFY. This method uses the attention mechanism of Transformer itself to reduce the influence of noises and aberrant values in the original samples on the prediction results, without increasing additional trainable parameters, and the computational cost is very low, thereby avoiding the problem of high resource consumption in common mixup methods such as Sentence Mixup. The experimental results show that, under a smaller computational resource cost, AMPLIFY outperforms other mixup methods in text classification tasks on seven benchmark datasets, providing new ideas and new ways to further improve the performance of pre-trained models based on the attention mechanism, such as BERT, ALBERT, RoBERTa, and GPT. Our code can be obtained at https://github.com/kiwi-lilo/AMPLIFY.


INTRODUCTION
Data augmentation techniques are widely used in modern machine learning field to improve the predictive performance and robustness of computer vision and natural language processing (NLP) models by adding feature noise to the original samples.Compared to computer vision tasks, NLP tasks face samples with more complex data structures, semantic features, and semantic correlations, as natural language is the main channel of human communication and reflection of human thinking.Therefore, NLP models are more sensitive to the quality of the dataset, and often require a variety of data augmentation techniques to improve the model's generalization ability, adaptability, and robustness to new data in practical engineering applications.Common text data augmentation techniques include random synonym replacement [Zhang et al. 2015], back-translation [Xie et al. 2020], random word insertion and deletion, etc. [Wei and Zou 2019].The core idea of these methods is to generate more training samples by performing a series of feature transformations on the original samples, allowing the model to learn and adapt to changing feature information, and enabling the model to better deal with and process uncertainty in the samples.Additionally, by selectively increasing the size of the training set, data augmentation techniques can effectively alleviate the negative impact of limited data and imbalanced data classification on model prediction performance.
Standard Mixup is a simple and effective image augmentation method first proposed in 2017 [Zhang et al. 2017].It aggregates two images and their corresponding labels by linear interpolation to generate a new augmented image and its pseudo-label.The main advantage of the standard Mixup technique is that it purposefully generates beneficial noise by using the weighted average of features in the original samples.After adapting to this noise, the model becomes less sensitive to other noise in the training samples, thereby improving the model's generalization ability and robustness.
Additionally, because it can generate more augmented samples by aggregating different original sample pairs, even at a high data augmentation magnitude, there is no duplicate aggregation result.This greatly enhances the diversity of augmented samples, thereby effectively improving the training efficiency of the model and reducing the risk of overfitting.In the NLP field, the standard Mixup technique can be broadly divided into two categories: Input Mixup [Yun et al. 2019] and Hidden Mixup [Verma et al. 2019].Input Mixup increases the diversity of learnable features by increasing the number of input samples in the training set, with the main goal of reducing the probability of model overfitting.The main goal of Hidden Mixup is to enhance the robustness of the entire network architecture by increasing the diversity of features represented by the corresponding hidden layers in the model, thereby improving the model's generalization ability.Additionally, unlike Input Mixup, Hidden Mixup only involves different representations of sample features by neural network hidden layers and does not require additional input data, which saves computational resources and time.[Verma et al. 2019]'s research has shown that performing Mixup operations at the input layer of the model leads to underfitting, while performing Mixup at deeper layers makes the training easier to converge.However, when the original samples contain noise or outliers, performing Mixup may result in generated augmented samples containing more regular noise or outliers due to the principle of linear interpolation.If the model learns too much from this noise and outliers, it may cause obvious overfitting and prediction bias, thereby reducing the model's generalization performance.The basic idea of Standard Mixup is to linearly aggregate two samples (  ,   ) and   ,   in the training set D  , to generate a new sample ( mix ,  mix ) as an augmented input to train the network, where  represents the sample and  represents its corresponding label.This aggregation process can be expressed by the following formulae: Where  is the weight coefficient sampled from the Beta distribution.which to discard when aggregating different samples, ultimately integrating more critical features into the generated output sequence.Therefore, AMPLIFY duplicates the MHA output results of the sample sequence in the same batch, shuffles the order of these copies, and then performs Mixup on them with the original output results.This allows the model to aggregate the correlations and attention of two sample features multiple times at different levels, thereby obtaining more reasonable weights for each position in the aggregated sequence and avoiding unnecessary noise or feature information loss during the Mixup process.Additionally, since the Attention mechanism itself is a method of weighting the input features, AMPLIFY is more natural and effective intuitively.
Since AMPLIFY aggregates the features of the augmented sample and the corresponding original sample in the hidden layers of the model, it has lower computational cost and fewer parameters compared to other data augmentation methods.It also avoids the problem of increased time and resource consumption caused by standard Mixup methods ??.
In addition, experimental results have shown that our method can better learn the key features of the input sequence, improving the generalization ability and prediction performance of the model 2. For example, on the MRPC dataset, AMPLIFY's average accuracy is 1.83% higher than the baseline, and 1.04%, 1.72%, and 1.46% higher than EmbedMix, SentenceMix, and TMix, respectively.

RELATED WORK
Data Augmentation: refers to a class of methods that generate new data with certain semantic relevance based on the features of existing data.By adding more augmented data, the overall prediction performance of the model can be improved and the robustness of the model can be enhanced.In the case of a limited size of the training set, data augmentation techniques are more effective because they can significantly reduce the risk of overfitting and improve the generalization ability of the model.However, the inherent difficulty of NLP tasks makes it difficult to construct data augmentation methods similar to those in the computer vision field (such as cropping and flipping), which may significantly alter the semantics of the text and make it difficult to balance the quality of the data with the diversity of the features.Currently, data augmentation methods in the NLP field can be roughly divided into three categories: Paraphrasing [Wang and Yang 2015], Noising [Wei and Zou 2019], and Sampling [Min et al. 2020].Among them, Paraphrasing methods involve certain transformations of the characters, words, phrases, and sentence structures in the text while trying to retain the semantics of the original text.However, such methods may lead to differences in textual semantics in different contexts, for example, substituting "I eat an iPhone every day" for "I eat an apple every day" obviously does not make sense.Noising methods aim to add some continuous or discrete noise to the sentence Manuscript submitted to ACM while keeping the label unchanged.Although this method has little impact on the semantics of the text, it may have a significant impact on the basic structure and grammar of the sentence, and there are also certain limitations in improving the diversity of features.For example, adding noise to the sentence "i like watching sunrise" to turn it into "i like, watching jogging sunrise" would impair the grammatical integrity of the original sentence.The goal of Sampling methods is to select samples based on the feature distribution of the existing data and use these samples to augment new data.This method needs to define different selection strategies manually according to the features of different datasets, so its application range is limited and the diversity of the features obtained by augmentation is relatively poor.
Mixup: is a noise-based data augmentation technique introduced by [Zhang et al. 2017

METHOD
Assuming that our AMPLIFY method requires the following inputs, outputs, function definitions, and data structures: •  pre is a pre-trained text classification model based on Transformer architecture.
• D  = {⟨  ,   ⟩}  =1 is the training set of a downstream text classification task with  samples, where   is a sample sequence and   is its corresponding label.
is the mini-batch of  pre during each training iteration, with a sample size of .

• 𝑆 M 𝑜
is a random shuffling function that is responsible for changing the order of samples in the mini-batch.
is an index function that is responsible for retrieving a collection of text sequences M   with  elements and returning the corresponding index according to the order of these elements.
•  M   ,   is a reordering function that is responsible for reordering the elements in the collection of text sequences M   with  elements according to the index   , and returning the sorted sequence collection.•  MHA pre () represents the corresponding feature sequence output after the input sample sequence x is fed forward to the Muti-Head Attention layer in the Transformer block of  pre .
• Beta (, ) represents the U-shaped Beta probability distribution function with shape parameter .
• WS (Beta (, )) represents a weight value obtained by sampling according to the U-shaped beta probability distribution.
• LABEL M   ,   is a label generation function that selects the corresponding labels from a collection of sequences M   in the order provided by the index   and puts them together into a label set.
is the prediction function of the text classification model, which generates a set of probabilities for each sequence corresponding to  categories based on the input sequence set M   .
Based on the above initial conditions, our AMPLIFY algorithm includes the following specific steps: • Step 1: When model  pre is trained for downstream text classification tasks, a corresponding mini-batch, referred to as M   , is obtained from training set D  through the following formula in each iteration: • Step 2: A copy of M   is made and named M   .The sample sequences in M   are shuffled randomly, and then the index of the shuffled elements is obtained through the indexing function  M   , which is denoted as   (it is used to calculate the loss value, as detailed in Equation 8).This process can be expressed as the following formulae: (2) • Step 3: Calculate the value of  MHA pre    for each sample sequence    in M   , and obtain the corresponding feature sequence set for M   , denoted as F   .This process can be expressed as the following formula: Manuscript submitted to ACM • Step 4: Make a copy of F   called F   , and reorder the sequence label pairs in F   based on   using funciton  F   ,   .The process can be represented by the following equations: • Step 5: Perform element-wise Mixup operation on F   and F   to obtain the aggregated feature sequence set M mix  , which is then fed into the subsequent hidden neural network layer.Considering that AMPLIFY requires Mixup operation to be performed in each Transformer block, if the weight coefficients  for Mixup in different blocks are not the same, it will lead to frequent and intense disturbance to the features in the sequence, and some abnormal features or noise will also be constantly accumulated and strengthened, resulting in significant fluctuations in the model's loss value during the actual experimental process.Moreover, when there is a large difference in the  values between each block, this instability will be further amplified.Therefore, the  max we use is different from the definition in the standard Mixup.To improve the stability of linear interpolation and the consistency of feature representation, in the AMPLIFY algorithm, we adopt a method of first performing multiple samplings based on the Beta probability distribution (BPD), and then selecting the maximum value from the resulting  values.Specifically, we call function Beta (, ) multiple times to obtain a weight set with  elements, denoted as Λ, and then select the maximum weight value  max in this set as the weight coefficient for all Mixup operations in the model.This method can significantly reduce the adverse impact of randomness in the feature sequence on linear interpolation, making one feature sequence the explanatory term and the other the random perturbation term, which is also the reason why we tend to choose smaller  values.In addition, according to [Zhang et al. 2017]'s research, when  → ∞, the value of the BPD will approach 0.5 infinitely, and the training error of the model on the real dataset will also increase significantly.If the standard Mixup calculation method is used for mixing operation, it is likely to produce a large number of abnormal features due to significant disturbance caused by linear interpolation, which, in turn, will make the model prediction to deviate and greatly reduce its generalization performance.On the other hand, as shown in Figure 2, since the weight value obtained by randomly sampling only once from the U-shaped BPD may fall into the low probability area, or even be very close to 0.5.Therefore, sampling a total of  weights from the BPD and taking the maximum value can effectively avoid this issue.According to experimental results, we also found that if the value of  is large, the weight values sampled from the U-shaped BPD will appear in the high probability area in large quantities, and the maximum value selected from them will also be closer to 1.This results in a significant reduction in the weight of the feature sequence F   or F   that serves as the random perturbation term, and even renders it ineffective as a perturbation, thus making Mixup meaningless.Eventually, based on the experimental results 1 (We compared the classification performance of AMPLIFY with different weight sampling numbers on five datasets.The mean in the table is the average performance on the five datasets, and it reaches its optimal when sampled five times), we choose  = 0.1 and  = 5 to avoid obtaining risky weight values as much as possible and effectively leverage the role of random perturbation term.The above process can be described by the following equations: Manuscript submitted to ACM • Step 6: Calculate the loss value L mix based on the predicted results of the model.Standard Mixup uses two common methods to calculate the loss value.One method calculates the cross-entropy loss value by taking the  output by the text classification head and computing it with the ground truth in both the original order and the shuffled order, and then weighting the two loss values and adding them together as the total loss.The other method calculates the cross-entropy loss value between the  and the mixed pseudo-labels.Essentially, the main difference between these two methods is the order in which the weighting and cross-entropy loss calculations are performed.The first method calculates the cross-entropy loss value of the results first, and then performs weighting and summation, while the second method first performs weighting and summation of the results before calculating the corresponding cross-entropy loss value.In the AMPLIFY algorithm, according to the experimental results of [Yoon et al. 2021], we adopt the more effective first method to calculate the loss value L mix .The specific process can be represented by the following equations: From the perspective of reflecting the correlation between labels, the method of mixing labels in the AMPLIFY algorithm can be considered as an enhanced version of label smoothing.Through the weight coefficient  max , we can determine how much proportion of the cross-entropy loss value comes from the interpretive term and how much proportion comes from the random perturbation term.This is equivalent to adding moderate noise to the original labels, so that the modelś prediction results do not overly concentrate on the categories with high probabilities, leaving some possibilities to the categories with low probabilities, while effectively reducing the risk of overfitting.In summary, the pseudocode of the AMPLIFY algorithm is shown as follows:

Benchmark Datasets and Models
When designing our experiment, considering the representativeness of the model and its relevance to text classification tasks, we chose the BERT-base-uncased model [Devlin et al. 2018] from the HuggingFace Transformers library as the backbone network among many pre-trained models based on the Transformer architecture, and conducted experiments on seven benchmark datasets including MRPC, SST-2 [Wang et al. 2018], SST-5 [Socher et al. 2013], TREC-Fine, TREC-Coarse [Li and Roth 2002], Yelp-5 [Yelp 2014], and IMDB [Maas et al. 2011].These datasets are all from the official websites of Huggingface Datasets and the source datasets.

Baselines
We conducted a detailed experimental comparison of our AMPLIFY method with the following four baseline methods: • No Mixup: Relying solely on the predictive ability of the backbone network without using Mixup technology [Devlin et al. 2018].• EmbedMix: First, the zero-padding technique is used to pad all text sequences in the training set to the same length.After completing word embedding processing, pairs of sequences are randomly combined.Then, Mixup operations are performed separately on the two vectorized samples in the sequence pairs and their corresponding classification labels to obtain the augmented sample and its pseudo-label.Therefore, the Mixup operation of this method occurs in the word embedding stage of the text preprocessing, involving only semantic features in the word vector space [Guo et al. 2019a].

Manuscript submitted to ACM
• SentenceMixup: First, the encoder of the text model is used to process all samples in the training set to obtain the corresponding sentence-level sequence encoding.Then, Mixup operations are performed separately on the two randomly selected sequence encodings and their labels to obtain the feature sequence after linear interpolation and its pseudo-label.Finally, the mixed result is fed to the softmax layer at the end of the network.Therefore, the Mixup operation of this method occurs in the prediction stage of text processing, and the entire feature aggregation process only involves the hidden layers within the classification head [Guo et al. 2019a].
• TMix: First, two sequences   and   are randomly selected from the training set and processed by the text model  .Then, a random hidden layer is selected from the hidden layers of  , and the output    and    of   and   in that hidden layer are extracted.Then, Mixup operations are performed separately on these two feature sequences    and    and their labels, obtaining mixed feature sequence after linear interpolation and its corresponding pseudo-label, which are then fed to the subsequent hidden layer.Therefore, the Mixup operation of this method occurs in the feature extraction stage of text processing, involving only a specific hidden layer in the model [Chen et al. 2020].
For EmbedMix, SenMixup, and TMix, we followed the best parameter settings provided in their original papers, where shape parameter  = 0.2, and the Mixup weight coefficients  for these methods are calculated using the following equations:  ) and variance of the model after running three times with three different random seeds [Dror et al. 2018].
For our AMPLIFY method, the weight coefficient   is calculated using Equations 5, where  = 0.1 and the number of samples  = 5.Additionally, during the training for the TMix method, we randomly select one hidden layer from the 7th, 9th, and 12th blocks of the BERT-base-uncased model for Mixup operation.

Experimental Settings
In all experiments, AdamW [Loshchilov and Hutter 2017] is chosen as the optimizer for training, using cosine learning rate [Shazeer and Stern 2018], warmup step accounting for 10% of the total training steps, initial learning rate of 2e-5, EPS of 1e-8, weight decay coefficient of 0.01, batch size of 32, maximum sequence length of 256, maximum epochs of 15 for fine-tuning the pre-trained model, and early stop patience of 5. To ensure the consistency and effectiveness of the experimental process, the construction method of all neural network models involved comes from HuggingFace Transformers [Wolf et al. 2019], and all experiments are completed on the same NVIDIA RTX A6000 GPU based on the same configuration file parameters under the Pytorch Lightning framework.Each experiment uses 3 different random seeds and reports the mean and variance of the results.

Overall Results
Table 2 details the impact of our AMPLIFY method and other standard Mixup methods on the performance of baseline pre-trained text models (denoted as "No Mixup" in the table) on seven benchmark datasets.The results show that AMPLIFY provides better performance gains for the baseline model, and the idea of introducing a random perturbation term also serves as a good regularization to reduce the risk of overfitting.Moreover, with the continuous iteration of the training process, AMPLIFY almost outperforms other standard methods in all experiments, especially on the TREC-Fine dataset, where its improvement on model accuracy reaches 4.2%.In addition, in terms of the variance of the results, the performance gain of AMPLIFY is relatively stable, indicating that it has good robustness to deal with uncertainty in the samples, as well as better overfitting resistance than other Mixup methods.
What caught our attention is that on the Yelp-5 dataset with 560,000 training samples, almost all Mixup methods failed to provide a net performance gain for the baseline model.Observing the experimental process and results, We believe this is because pre-trained models like BERT generally perform well when facing large-scale datasets, as they are usually pre-trained on massive corpora of textual data, allowing them to learn more language-level knowledge Manuscript submitted to ACM and patterns.Therefore, they have already achieved excellent performance on datasets such as Yelp-5, which have relatively clear classification features, leaving limited room for Mixup to improve their performance.On the other hand, Mixup methods essentially augment the samples and their labels by using the existing feature information in the same dataset, which is different from standard data augmentation strategies and does not introduce out-of-domain feature information.Therefore, they cannot significantly improve the model's performance or seriously impair it.When the dataset is large, the baseline model can already understand the text features well.In this case, using Mixup methods may not have a significant effect.However, when the dataset is small and the model's overfitting problem is severe, like many other data augmentation methods, Mixup can often have a significant effect, helping the model improve its generalization ability and robustness in the face of sample uncertainty [Sun et al. 2020].
For example, the MRPC dataset has only 4500 samples, and using the Mixup method on it can lead to performance gains, especially for AMPLIFY, which performs multiple Mixup operations during the model's prediction process, effectively adding multiple mild random perturbations to the feature sequence, resulting in more significant performance gains.Additionally, when using the Mixup method to augment feature sequences, the mixed sequences must differ significantly from the original ones to improve the model's generalization performance.If the number of sequences per class in the dataset is relatively balanced, the distribution of differences between the mixed samples will also be more uniform, making it difficult for the Mixup method to bring further performance gains, as demonstrated in the study by [Yoon et al. 2021].From this perspective, the sample sizes of various categories in Yelp-5 are very similar, and the feature distribution of the data is relatively balanced.Therefore, the differences between the mixed sequences are not significant enough, which renders the Mixup method ineffective.Furthermore, on SST-2, the AMPLIFY method resulted in negative gains in model performance.After analyzing the reasons, we found that SST-2 is a dataset used for sentiment binary classification tasks, and it contains text samples of audience comments on movies or annotations of audience sentiment on movies.Therefore, these comment texts vary greatly in length, and after padding, many text sequences contain a large number of meaningless placeholders.Therefore, when performing Mixup operations on feature sequences of short and long texts, it may mix placeholders with sentiment information, which weakens or even submerges the classification features, thereby affecting the model's prediction results.
To further validate the effectiveness of AMPLIFY, we selected the distilled version of BERT, DistilBERT, as the backbone network for comparison.Table 3.The effectiveness of using AMPLIFY on DistilBERT.
Manuscript submitted to ACM

Variance
The experimental results show that compared to the baseline model and most other Mixup methods, our AMPLIFY achieves better performance gains while also having lower variance.Especially on the TREC-Fine dataset, AMPLIFY outperforms the baseline accuracy by 4.2%, while its variance is reduced by 0.32.Compared to other Mixup methods, AMPLIFY also has lower variances on the gain by 0.533, 1.04, and 0.693, respectively.We analyzed the reasons and found that TREC-Fine is a dataset for multi-classification tasks composed of 6850 questions and their classification labels.
Since the samples are divided into 47 categories, the number of samples under each category is relatively small, and the distribution of samples between these categories is very unbalanced.As a consequence, the category distribution of the augmented samples mixed by Mixup methods is also very unbalanced.When training the model, its predictions will be more biased towards categories with more samples and ignore categories with fewer samples.Although the overall accuracy of the model is not low, its performance on few-shot categories may be very poor.Moreover, if considering the variance of accuracy, the situation will be different.In a dataset with an imbalanced sample size, few-shot categories will bring greater accuracy variance because the model is difficult to get sufficient training on these categories and learn the features of them.This easily leads to larger errors when the model predicts these categories, thereby increasing variance.On the other hand, AMPLIFY can fully utilize the advantages of MHA in preserving local feature information and semantic relevance when processing natural language sequences.By adding mild random perturbations to the feature sequence and mixing the outputs of Attention multiple times, the coherence of the features and the consistency of the semantics are not impaired, allowing the features of few-shot categories to be more likely learned by the model and reducing variance.This also means that AMPLIFY can better adapt to the imbalanced sample distribution of the dataset, effectively reducing the performance fluctuations of the model, making it less prone to overfitting while having better generalization ability, bringing higher performance and reliability to the model in real application scenarios.

Visualization of Experimental Results
Figure 3 shows the cross-entropy loss values of four Mixup methods, EmbedMix, TMix, AMPLIFY, and SentenceMix, on the MRPC dataset for the first 12k training iterations.From this figure, it can be seen that the AMPLIFY method has a lower loss value and less fluctuation during the training process compared to other Mixup methods, indicating a more stable training process.However, using Mixup operation may aggregate noise and outlier features from the original sequence into the mixed sequence, leading to overly concerning these interference information and reducing the model's generalization ability.Specifically, the mixed sequence contains features from both two original sequences, but the linear interpolation operation also interferes with the information from them, weakening or even drowning out useful features.Therefore, as a special data augmentation technique, Mixup increases noise during the training process, making it harder for the model to fit the data, and thus increasing the loss value.As a comparison, AMPLIFY reduces interference from noise and useless features by adding mild random perturbation terms to the explanatory terms multiple times, while retaining the advantages of the Mixup method.In terms of computation time, the experimental results further illustrate that AMPLIFY saves 24s, 32s, and 472s compared to EmbedMix, SentenceMix, and TMix, respectively, within the first 12k iterations.It is shown that while bringing better performance gains and more stable performance, AMPLIFY also largely saves the computational cost of other Mixup methods during the mixing process.
Figure 4 shows the effect of the two different Mixup operations on the attention mechanism in the same text sequence.
It is clear that after the AMPLIFY operation, the attention between words in the same sentence remains at a high level, while after the EmbedMix operation, the MHA cannot recognize the special separator between the two sentences very Manuscript submitted to ACM  well, and establishes a high-level attention between words in different sentences.This is because the linear interpolation operation of EmbedMix makes the feature sequence of a mixed sentence contain context information from another Manuscript submitted to ACM sentence, which even overwhelms the semantics of the separator itself, causing confusion in the understanding of sentence context by the attention mechanism.
Figure 5 shows the effect of the two Mixup operations on the attention mechanism when applied to a specific word in the same sentence.Clearly, AMPLIFY does not have a negative impact on the output of MHA, the separator is successfully recognized, and closely related words still have a high attention weight, such as the two words "rabbit" and "hopped" which have the highest attention weight.In contrast, EmbedMix has a negative impact on the output of MHA, making it difficult to select the correct word and assign appropriate attention weights, resulting in the separator not playing its role and the attention being scattered.
Figure 6 demonstrates the impact of different Mixup operations on MHA when querying other words in the same sentence that have high correlation with the word "it".The query vector q and key vector k jointly determine the correlation value between any two words, while the element-wise multiplication of q and k, q • k, determines the attention value, and the softmax provides the query result based on the attention distribution.In the figure, EmbedMix focuses the attention on "too" and "tired", which is not the semantic correlation we expect.In contrast, AMPLIFY made a more accurate choice by putting the attention on "animal" and "it".This is consistent with our understanding that "it" should refer to "animal".
In the semantic relationship diagram of the word "it" shown in Figure 4, because the sentence initially defines "film", expresses dissatisfaction with a certain type of movie, and finally expresses a positive attitude towards "film", it can be inferred that the "it" appearing in the sentence refers to "film" based on the consistency of semantic logic and the way the sentence expresses emotions.However, common pre-trained text models such as BERT split the keyword "cartoonish" into two lower-granularity tokens "cartoon" and "#ish" in order to solve the out-of-vocabulary problem, which obviously led EmbedMix to not consider that "cartoonish" is a complete adjective used to describe a certain movie genre.On the other hand, AMPLIFY not only focuses on "film", but also extends attention to the grammar dependencies between subjects and predicates such as "#ish", and "provides".
Manuscript submitted to ACM For the sake of fairness, both Mixup methods used in this section to compare respectively copy the outputs of the hidden layers at the same position, and perform Mixup after shuffling.Other settings are consistent with those in the 4.3 section.In addition, all attention visualization patterns in this section are from [Vig 2019] and [Wang et al. 2021].The input text sequence consists of two identical sentences, "A coming-of-age film that avoids the cartoonish clichés and sneering humor of the genre as it provides a fresh view of an old type".Research [Clark et al. 2019] and [Vig 2019] found that the lower blocks (BERT layers 1-4) of BERT-base-uncased mainly learn vocabulary, syntax, and semantic information in the text sequence, the middle blocks (BERT layers 5-8)

Method
Manuscript submitted to ACM focus more on syntax information, and the last few blocks (BERT layers 9-12) focus on abstract semantic information and context-related information.Therefore, using Mixup in the lower layers can enhance the model's robustness and generalization ability to word-level features.In the lower layers, BERT learns basic language features and contextual relationships between vocabulary, and using Mixup can enhance BERT's perception of local information in the input text.Using Mixup in the middle layers can enhance the model's robustness and generalization ability to sentence-level features.In the middle layers, BERT learns long dependency relationships between higher-level syntactic structures and morphemes, and using Mixup can enhance BERT's ability to resist noise and out-of-domain features in the sequence.In theory, using Mixup in the high layers of the model should have the best effect, because in the high layers, BERT learns global feature information of the text, using Mixup can enhance BERT's ability to recognize noisy and out-of-domain samples in the dataset, and thus improve its performance on downstream tasks.In addition, these global feature information is to some extent universal for different downstream tasks, so using Mixup in the high layers can effectively improve the model's generalization ability and reduce the risk of overfitting.In summary, this also provides us with a better reason to perform Mixup on all layers.

CONCLUSION AND FUTURE WORK
This article proposes a novel and simple hidden layer Mixup method called AMPLIFY, which solves the limitations of the standard Mixup method's sensitivity to noise information and out-of-domain features.By performing Mixup operations on all MHA layers of pre-trained language models based on Transformer architecture and using mild random perturbation terms to augment the explanatory feature sequences of each attention mechanism output, AMPLIFY suppresses the effects of noise information and out-of-domain features on the mixed results.Compared with standard data augmentation strategies, AMPLIFY can better control the fluctuations in model performance gains.Compared with traditional Mixup methods, AMPLIFY has better robustness and generalization.The experimental results show that our proposed method has practical significance for exploring the performance potential of models in different NLP tasks.In addition, the AMPLIFY method has high computational efficiency, avoiding the partial resource overhead required by other Mixup methods, reducing the overall cost of the algorithm, and making it more engineering-oriented.
The experimental results also show that using Mixup in different MHA layers is a more effective choice depending on the features of different datasets.However, constructing a Mixup method that can dynamically adjust the structure and application location for different datasets is a very difficult task.More generally, research on the learning rules of the BERT model also shows that the semantic features and grammar-related information learned by the model in different network layers are different, and the benefits brought by them are also different.Therefore, it is recommended to perform appropriate Mixup on different network layers to obtain more significant overall performance gains, which is consistent with the idea of AMPLIFY.
For future work, we believe that there is still considerable exploration space in how to combine Mixup operations with various attention mechanisms.In addition, extending and optimizing existing Mixup methods is also a potential opportunity.For example, our next research direction is to apply AMPLIFY to models in other fields outside of NLP classification tasks, such as ViT (Vision Transformer) or CLIP (Contrastive Language-Image Pre-training), to evaluate its applicability and effectiveness.Additionally, we plan to combine AMPLIFY with other data augmentation techniques (such as Test-Time Augmentation) to further improve the performance of pre-trained language models.
Multi-Head Attention (MHA) is a technique commonly used in sequence-to-sequence models to calculate feature correlations.Namely, it can calculate the attention weights for each position in the sequence in parallel using different attention heads, and then obtain the representation of the entire sequence by summing the weighted values.Specifically, MHA helps the model establish correspondences between features in the input and output sequences, thereby improving the expressiveness and accuracy of the model.It can also attend to different features at different positions in the input sequence without taking into account their distance, assigning different weights to different parts of the input to capture information relevant to the output sequence.To address the problem of noise and out-of-distribution values in original samples, which affect the aggregated samples in standard Mixup, our AMPLIFY method uses the Hidden Mixup idea to perform Mixup operations on the hidden layers of the Transformer block.Since the output of the MHA already includes the model's attention to different parts of the input sequence, it can guide the model on which features to retain and Manuscript submitted to ACM

Fig. 1 .
Fig. 1.A schematic of AMPLIFY.In each encoder block of the Transformer, the forward-propagated input data is duplicated and re-ordered according to the label-mixing order after obtaining the results of the Muti-Head Attention, and then Mixup operation is performed.No changes are made to other network structures.Similarly, each decoder block can also perform the same operation.

Fig. 2 .
Fig. 2. The PDF of the U-shaped Beta probability distribution Beta (,  ) corresponding to different shape parameters .

Algorithm 1 :
Algorithm of AMPLIFY.Input: The pre-trained text classification model based on Transformer architecture  pre The training dataset of downstream text classification task with  samples D  = { ⟨  ,   ⟩ }  =1 The mini-batch during each training iteration M   =   1 ,

Fig. 4 .
Fig. 4. The influence of the fourth MHA layer of the model on the same text sequence after undergoing AMPLIFY and EmbedMix operations, respectively.The left side of the figure represents the word being updated, while the right side represents the word being processed.The lines in the figure represent the semantic correlations between words, and the color depth reflects the weight of attention obtained from the correlation.The text sequence consists of two sentences "the rabbit quickly hopped" and "the turtle slowly crawled", with [SEP] being a special token used to separate the two sentences and [CLS] being a special token used to classify the text sequence.

Fig. 5 .
Fig. 5.The influence of the fourth MHA layer of the model, after undergoing AMPLIFY and EmbedMix operations, respectively, on a specific word "rabbit".

Fig. 6 .
Fig. 6.The neuron view of how the fourth MHA layer of the model calculates attention weights based on query and key vectors after being mixed by two Mixup methods, AMPLIFY and EmbedMix, respectively.The positive values are represented in blue, while the negative values are represented in orange.The depth of the color indicates the weight, and the lines represent the attention between words.The input text sequence consists of two identical sentences, "The animal didn't cross the street because it was too tired".

Fig. 7 .
Fig. 7.The semantic relationship graph between words corresponding to the output of the first MHA layer, with or without the AMPLIFY operation.The color depth of the nodes represents different attention weights, and the currently focused word is highlighted.The input text sequence consists of two identical sentences, "A coming-of-age film that avoids the cartoonish clichés and sneering humor of the genre as it provides a fresh view of an old type".

Table 1 .
The impact of different weight sampling numbers on AMPLIFY.
The Logits output by the text classification head  The final loss value of the model L mix foreach M   ⊂ D  do Use equation 2 to obtain the shuffled mini-batch M   and its index   .foreach multi-head attention layer ∈  pre do Use equation 3 to obtain the feature sequence set F   corresponding to M   .Use equation 4 to obtain the re-ordered feature sequence F   based on   .Use equations 6 to element-wise mixup F   and F   , obtaining the aggregated feature sequence set M mix  .end Use equation 7 to obtain the prediction result  of the model.Use equations 8 to calculate the loss value L mix end

Table 2 .
Comparison experimental results of different Mixup methods on seven benchmark datasets.All values in the table are the average accuracy (% Table3details the impact of AMPLIFY on the performance of DistilBERT on seven benchmark datasets.As can be seen from the results, since DistilBERT has only 6 Transformer encoder blocks compared to BERT's 12, the number of Mixup operations on DistilBERT with AMPLIFY was reduced by half, resulting in a less significant improvement in performance compared to the standard BERT model.However, consistent performance net gains were still achieved, indicating that AMPLIFY can have a greater impact on complex Transformer models.
Table4shows in detail the effects of applying the AMPLIFY operation to hidden layers of different depths on seven benchmark datasets for the BERT-base-uncased model.The experimental results illustrate that for different datasets, Manuscript submitted to ACM

Table 4 .
The effects of applying the AMPLIFY operation to hidden layers of different depths in the model on seven benchmark datasets.The Input layer, Middle layer, and Last layer correspond to the MHA layers in the 4th, 8th, and 12th blocks, respectively.The experimental settings are the same as in section 4.3, and the values in the table are the average accuracy (%) and corresponding variance after running three times with three different random seeds.thecorrespondingMHA layer should be selected at the appropriate depth to use Mixup to achieve the best performance gain.In other words, performing Mixup operations on MHA layers fixed at a certain depth can only allow the model to achieve ideal results on a few datasets.After considering the trade-offs, we chose to perform a relatively mild Mixup on all MHA layers to achieve relatively better performance gains on as many datasets as possible.BERT (Bidirectional Encoder Representation from Transformers) is a common pre-trained language model consisting of 12 Transformer encoder blocks (referred to as BERT layers), each with its own attention mechanism and feedforward neural network layer.BERT-base-uncased is a case-insensitive version of the BERT-base model fine-tuned on large-scale unlabeled text data, such as Wikipedia, news articles, and website texts.Since it only requires character-level BPE (Byte Pair Encoding) on input text, it can be trained and deployed quickly.