Improving BERT with Self-Supervised Attention

One of the most popular paradigms of applying large pre-trained NLP models such as BERT is to fine-tune it on a smaller dataset. However, one challenge remains as the fine-tuned model often overfits on smaller datasets. A symptom of this phenomenon is that irrelevant or misleading words in the sentence, which are easy to understand for human beings, can substantially degrade the performance of these finetuned BERT models. In this paper, we propose a novel technique, called Self-Supervised Attention (SSA) to help facilitate this generalization challenge. Specifically, SSA automatically generates weak, token-level attention labels iteratively by probing the fine-tuned model from the previous iteration. We investigate two different ways of integrating SSA into BERT and propose a hybrid approach to combine their benefits. Empirically, through a variety of public datasets, we illustrate significant performance improvement using our SSA-enhanced BERT model.


I. INTRODUCTION
T HE models based on self-attention such as Transformer [1] have shown their effectiveness for a variety of NLP tasks. One popular use is to leverage a single pretrained language model, e.g., BERT [2], and transfer it to a specific downstream task. However, opportunities remain to further improve these fine-tuned models as many of them often overfit, especially on smaller datasets.

Motivating Example: One Symptom of Overfitting
This paper is motivated by the observation that many finetuned models are very sensitive to irrelevant or misleading words in a sentence. For example, consider the sentence "a whole lot foul, freaky and funny." from SST (Stanford Sentiment Treebank) dataset. In vanilla BERT, this sentence is predicted as negative while its actual label is positive. As illustrated in Figure 1(a), one reason behind this may be that the word 'foul', which is often associated with negative predictions, was given too predominant attention score by the BERT model. Instead, by masking the misleading word 'foul' (see Figure 1(b)), BERT is able to focus on the more relevant word 'funny', and the final result flips to the correct label.

Self-Supervised Attention (SSA)
In this paper, we hope to alleviate this above problem by introducing auxiliary knowledge to the attention layer of BERT. An interesting aspect of our approach is that we do not need any additional data or annotation (such as those in MT-DNN [3]) for this auxiliary task. Instead, we propose a novel mechanism called self-supervised attention (SSA), which utilizes self-supervised information as the auxiliary loss.
The idea behind SSA is simple -Given a sentence of n tokens S = (t 1 , ...t i ..., t n ) predicted by the model with label y, we change the sentence into S by, for example, masking a token t i , and generate the predicted label as y . If y is different from y, we set the SSA score of t i as 1, otherwise 0. Intuitively, this means that if the token t i plays a prominent role in the current model (i.e., the label prediction will be flipped if this token is masked), it will correspond to higher importance with a larger SSA score. Therefore, we can leverage a SSA layer to improve the accuracy and generalization ability of the original model by correctly predicting each token's SSA score.
At first glance, it might be mysterious how the SSA score helps the performance of a deep neural network modelall the information of SSA comes from the model itself. The intuition behind our approach is to impose constraints over the model to obtain several good properties. (1) If a model can predict SSA scores correctly, its decision surface is relatively smooth and less noisy. In other words, if we randomly mask some irrelevant or misleading words, the decision surface will become more robust. (2) By forcing the learned features to predict the auxiliary task of SSA, the learned features are able to encode more information about the keywords in a sentence and obtain the better capability of generalization.

Summary of Technical Contributions
Our first contribution is a co-training framework that precisely encodes the above motivation by a joint loss of downstream task and SSA. This itself already improves the performance of BERT significantly on some tasks. We then explore the idea of directly integrating SSA as an attention layer into the model. We propose a hybrid framework that combines both co-training and an additional self-supervised attention layer as our second contribution. This further improves the performance of our model on a variety of public datasets. We also demonstrate that the token-level SSA scores learned by the model agree well with human common-sense.

II. RELATED WORK
One way of addressing the overfitting problem is to increase the volume of training data. STILTs [4] introduces an intermediate-training phase rather than directly fine-tuning on the target task, for example, training on MNLI dataset before fine-tuning on the target RTE dataset. RoBERTa [5] focuses on training a better language model, which obtains a more powerful pre-trained checkpoint and improves the performance of downstream tasks significantly. Specifically, it collects a larger corpus, employs better hyper-parameters, and trains more iterations than BERT. [6] proposes a data augmentation method, named conditional BERT contextual augmentation, to generate synthetic data samples to be additional training data. Similarly, [7] explores four simple text editing techniques for enlarging the volume of data and improves the performance of CNN/RNN models. However, it is also noted that the improvement of such data augmentation methods is negligible when using pre-trained models like ELMo and BERT. Another line of research focuses on auxiliary tasks. For instance, [8] leverages powerful BERT to tackle Aspect-Based Sentiment Analysis (ABSA) task. By constructing an auxiliary sentence from the aspect, authors convert ABSA to a sentence-pair classification task, such as Question Answering (QA) and Natural Language Inference (NLI). In this way, the original ABSA task is adapted to BERT, and new state-ofthe-art results are established. MT-DNN [3] presents a multitask learning framework that leverages large amount of crosstask data samples to generate more general representations. AUTOSEM [9] proposes a two-stage multi-task learning pipeline, where the most useful auxiliary tasks are selected in the first stage and the mixing ratios of the tasks are learned in the second stage. [10] incorporates syntactic information as an auxiliary objective and achieves competitive performance on the task of semantic role labeling.
This work also relates to the attention mechanism, which has been extensively utilized for modeling token relationships in natural languages. Previous works mainly focus on how to enhance sentence representation. For instance, [11] proposes hierarchical attention network to progressively build a document vector for classification; [12] extracts an interpretable sentence embedding in a 2-D matrix through self-attention; [13] utilizes a vector-based multi-head attention pooling method to enhance sentence embedding. [14] is the most relevant method to our work, which utilizes a graph-based attention mechanism in a sequence-to-sequence framework for document summarization. It is not even half the interest 0 0 0 It is not even half the interest 1 1 1 1

FIGURE 2. Label generation procedure
In addition, some methods of interpretability are relevant to the SSA score calculation strategy. For example, LIME [15] focuses on explaining model's predictions by sampling perturbations of inputs and treating decision flipping as a key to explanation. Besides, the gradient-based techniques such as GradCAM [16] are also applicable for SSA score calculation. The major focus of this paper is to address the usefulness of token importance to improve model accuracy rather than model interpretability. We note that the SSA score generation strategy may be improved by LIME or GradCAM in the future work.
The method proposed in this paper is different from previous works in the following perspectives: (1) Our solution does not require any extra knowledge (such as WordNet [7] and back-translation [17]) to build augmented dataset, Instead, we explore the token-level importance as an auxiliary task and utilize it to improve the generalization ability of the model. (2) We propose a novel auxiliary task, Self-Supervised Attention (SSA), which differentiates the importance of each token with respect to the optimization target.
(3) We apply an additional self-supervised attention layer to BERT for amplifying the effect of relevant tokens and diminishing the impact of irrelevant or misleading tokens. With the self-supervised attention layer, the BERT model becomes more resistant to overfitting in the fine-tuning stage.

III. MODEL
This section presents an overview of the proposed Self-Supervised Attention (SSA) model. In Section III-A, we introduce the auxiliary task of SSA and describe the training data generation procedure for this task. Then, in Section III-B, we introduce the co-training framework, which simultaneously optimizes the target task and auxiliary task. Finally, in Section III-C, we propose the hybrid model with selfsupervised attention layer.

A. THE AUXILIARY TASK OF SSA
Based on a sentence S, we generate another sentence S by masking several tokens chosen at random in the original sentence. Sentence S can be seen as a noisy counterpart of S. At a high level, by examining the relationship between S and S , we can construct a smoother surface passing through both S and S , and thus allow more robust local minimum point to be reached via optimization. Specifically, if S has the same prediction label as S, the modified tokens can be de-emphasized for improving the generalization capability of the original model. Otherwise, we want to emphasize the modified tokens to address their task-oriented importance. Motivated by this, we propose a novel auxiliary task: Self-Supervised Attention (SSA), which learns a task-oriented weight for each token. When we correctly predict the weights of the tokens for a specific task, the decision surface between S and S will be smoother and so, the model will have a better generalization capability.
Given a sentence S = (t 1 , ...t i ..., t n ), the SSA task outputs a binary output vector Y = (y t 1 , ..., y t i ..., y t n ), where y t i indicates how important the token t i is to the target task. The loss function can be formulated as follows: where M (t i ) denotes the model from the previous epoch, w i SSA is the fully connected layer of the i-th token for SSA task, σ is a softmax operation, y i denotes the SSA label of token t i , and SSA is the cross entropy loss.
The training data generation procedure for SSA task is described as follows. Given a sentence S, we first mask several tokens randomly and get a new sentence S . Then, we inference the predicted label of two sentences using the same model, M (t i ). If both predicted labels are the same, we set the importance labels of masked tokens as 0, as VOLUME 4, 2016 they have little impact on the target task. Otherwise, these tokens should be emphasized and corresponding importance labels are set to be 1. We only process the sentences S that the prediction of M (t i ) is correct to get rid of noises. The benefit of this operation is that, with the increase of epochs, the label given by M (t i ) becomes more and more accurate. Besides, the number of tokens being masked is proportional to the length of the sentence, which is set to 0.3 empirically. We use a generation ratio γ to control the amount of generated sentences, and larger γ means more sentences will be generated.
We train different samples generated by the same sentence in different iterations without slowing down the training procedure. Figure 2 shows two training samples generated by a sentence in multiple iterations. Suppose that we have an input sentence S ="It is not even half the interest", we generate the masked sentence A straight-forward way to leverage the SSA task is training the target task and SSA jointly. Specifically, the overall loss can be defined as a linear combination of two parts: where target is the loss function of the target task, for example, negative log-likelihood loss for sentiment classification and mean-squared loss for regression task; y i and y s i denote the actual label of sentence s i and the predicted label for the target task respectively; L target denotes the loss function of the target task while L SSA represents for that of SSA task; α is a linear combination ratio which controls the relative importance of two losses.
The model architecture for co-training is illustrated in Figure 3. Each token in the input sentence is mapped to a representation, and then fed into an encoder of BERT or multi-layers Transformer. The output of the encoder consists of a sentence representation vector R s and token-level representation vectors [R t1 , R t2 , ..., R tn ]. The sentence representation is used for target task prediction and the token representation vectors are leveraged in the SSA task. The cotraining framework optimizes these two tasks by the target loss and auxiliary loss of SSA alternatively. The pseudo code is described in Algorithm 1.  The limitation of co-training framework is that the SSA task can not impact target prediction explicitly. It only acts as a regularizer on the loss function to force the learned embedding vectors of tokens to encode their relevant importance. Intuitively, if an irrelevant or misleading token exists in an training sentence, we can mask the token explicitly and guide the model to capture more important information for alleviating the overfitting problem. Therefore, we add an additional self-supervised attention (SSA) layer on the vanilla BERT model, which readjusts the weight of each It is not even half the Multi-layers Transformer BERT Repre.
The hybrid model with SSA layer is illustrated in Figure  4 and Equation 5 is the mathematical formulation. It yields an extra sentence representation by summing up all the token embedding vectors R i weighted by the SSA prediction scores y t i after the softmax operation σ. Afterwards, the extra sentence embedding output by the SSA layer and the original sentence embedding vector R [CLS] are linearly combined as the final sentence representation R o . β is a hyper-parameter to control the relative weight of this linear combination. Figure 5 further demonstrates a concrete example of hybrid model. Take "It is not even half the interest" as input, the discrete tokens are mapped to embedding representations first, then the embedding vectors are transformed by a multi-layer transformer encoder (e.g., BERT), to generate sentence embedding and token representations. As a result, the SSA layer identifies that 'It', 'not', 'half', and 'interest' are important for the target sentiment classification task, and then constructs an extra sentence embedding by weighted summation of token-level embedding vectors based on the corresponding SSA scores. The final sentence representation is a linear combination of original sentence embedding and extra sentence embedding. The loss function of primary task and SSA task are jointly optimized in a co-training framework.
As with the co-training framework, the model training step and SSA data generation are alternatively processed. The detailed training procedure is identical to algorithm 1 except that we train the hybrid model instead of the joint model.

IV. EXPERIMENTS
In this section, we describe the experimental details about the proposed models SSA-Co and SSA-Hybrid on the GLUE benchmark and ABSA tasks.

A. TASKS
The General Language Understanding Evaluation (GLUE) benchmark is a collection of datasets for training and evaluating neural language understanding models. The statistics of datasets are summarized in Table 1, where CoLA [18] and SST-2 [19] are datasets for single-sentence classification tasks, STS-B [20] is for text similarity task, MRPC [21] and QQP are binary classification tasks, and the rests are natural language inference tasks including MNLI [22], QNLI [23], RTE [24] and WNLI [25]. We follow the default evaluation metrics for tasks in GLUE benchmark.
For each task, we use the default train/dev/test split. The model is trained on the training set, while the hyperparameters are chosen based on the development set. We submit the prediction results of test set to the GLUE evaluation service 1 and report the final evaluation scores.
To further evaluate the effectiveness of the proposed models, we conduct experiments on the aspect-based sentiment analysis (ABSA) task. ABSA task is more than a pure sentiment polarity classification task, it also involves aspect detection. We argue that this complex task can also benefit from the proposed SSA task. We follow [8] to conduct experiments on two ABSA task datasets: SentiHood [26] and SemEval-2014 Task 4 [26]. Note that for SemEval-2014 Task 4, we jointly evaluate subtask 3 (Aspect Category Detection) and subtask 4 (Aspect Category Polarity). Since SemEval-2014 task 4 only provides train/test split, we randomly select Statistics of all datasets. The numbers on the left and right side of character "/" for task MNLI represent MNLI-m and MNLI-mm correspondingly; "#C." is number of categories in the task; "*" denotes regression task.
10% samples from the training set as validation data. For SentiHood we follow the default train/dev/test split. The detailed statistics of these two datasets are also listed in Table  1.

B. ALGORITHMS
The pre-training and fine-tuning frameworks have evolved rapidly since BERT was proposed. As introduced in the Related Work section, there are many existing works on data augmentation and leveraging auxiliary tasks. Our proposed algorithm is orthogonal to these techniques and can be applied to more advanced models. In this paper, the comparison is mainly on the basis of vanilla BERT while more experiments on other model variants (e.g., MT-DNN) are left to future works. The configurations of baselines and our solutions are described as below.
BERT [2] is a multi-layer bidirectional transformers. It is first pre-trained on a large corpus containing 3,300M words, and then fine-tuned on downstream tasks. We download the checkpoint of BERT-base 2 and BERT-large 3 from the official website.
BERT-mask is a simple baseline to our algorithm, where the training data is augmented by randomly masking tokens in the original sentences in the same way as SSA. By comparing our solution with this simple baseline, we can clearly examine the superiority of self-supervised attention.
BERT-EDA [7] is another baseline to our algorithm, where training sentences are edited by four operations: synonym replacement, random insertion, random swap and random deletion. We increase the data volume as large as other models use. This model utilizes the WordNet as the guidance, while ours only extracts the token importance information from the original data.
BERT-ABSA [8] enhances BERT by constructing an auxiliary sentence from the aspect and converts the ABSA task to the sentence-pair classification task, which is designed for SentiHood and SemEval2014 Task 4 datasets. We compare with BERT-single (original data format) and BERT-pair (sentence-pair data format) models proposed in this paper. Correspondingly, for the ABSA task we name our SSA-based model as BERT-single-H and BERT-pair-H.
RoBERTa [5] is an improved recipe of BERT. It is worth noting that the results reported on the leaderboard are ensembles of single-task models. For a fair comparison, the results of RoBERTa in our experiments are reproduced based on the officially released checkpoint 4 .
SSA-Co and SSA-Hybrid are the models proposed in this paper, where SSA-Co takes SSA as an auxiliary task and leverages the co-training framework to optimize the two tasks jointly; SSA-Hybrid (or SSA-H) takes the hybrid model which contains an additional self-supervised attention (SSA) layer in the network structure.

C. EVALUATION METRICS
During the evaluation of GLUE benchmark, following [27], we use different evaluation metrics for different dataset. For CoLA, we use MCC (Matthews correlation coefficient) score; Regression task STS-B is person correlation coefficient and spearman correlation coefficient; MRPC and QQP are accuracy and F1 score; And others are only accuracy score. For SentiHood task, we consider four aspects which appear the most frequently (general, price, transit location and safety) [26]. The evaluation metrics for aspect detection are strict accuracy, Macro-F1 and AUC, while those for sentiment classification are accuracy and macro-average AUC. For Semeval-2014 task 4, we use Precision, Recall, and Micro-F1 to evaluate subtask 3 (aspect category detection), and use accuracy score for subtask 4 (aspect category polarity) [28]. 4-way, 3-way and 2-way settings refer to how many categories should be included in the calculation process.

D. DETAILED SETTINGS
All experiments are based on the publicly available implementation of BERT and RoBERTa (PyTorch) 5 . For GLUE tasks, we follow the fine-tuning regime specified in [2]. with the open sourced codes and the reproduction numbers are on-par with the numbers reported in the GLUE leaderboard. "BERT-m" represents our baseline model "BERT-mask"; "BERT-EDA" is another baseline model proposed by [7]; "BERT-SSA-Co" and "BERT-SSA-H" denote our SSA-Co and SSA-Hybrid method respectively; The result on the left and right side of character "/" for task MNLI represents MNLI-m and MNLI-mm correspondingly; We also conduct significant test between BERT-SSA models and BERT, where * means the improvement over vanilla BERT is significant at the 0.05 significance level; ** means the improvement over vanilla BERT is significant at the 0.01 significance level.  While many submissions to the GLUE leaderboard 6 depend on multitask fine-tuning or ensembles, our submission depends only on single-task fine-tuning. We employ grid search to find the optimal hyperparameters according to a pre-defined range: learning rate lr ∈ {1e-5, 2e-5}, batch size b ∈ {16, 32}, epochs T ∈ {2, 3, 5}, loss combination ratio α ∈{0.7, 0.9}, linear combination ratio β ∈{0.2, 0.5, 0.9} and auxiliary data generation ratio for SSA and BERT-m γ ∈ {0.6, 1.0, 2.0}. The maximum sequence length is set to 512 in all of our experiments and we will cut the extra length in each batch to accelerate the training speed.

GLUE test set results of BERT-large
Further, we evenly split the validation set into three subsets when fine-tuning BERT models and the hyper-parameters achieving high average performance with the smallest variance on the three validation subset will be chosen.

E. GLUE RESULTS
The experimental results on GLUE benchmark are summarized in Table 2 and Table 3. It is a known problem 6 https://gluebenchmark.com/leaderboard that the train/dev split for WNLI is correct but somewhat adversarial, and many papers exclude WNLI in their tables, including the original paper of BERT [2]. Therefore, we do not compare the performance on this dataset (only on the list for completeness).
As shown in the Table 2, our methods consistently achieve better results than base BERT. Specifically, SSA-Co obtains better results on seven datasets, while keeps on-par performance on the rest of the datasets. It demonstrates that the auxiliary task of SSA is helpful for model generalization. Moreover, the hybrid model SSA-Hybrid performs better on all the datasets and pushes the average score of base BERT (reproduction version) from 78.4 to 79.3. The BERT-mask baseline also shows some advantages in several datasets, which can be explained by the effect of data augmentation. However, the decay in other datasets (MRPC, STS-B, QQP and MNLI) indicates that such an improvement is unstable. The BERT-EDA baseline obtains similar results as BERTmask, which verifies models that have been pre-trained on massive datasets with negligible improvement from easy data augmentation techniques. Our model is obviously more VOLUME 4, 2016 robust, demonstrating the superiority of SSA-based hybrid model over models that only use data augmentation.
We also apply our solutions to more advanced baselines, i.e., BERT-large, RoBERTa-base, and RoBERTa-large. As shown in Table 3, SSA-Hybrid consistently outperforms other baselines on different datasets. For BERT-large, it wins on all datasets and leads to a 0.7 absolute increase on the average score. The results further verify the advantages of self-supervised attention layer. Moreover, SSA-Hybrid also achieves improvement on RoBERTa, especially on RoBERTa-base. For RoBERTa-large, one of the strongest single models in the literature, this still leads to an absolute increase of 0.5 on the average score. Considering that no external knowledge is leveraged and only a few extra weights are introduced to the model, these improvements are significant.
In addition, we discuss gains on different data sizes. Among GLUE classification datasets, RTE, MRPC and CoLA have the smallest training data sizes. Our model respectively achieves 1.7%, 2.7% and 4.4% relative lifts in these tasks compared to BERT-base, which is more significant than larger datasets like MNLI. This is reasonable because smaller datasets benefit more from generalization techniques.

Model
Aspect Sentiment Acc.  4   TABLE 4. Base BERT performance on SentiHood dataset with the best performances bold. Top: "BERT-single" represents the baseline models fine-tuned on BERT with original datasets; "BERT-pair" is another baseline model proposed by [8] trained with auxiliary Sentences. Bottom: "BERT-single-H" and "BERT-pair-H" denote the models incorporating our SSA layers.  TABLE 5. Base BERT performance on Semeval-2014 task 4 Subtask 3 and Subtask 4. "n-w" means the accuracy of n-way setting in Subtask 4. We also conduct significant test between SSA-Hybrid enhanced model and vanilla BERT in Subtask 4, where * means the improvement over vanilla model is significant at the 0.05 significance level. Table 4 and 5 present results of applying our approach to ABSA tasks and show that SSA-based BERT obtains the increased performance consistently both on the BERTsingle and data-augmented BERT-pair configuration. Specif-ically, from the experiments on SentiHood, we learn that the proposed SSA can provide complementary information for pre-trained BERT. This means that, not only the sentiment classification task can benefit from our model, but also the aspect detection task. Across all the evaluation metrics, the BERT-pair-H combined with the SSA layer and data augmentation method performs the best. BERT-pair-H obtains an 82.0% accuracy for the aspect detection task, which is a 2.2% absolute improvement compared to the original BERTpair model. We also observe that the performance on the sentiment classification task is improved when using SSA hybrid model. For Semeval-2014 task 4, the F1 scores are pushed forward substantially by at least 0.4% compared to baseline models. It is noted that our proposed model achieves further performance based on the data augmentation of constructing auxiliary sentences [8]. We conjecture that the ability of generalization learned by SSA plays a different role with existing data-augmented methods, which will be further studied. To further evaluate the extra training cost of SSA paid for the performance improvement, we analyze the training cost on SST-2 dataset. The results are demonstrated in Figure 6. It is predictable that BERT and BERT-m converge faster than BERT-SSA-H in the initial iterations, since the SSA-based solution contains more parameters which lead to harder optimization. However, after one epoch, the SSA-based model begins to show its advantage and achieve a lower loss than baseline models with the same training time. Note that when the loss drops each time, the SSA-based model always has a short delay after two baselines, indicating the time cost of label generation. Nevertheless, as shown in the figure, this cost is minor compared with the total training time, and the final performance improvement justifies the cost.

H. SENSITIVITY STUDY
To investigate the effectiveness of SSA label generation strategy, we conduct experiments on SST-2 under different generation ratio γ. The larger γ means more sentences are generated as training samples. We depict the comparison results in Figure 7. In the beginning, with more sentences generated, all three models continue to improve. After γ is greater than 0.8, the performance of BERT-mask degrades dramatically. In contrast, our solution BERT-SSA-Co and BERT-SSA-Hybrid retain the enhancement and converge to better performance. The robustness of our model owes to the self-supervised attention layer which identifies relevant words, and a strict label generation strategy that leverages pre-trained knowledge to obtain self-supervised labels.

I. CASE STUDY
In this section, we visualize two cases to explain why the selfsupervised attention layer improves the model. As illustrated, the SSA scores learned in the hybrid model are in line with human common-sense. One case is the motivating example, i.e., "a whole lot foul, freaky and funny". As shown in Figure 8(a), the vanilla BERT is misled by the strong negative word 'foul' and results in a wrong prediction. Instead, Our proposed SSA layer identifies 'funny' as the important token and put less emphasis on the token 'foul'. The SSA layer correctly captures the relative word importance and generate better sentence embedding for label prediction. As a result, the sentence is correctly classified as positive by our SSAenhanced hybrid model.
Another case is the sentence "in its ragged, cheaps and unassuming way, the movie works.". It is much more complicated than the first case because more negative words in this sentence may result in the wrong sentiment prediction. As shown in Figure 8(b), the vanilla BERT pays much attentions on 'ragged' and 'cheaps' than 'works'. The reason is obvious since 'ragged' and 'cheaps' present strong negative sentiments while 'works' is even not an emotional adjective. However, the self-supervised attention layer is able to identify the actual important tokens, i.e., 'works', in such a misleading context. In this way, the final prediction is corrected by the SSA-based model.

J. ERROR ANALYSIS
To observe our results in more details, we perform an error analysis for test predictions and show two typical kinds of mistakes below. Table 6 presents three examples collected from the SST-2 dataset as specific illustrations.
The SSA mechanism has weaknesses in commonsense reasoning. When there are no explicit words relevant to the real label in the sentence, it is difficult to get promotion compared with baseline models. For example, in the first sentence, the SSA-H model assigns a higher attention score to the token 'dominatrixes' while BERT focuses on 'raunchy'. These two words present negative sentiments and result in a wrong prediction, even though this sentence expresses that actors are good at acting. Similarly, BERT identifies the label of the second sentence as strong positive emotion because of the token 'worthwhile'. SSA-H model also makes a mistake in this case. Although our method puts more attention on the word 'follow', it could not provide an apparent emotional VOLUME 4, 2016  tendency to get the correct result. Therefore, SSA cannot offer extra background knowledge for commonsense reasoning because all the information comes from the model and the dataset. Under these circumstances, we conjecture a pretrained model integrating with domain knowledge will be a better solution.
The SSA mechanism obtains limited improvement in capturing intention revealed by consecutive snippets. Taking the third sentence as an illustration, we notice that BERT and SSA both make wrong predictions. Specifically, BERT assigns the largest attention score on the token 'glad', while SSA takes a step forward by highlighting the word 'truth'. However, for this case, the sentiment revealed by an individual word is not enough to help the model identify the sentimental polarity of the entire sentence. Instead, the intent contained in the snippet "I was glad when it was over" is much more critical. Since we use a random mask mechanism, snippets that consist of multiple consecutive words are hardly sampled. Therefore, our future work will focus on improving the efficiency of SSA by designing better sampling strategies to cover various patterns.

V. CONCLUSIONS AND FUTURE WORK
In this paper, we propose a novel technique called selfsupervised attention (SSA) to prevent BERT from overfitting when fine-tuned on small datasets. The hybrid model contains an additional self-supervised attention layer on top of the vanilla BERT model, which can be trained jointly by an auxiliary SSA task without introducing any external knowledge. We conduct extensive experiments on a variety of neural language understanding tasks, and the results demonstrate the effectiveness of the proposed methodology. The case study further shows that the token-level SSA scores learned in the model are in line with human common-sense. In parallel with our research, many recent works focus on data augmentation and multi-task learning. As the next step, we plan to integrate our methodology with these advanced techniques to achieve the further improvement and evaluate the generality of our proposed solution.
YIREN CHEN received the bachelor's degree from School of Automation Science and Electrical Engineering, Beihang University. He is currently pursuing the Ph.D. degree with School of Electronics Engineering and Computer Science, Peking University. His main research interests include natural language processing and deep learning.
XIAOYU KOU received the bachelor's degree from School of Computer Science and Technology, Shandong University and the master's degree from School of Electronics Engineering and Computer Science, Peking University. Her main research interests include knowledge graph and media intelligent computing.
JIANGANG BAI received the bachelor's degree from School of Computer Science and Technology, Sichuan University. He is currently pursuing the Ph.D. degree with School of Electronics Engineering and Computer Science, Peking University. His main research interests include natural language processing and data mining.
YUNHAI TONG received the Ph.D. degree in computer science from Peking University in 2002. Currently he is a professor in School of Electronics Engineering and Computer Science, Peking University. His main research interests include data mining, media intelligent computing and big data analysis. VOLUME 4, 2016