Abstract
Attention networks often make decisions relying solely on a few pieces
of tokens, even if those reliances are not truly indicative of the
underlying meaning or intention of the full context. That can lead to
over-fitting in Transformers and hinder their ability to generalize.
Attention regularization and sparsity-based methods have been used to
overcome this issue. However, these methods cannot guarantee that all
tokens have sufficient receptive fields for global information
inference. Thus, the impact of individual biases cannot be effectively
reduced. As a result, the generalization of these approaches improved
slightly from the training data to new data. To address these
limitations, we proposed a balanced sparsity (BaS) regularized attention
network on top of the Transformers, called BaSFormer. BaS regularization
introduces the K-regular graph constraint on self-attention connections,
which replaces SoftMax with SparseMax in the attention transformation.
In BaS-regularized self-attentions, SparseMax assigns zero attention
scores to low-scoring connections, highlighting influential and
meaningful contexts. The K-regular graph constraint ensures that all
tokens have an equal-sized receptive field to aggregate information,
which facilitates the involvement of global tokens in the feature update
of each layer and reduces the impact of individual biases. As no
continuous loss can be used as the K-regular graph regularization, we
proposed an exponential extremum loss with augmented Lagrangian.
Experimental results show that BaSFormer improves debiasing
effectiveness compared to the newest large language models, such as the
chatGPT, GPT-4 and LLaMA. In addition, BaSFormer achieves new
state-of-the-art results in text generation tasks. Interestingly, this
paper also evaluates that BaSFormer can learn hierarchically linguistic
dependencies in gradient attributions, which improves interpretability
and adversarial robustness.