Attention Fusion: a light yet efﬁcient late fusion mechanism for task adaptation in NLU

Fine-tuning a pre-trained language model us-ing annotated data has become the de-facto standard for adapting general-purpose pre-trained models like BERT to downstream tasks. However, given the trend of larger pre-trained models, ﬁne-tuning these models for each downstream task is parameter-inefﬁcient and computationally-expensive deeming this approach sub-optimal for adoption by NLU systems. In recent years, various approaches have been proposed for parameter efﬁcient task adaptation such as Adaptor, Bitﬁt, Prompt tuning, Preﬁx tuning etc. However, most of these efforts propose to insert task spe-ciﬁc parameters in-between or inside intermediate layers of the pre-trained encoder resulting in higher computational cost due to back-propagation of errors to all layers. To mitigate this issue, we propose a light but efﬁcient, attention based fusion module which computes task-attuned token representations by aggregating intermediate layer representations from a pre-trained network. Our proposed fusion module trains only 0.0009% of total parameters and achieves competitive performance to the standard ﬁne-tuning approach on various tasks. It is also decoupled from the pre-trained network making it efﬁcient during computation and scalable during deployment. Last but not the least, we demonstrate that our proposed attention-fusion mechanism can transfer effectively to different languages for further re-use and expansion.


Introduction
Aligned with recent advancements in deep learning research, most state-of-the-art (SOTA) NLU models are built upon neural networks, especially using transformer (Vaswani et al., 2017) based architectures. However, these models require a large amount of domain-specific labeled examples for * equal contribution training, which is prohibitively expensive. The recent adoption of self-supervised pre-training and transfer learning mitigates the issues stemming from scarcity of labeled data (Yang et al., 2017;, by pre-training with unsupervised tasks established upon massive unlabeled corpora (Devlin et al., 2019;Liu et al., 2019;Radford et al., 2019;Raffel et al., 2020). The resulting models encode syntactic and semantic linguistic information and can be fine-tuned with limited labeled examples on downstream NLU tasks, such as Question-answering (QA) (Rajpurkar et al.), Textual-entailment (Dagan et al., 2006), Slot Labeling (SL)  etc. Fine-tuning is a commonly used method that adapts a pre-trained model to a downstream task and has been shown to achieve SOTA results in various NLU tasks. However, in the presence of larger pre-trained models and many downstream tasks, fine-tuning the whole model for each downstream task is inefficient and expensive due to reasons such as higher memory consumption since gradient and optimizer states need to be stored for all parameters, higher computational cost since error needs to be backpropagated through all layers and bigger cost of hosting large models for each task.
Parameter efficient domain adaptation has been an area of interest in recent literature comprising of various approaches such as Adaptor (Houlsby et al., 2019), Bitfit (Ben-Zaken et al., 2021), DiffPrune (Guo et al., 2020), Prompt tuning (Lester et al., 2021;Liu et al., 2021) etc. Most of these efforts propose to insert or append task specific parameters in-between or inside of the pre-trained encoder layers, we refer these approaches as early-fusion techniques, as task specific parameters are fused inside the pre-trained network. Some drawbacks of early fusion based methods are: during training, loss has to be back-propagated to all layers making them slower; hard to scale in NLU systems as pre-trained encoder and task specific modules are tightly coupled together. In comparison to early fusion, one can place task specific modules after the pre-trained network, so the pre-trained network is untouched regardless of downstream tasks, we refer to this as late-fusion. One late-fusion option is to concatenate (Cao et al., 2020) all layers from a pretrained network and project to a lower dimension for task-specific decoders. However, the projection matrices can be considerably big with larger models, e.g., concatenation then projecting hidden layers of a BERT-large model to a dimension of 256 amounts to 6.2 million parameters leading to increased computational cost. These challenges hinder the progress of deploying SOTA transfer learning based models to downstream NLU systems.
To address these challenges, we propose attention-fusion, a light but effective task-specific late-fusion based module, for adapting pre-trained models to downstream NLU tasks. Our proposed architecture decouples general purpose pre-trained models from downstream task-specific decoder layers with an attention-fusion module. The fusion module enables decoders to effectively adapt hidden representations from intermediate layers of the pre-trained network.
To examine the effectiveness of attentionfusion mechanism, we conduct experiments on popular language understanding tasks, including QQP (Quora Question Pair), QNLI (Questionanswering NLI), SST-2 (Stanford Sentiment Treebank), CONLL-03 (Name Entity Recognition) and a multilingual Spoken Language Understanding (SLU) (Tür et al., 2002;Huang and Chen, 2019) task using mATIS dataset. Our results demonstrate that attention-fusion module achieves comparable performance to fine-tuning approach while only tuning a small amount of parameters. Our attentionfusion approach is a late-fusion based mechanism, thus, exhibiting lower computation cost since backpropagation is limited to task-specific fusion module and decoder layers. Furthermore, we empirically show that the task-specific attention-fusion module is transferable across languages. We aim to release our code on Github to support further experimentation. In summary, our primary contributions are three-fold: • Propose a light but efficient task-specific latefusion module called attention-fusion, which is capable of aggregating representations from intermediate layers of the pre-trained model to adapt to a downstream NLU task.
• Demonstrate the benefit of the proposed module by evaluating both accuracy and computation efficiency on various tasks.
• Analyze how the attention-fusion module interacts with pre-trained models and show that such a module is task-specific and can transfer effectively to different languages.

Related Work
The importance of efficiently fine-tuning and deploying pre-trained networks to NLU systems has gained wider recognition. In this section, we discuss various approaches proposed in literature.

Model Compression and Distillation
One research direction focuses on building compact pretrained networks with techniques like model compression (Bucilua et al., 2006;Ganesh et al., 2020), pruning (Gordon et al., 2020;Han et al., 2016;Wang et al., 2019b), quantization and knowledge distillation (Hinton et al., 2015). DistilBERT (Sanh et al., 2019) and TinyBERT (Jiao et al., 2020) suggested using knowledge distillation framework to train a smaller student network by matching the layer outputs with a larger teacher model. ALBERT (Lan et al., 2019) attempted to reduce parameters through weight-sharing across all transformer layers and factorizing the embedding matrix. Zafrir et al. (2019) applied an 8-bit integer quantization to reduce BERT model size by 4x. However, these approaches still suffer from sub-optimal performance in accuracy when the model size gets smaller.
Lightweight Fine-tuning Another line of research focuses on using a small amount of extra parameters along with the pre-trained network. Some popular methods include: Adaptor, proposed by Houlsby et al. (2019), suggested to insert a task specific bottleneck module between pre-trained network layers. Other ideas suggest to re-parameterize the pre-trained network partially: Ben-Zaken et al. (2021) proposed to tune only bias-terms of the pretrained network for each task; Guo et al. (2020) formulates task-specific fine-tuning as learning a diff vector that is added to the pre-trained network, both shown to match the full fine-tuning approach on accuracy while only using less than 0.1% trainable parameters; more recently, Lester et al. (2021); Liu et al. (2021) suggested appending extra prompt tokens to layers of the model to control output while keeping the network frozen. There are also efforts focusing on using intermediate layers of the pre-trained network for different tasks, Peters et al. (2018) proposed to learn a weighted sum representation from the intermediate layers of the model , while Cao et al. (2020) suggested concatenating the intermediate layers of the pre-trained network.
Probing in Transformers The significant performance gain brought about by pre-training has emphasized the need to better understand the corelation between pre-trained network architecture and resulting language representations. Studies (Tenney et al., 2019b,a;Kovaleva et al., 2019) suggest that pre-trained models can encode a range of syntactic and semantic information in different layers of the network. Complex linguistic structures are represented hierarchically in the higher layers of the model. In comparison, simple language clues are encoded in lower layers. Inspired by these findings, we propose a task-specific attention-fusion architecture, to more effectively utilize hidden representations with different granularity from pretrained networks.

Approach
In this section, we describe the proposed taskspecific fusion model architecture, which augments general-purpose pre-trained models with task-dependent attention on encoded representations in multi-granularity and with prediction layers for NLU tasks. The attention-fusion module aims to improve performance and parameter efficiency by sharing parameters of the pre-trained model with other tasks.

Late-Fusion vs. Early-Fusion
Most of the existing methods adopt early-fusion for task adaptation, by either inserting light-weight task-specific module or appending prompts inside the pre-trained encoder; or tuning only a small portion of the parameters from the pre-trained network to adapt to a downstream task. With early-fusion, the light-weight module can take advantage of the depth of the pre-trained network, to adjust model output to downstream task. One drawback of earlyfusion based approaches is that the task specific module is tightly coupled with the pre-trained network, making the training process costly and slow. Early-fusion based methods require loss/error to be back-propagated to all layers of the pre-trained network since light tune-able modules reside in each layer of the network. One solution to address this inefficiency is to adopt late-fusion, where the entire pre-trained network is kept frozen, and the task-specific module is placed after the pre-trained encoder. In such a setting, pre-trained network is de-coupled from the downstream task, training loss is only back-propagated to task-specific parameters i.e attention-fusion module and decoder layers, make the training process more efficient. Figure 1: The architecture of the proposed task-specific attention-fusion module. It also depicts the shareability of a pre-trained network among different tasks.

Model Architecture
We propose attention fusion, a late-fusion module to utilize hidden representations from a shareable, general-purpose pre-trained model, such as BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019), for downstream tasks adaptation. Figure  1 visualizes the proposed architecture, including the fusion layer and how it fits into a pre-trained network (in our case, BERT-large or multi-lingual BERT). The parameters in the pre-trained network (including the embedding layer), colored in blue, are frozen, making the pre-trained model shareable with other tasks. The pre-trained encoder takes a sequence of tokens as input and generates an encoded representation for every token at each layer. The task specific decoding network, colored in green, is tailored per downstream task. For NLU tasks, decoders typically include a feed-forward layer and a softmax layer for the final output. The attention fusion module, also colored in green, connects the pre-trained encoder to the task-specific decoder. This fusion module is used to extract useful features from the intermediate (and final) layers of the pre-trained encoder for the task-specific decoder. During training, errors are propagated only to the task-specific parts (the green components in the figure) hence it is memory and computation efficient.
The fusion module is also extremely light weight, for a BERT-large model, the attention fusion module only adds 0.0009% of the total parameters of pre-trained encoder.
The proposed approach is scalable when the system grows to include more tasks, as the most computation and memory-intensive component, the encoder, is shared and frozen, while the task-specific decoder and attention fusion module is decoupled from the pre-trained network and trained on each downstream task. Moreover, with the addition of a fusion module, we can achieve comparable performance to fine-tuning, without adapting the pretrained network by surfacing the pertinent information already encoded and buried in the intermediate layers, for different downstream tasks.

Attention-Fusion Module
The encoded representation of a token is achieved by focusing on different layers of a pre-trained network for a given downstream task. The focus on different layers of the network shifts based on the task at hand. To attend to the corresponding token-level representation across different layers for a given downstream task, we propose an attention-fusion module to learn task-specific token representation, by pooling intermediate layer representations at a token level. More specifically, for each task, we use an attention query vector, denoted as Q t . This query vector is a task-specific representation which can be either learned during training or adopted from a pre-trained one (learned on the same task but same/different datasets). We further denote the representation of token i at layer j as V j i and the attention weight of token i at layer j for task t as α j i (t), which can be calculated as: Thus, the contextual representation of token i for task t, denoted as c i (t), can be calculated as weighted sum of token i across all vertical layers. We denote such attention-based pooling mechanism as attention-fusion in our experiments. The re-computed fused token representation is then projected and connected to feed-forward layers and final softmax layer.
There are other ways to extract token representations from an encoder. Peters et al. (2018) proposed all layers be combined with a weighted average pooling operation, c k = L j=0 s j h k,j . The weight vector is optimized as part of the task model, so that it may preferentially mix contextual information represented in different layers of the model for the task. We refer to this approach as linear-fusion. Cao et al. (2020) proposed to concatenate all BERT layers, then project to a feed forward layer before passing to decoders. We refer to this approach as concat-fusion. For comparison purposes, we add linear-fusion and concat-fusion as our baseline alternatives for late-fusion approaches.

Experiment
We evaluate attention-fusion as well as other popular light-weight fine-tuning approaches on 5 popular general language understanding tasks.

Dataset
Sentiment Analysis: We use the SST-2 dataset belonging to GLUE benchmark (Wang et al., 2019a) to perform a single sentence binary classification task. We report accuracy on the developement set.
Paraphrase Similarity: We use the QQP dataset belonging to GLUE benchmark to perform a sentence-pair binary classification task. We report accuracy on the developement set.
Natural Language Inference: We use the QNLI dataset belonging to GLUE benchmark to perform a sentence-pair multi-class classification task. We report accuracy on the developement set.
Named Entity Recognition: We use the CoNLL-2003 dataset which is a widely adopted NER benchmark (Tjong Kim Sang and De Meulder, 2003). We report micro-f1 score on the test set.
Spoken Language Understanding: We use the public mATIS (Mansour and Batool, 2020) dataset to perform Intent Classification (IC) and Slot Labeling (SL) tasks. The dataset is originally transcribed in English and then manually translated into four languages: EN, FR, DE and ES; thus, the ontology of the data in all 4 languages is the same. We report micro F1 score on the test set.

Experiment Setup
For monolingual tasks, we compare the proposed attention-fusion module on four popular general language understanding tasks: QQP, QNLI, SST-2 and CONLL-03 with: Table 1: Results on monolingual English tasks comparing attention-fusion module against various light weight finetuning approaches. We report accuracy metric on each of these tasks (higher scores indicate better). Additionally, we also present the percentage of trainable parameters for each of these approaches as well as fusion type. Score shown in bold with underscore indicates best score across all, while bold font indicates best score among lightweight fine-tuning approaches. * indicates accuracy from dev set. For results on Bitfit, Diff-Prune, Prompt-tuning v1 and v2, we quote the numbers listed in the paper if available, otherwise, we produce the numbers using their code and settings.

Model
Fusion type % params QQP* QNLI* SST-2* CONLL-03 AVG  • two baselines: a standard fine-tuning mechanism that trains the entire network on the downstream task (denoted as fine-tune), and the case where the pre-trained encoder is kept frozen and only the parameters in decoder layers are fine-tuned (denoted as last-layer) • different late-fusion mechanisms such as concat-fusion and ELMo style linear-fusion.
We evaluate the multilingual and cross-lingual capabilities of attention fusion module on IC and SL tasks using mATIS dataset under two learning regimes: using full-sized training data and few-shot learning.
For monolingual tasks, we use a BERT-large encoder from public available gluonnlp; for multilingual tasks, we use an in-house pre-trained mBERT base encoder trained on 8 languages. Both pretrained models are trained on public data, including Wikipedia, Books corpus, and CommonCrawl corpus. For decoders, we use two feed forward layers of hidden size 256, and a softmax layer for sequence classification task, a CRF layer followed by a softmax layer for sequence labeling tasks.
Throughout our experiments, we train all models (baselines and variations of fusions) with minibatch sizes ranging between 16 to 64, on 2 Nvidia Tesla V100 GPUs. We adopt Adam optimizer for all our experiments and use a learning rate of 2e-5 for the fine-tune baseline and 2e-3 for other latefusion models. For all experiments, we report mean statistic of 3 random seeds run.

Results and Discussion
In this section, we present the results on monolingual and multilingual tasks, compare the training efficiency and analyze the interactions of attention fusion with different layers across different tasks. Table 1 compares the performance of attentionfusion module with various light-weight finetuning approaches proposed in literature on QQP, QNLI, SST-2 and CONLL-03 datasets. We quoted numbers on Bitfit, DiffPrune, Prompt-tuning and Prompt-tuning v2 from their published results, while for Adaptor, we reproduced results using code and settings suggested by the authors since their published numbers are on test set only. For late-fusion baselines, we compare against ELMo style linear-fusion and concat-fusion. We also record the percentage of additional trainable parameters to demonstrate the computational cost associated with each approach. Across all tasks, the attention-fusion module sees a significant improvement of an average 13.2 absolute points compared to the last-layer baseline model which indicates the need to harness intermediate layers representation of a network for a downstream task. Among the different lightweight fine-tuning mechanisms, attention-fusion achieves the best performance on QQP and CONLL-03 tasks, comparable performance on SST-2, while seeing a degradation of 4 points on QNLI, we hypothesis NLI task requires some hierarchical mapping of semantic features from representation space, thus limited performance of late-fusion based methods on such a task. On an average, attention-fusion is 1.9 abmodel. When the concatenated hidden representations are utilized by the decoder, it increases the size of FFN layers in the decoder, for a BERT-large model with 1024 hidden size and 24 layers, the first FFN layer in the decoder is (1024 * 24) * decoder_hidden_units which is 24x the size of FFN layer in other approaches.

Evaluation on Monolingual Tasks
solute points behind the fine-tune approach and achieves comparable performance against other early-fusion approaches while only training a small fraction (0.0009%) of parameters. Among late fusion methods, ELMo style linear-fusion is behind attention fusion by 8.3 absolute points, with significant degradation in pair-utterance tasks like QQP and QNLI, indicating that using identical weight assignment for all tokens in the sequence is a suboptimal approach and requires a more flexible and nuanced fusion mechanism. Concat-fusion shows comparable performance to attention-fusion but it is not as light-weight as attention-fusion due to extra parameters being added to project the concatenated layers to the downstream decoders. Table 2 shows the performance of mBERT models on IC and SL tasks with different types of latefusion modules applied, along with fine-tune and last-layer baselines on four languages in the mA-TIS dataset. We observe a significant improvement in performance for both IC and SL tasks across all languages (an average of 6.4 absolute F1 score increase in IC and 3.6 absolute F1 score increase in SL) using attention-fusion compared to the lastlayer model. Attention-fusion also achieves comparable performance to fine-tune model in both IC and SL tasks (an average of 0.12 absolute F1 score increase in IC and 0.13 absolute F1 score increase in SL). Linear-fusion and concat-fusion also outperforms the last-layer, demonstrating the effectiveness of utilizing intermediate layer representations of the encoder.

Evaluation on Multilingual Tasks
With the rising popularity of NLU systems, there is a need to expand them to new languages. An open challenge with language expansion is the scarcity of annotated data in the new language. A popular way to tackle this challenge is through transfer learning; thus, we examine the language transfer-ability of our proposed architecture in a few-shot learning setting compared to the baseline models. We measure F1 score on IC and SL tasks on four languages (EN, FR, DE, and ES) for the public mATIS dataset. We create a fewshot dataset for each one of the four languages by down-sampling the original training dataset to 5% of the original size. Table 3 summarizes the IC and SL performance of models fine-tuned on full-sized training data in the source language plus few-shot data in the target language, and evaluated on the target language. For example, EN -> FR indicates a model fine-tuned on full-size EN data and few-shot FR data, and evaluated on the FR dataset. We observe attention-fusion improves transfer-ability over last-layer baseline by a large margin (an average increase of 13 points F1 score for IC and 5.4 points F1 score for SL), achieving comparable performance with the fine-tune model on IC. Even though the F1 scores are lower than that of fine-tune model for SL in DE and ES, using attention-fusion allows for language expansion at a significantly lower cost, compared to the fine-tune baseline. The overall result suggests that attentionfusion can effectively improve knowledge transfer across languages.

Training Efficiency
In this section, we examine the training efficiency through number of trainable parameters and convergence speed of attention-fusion in comparison with other methods. A popular approach is Adapter in which the number of extra parameters is num_layers × (2 × m × d + m + d) , for BERTlarge, m=1024, num_layers=24, with bottleneck dim of 256, results in 12.6 million trainable parameters. Bitfit adjusts the parameters of bias terms in Query, Key, Value matrices, as well as projection and feed-forward module among all layers, for BERT-large, this amounting to 270 K trainable parameters. Prompt-tuning appends prompts to the first layer or to all layers (Liu et al., 2021), the preferred length of the prompt ranges from 20 to 100 tokens, for BERT-large, the trainable parameters is (1024 × prompt length) if prompt is inserted into the first layer, or (1024 × prompt length × num layers) when inserted to all layers. In comparison, the attention-fusion mechanism is more efficient than other early-fusion based methods due to the following reasons: • it uses a query vector of the same size as the pre-trained encoder hidden dimension (e.g., 1024 parameters for BERT-large) which adds significantly fewer parameters for training compared to early-fusion mechanisms • the size of the fusion module does not grow with the number of layers in the pre-trained network unlike some other approaches • it uses a late-fusion mechanism, in which the backward-pass and model update are performed only to the fusion module and decoder layers resulting in faster training.
In Figure 3, we report training loss over time for attention fusion and some other methods. We chose BitFit to represent an early-fusion approach due to its simplicity and effectiveness, and linear-fusion as an alternate late-fusion mechanism. We train all models with same number of GPUs and batch size, as well as adopt learning rate suggested by published paper. As shown in the plot, attention-fusion converges faster than BitFit, we hypothesis this is because late fusion does not need to back-propagate loss to all layers; hence making it faster to train and converge. last-layer is the least performant given it doesn't harness intermediate representations; on the other hand, despite using intermediate layers, linear-fusion does not perform as good as attention fusion, suggesting the effectiveness and efficiency of the task-specific attention mechanism. 3

Analysis on Attention-fusion
In this section, we analyse the role and nature of the attention-fusion module. We visualize the distribution of attention weights after the softmax operation for different layers of mBERT in Figure 2. Along with the 12 layers of the mBERT-base encoder, we also attend to the embedding layer. Hence, the Xaxis of all plots indicates layer 1-13, with 1 being the embedding layer and 13 being the 12th layer of BERT. The Y-axis denotes the attention weight associated with a layer. The attention weights across layers sum up to 1.0.
We show the attention weight distribution for IC and SL tasks in Figure 2 to investigate the learned attention patterns for different tasks. We observe that the attention-fusion module attends to mBERT layers differently for different tasks. IC focuses on mid-late layers while SL focuses on early and midlayers. The result demonstrates that the learned attention weights vary across tasks and thus attentionfusion can improve task adaptability with its flexibility in using intermediate representations. We hypothesize that IC relies on higher-level semantic information, while SL attends to both token-level embedding input from lower layers as well as contextual information from higher layers.
We also visualize the weights learned for IC and SL tasks on four languages Figure 2 to compare the attention patterns across languages. We observed that all languages learn similar attention weight distribution for different mBERT layers for each of the tasks. This observation explains the language transfer-ability of the attention-fusion model as described in the previous section. To further demonstrate that the attention-fusion module is task-specific and language-agnostic, we take a pre-trained attention-fusion module trained on EN language, freeze it and use it for different languages such as DE, FR and ES. We then compare this with the baseline in which the attention-fusion module is trained and evaluated on the same target language. As shown in Table 4, we observe comparable results against baseline; this suggests that the attention-fusion module is task-dependent and can transfer effectively across different languages.

Conclusion
With the rising popularity of transfer learning in NLU, the challenge of adapting pre-trained models to NLU tasks effectively and efficiently is becoming increasingly relevant. To address this challenge, we propose a light yet efficient task-specific attention-fusion module which enables parameter sharing and efficient fine-tuning for downstream tasks. We demonstrate that our proposed latefusion module achieves comparable performance to other popular methods as well as the fine-tuning approach, while using less tune-able parameters per task. We also show that the task-specific attentionfusion module is transferable across languages, enabling language expansion work in NLU at a much lower cost.