Explain and Improve: LRP-Inference Fine-Tuning for Image Captioning Models

This paper analyzes the predictions of image captioning models with attention mechanisms beyond visualizing the attention itself. We develop variants of layer-wise relevance propagation (LRP) and gradient-based explanation methods, tailored to image captioning models with attention mechanisms. We compare the interpretability of attention heatmaps systematically against the explanations provided by explanation methods such as LRP, Grad-CAM, and Guided Grad-CAM. We show that explanation methods provide simultaneously pixel-wise image explanations (supporting and opposing pixels of the input image) and linguistic explanations (supporting and opposing words of the preceding sequence) for each word in the predicted captions. We demonstrate with extensive experiments that explanation methods 1) can reveal additional evidence used by the model to make decisions compared to attention; 2) correlate to object locations with high precision; 3) are helpful to"debug"the model, e.g. by analyzing the reasons for hallucinated object words. With the observed properties of explanations, we further design an LRP-inference fine-tuning strategy that reduces the issue of object hallucination in image captioning models, and meanwhile, maintains the sentence fluency. We conduct experiments with two widely used attention mechanisms: the adaptive attention mechanism calculated with the additive attention and the multi-head attention mechanism calculated with the scaled dot product.


I. INTRODUCTION
Image captioning is a setup that aims at generating text descriptions from image representations. This task requires a comprehensive understanding of the image content and a well-performing decoder which translates image features into sentences. The combination of a convolutional neural network (CNN) and a recurrent neural network (RNN) is a commonly used structure in image captioning models, with CNN as the image encoder and RNN as the sentence decoder [1], [2], [3]. An established feature of image captioning is the attention mechanism that enables the decoder to focus on a sub-region of the image when predicting the next word of the caption [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15]. Attentions are usually visualized as attention heatmaps, indicating which parts of the image are related to the generated words. As such, they are a natural resource to explain the prediction of a word. Furthermore, attention heatmaps are usually considered as the qualitative evaluations of image captioning models in addition to the quantitative evaluation metrics such as BLEU [16], METEOR [17], ROUGE-L [18], CIDEr [19], SPICE [20]. Attention heatmaps provide a certain level of interpretability for image captioning models since they can reflect the locations of objects. However, the outputs of image captioning models rely on not only the image input but also the previously generated word sequence. Attention heatmaps alone meet difficulties in disentangling the contributions of the image input and the text input.
To gain more insights into the image captioning models, we adapt layer-wise relevance propagation (LRP) and gradientbased explanation methods (Grad-CAM, Guided Grad-CAM [21], and GuidedBackpropagation [22]) to explain image captioning predictions with respect to the image content and the words of the sentence generated so far. These approaches provide high-resolution image explanations for CNN models [22], [23]. LRP also provides plausible explanations for LSTM architectures [24], [25]. Figure 1 shows an example of the explanation results of attention-guided image captioning models. Taking LRP as an example, both positive and negative evidence is shown in two aspects: 1) for image explanations, the contribution of the image input is visualized as heatmaps; 2) for linguistic explanations, the contribution of the previously generated words to the latest predicted word is shown.
The explanation results in Figure 1 exhibit intuitive correspondence of the explained word to the image content and the related sequential input. However, to our best knowledge, few works quantitatively analyze how accurate the image explanations are grounded to the relevant image content and whether the highlighted inputs are used as evidence by the model to make decisions. We study the two questions by quantifying the grounding property of attention and explanation methods and by designing an ablation experiment for both the image explanations and linguistic explanations. We will demonstrate that explanation methods can generate image explanations with accurate spatial grounding property, meanwhile, reveal more related inputs (pixels of the image input and words of the linguistic sequence input) that are used as evidence for the model decisions. Also, explanation methods can disentangle the contributions of the image and text inputs and provide more interpretable information than purely image-centered attention.
With explanation methods [26], we have a deeper understanding of image captioning models beyond visualizing the attention. We also observe that image captioning models sometimes hallucinate words from the learned sentence correlations without looking at the images and sometimes use irrelevant evidence to make predictions. The hallucination problem is also discussed in [27], where the authors state that it is possibly caused by language priors or visual mis-classification, which could be partially due to the biases present in the dataset. The image captioning models tend to generate those words and sentence patterns that appear more frequently during training. The language priors are helpful, though, in some cases. [28] incorporates the inductive bias of natural language with scene graphs to facilitate image captioning. However, language bias is not always correct, for example, not only men ride snowboards [29] and bananas are not always yellow [30], [31]. To this end, [29] and [31] attempted to generate more grounded captions by guiding the model to make the right decisions using the right reasons. They adopted additional annotations, such as the instance segmentation annotation and the human-annotated rank of the relevant image patches, to design new losses for training.
In this paper, we reduce object hallucination by a simple LRP-inference fine-tuning (LRP-IFT) strategy, without any additional annotations. We firstly show that the explanations, especially LRP, can weakly differentiate the grounded (truepositive) and hallucinated (false-positive) words. Secondly, based on the findings that LRP reveals the related features of the explained words and that the sign of its relevance scores indicates supporting versus opposing evidence (as shown in Figure 1), we utilize LRP explanations to design a reweighting mechanism for the context representation. During fine-tuning, we up-scale the supporting features and downscale the opposing ones using a weight calculated from LRP relevance scores. Finally, we use the re-weighted context representation to predict the next word for fine-tuning.
LRP-IFT is different from standard fine-tuning which weights the gradients of parameters with small learning rates to gradually adapt the model parameters. Instead, it pinpoints the related features/evidence for a decision and guides the model to tune more on those related features. This fine-tuning strategy resembles how we correct our cognition bias. For example, when we see a green banana, we will update the color feature of bananas and keep the other features such as the shape.
We will demonstrate that LRP-IFT can help to de-bias image captioning models from frequently occurring object words. Though language bias is intrinsic, we can guide the model to be more precise when generating frequent object words rather than hallucinate them. We implement the LRP-IFT on top of pre-trained image captioning models trained with Flickr30K [32] and MSCOCO2017 [33] datasets and effectively improve the mean average precision (mAP) of predicted frequent object words evaluated across the test set. At the same time, the overall performance in terms of sentence-level evaluation metrics is maintained.
The contributions of this paper are as follows: • We establish explanation methods that disentangle the contributions of the image and text inputs and explain image captioning models beyond visualizing attention. • We quantitatively measure and compare the properties of explanation methods and attention mechanisms, including tasks of finding the related features/evidence for model decisions, grounding to image content, and the capability of debugging the models (in terms of providing possible reasons for object hallucination and differentiating hallucinated words). • We propose an LRP-inference fine-tuning strategy that reduces object hallucination and guides the models to be more precise and grounded on image evidence when predicting frequent object words. Our proposed finetuning strategy requires no additional annotations and successfully improves the mean average precision of predicted frequent object words.
In the rest of this paper, Section II introduces recent image captioning models, the state-of-the-art explanation methods for neural networks, and other related works. In Section III, we will introduce the image captioning model structures applied in this paper. The adaptations of explanation methods to attention-guided image captioning models are summarized in Section IV. The analyses of attention and explanations and our proposed LRP-inference fine-tuning strategy are introduced in Section V.

A. Image Captioning
Many models adopt the encoder-decoder approach to bridge the gap between image and text, usually with a CNN as the image encoder and an RNN as the sentence decoder [1], [2], [3]. Considering that it might be helpful to focus on a subregion of the image when generating a word of the caption, various attention mechanisms have been developed, guiding the model to focus on the relevant parts of the image when predicting a word. Some representative works include hard or soft attention [4], semantic attention [6], adaptive attention [7], bottom-up and top-down attention [8], adaptive attention time [9], hierarchical attention [10], X-Linear attention [34], and spatio-temporal memory attention [35]. Recently, many works build the attention mechanism with the multi-head attention originated from Transformer models [11], such as attention on attention [12], entangled transformer [13], multi-modal transformer [14], meshed-memory transformer [15]. These attention mechanisms effectively facilitate image captioning models to better recognize and locate the objects in an image. We will analyze the adaptive attention mechanism [7], [8], [9], [10] and the multi-head attention mechanism [11], [36], [12], [13], [14], [15]. Both attention mechanisms are employed as a sub-module in a number of works.
Recognizing and locating the objects in an image is often not sufficient to generate fine-grained captions. In addition to studying attention mechanisms, a branch of research explores the relations of objects (e.g. playing with balls) and object attributes (e.g. a wooden desk). Many of these works build a graph to capture the relation and attribute representations of objects, such as the scene graph [28], [37], [38], [39], [40] and visual relation graph [41]. Some other works aim to generate more fine-grained captions by learning global and local representations in a distilling fashion [42], by gradually learning the representation via context-aware visual policy [43], by parsing and utilizing the noun chunks in the reference captions [44]. The unified VLP [45] learns unified imagetext representations in the spirit of the BERT embedding [46]. VIVO [47] and OSCAR [48] further enhance the unified representation by incorporating external image-tag pairs for training. These unified representations can be used in various visual-language tasks. [49] uses additional rank annotations of the referenced captions.
There are also other challenging directions of image captioning like novel object captioning (NOC) and captioning with different styles. NOC tries to predict novel objects that are not in the image-caption training pairs, which overcomes the limitation of fixed training vocabulary and achieves better generalization [50], [51], [52], [53], [54], [47], [55]. [56] and [57] attempt to generate captions with controlled sentiments and styles.

B. Towards de-biasing visual-language models
The intrinsic composition of the training data can lead to biased visual-language models. To this end, many works aim to reduce model bias and improve the grounding property of visual-language models. For visual-question-answering (VQA) models, [30] learns the language bias in advance by using the textual question-answer pairs in order to increase the loss computation for biased answers during training. [58] proposes a grounded visual question answering model that disentangles the yes/no questions and visual concept-related questions. Both show an effective reduction of the bias for the VQA models. As for image captioning models, [29] designs an appearance confusion loss and a confidence loss using segmentation annotations to reduce the gender bias of the captioning models. [31] adopts external human-annotated attention maps to guide the model to generate more grounded captions. Different from the above methods, we propose an LRP-inference fine-tuning strategy that requires no additional annotations to mitigate the influence of language bias for image captioning models. The guidance comes from the explanation scores obtained from explanation methods.

C. Explanation methods for image captioning models.
Many explanation methods explain the predictions of DNNs such as gradient-based methods [59], [22], [60], [21], decomposition-based methods [61], [23], [24], [62], [63], [64], [65], and sampling-based methods [66], [67], [68], [69], [70]. These explanation methods have provided plausible explanations for various DNN architectures including CNNs [23], [21], [62], [66], [63], [64], [67], [68], [71], RNNs [24], [65], [66], [64], graph neural networks (GNNs) [72], [73], [74], [75], [76], and clustering models [77], making it practical to derive the explanation methods for image captioning models. However, to our best knowledge, only a few works have studied the interpretability of image captioning models so far. In principle, gradient-based methods can be directly applied to image captioning models. Grad-CAM and Guided Grad-CAM have been used to explain non-attention image captioning models [21]. [78] introduces an explanation method for video captioning models. They further adapt the method to image captioning models by slicing an image with grids to form a sequence of image patches, treated as video frames, however, the slicing operation may cut through object structures. Attention heatmaps are usually considered as explanations of image captioning models. The question to what extent attention is suitable as an explanation has been raised in the natural language processing context [79], [80], [81]. For the image captioning task, although attention heatmaps can show the locations of object words, they cannot disentangle the contributions of the image and text inputs. Furthermore, attention heatmaps meet difficulties to provide pixel-wise explanations that reflect the positive and negative contributions of pixels and regions. These issues can be addressed by several explanation methods. For the sake of keeping the scope of analyses within reasonable limits, we will adapt exemplarily LRP, Grad-CAM, and Guided Grad-CAM to image captioning models.

D. Explanation-guided training
Recently, some studies observe that explainable AI is not limited to providing post-hoc insights into neural networks but can also be applied to train a model. [82] utilizes the saliency maps of Grad-CAM and Guided Grad-CAM to design a pixelwise cross-entropy loss for transfer learning. They show that the pixel-wise cross-entropy loss can guide the model to make the right decisions using the right reasons, meanwhile, improve image classification accuracy. [31] also uses Grad-CAM saliency maps together with additional human-annotated attention maps to design a ranking loss for image captioning models. They show that the ranking loss can help to generate more grounded captions and maintain sentence fluency. [83] adopts LRP explanations to guide few-shot classification models. They demonstrate that explanation-guided training can improve the model generalization and classification accuracy for cross-domain datasets. We will show that LRP explanations can also help to mitigate the influence of language bias for image captioning models.

III. BACKGROUNDS OF IMAGE CAPTIONING MODELS A. Notations for image captioning models
In this section, we recapitulate common structures of image captioning models, which consist of an image encoder, a sentence decoder, and a word predictor module. To caption a given image, we first encode the image with pre-trained CNNs or detection modules such as a Faster RCNN and extract a visual feature I ∈ R nv×dv , where n v and d v are the number and dimension of the visual feature. For I from a Faster R-CNN, n v would be the number of regions of interest (ROIs), and for I from a CNN, n v would be the number of spatial elements in a feature map. Then, the visual feature I is decoded by an LSTM augmented with an attention mechanism to generate a context representation. Finally, the word predictor takes the context representation and the hidden state of the decoder as inputs to predict the next word.
During training, there is a reference sentence as the ground truth, S = (w t ) l t=1 , where w t is a word token, and l is the sentence length. At each time step t, the LSTM updates the hidden state h t and memory cell m t as follows.
where [·] denotes concatenation, E m is a word embedding layer that encodes words to vectors, E m (w t−1 ) ∈ R dw . I g = 1/n v nv k=1 I (k) represents an averaged global visual feature. During inference, the w t−1 is the predicted word from the last step. Then, an attention mechanism AT T (·) uses h t and I to generate a context representation c t for word prediction.
where p t is the predicted score over the vocabulary. The concrete implementations of E m , I g , AT T (·), and P redictor may vary across different models.

B. Attention mechanisms used in this study
We choose two representative attention mechanisms, adaptive attention [7] and a modified multi-head attention [11], [12]. They are employed in variants by several image captioning models, thus aiming at generalizability for our studies.
1) Adaptive attention mechanism: The adaptive attention mechanism generates a context representation by calculating a set of weights over the visual feature I and a sentinel feature s t that represents the textual information. At time step t: d h and d x denote the dimension of the hidden state and x t , respectively. σ denotes the sigmoid function. The weights for I and s t are calculated as follows: where W I ∈ R d h ×nv , W s and W g ∈ R nv×d h , w a ∈ R nv are trainable parameters 1 . α t ∈ R nv is the attention weight for I. It tells the model which regions within the image to use for generating the next word. β t is the (n v + 1) th element of the softmax over [a, b], corresponding to the weight for the component b. It balances the visual and textual information used to predict the next word. We use the following expression to summarize the adaptive attention mechanism.
2) Multi-head attention mechanism: The multi-head attention is defined with a triplet of query (Q), key (K), and value (V ). To apply the multi-head attention to the sentence decoder, we adopt h t as the query and two distinct linear projections of I as K and V .
where W K , W V ∈ R dv×d h . We evenly split the hidden dimension d h to obtain multiple triplets of (Q (i) , K (i) , V (i) ), denoted as multiple heads. For each head, the attention weight over V (i) is the scaled dot product of Q (i) and K (i) and we can obtain a weighted feature v (i) as follows.
where n h is the number of head 2 . By concatenating the weighted feature of each head, we can obtain the integral attended feature v, which is further fed to a linear layer to generate the visual representation.
Under the image captioning setup, there are cases where the visual feature is less relevant to the predicted word, e.g. "a" and "the". Thus, we add another gate to control the visual information, which is consistent with many recent image captioning models using the multi-head attention module [12], [13], [15]. This also shares the same spirit of β t in the adaptive attention mechanism, which controls the proportion of image and textual information. Specifically, we generate the gate using the hidden state and the gated output c t is the context representation for prediction.
where W mh ∈ R d h ×d h and b mh ∈ R d h are trainable parameters and σ is the sigmoid function. We briefly summarize the multi-head attention mechanism as follows.
3) Image captioning models with adaptive attention and multi-head attention: We build two image captioning models in this paper. The details of the two models are illustrated in Figure 2. The left of Figure 2 is the Ada-LSTM model that consists of an adaptive attention module and an LSTM followed by a fully connected (fc) layer as the word predictor. Note that the x t is adjusted accordingly to incorporate the predictor. On the right is the MH-FC model that adopts a multi-head attention module followed by an fc layer as the word predictor. Both model structures are commonly used [7], [8], [12], [43], [44].
The image captioning models are usually trained with crossentropy loss in the first stage: where p = (p t ) l t=0 is the predicted scores over vocabulary, l is the sentence length, and y is the ground truth label of a referenced caption. Then, the models are further optimized with the SCST algorithm from [84]. SCST optimizes nondifferentiable evaluation metrics, e.g. CIDEr score [19], using reinforcement learning: where R is the reward, S s is the sampled sentence from the predicted distribution p = (p t ) l t=0 , S greedy is the predicted sentence with greedy search, and S gt is the referenced caption. The training objective is to obtain higher reward R. CIDEr is usually adopted to calculate the reward and some papers also call this algorithm as CIDEr optimization [8], [12].

MODELS
In this section, we will explain how to adapt LRP [23], Grad-CAM, and Guided Grad-CAM [21] for use in attentionguided image captioning models. For brevity, we will use Grad* to denote Grad-CAM and Guided Grad-CAM.
Grad* methods are based on gradient backpropagation and can be directly applied to the attention-guided image captioning models. Grad* methods first backpropagate the gradient of a prediction till the visual feature I, denoted as g(I) ∈ R nv×dv . Then, we can obtain a channel-wise weight from g(I) for the visual feature I, which is w I = nv k=1 g(I) (k) ∈ R dv . I is further summed up over the feature dimension, weighted by w I , to generate the class activation map, which reflects the importance of each pixel in the feature map. Grad-CAM reshapes and up-samples the class activation map to generate the image explanations. To obtain fine-grained and high-resolution explanations, Grad-CAM is fused with GuidedBackpropagation [22] by element-wise multiplication. GuidedBackpropagation can be easily implemented in pytorch by writing a custom torch.autograd.Function wrapping the stateless ReLU layers. This fused method is Guided Grad-CAM. The linguistic explanations of Grad* methods are obtained by summing up the gradients of the word embeddings. Next, we will elaborate on LRP for image captioning models.
We briefly introduce the basics of LRP. For an in-depth introduction, we refer to a book chapter like [85]. LRP explains neural networks by assigning a relevance score to every neuron within the network. The relevance assignment is achieved by backpropagating the relevance score of a target prediction along the network topology until the inputs according to LRP rules.
Consider the basic component of neural networks as a linear transformation followed by an activation f (·).
where y i is the input neuron, z j is the linear output, andẑ j is the activation output. We use R(·) to denote the relevance score of a neuron. Suppose R(ẑ j ) is known, we would like to distribute R(ẑ j ) to all of its input neurons y i , denoted as relevance attribution R i←j . We refer to two LRP rules for relevance backpropagation that are frequently applied [23], [86], [87], [88]: where is a small positive number. The stabilizer term sign(z j ) guarantees that the denominator is non-zero. 2) α-rule where α 0, (·) + = max(·, 0), and (·) − = min(·, 0). By separating y i w ij and z j into positive and negative parts, the α-rule ensures a boundedness of relevance terms. The parameter α determines the ratio of focus on positive and negative contribution during relevance backpropagation, from the outputẑ j to all of its inputs y i . The relevance of neuron y i is the summation of all its incoming relevance attribution flows.
LRP has provided plausible explanations for CNNs [23], RNNs such as LSTM [24], and also GNNs [72]. These modules are commonly used in image captioning models. To explain image captioning models with LRP, we define next how to apply LRP to the attention mechanisms.
From Section III, we have seen that attention mechanisms involve non-linear interactions of the visual features and the hidden states of the decoder. However, the attention mechanisms mainly serve as weighting operations for features. Thus, we consider an attention mechanism as a linear combination over a set of features with weights such that LRP relevance scores are not backpropagated through the weights. This is consistent with the "signal-take-all" redistribution explored in [89]. In this way, we can directly apply LRP rules to distribute the relevance score of the context representation to the visual features according to the attention weights and bypass the computations within the attention mechanisms.
To give an overview of LRP for image captioning models, we take the Ada-LSTM model as an example and elaborate on each step of the explanation in Figure 3 and Algorithm 1. It is important to realize here, that LRP follows topologically the same flow as the gradient backpropagation (except the attention mechanisms) along the edges of a directed acyclic graph. The difference lies in replacing the partial derivatives on the edges by LRP redistribution rules motivated by the deep Taylor framework [61].
We initialize the relevance score of a target word, R(w T ), from the output of the last fc layer (the logits). Then, as illustrated in Figure 3, LRP-type operations for computing R(·) are applied to the layers fc, ⊕, Language LSTM, AT T ada , Decoder LSTM, and Encoder. The LRP operations used for these layers are shown as the =⇒ in Algorithm 1. For each word to be explained, LRP assigns a relevance score to every pixel of the input image (R(image)) and every word of the sequence input (R(w T −1 ), . . . , R(w 1 )). We can visualize the image explanation as a heatmap after averaging R(image) over the channel dimension. The relevance score of each preceding word is the summation of the relevance scores over the word embedding. In the experiments, we will also use the relevance score to denote the explanation scores of gradientbased methods.

A. Model preparation and implementation details
We train the Ada-LSTM model and the MH-FC model on Flickr30K [32] and MSCOCO2017 [33] datasets for the following experiments 3 .
Dataset: We prepare the Flickr30K dataset as per the Karpathy split [2]. For MSCOCO2017, we use the original validation  set as the offline test set and extract 5000 images from the training set as the validation set. The train/validation/test sets are with 110000/5000/5000 images. Vocabularies are built only on the training set. We encode the words that appear less than 3 and 4 times as an unknown token <unk> for Flickr30K and MSCOCO2017, respectively, resulting in 9585 and 11026 vocabularies for the two datasets. Encoder: We experiment with CNN and FasterRCNN as the image encoder. The CNN features are extracted from the pre-trained VGG16 [90] on ImageNet, specifically, we use the output of "block5 conv3" with a shape of 14 × 14 × 512. The Faster RCNN encoder provides bottom-up image features corresponding to the candidate regions for object detection. We refer to Detectron2 LRP parameters: We follow the suggestions of [86] on the best practice for LRP rules. We use α-rule for convolutional layers with α = 0 and -rule for fully connected layers and LSTM layers with = 0.01.
Training details: We adopt the Adam optimizer for training, with β 1 = 0.8, β 2 = 0.999, and a learning rate lr = 0.0005. We anneal lr by 20% when the CIDEr score does not improve for the last 3 epochs and stop the training when the CIDEr score does not improve for 6 epochs. We further optimize the models with the SCST optimization [84] using CIDEr score with lr = 0.0001. For the models using CNN features, we also fine-tune the CNN encoder with lr = 0.0001 before applying the SCST optimization. Table I lists the performance of the Ada-LSTM model and the MH-FC model. We generate the captions with beam search (beam size=3) and report five evaluation metrics of image captioning task: METEOR [17], ROUGE-L [18], SPICE [20], CIDEr [19], and the F BERT (idf) metric of BERTScore [92]. To validate our models, we include the performance of some benchmark image captioning models with similar model structures. AdaATT [7] is the first paper that proposes the adaptive attention mechanism. SCST [84] adapts reinforcement learning to image captioning and optimizes non-differentiable evaluation metrics. BUTD [8] adopts the bottom-up features and uses an LSTM as the word predictor. We can see that our models are properly trained and achieve comparable performance.

B. Explanation results and evaluation
Section I has shown some examples of the explanation results generated by LRP, Grad-CAM, Guided Grad-CAM. In comparison to attention heatmaps, we observe the following.
Firstly, explanation methods can disentangle the contributions of the image input and the textual input, which is beyond the interpretability that attention mechanisms can  provide. Secondly, some explanation methods provide highresolution, pixel-wise image explanations, such as LRP and Guided Grad-CAM. Thirdly, LRP explicitly shows the positive and negative evidence used by the model to make decisions. In the following experiments, we will quantitatively evaluate the information content of attention, LRP, Grad-CAM, and Guided Grad-CAM with two ablation experiments and one object localization experiment. The ablation experiment aims to measure the information in the visual domain and the text domain, expressed by the relevance scores assigned to pixels and words. The object localization experiment evaluates the visual grounding property of relevance scores for image regions.
1) Ablation experiment: We conduct the ablation experiment for both the image explanations and the linguistic explanations, as illustrated in Figure 4. We demonstrate the approach using the same example in Section I based on the caption: A red fire hydrant sitting in the grass in a field.
The first row of Figure 4 shows the image explanations of the word hydrant, which highlight parts of the image related to the hydrant. To assess whether the highlighted areas contribute to the prediction, we firstly segment the image into non-overlapping 8 × 8 patches. Secondly, we sum the relevance scores within each patch as the patch relevance. Thirdly, we mask the top-20 high-relevance patches with the training data mean, to eliminate the contributions of these patches. The top-20 high-relevance patches found by different explanation methods are shown in the second row of Figure 4. Finally, we predict a caption on the masked image. If the masked areas are important to the prediction, the model will be less confident to predict the target word or will not generate the target word at all from the masked image.
The linguistic explanations reflect the contributions of the previously generated sequence. For example, when generating the word field, the model perhaps uses the words ""sitting, "in", and "a" as related evidence. Similar to the idea of the image ablation experiment, we remove the top-3 relevant words in the preceding sequence and forward the modified sequence to the model in a teacher-forcing manner. Finally, we observe the new probability of the target word. We do not modify the image for the word ablation experiment. If the removed words are strongly related to the prediction, the new probability of the target word will drop considerably compared to its original value.
We conduct the ablation experiment using image captioning models trained on the MSCOCO2017 dataset and CNN features. We report the results on the test set. For the word ablation experiment, we consider the predicted words with a sequence index greater than 6 so that there is a sufficiently long preceding word sequence to avoid evaluating purely frequency-based predictions in the experiment. For the image ablation experiment, we consider all the predicted object words. A random ablation is included as a baseline. Figure 5 shows the results of word-ablation experiments. The words we explain are split into object words and stopwords. We show the frequency of probability drop, and the difference between the original word probability and the new word probability after the word deletion (denoted as an average score of probability drop). A higher average score of probability drop means the model is less confident to make the original prediction after ablation, therefore, the ablated words are more strongly related to the prediction. LRP and gradient-based explanation methods achieve a decrease in prediction probability more often and with greater impact than the random ablation, indicating that the words found by explanation methods are used by the model as important evidence to predict the target word. LRP achieves both the highest frequency and the highest average score of probability drop. In our word ablation experiment, we use 8 heads for the multi-head attention mechanism of the MH-FC model, resulting in 8 sets of attention weights. This is computationally too heavy for use in the image ablation experiment. We, therefore, implement the image ablation experiment with the Ada-LSTM model and show how often the model fails to generate the target word after the image ablation, as shown in Figure 6 (left). We can see that high-resolution explanations from the evaluated explanation methods LRP, Guided Grad-CAM, and GuidedBackpropagation achieve a higher frequency of object words vanishing, indicating that the highlighted areas are related to the evidence for model decisions.
With the above experiment results, we verify that using explanation methods adds information compared to relying on attention heatmaps alone.
2) Measuring the correlation of explanations to object locations: Many studies employ attention heatmaps as a tool to verify the visual grounding property qualitatively [4], [7], [12], [10], [35]. In this part, we will quantify the correlation of explanation results to object locations and show that highresolution explanations can also achieve a high correlation to the object locations.
To assess the correlation of explanations to object locations, we utilize the bounding box annotations of the MSCOCO2017 dataset and extend the correctness measure from [93], which evaluates the grounding property of attention heatmaps, to the explanation results. For a correctly predicted object word, we first obtain the relevance scores of the image input, R(image), with explanation methods and average R(image) over the channel dimension, resulting in a spatial explanation E ∈ R h×w , where h and w are the height and width of the image.
We keep the positive scores of E for object localization. The correctness is the proportion of the relevance scores within the bounding box.
where the norm(·) is the normalization with the maximal absolute value. For the MH-FC model with the multi-head attention mechanism, we generate the explanations for each head, R(image) (i) , by only backpropagating the relevance scores or gradients through head i. The correctness of the MH-FC model is the maximum across the correctness (i) of all the heads, i.e.
Higher correctness means the relevance scores concentrate more within the bounding box, indicating a better grounding property. Figure 6 (right) shows the average correctness of all the correctly predicted object words across the MSCOCO2017 test set, evaluated with image captioning models trained using CNN features. First of all, the MH-FC model achieves consistently higher correctness than the Ada-LSTM model, indicating that there is at least one head of the MH-FC model that accurately locates the object, especially for attention and LRP where there is a large discrepancy of the correctness between the Ada-LSTM and the MH-FC models.
Secondly, high-resolution explanations provided by LRP, Guided Grad-CAM, and GuidedBackpropagation achieve comparable or higher correctness than attention. The notable exception is due to the spatial localization property of the multiple heads in the MH-FC model. Combining the results of the ablation experiments, explanation methods tend to find parts of objects which correlate well to the prediction.
Thirdly, to further get insights into the role of the sign of the relevance scores, we calculate the correctness using the absolute value of the negative relevance scores, E n = norm(max(−E, 0)). As shown in Figure 6 (right), the low correctness of "LRP-neg" and the high correctness of "G. Grad-CAM-neg" verifies that the positive/negative sign of LRP relevance scores reveals the support/opposition of a pixel to the predictions, while for Guided Grad-CAM, both positive and negative relevance scores are related to the predictions and irrelevant pixels have low absolute relevance scores.
Last but not least, our correctness evaluation results over various explanation methods under the image captioning scenario are consistent with some prior works. GuidedBackpropagation and LRP generate more coherent explanations for MRI data than other gradient-based methods [94], despite failing certain sanity checks postulated in [95]. This underlines the importance of considering multiple criteria in contrast to decisions based on selected axiomatic requirements. Furthermore, the sign of LRP relevance scores is meaningful [86]. Both properties can be helpful for model debugging [95], [96], [97].
In the next section, we will show how we use LRP to "debug" and improve image captioning models.

C. Reducing object hallucination with explanation
In our experiment, we observe the common hallucination problem of image captioning models. Image captioning models sometimes generate object words that are not related to the image content, which is possibly caused by the learned language priors. The vocabulary and sentence patterns of the image-caption pairs are intrinsically biased toward frequent occurrences. As illustrated in Figure 7, the vocabulary count distribution of the predicted words is close to that of the training vocabulary.
A language bias can be helpful for image captioning models. [28] learns the inductive language bias to guide the model to deduce the object relations and attributions. However, it can also cause mistakes. For example, the models could be flawed when predicting gender [29] or always paint bananas yellow irrespective of their actual color [30], [31]. To this end, we explore the explanations of hallucinated words and investigate using approaches from explainability to reduce object hallucination.
1) Exploring the explanations of hallucinated words : Based on the findings in Section V-B that high-resolution explanations obtained by LRP and Guided Grad-CAM correlate to the object locations and reflect well the related evidence for predictions, we explore the difference of image explanations between grounded (true-positive) and hallucinated (falsepositive) object words. Figure 8 illustrates some examples of LRP image explanations for hallucinated words.
In Figure 8 (a) to (e), the LRP image explanations show more negative scores, implying that the model generates hallucinated words mainly with the linguistic information rather than the image information. In Figure 8 man  people  woman  street  table  person  field  tennis  train  plate  room  dog  cat  water  baseball  bathroom  sign  kitchen  food  grass  bus  pizza  building  clock  the yellow frisbee for a banana, evidenced by red pixels (positive scores). We now quantify the difference in image explanations between true-positive and false-positive object words. Specifically, we use the statistics of image explanations (the E mentioned in Section V-B2) to differentiate the hallucinated words.
We assign a label 1/0 to the true-positive/false-positive predicted words, respectively. Each word is also assigned A black and white cat standing next to a person.
A man sitting on a chair in front of a TV.
A man holding a banana in his hand. Aclose up of a person on a cellphone.
A person sitting on a bench with a skateboard. A bedroom with a bed a chair and a television.  with a statistic calculated from the image explanation E, such as the maximum value (max(E)), the 5% and 50% quantiles (quantile-5%/50%(E)), and the mean (mean(E)). We also evaluate 1 − β from Eq. (10) of the adaptive attention mechanism. We remind that the adaptive attention mechanism contains a sentinel feature s t that represents the text-dominant information. It then learns a weight, β t , which controls the proportion of linguistic information used for predictions. Thus, it is a model-intrinsic baseline to show differences between grounded and hallucinated object words. We calculate the AUC scores, using the labels and statistics of true-positive and false-positive words. A higher AUC score indicates a better differentiation between hallucinated and grounded words. Table II lists the AUC scores computed with various explanation methods. We conduct the experiment with the Ada-LSTM model trained on Flickr30K dataset, because its vocabularies are more imbalanced than that of the MSCOCO2017 dataset. The results are reported on the test set of Flickr30K. The evaluated words are the top-20 frequent object words 5 with 715 false-positive and 1,027 true-positive cases.
The LRP quantile-5%(E) achieves a slightly higher AUC score than 1 − β and can weakly recognize the hallucinated words, which indicates that true-positive words are usually with higher LRP quantile-5%(E) and false-positive words are with lower LRP quantile-5%(E). The statistics of LRP all obtain AUC scores greater than 0.5, which verifies that the LRP image explanations consist of lower relevance scores for false-positive words, and thus, reflect less supporting evidence for the hallucinated words.
In the next section, we will introduce a fine-tuning strategy that builds upon LRP-based explanations to reduce object hallucination.
2) Using LRP explanations to reduce object hallucination: We introduce an LRP-inference fine-tuning (LRP-IFT) strategy that can help to de-bias a pre-trained image captioning model and reduce object hallucination. We design a re-weighting mechanism inspired by two properties of LRP explanations: 1) meaningfulness of the positive and negative sign of LRP relevance scores, indicating the support and opposition to the predictions; 2) the property of finding the regions and evidence in the image used by the model to make predictions. In particular, we design weights for the input features of the last fc layer using the LRP relevance scores and embed the reweighted features into the model for fine-tuning. We elaborate on each step of the fine-tuning strategy with Algorithm 2 and detail the underlying idea as follows.
To fine-tune an image captioning model M, we generate an initial caption first.
where I is the image , p t ∈ R V is the probability distribution over the vocabulary at time step t, V is the vocabulary size, and h(w t ) is the label of the word w t . If w t is not a stop-word, we will explain the predicted label h(w t ) through the last fc layer using LRP and obtain the relevance scores of the context representation and the hidden state, R(c t ) and R(h t ). (Remember that c t + h t is the input of the last fc layer.) We then normalize R(c t ) and R(h t ) with the maximal absolute value, so that their values are in [−1, +1], and generate a new word probability distributionp t as follows. In LRP explanations, positive relevance is attributed to features supporting the prediction of the target class and negative relevance is attributed to contradicting features. The operations performed in Eqs. (28) and (29) construct a weight ω such that ω < 1 for the opposing features and ω > 1 for the supporting features. The re-weighting mechanism will thus up-scale the supporting features and down-scale the opposing ones.
During fine-tuning, we use the LRP-inference prediction p = (p t ) l t=0 to calculate the loss. For the cross-entropy loss function, we can combine both the original loss and the new loss with a parameter λ ∈ [0, 1]. The loss function from Eq. (17) is updated as follows.
where L ce denotes the cross-entropy loss and y is the ground truth label. We can also usep for the SCST optimization and the reward formula from Eq. (18) is re-written as follows.
R = E S s ,S greedy p [metric(S s , S gt ) − metric(S greedy , S gt )] (32) where we replace the original probability distribution p with the LRP-inference onep. R is the reward, S s is the sampled sentence, S greedy is the greedily sampled sentence, and S gt is the referenced caption.
Different from standard fine-tuning, LRP-IFT disentangles the contributions of the visual information, R(c t ), and the hidden state, R(h t ). It selects and fine-tunes the more related features rather than training all the features generally.
To evaluate the performance of the LRP-IFT, we observe the mean average precision (mAP) of the frequent object words 6 . The motivation of LRP-IFT is to guide the model to make more grounded captions rather than thoroughly enumerate all objects within an image. Therefore, we do not use the recall and F1 score. Table III lists the mAP of the models with or without LRP-IFT. We implement the LRP-IFT on two sets of pre-trained models. The first set of models are from Table I that are  optimized with SCST optimization, and we refer to Eq. (32) to fine-tune the models for one epoch. The second set of models are trained only with cross-entropy loss, denoted as (ce) in the table and we refer to Eq. (31) with λ = 0.5 to fine-tune the models for one epoch. For the baseline models, we fine-tune the two sets of models with standard SCST optimization or cross-entropy loss with the same training hyperparameters. As shown in Table III, the mAP is effectively improved after LRP-IFT for both sets of models except the MH-FC models trained on the MSCOCO2017 dataset. We discuss the mAP results from three aspects: 1) the MSCOCO2017 dataset has a more balanced vocabulary and more training data than the Flickr30K dataset, which results in less biased models. This also explains the more pronounced improvement of mAP on the Flickr30K dataset; 2) the multi-head attention mechanism has better grounding property as discussed in Section V-B2, which is the possible reason why LRP-IFT obtains similar mAP for the MH-FC model trained on the MSCOCO2017 dataset; 3) as expected, the image captioning models with bottom-up features consistently obtain higher mAP than those with CNN features, demonstrating the potential of better feature representation for visual-language models such as VIVO [47] and OSCAR [48].
Furthermore, LRP-IFT maintains the overall performance on the sentence level, as shown in Table IV. Figure 9 illustrates some example captions of the baseline models and the LRPinference fine-tuned models. LRP-IFT is conducted on the non-stop words and can improve the precision of the frequent object words. As shown in Figure 9, with LRP-IFT, the model can correct or remove the hallucinated words and maintain the sentence structure. This can partially explain why the sentencelevel performance is very close to that of the baseline models. We will provide more detailed analyses of the sentence-level performance in Section V-D.
From the above analyses, the LRP-IFT can effectively de-bias and reduce object hallucination for a biased image captioning model, meanwhile, maintain the sentence-level performance in terms of F BERT , CIDEr, SPICE, METEOR, and ROUGE-L. On the other hand, this fine-tuning strategy does not degrade the performance of a less biased image captioning model notably. We remark that the LRP-IFT requires no ad-  ditional training parameters and human annotations. The finetuning procedure is also analogous to the human's recognition process that we first build prior knowledge by learning the objects, relations, and attributes and update related features when facing new shifts in distributions.

D. Discussion and outlook
In the experiments of LRP-IFT, we have observed that LRP-IFT alleviates the object hallucination issue of image captioning models measurably. However, we can also see that LRP-IFT does not effectively improve sentence-level perfor-mance. In this part, we will further analyze the effects of the LRP re-weighting mechanism and we will take a closer look at the samples where LRP-IFT improves the sentence-level performance. We conclude by proposing a potential future direction where the LRP-inference training can be helpful.
1) On limitations of the LRP re-weighting mechanism: We performed an analysis on the samples where LRP-IFT improves or degrades sentence-level performance. At first, for each word in a ground truth caption, we computed the count of that word within the training set. Then, for each ground truth caption in the test set, we find the minimum of the word V: The average c(S gt ) over the ground truth captions from two sets of samples: the LRP-IFT-improved set, where LRP-IFT increases the CIDEr scores, and the LRP-IFT-degraded set, where LRP-IFT decreases the CIDEr scores. (ce) denotes that the models are trained only with cross-entropy loss. The other models are further optimized with SCST. BU and CNN denote bottom-up features and CNN features. Bold numbers indicate lower counts of the ground truth words in the training set. This statistic can be interpreted as a heuristic for training data density. counts, denoted as c(S gt ), over the non-stop words in the caption S gt : c(S gt ) = min wt∈S gt count(w t ) where count(w t ) returns the counts of the word w t in the training set. This statistic c(S gt ) for test set captions can be viewed as a heuristic 1-gram estimate of the training data density for the linguistic modality of image captioning. For images with multiple ground truth captions, we take the minimum of c(S gt ) over all the captions of one image. We verified that taking the average yields the same qualitative results.
Finally, we compute the average of this heuristic c(S gt ) for two sets of images: 1) the images on which LRP-IFT improves the predictions compared to the baseline model and 2) the images for which LRP-IFT degrades the predictions compared to the baseline model. We refer to the sentence-level evaluation metrics, such as the CIDEr score, to separate the two sets of image samples. Table V lists the results of average c(S gt ) using CIDEr scores for performance comparison. We observe a clear correlation across most of the models (except only one): The LRP-IFT-improved set exhibits a lower average c(S gt ), while the LRP-IFT-degraded set shows a higher average c(S gt ). In summary, LRP-IFT achieves a tradeoff. It performs worse on those test images with a higher estimate of the sample density, where the base model seemingly generalizes sufficiently well. On the other hand, it achieves an improvement on images with lower training data density. The results using other metric scores for comparison lead to the same finding. This makes intuitively sense as one can expect that captions supported by a higher amount of training data would profit less from learning with explanations. A similar correlation for using explanations to improve age prediction models using image data is reported in [98]. The authors observe that using explanations improves predictions on the poorly performing age subset 48-53 years, which has a small sample size, while slightly degrades the performance on age subsets with larger sample sizes.
There are further possible reasons for the non-improved sentence-level performance. [27] points out that hallucinating less does not necessarily render higher sentence-level evaluation metrics [29], [31], [99], which is also in line with our observations in Table IV. Furthermore, LRP-IFT implements the re-weighting mechanism on top of pre-trained models as a fine-tuning step, making it challenging to achieve larger changes over pre-trained models.
2) An outlook for re-weighting mechanisms based on explanations: Based on the above analyses, we surmise that the LRP re-weighting mechanism could be helpful for novel object captioning (NOC). NOC aims to predict those object words that are unseen by the model during training. It also faces the challenge of unbalanced training data, in an even more extreme case where some object words are not shown in the training data. For example, [52] proposed a pointing mechanism to combine the sentence correlation representation and object representation, which dynamically decides whether to include an object word from a detection model. The LRP re-weighting mechanism could be helpful here to better guide the model when and where to include the detected objects in the caption.

VI. CONCLUSION
We adapt LRP and gradient-based explanation methods to explain the attention-guided image captioning models beyond visualizing attention. With extensive qualitative and quantitative experiments, we demonstrate that explanation methods provide more interpretable information than attention, disentangle the contributions of the visual and linguistic information, help to debug the image captioning models such as mining the reasons for the hallucination problem. With the properties of LRP explanations, we propose an LRPinference fine-tuning strategy that can successfully de-bias image captioning models and alleviate object hallucination. The proposed fine-tuning strategy requires no additional annotations and training parameters.