Image Captioning for Effective Use of Language Models in Knowledge-Based Visual Question Answering

Integrating outside knowledge for reasoning in visio-linguistic tasks such as visual question answering (VQA) is an open problem. Given that pretrained language models have been shown to include world knowledge, we propose to use a unimodal (text-only) train and inference procedure based on automatic off-the-shelf captioning of images and pretrained language models. Our results on a visual question answering task which requires external knowledge (OK-VQA) show that our text-only model outperforms pretrained multimodal (image-text) models of comparable number of parameters. In contrast, our model is less effective in a standard VQA task (VQA 2.0) confirming that our text-only method is specially effective for tasks requiring external knowledge. In addition, we show that increasing the language model's size improves notably its performance, yielding results comparable to the state-of-the-art with our largest model, significantly outperforming current multimodal systems, even though augmented with external knowledge. Our qualitative analysis on OK-VQA reveals that automatic captions often fail to capture relevant information in the images, which seems to be balanced by the better inference ability of the text-only language models. Our work opens up possibilities to further improve inference in visio-linguistic tasks


Introduction
Most visio-linguistic tasks are framed in such a way that all the necessary information to solve them is in the images and texts provided in the dataset. That is the case of visual question-answering (VQA) (Antol et al., 2015) or visual entailment (Xie et al., 2019), to name a few. In addition, some tasks require access to external knowledge in order to solve them. An example is Outside Knowledge VQA (OK-VQA) (Marino et al., 2019), where the image 1 Our code will be publicly available soon. C: Three teddy bears sitting next to each other on a couch.
Q: Which american president is most associated with the stuffed animal seen here?

Language Model
A: Teddy Roosevelt Figure 1: Given a question and image, we verbalize the contents of the image and apply a pretrained language model for inference. We show that current text-only models are better in generalization and inference than multimodal models for knowledge-based QA.
content is not sufficient to answer the questions. Contrary to self-contained VQA tasks, which can be solved grounding images and text alone, these tasks require methods that leverage external knowledge resources and are able to do inference on that knowledge.
External knowledge useful for OK-VQA can be broadly classified into two categories, according to (Marino et al., 2020): (i) symbolic knowledge, which can be represented using graphs, for example ConceptNet (Speer et al., 2017), and (ii) implicit knowledge, which is encoded in the weights of neural networks trained in different datasets. Supporting the later case, transformer-based language models (LM) pretrained in large corpora like BERT (Devlin et al., 2019) have been successfully used as implicit knowledge bases (Petroni et al., 2019). In any case, the best results on the OK-VQA dataset have been reported by systems that use both pretrained models and symbolic knowledge, usually integrating external knowledge sources (Gardères et al., 2020;Marino et al., 2020;Wu et al., 2021;Shevchenko et al., 2021).
In this paper we focus on the use of implicit knowledge in the form of pretrained LMs. While using LMs is relatively common in OK-VQA, they are usually integrated into multimodal transformers by diverse means, so as to integrate the visual and textual inputs of the task. Given that LMs were originally designed to process textual input and are extensively trained in textual corpora, we hypothesized that a system that relies exclusively on text will allow LMs to better leverage their implicit knowledge. Because OK-VQA is a visio-linguistic task, we propose to use automatic image captioning as a way to verbalize the information in the image, where the captions are descriptions of the images which are used as input to the LMs. Once the captions are generated, all the inference in our method is done using text-only models. We are aware that captions do not contain all the information in an image, and want to check whether the text-only models can compensate for that initial loss of information.
The approach proposed in this paper, named caption-based model, can be seen in Figure 1.
To validate our hypothesis, we present an extensive experimentation on the OK-VQA dataset, comparing our proposed caption-based model with the de facto standard of visio-linguistic tasks, i.e. multimodal transformers, which are widely used in VQA tasks to process the questions (text) and the images. We also analyze the compatibility between images and captions based on two different fusion strategies. As a result of our experiments, we find that: • Captions are more effective than images for OK-VQA when pretrained language and multimodal models are used as is, and achieve similar results when both are fine-tuned on additional VQA datasets.
• The combination of the two approaches improves results further, showing that the textonly and multimodal models make complementary inferences.
• The larger contribution of captions on OK-VQA with respect to results on a regular VQA dataset (Goyal et al., 2017) show that our textonly system is specially effective when external knowlede is needed.
• Our combined system is best among systems using implicit knowledge only, and nearly matches the results of state-of-the-art systems that integrate symbolic knowledge graphs.

Related Work
There are many visual question-answering datasets in the literature (Antol et al., 2015;Goyal et al., 2017;, where given an image and a question about the contents of that image, a system has to provide a textual answer. Some VQA datasets also demand leveraging external knowledge to infer the answer and, thus, they are known as knowledge-based VQA tasks. Good examples are KB-VQA (Wang et al., 2017b), KVQA (Sanket Shah and Talukdar, 2019), FVQA (Wang et al., 2017a) and OK-VQA (Marino et al., 2019). KVQA requires knowledge about named entities (e.g. Barack Obama, White House, United Nations) and that knowledge is already provided as a graph. FVQA annotates questions by selecting a fact from a fixed knowledge base but its size is relatively small. KB-VQA is even smaller, presenting template-based questions whose answers can be obtained reasoning over commonsense resources or Wikipedia. In contrast, OK-VQA requires knowledge from unspecified external resources and, although smaller than KVQA in terms of the number of images and question-answer pairs, it is considerably bigger than the other knowledge-based VQA datasets. Currently, multimodal transformers are the most successful systems for VQA and can be broadly classified into two types: single-stream and double-stream transformers. A good example of the former is VisualBERT (Li et al., 2019), where the BERT architecture (Devlin et al., 2019) is used, adding visual features obtained by an object detector as input and using visio-linguistic pretraining tasks, such as image-text matching. OSCAR (Li et al., 2020) also follows a very similar philosophy, adding object tags to the input and proposing different pretraining strategies. Among doublestream transformers, VilBERT (Lu et al., 2019) and LXMERT (Tan and Bansal, 2019) use a dedicated transformer for each modality (text and image) to fuse them with a cross-modal transformer. Their differences lie mainly on some architectural choices and pretraining task selection.
Regarding OK-VQA systems, multimodal transformers have also been used to provide im-plicit knowledge from pretraining tasks. For example, VilBERT uses a pretrained BERT to encode the questions, so it uses the implicit knowledge that BERT acquired during its pretraining. Additionally, VilBERT is further trained on Conceptual Captions (Sharma et al., 2018), a very large image-caption dataset from where additional knowledge can be acquired. Those multimodal transformers are the backbone of the best performing systems for OK-VQA, which also use symbolic knowledge to bring some extra performance.
ConceptBert (Gardères et al., 2020) was the first system to use multimodal transformers and symbolic knowledge for OK-VQA. It is based on a combination of a pretrained BERT to encode questions, a graph convolutional neural network to encode triples extracted from the ConceptNet knowledge graph (Speer et al., 2017) and a multimodal transformer (VilBERT) to jointly represent and reason over image features and encoded question tokens.
A similar approach was followed by KRISP (Marino et al., 2020), combining again a multimodal transformer with symbolic knowledge. In this case, the multimodal transformer, called MMBERT, is based on VisualBert (Li et al., 2019) and initialized with the weights of a pretrained BERT. Additionally, authors built a knowledge graph fusing DBPedia (Auer et al., 2007), Concept-Net (Speer et al., 2017), VisualGenome (Krishna et al., 2017) and hasPart KB (Bhakthavatsalam et al., 2020). They used different image feature encoders and the question tokens to obtain a subset of the full graph relevant to the target question and image. Finally, using a graph convolutional neural network, they combined the symbolic and implicit knowledge to predict the final answer.
Some recent approaches, named MAVEx (Wu et al., 2021) and RVL (Shevchenko et al., 2021) showed different ways to combine implicit and symbolic knowledge. MAVEx used a pretrained VilBERT to generate various candidate answers which were later validated using answer-specific knowledge retrieval. Authors used both textual and visual knowledge resources, including images searched using Google, sentences from Wikipedia articles, and concepts from ConceptNet. On the other hand, RVL trained the two-stream multimodal transformer LXMERT (Tan and Bansal, 2019) with an auxiliary objective that aligned its representations with knowledge graph embeddings retrieved from ConceptNet and Wikidata.
Regarding the use of captions for VQA, to the best of our knowledge, Mucko (Zhu et al., 2020) is the only system that explores this idea. Mucko uses dense captions (Johnson et al., 2016) to query a knowledge graph to extract relevant information to answer the question. The reported results on OK-VQA are well below the state-of-the-art. Dense captions describe different regions of an image using short sentences. Our method differs in the use of a single caption which is the input to the LM, and does not require any knowledge graph.

Implemented models
In this section we describe the implemented models. We use Pytorch (Paszke et al., 2019) and the Transformers library (Wolf et al., 2020) for all the implementation work.

Caption-based model (CBM)
Our caption-based model, denoted by CBM, is divided in two steps: (i) a caption generation system that generates a short description of a given image and (ii) a language model that takes this caption and a question in order to answer it.
We use OSCAR (Li et al., 2020) to generate captions from images, a transformer encoder that produces state-of-the-art results on several multimodal tasks including image captioning. As it is common in multimodal transformers, OSCAR uses a pretrained object detector called FasterRCNN (Ren et al., 2015) to obtain region features from images and their respective labels. Both features and labels alongside manually annotated captions are then fed to the transformer during pretraining, following the work of (Anderson et al., 2018). The performance on image-captioning of both base and large models is similar, so we use OSCAR-base as our image-captioning system for all of our experiments.
On the other hand, the LM we use in all the experiments is a pretrained BERT-base model (Devlin et al., 2019). We feed sequences of tokenized captions and questions T (0) = {t (0) i |i = 1, . . . , n t } to BERT, and take the output of the [CLS] or first token of the sequence t (n l ) 1 , where n t is the number of tokens in the sequence and n l is the number of transformer layers.
Although VQA (Antol et al., 2015;Goyal et al., 2017) and OK-VQA (Marino et al., 2019) were defined with open-ended answers, recent state-ofthe-art models (Zhang et al., 2021;Marino et al., 2020) cast these tasks as classification problems, building a fixed vocabulary of answers from the training dataset. In order to fine-tune the language model for VQA, we add a classification head to the [CLS] embedding. Our classification head is a multilayer perceptron (MLP) with one hidden layer after t (n l ) 1 . We define our MLP in Eq. 1.
We use a GELU activation function as well as layer normalization (Ba et al., 2016). The trainable where n label equals to the number of labels on a given classification task and d h equals to 768.

Question-only baseline (BERT Q )
In order to assess the contribution of captions, we also trained a model which only had the question in the input, without any information about the image or caption, denoted as BERT Q . This model can be seen as an ablation of CBM.

Multimodal transformer (MMBERT)
We compare our CBM model with the multimodal transformer-based MMBERT (Marino et al., 2020), a variant of BERT that uses the question text and image region features as input. While BERT is designed to only process textual inputs, MMBERT adapts its embedding layer in order to be able to process features from images.
We use a FasterRCNN with a ResNeXt-152 (Xie et al., 2016) as its backbone to extract a total of n v region features V = {v 1 , . . . , v nv } per image. Each of these v i ∈ R dv features represents an object that appears in the image, where d v equals to 2048. V lacks the positional information between objects, which can be solved concatenating the corresponding bounding box coordinates to each feature. Upon some initial experiments, we concluded that this extra information does not improve performance in any of VQA 2.0 and OK-VQA. We use MMF Multimodal Framework (Singh et al., 2020) to extract the image region features that are fed into MMBERT.
In order to allow for easier comparison between our CBM and MMBERT we use the output representation for [CLS] to feed into the classification multilayer perceptron (see Section 3.1). Note that this is slightly different from the original MMBERT (Marino et al., 2020), which uses the average of all token representations in the last transformer layer.

Loss function
Contrary to previous works in VQA, we do not use binary cross-entropy loss, as initial experiments showed that cross-entropy loss with soft labels (SCE) converges faster with similar results. SCE loss is defined in Eq. 2, where y is the ground truth vector with probabilities proportional to the VQA evaluation metric (Eq. 3) assigned to each class.

Combining both modalities
We are also interested in analyzing the complementarity of both models, i.e. the text-only modality using questions and captions, and the image-text modality with image region features and questions. Therefore, we define two different approaches to check how they complement each other. Early fusion. For each question we feed both caption and image features alongside the question to the language model. This system can be seen as a MMBERT which processes a multimodal input composed by a question (text), a caption (text) and image region features. We initialize the weights of this model with the weights of the base language model (BERT-base) and fine-tune it on the target train data.
Late fusion. We train the caption-based model (Section 3.1) and MMBERT (Section 3.3) separately, each of them with their corresponding inputs, and combine their outputs in inference time to obtain the final answer. The combination is done by multiplying output probabilities of both models for each class and taking the answer with the highest value.

Datasets
The main dataset for our experiments is OK-VQA (Marino et al., 2019), since it allows us evaluating the usage of the implicit knowledge of LMs in a multimodal task. But we also run experiments on the VQA 2.0 dataset (Goyal et al., 2017) with a double motivation: (i) to use it as additional pretraining before applying the model to OK-VQA; (ii) to analyze the performance differences among models on a knowledge-based VQA dataset and a VQA: What is the weather like? cloudy OK-VQA: Why would one suspect that this is not chicago? sign VQA: What color is the bear? brown OK-VQA: What species of bear is this? grizzly VQA: Are the animals in captivity? yes OK-VQA: Which valuable material grows on this animal's face? ivory

VQA 2.0
This dataset contains open-ended questions about images where questions focus on identifying objects in the image and their attributes, detecting relations between them, as well as counting those objects. The dataset is composed of 204K images taken from the COCO dataset (Lin et al., 2014) and 1.1M questions, each question having 10 (possibly repeated) annotations as accepted answers. Following the classification setting of VQA tasks, which is currently the dominant paradigm, VQA 2.0 has 3129 different possible answers, extracted from the most frequent answers of the training split.
VQA 2.0 is divided in three splits named train, dev and test. Some of the images from the development split of VQA 2.0 are reused in OK-VQA's test split. So, in order to avoid any contamination, we do not use the VQA 2.0 dev set for any training or hyper-parameter tuning. (Antol et al., 2015) proposed a standard evaluation metric for VQA tasks where a system answer is considered totally correct if it appears at least three times in the ten ground-truth annotations. Considering that a given answer appears x times in a question's annotations, this accuracy metric is defined in Eq. 3.

OK-VQA
The OK-VQA dataset is built upon 14,031 images from the COCO dataset and 14,055 crowd-sourced questions. Each question has ten annotated answers (possibly repeated), and the evaluation metric is the same as in VQA 2.0 (Eq. 3). As a knowledge-based VQA dataset, OK-VQA requires outside knowledge to answer the questions. However, this outside knowledge is neither provided nor identified, i.e. there is not a list of available knowledge sources for this task, making the task more challenging.
There are two versions of this dataset, depending on how the stemming of the answers provided by the crowd-sourcers is handled. The stemming used in OK-VQA v1.0 results in some "non-word" answers (such as "poni tail" instead of "pony tail"). OK-VQA v1.1 applied a different stemming algorithm, resulting in a more coherent answer vocabulary. We use OK-VQA v1.1 through our experiments, except for the state-of-the-art comparison, as most published systems report results on the v1.0 version.

Experiments and results
This section provides results of the models defined in Section 3 and compare them with the state-ofthe-art.

Experimental settings
We use the same hyperparameters as (Marino et al., 2020) for fine-tuning CBM, MMBERT, BERT Q and Early fusion models both in VQA 2.0 and OK-VQA tasks. We train our models for 88K steps using AdamW optimizer (Loshchilov and Hutter, 2019). Our batch size is of 56 with a maximum learning rate of 5·10 −5 following a cosine schedule with a linear warmup of 2K steps. All experiments have been run in a single GPU with 12GB of vRAM and their runtimes are at most of 12 hours. Table 1 shows the results for the three models presented in Section 3, which share the same architecture and initial parameters. Topmost rows for the models fine-tuned only on OK-VQA (tagged as "Without VQA pretraining"), and the bottom rows for the same models which have been fine-tuned on VQA 2.0 before being fine-tuned on OK-VQA. We observe that the sole use of questions BERT Q offers poor performance compared to the other two systems, achieving up to 13 points less accuracy. This shows that having any representation of the image (captions or image region features) is key to answer questions correctly. This is further justified comparing the improvement that VQA pretraining entails, as BERT Q improves less than 2 points, whereas the other two improve their accuracy between 4-6 points.

Images vs. captions
Contribution of captions. When we compare the performance of CBM and MMBERT, we see that, when there is no visio-linguistic pretraining involved, CBM performs better in OK-VQA. However, when we pretrain these models in a similar multimodal task like VQA 2.0, their accuracy increases by 4-6 points and both obtain similar performance. As OK-VQA's training is comparatively smaller (9K instances vs. VQA's 410K instances), we hypothesize that training MMBERT on OK-VQA is not enough to adapt the model to the new input modality. However, as CBM uses only text, the fine-tuning with such small training is more effective.

Combining CBM and MMBERT
Given the different nature of the inputs, we wanted to check whether CBM and MMBERT are complementary. Our hypothesis is that the former can take advantage of the implicit knowledge acquired by the language model, whereas the latter has ac-  cess to more fine-grained information found in image regions. Following the approaches of early and late fusion defined in Section 3.5, we show their performance in Table 2. These fusion models improve the performance of both CBM and MMBERT by 2 points in almost all cases. The only case where there is no improvement comparing to CBM is in the early fusion without VQA pretraining. This may be caused again by the small training split of OK-VQA, causing difficulties to learn how to ground textual and visual modalities. However, this is solved when VQA pretraining is added to the model, increasing vastly the amount of data seen by the models and showing similar performance on both early and late fusion models. The results validate our hypothesis, showing that image region features and captions are complementary.

Comparison with the state of the art
To compare our models with state-of-the-art models in OK-VQA, we had to repeat the experiments in OK-VQA v1.0. The results vary slightly, as can be seen in Table 3. In that table, we show the results of various models using only implicit knowledge and combining it with symbolic knowledge. As our models do not use symbolic knowledge, the corresponding column is empty.
The performance of KRISP, MAVEx and RVL is very similar. But RVL has a contamination issue as images from OK-VQA's test split were used to train their multimodal transformer. In Table 3 we observe that using symbolic knowledge improves the results around 2 points in average. The highest improvement is achieved by MAVEx with 3.5 points 2 . Notice that all four systems use different ways to integrate symbolic knowledge from differ-  ent resources.
If we look at our caption-based model CBM, we see that its performance is on par with the multimodal transformers used by the other systems. We believe this is remarkable, since we do not use directly any visual features in our models. Furthermore, when we use late fusion, the results we obtain are comparable to the systems which also use symbolic knowledge. Notice that we only use implicit knowledge for our systems and match the performance of systems which combine implicit and symbolic knowledge.

Analysis
In this section we first contrast the results on OK-VQA with those obtained in VQA 2.0, discussing the reasons for the different performance. We then present some qualitative analysis.
6.1 Results on VQA 2.0 Even though both unimodal and multimodal methods perform similarly in OK-VQA, we observed a different trend in VQA 2.0. Table 4 shows that CBM obtains 59.6, while MMBERT achieves 6 points more. We think this is due to the information loss when converting an image into a caption, as relevant information that is needed to answer the  question can be lost. This is specially important for VQA 2.0, where the questions refer directly to image contents, spatial relations and object attributes (see Figure 2). Captions do not usually provide that additional information, and tend to focus on the description of the most relevant information. However, looking at the performance in OK-VQA, we see that captions contain enough information to effectively use the implicit knowledge of the BERT language model. Regarding early and late fusion models, both of them improve the performance of MMBERT by 2-3 points, showing that our model is complementary to multimodal methods also in the VQA dataset.

Qualitative Analysis on OK-VQA
Both unimodal and multimodal algorithms perform similarly (see Table 1), but in 54.3% of the test examples their output differs. Figure 3 shows some OK-VQA test examples together where the outputs of CBM and MMBERT with VQA pretraining differ.
Starting with the top-left example, CBM can infer that elephants are native to Africa whereas MMBERT does not. In fact, the generated caption includes the information that the animal found in the image is an elephant, performing the first step needed to answer the question. This way, the LM can focus on using its implicit knowledge in order to answer correctly.
The other two examples in the top row behave similarly. The caption facilitates the grounding between the question and the image. Whenever a question refers to the image ("this fruit" and "these items"), if the caption already mentions these objects ("bananas" and "traffic light", respectively), the LM seems to better leverage its implicit knowledge and reasoning capabilities to answer the question. The top-right example is interesting in this regard. While the image shows red traffic lights, the question asks about the effects of green lights. This may trick MMBERT into answering the effect that red lights produce, not the green ones.
The bottom row of Figure 3 shows three examples where the caption does not give enough information to infer the answer. In the first case CBM can not decide whether the meat is steamed, fried or grilled by only examining the caption, while MMBERT does have access to visual cues of the image, where we can see that the meat is grilled. This also happens in the second example, as the caption does not specify any ingredient of the beverage while we can see fruits in the image. The rightmost example illustrates an example where the caption could support the inference, but where CBM is wrong: with the given caption, "this game" refers to baseball, but, however, CBM is unable to infer that three strikes are enough for a strikeout whereas MMBERT manages to gives the correct answer.
All in all, these examples support our hypothesis that visual features and captions are complementary. They also show that our system has some advantages regarding the interpretability of the system, specially in the cases our method is wrong. In some cases like the two leftmost examples in the bottom row, the object or feature needed to answer the question is missing from the caption. In other cases, the required information is in the caption, but the inference is erroneous.

Conclusions
In this paper we present a VQA system which describes images with a caption to then ignore the image completely. We show that such a system performs surprisingly well in OK-VQA, where the questions cannot be answered with the image alone, requiring access to external knowledge. Our analysis indicates that the loss of information when summarizing the image into a caption is compensated by the better inference ability of text-only pretrained language models. In the future we would like to explore whether richer descriptions of images might improve results further, and whether text-only language models are more effective when incorporating symbolic knowledge graphs than current multimodal models.