Improving reasoning with contrastive visual information for visual question answering

Visual Question Answering (VQA) aims to output a correct answer based on cross-modality inputs including question and visual content. In general pipeline, information reasoning plays the key role for a reasonable answer. However, visual information is commonly not fully employed in many popular models nowadays. Facing this challenge, a new strategy is proposed in this work to make the best of visual information during reasoning. In detail, visual information is divided into two subsets: (1) question-relevant visual set, and (2) question-irrelevant visual set. Then, both of these two sets are employed by reasoning to generate reasonable outputs. Experiments are conducted on the benchmark VQAv2 dataset, which demonstrate the effectiveness of the proposed strategy. The project page can be found in https: //mic.tongji.edu.cn/e6/8d/c9778a190093/page.htm.

✉ Email: hanliwang@tongji.edu.cn Visual Question Answering (VQA) aims to output a correct answer based on cross-modality inputs including question and visual content. In general pipeline, information reasoning plays the key role for a reasonable answer. However, visual information is commonly not fully employed in many popular models nowadays. Facing this challenge, a new strategy is proposed in this work to make the best of visual information during reasoning. In detail, visual information is divided into two subsets: (1) question-relevant visual set, and (2) questionirrelevant visual set. Then, both of these two sets are employed by reasoning to generate reasonable outputs. Experiments are conducted on the benchmark VQAv2 dataset, which demonstrate the effectiveness of the proposed strategy. The project page can be found in https: //mic.tongji.edu.cn/e6/8d/c9778a190093/page.htm.
Introduction: Visual Question Answering (VQA) is a cross-modality task which aims to output a reasonable answer by handling and fusing a target image and the corresponding question. In most of the popular works [1][2][3][4], there are three parts for an integrated framework: (1) question feature extraction module, (2) visual feature extraction module, and (3) cross-modality feature reasoning module to fuse and output an answer. In general, the reasoning ability of the third part is of the most significance to produce a correct and decent answer.
Nowadays, there are a few effective models for efficient reasoning. However, according to the evaluation of effectiveness of each input, it is found that only the question text is fully used in most of the current models, and the comprehensive key information of visual content is not sufficiently and effectively mined and employed. For instance, the prediction accuracy of 'BLOCK [3]' is still high (i.e. 44% on the VQAv2 dataset [5]) under the condition of only question being used but without any visual information.
Additionally, attention mechanism which is used to align and select feature in the reasoning module is widely employed in a variety of models [2][3][4]. During this procedure, only a few visual objects in the image are positive for reasoning meanwhile other visual objects are discarded. The discarded visual contents also contain structured information that could not be aligned with the question in the attention mechanism, and they should be classified as the question-irrelevant visual information. Inspired by the fact that contrastive visual information is able to help improve the ability of reasoning, a new strategy is proposed in this work to fully use all these visual semantics, with an overview of the proposed framework shown in Figure 1.
As shown in the feature extraction stage, given an image V and a question Q, the visual and question representations are extracted by Faster R-CNN and RNN/BERT, denoted as V R and QR, respectively. Both the representations are then sent to the following coarse reasoning stage, and the visual information will be split into question-relevant visual set and question-irrelevant visual set, which are denoted as V R re and V R ir respectively. Next, the two subsets are used in the fine-grained reasoning stage with the question feature. Clearly, besides a new coarse reasoning module for vision splitting, a new cross-modality reasoning branch that shares the parameters with the regular reasoning branch is required, as well as the corresponding new prediction target and loss function. The rest of this paper is organized as follows. At first, the related works are reviewed, and then the details of the proposed strat- egy are presented. Next, experiments are conducted to demonstrate the effectiveness of the proposed strategy. At last, the conclusion and future works are discussed.

Related works:
The attention mechanism is usually employed to search semantic image regions in the VQA task. Particularly, a stacked attention module is employed to iteratively discover the available semantic visual regions by the query of question [2]. Besides visual attention for reasoning, the fusion strategy commonly plays the most important role to fuse the cross-modality inputs, and a series of bilinear pooling strategies are developed to replace the simple non-linear projection [6]. Recently, BERT, a powerful framework derived from natural language processing (NLP), is also introduced for better cross-modality feature fusion [4,7,8]. In addition, the language bias draws more and more attentions of researchers. In the work of GVQA [9], the training and testing sets of VQAv2 are re-split, clearly revealing that many models are driven by the superficial linguistic correlations. And then a few intuitive models are designed to relieve this bias [10]. However, all these works don't pay attention to the unaligned visual information.
In fact, irrelevant or negative information is also used in other fields. For metric learning, the Siamese network [11] is a classic model, in which the loss function is established by the correlation measurement of inputs. The training strategy [11] is similar to the positive and negative strategy used in video-text retrieval [12], where the model is optimized by minimizing both the intra-sample loss and the inter-sample loss. However, the reasoning procedure in the VQA task concerns not only the aligning of the cross-modality information but also the reasoning on these aligned visual semantic information.
Algorithm: Attention mechanism, which is inspired by human visual psychological system, is frequently employed in vision-language tasks owing to its good intuition and excellent performance. In general, an attention mechanism pushes the model to select valuable features, which is consistent with the casual reasoning of human. In this procedure, the unnoticed information may be discarded, but it is also regarded to be the question-irrelevant information. From this motivation, both of the noticed and unnoticed visual information are employed to enhance the learning ability of model in this work. In practice, the models of BLOCK [3] and LXMERT [4] are employed as the backbone to represent the UpDn based models [3,6,13] and the BERT based models [4,7,8] respectively, to fully reveal the effectiveness of the proposed strategy.
Coarse reasoning: One of the important functions in the proposed framework is to align cross-modality features. To this end, the questionrelevant visual features are well selected, weighted and then fused with the question feature. As shown in Figure 1, the proposed Coarse Reasoning Module (CRM) focuses on the question-relevant visual feature screening, which acts as a sub-task of the original reasoning procedure. The structure of CRM implemented with each backbone model could be the same as the cross-modality feature reasoning module in the employed backbone model. The details of CRM used in BLOCK and LXMERT are presented in Figure 2.
For the CRM module of BLOCK (the top part in Figure 2), it consists of a fusion unit (i.e., Block Fusion Unit), two non-linear layers for mapping the feature into a suitable size and a sigmoid layer to suppress the value into (0, 1). On the other hand, the CRM structure of LXMERT is similar to that of BLOCK, except that they both share the same architecture of fusion unit in their own respective backbone. In the BERT based model, the output size is equal to that of the input, and only the outputs corresponding to the visual inputs are used in the following procedure. As shown in the coarse reasoning stage in Figure 1, the outputs of CRM can be formatted as Gate ∈ R N , where N is the number of visual objects. And then the visual representation V R is split into two subsets: (1) the question-relevant visual set V R re = V R × Gate, and (2) the opposite question-irrelevant visual set V R ir = V R × (1 − Gate), which will be fed to the following reasoning module.
Fine-grained reasoning: In order to enhance the reasoning ability, the question-irrelevant visual information is employed at the training phrase besides the question-relevant visual information. In particular, a new reasoning branch is designed and optimized in the training procedure to make full use of the visual semantic information. It indicates that the corresponding loss functions are also required during training. The structure of the new reasoning branch is the same as the regular reasoning branch in the backbone, and two branches share the same parameters.
Regarding the regular reasoning branch in the backbone, it is used to generate the reasonable answer. And as usual, the binary cross entropy loss is used to optimize the branch by minimizing the distance between the output and the target. The loss function is written as L Ans = T Ans · log (σ (P Ans ))+(1−T Ans ) · log (1−σ (P Ans )), where P Ans ∈ R K+1 and T Ans ∈ R K+1 are the the output vector and the target vector of the regular reasoning branch while K is the number of predicted answers, and σ is the sigmoid function.
The new additional reasoning branch is designed to make the model to further understand the visual content during the procedure of crossmodality information reasoning. And the state should be identified when the model fails to align the cross-modality inputs, which indicates that the branch is required to be trained to generate a new answer. As the total K answers are predicted and the max label of the output is K − 1, so K is adopted as the target label for this branch. The output vector and the target vector are written as P Ir ∈ R K+1 and T Ir ∈ R 1 , respectively. For the loss function, the cross entropy is also employed as formatted as Compared to the regular branch, the convergence of this reasoning branch is unstable. Thus a new flat loss function is built to solve the problem, which is formatted as where T Flat is the corresponding prediction target of the loss function for this branch, which is a vector and could be formatted as T Flat ∈ R K+1 . To stabilize convergence, T Flat should keep relatively consistent with the original branch output, which is currently computed as As a consequence, the loss function of the whole model is in which λ 0 , λ 1  Results: Besides the aforementioned models, another two classic UpDn based models including SAN and BAN are also employed as the backbone models to comprehensively reveal the effectiveness of the proposed strategy. The fusion unit of CRM in SAN is the solo attention used in the backbone model, while for simplification the fusion unit of CRM in BAN is the same as the fusion unit used in BLOCK. The optimization methods used for all these UpDn based models are Adamax, but the BertAdam method is employed for the BERT based model. As shown in Table 1, it is obviously that the performances of the proposed strategy with all these models are better than the baseline backbone models. Particularly, the improvements with the LXMERT backbone based model is more exciting. In addition, the BERT based models are usually pre-trained on other larger datasets, but the parameters of CRM and backbone are initialized as the original model, and all the other additional parameters (i.e., the parameters in the appended fully connected layer) are randomly initialized in our experiments. The better performance shows that even without any pre-trained procedure, the proposed strategy could still gain an obvious performance improvement. This suggests that the reasoning mode in the original architecture is not changed with the proposed reasoning method, but the reasoning ability of the regular branch is enhanced by the appended new branch. Besides the results listed above, more experiments that concern the sensitivity of the parameters λ 0 , λ 1 and λ 2 are conducted. To maintain the effectiveness of the proposed method, the value of λ 1 should be less than 1/batchsize and the value of λ 2 should be larger than or equal to 1.
By contrast, the model using the proposed strategy performs worse than the backbone without using proper visual input, as seen from the results in Table 2. Concretely, the visual information is changed in the validation stage, and 'WV' means each image in the dataset is changed by another image, while 'ZV' means each image is replaced by a zero matrix. The performance trend is consistent with the intuition that the new designed branch can make the model to grab more correlations between visual information and question and ignore the correlation between the   input question and the output label. This shows that the reasoning ability can be further enhanced by the new designed reasoning branch.
Ablation study: Ablation experiments are implemented to further reveal the effectiveness of each component of the proposed strategy. The models BLOCK and LXMERT are used as the backbones for our implementation. Compared to the backbone models, more parameters are required by the the model using the proposed strategy, which usually comes from the new module CRM actually. The effectiveness of CRM is evaluated, specifically, the CRM module is appended on the backbone models, meanwhile the new reasoning branch and the corresponding loss function are removed from the model, and all the super-parameters are the same as that used in the backbone. The results are presented in Table 3, where it can be seen that both the models could slightly benefit from using CRM, and the main improvement of employing the proposed strategy gains from the usage of the new reasoning branch.
The effectiveness of the new reasoning branch is derived from the usage of the loss functions L Ir and L Flat together. To evaluate the effect of these two loss functions, each of them is appended individually. Note that the model without L Flat could not stably converge unless λ 1 is small enough (i.e., 1e-7). The experimental results are shown in Table 4 (where +L Ir means only L Ir is used). On the baseline backbone BLOCK, the performance by only using L Ir drops compared to the model using CRM, and the improvement on the backbone LXMERT is subtle. This may be caused by the magnitude of λ 1 and the effectiveness of the proposed method could be nearly ignored with a small value. As far as L Flat is concerned, it is designed to stabilize the convergence of the new reasoning branch, both of the two models could benefit from using L Flat .

Conclusion:
To further exploit the visual information in the reasoning procedure for the VQA task, a novel reasoning strategy is proposed in this work, which makes the model to align not only the question with the question-relevant visual feature, but also the question with the question-irrelevant visual feature. As a consequence, a more accurate correlation between the question and the visual image is obtained. The experimental results demonstrate the effectiveness of the proposed strategy. However, there are still problems remained to be addressed. For instance, the objective function of the question-irrelevant reasoning branch needs to be further investigated and the coarse reasoning module is required to be further explored in future works.