Multi-modal adaptive gated mechanism for visual question answering

Visual Question Answering (VQA) is a multimodal task that uses natural language to ask and answer questions based on image content. For multimodal tasks, obtaining accurate modality feature information is crucial. The existing researches on the visual question answering model mainly start from the perspective of attention mechanism and multimodal fusion, which will tend to ignore the impact of modal interaction learning and the introduction of noise information in the process of modal fusion on the overall performance of the model. This paper proposes a novel and efficient multimodal adaptive gated mechanism model, MAGM. The model adds an adaptive gate mechanism to the intra- and inter-modality learning and the modal fusion process. This model can effectively filter irrelevant noise information, obtain fine-grained modal features, and improve the ability of the model to adaptively control the contribution of the two modal features to the predicted answer. In intra- and inter-modality learning modules, the self-attention gated and self-guided-attention gated units are designed to filter text and image features’ noise information effectively. In modal fusion module, the adaptive gated modal feature fusion structure is designed to obtain fine-grained modal features and improve the accuracy of the model in answering questions. Quantitative and qualitative experiments on the two VQA task benchmark datasets, VQA 2.0 and GQA, proved that the method in this paper is superior to the existing methods. The MAGM model has an overall accuracy of 71.30% on the VQA 2.0 dataset and an overall accuracy of 57.57% on the GQA dataset.


Review Comments to the Author
Please use the space provided to explain your answers to the questions above. You may also include additional comments for the author, including concerns about dual publication, research ethics, or publication ethics. (Please upload your review as an attachment if it exceeds 20,000 characters) Reviewer #1: Please can you explain in more detail how the gating mechanism is used to reduce noise. Response: First, thank you for reviewing our paper and giving suggestions. We have carefully considered your questions and added new content in lines 341-353, 380-391, and 450-454 of the manuscript, marked in blue font, and we use yellow background and strikethrough to mark older content, so you can better distinguish them.
In subsection 3.2.2.1, we introduce a gating mechanism in the SAG unit. Through formula (6), the output of the multi-head self-attention mechanism X 1 and the output of the feedforward neural network X 2 can be concatenated. This step can thoroughly combine the semantic dependencies captured by the multi-head self-attention mechanism and the local features of the text captured by the feedforward network; Through formula (7) and formula (8), the information of the concatenated feature vector X 3 can be effectively limited, so that the Gated Linear Unit (GLU) in formula (9) can be used to control the information inflow of the limited feature vector adaptively; By fusing X 2 in formula (9), more low-and high-level semantic features can be retained while effectively filtering irrelevant and redundant information. Due to the short length of the question texts in the VQA task, the critical semantic features need to be fully extracted and mined. Through the above calculation process, the semantic representation ability of the feature subspace can be enhanced to better capture and represent the critical semantic information in the question texts. We marked the relevant revisions in a blue font at lines 352-365 of the manuscript.
In subsection 3.2.2.2, we introduce a gating mechanism in the SGAG unit. Through formula (12), the original feature input of the image Y and the output of the feedforward neural network Y 2 can be concatenated. This step can thoroughly combine the text information-image area interaction features captured by the multi-head self-attention mechanism and feed-forward neural network with the original image features, and then obtain rich image semantic features containing text information; Through formula (13) and formula (14), the concatenated feature vector Y 3 can be effectively limited, so that the GLU in formula (15) can be used to control the information inflow of the limited feature vector adaptively; By fusing Y 2 in formula (15), more low-and high-level semantic features can be retained while effectively filtering irrelevant and redundant information. The above calculation process can enhance the semantic representation ability of text question-guided image features. At the same time, it can improve the semantic association between text and image and better capture the critical semantic features related to text questions in image regions. We marked the relevant revisions in a blue font at lines 387-400 of the manuscript.
In Section 3.3, we introduce an adaptive gated feature fusion structure. Through formula (20), formula (21), and formula (23), deep-level and fine-grained information filtering is carried out on the preliminarily fused problem feature X � , image feature Y � , and their weighted sum feature h. Finally, through formula (24-26), the adaptive fusion of text and image feature representations can be realized so that the model can thoroughly learn the high-level modality fusion semantic feature representations. We marked the relevant revisions in a blue font at lines 447-452 of the manuscript.
We have carefully considered your constructive questions and provided detailed explanations while carefully revising the manuscript so that readers can understand our intentions.
Thank you again for reviewing our manuscript and acknowledging the relevant work we have done.
In equation 28, the left hand side of equation should be BCELoss not "N". Response: First, we apologize for making this mistake in equation 28. After your severe suggestions and reminders, we carefully modified equation 28 accordingly. The left hand side of equation 28 should be "BCELoss" instead of "N".
In equation 28, what does "n" represent. It should be replaced with N (as N represent number of answer categories.) Response: First, we apologize for making this mistake in equation 28. After your severe suggestions and reminders, we double-checked Equation 28, where "n" should be "N", which we have carefully modified.
Reviewer #2: This article improves the performance of the VQA task by an introduction of a gating mechanism in intra-and inter-modal interactive learning, along with a multi-modal adaptive gated mechanism during multi-modal fusion. To better control the contribution of each modal feature to the predicted answer, the authors propose utilizing the self-attention gated unit and the self-guided attention gated unit. Experiment results on the VQA2.0 and GQA benchmark datasets show the relevence of the proposed model. In general, the contribution of this work is significant. I recommend this work to be published.