Co-attention graph convolutional network for visual question answering

Liu, Chuan; Tan, Ying-Ying; Xia, Tian-Tian; Zhang, Jiajing; Zhu, Ming

doi:10.1007/s00530-023-01125-7

Co-attention graph convolutional network for visual question answering

Regular Paper
Published: 20 June 2023

Volume 29, pages 2527–2543, (2023)
Cite this article

Multimedia Systems Aims and scope Submit manuscript

Chuan Liu¹,
Ying-Ying Tan^1,3,
Tian-Tian Xia¹,
Jiajing Zhang¹ &
…
Ming Zhu²

381 Accesses
1 Citation
Explore all metrics

Abstract

Visual Question Answering (VQA) is a challenging task that requires a fine-grained understanding of both the visual content of images and the textual content of questions. Conventional visual attention model, which is designed primarily from the perspective of attention mechanism, lacks the ability to reason about relationships between visual objects and ignores the multimodal interactions between questions and images. In this work, we propose a combined both graph convolutional network and co-attention network to circumvent the aforementioned problem. The model employs binary relational reasoning as the graph learner module to learn a graph structure representation that captures relationships between visual objects and learns image representation related to the specific question that has an awareness of spatial location via spatial graph convolution. After that, we perform parallel co-attention learning by passing image representations and features of question words through a deep co-attention module. Experiment results demonstrate that the Overall accuracy of our model delivers \(68.67\%\) on the test-std set of the benchmark VQA v2.0 dataset, which outperforms most existing models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Fig. 4

Object-difference drived graph convolutional networks for visual question answering

Article 20 March 2020

Graph neural networks for visual question answering: a systematic review

Article 16 November 2023

An analysis of graph convolutional networks and recent datasets for visual question answering

Article 09 April 2022

Data availability

The datasets generated during and/or analyzed during the current study are available at https://visualqa.org and https://visualgenome.org.

References

Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: Vqa: Visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)
Li, Y., Pan, Y., Yao, T., Mei, T.: Comprehending and ordering semantics for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17990–17999 (2022)
Cheng, Y., Zhu, X., Qian, J., Wen, F., Liu, P.: Cross-modal graph matching network for image-text retrieval. ACM Transact. Multimedia. Comp. Communicat. Appl. (TOMM) 18(4), 1–23 (2022)
Article Google Scholar
Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847 (2016)
Zhang, W., Yu, J., Hu, H., Hu, H., Qin, Z.: Multimodal feature fusion by relational reasoning and attention for visual question answering. Informat. Fusion 55, 116–126 (2020)
Article Google Scholar
Yu, Z., Yu, J., Fan, J., Tao, D.: Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1821–1830 (2017)
Kim, J.-H., On, K.-W., Lim, W., Kim, J., Ha, J.-W., Zhang, B.-T.: Hadamard product for low-rank bilinear pooling. arXiv preprint arXiv:1610.04325 (2016)
Kim, J.-H., Jun, J., Zhang, B.-T.: Bilinear attention networks. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 1571–1581 (2018)
Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 21–29 (2016)
Yu, Z., Yu, J., Cui, Y., Tao, D., Tian, Q.: Deep modular co-attention networks for visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6281–6290 (2019)
Yu, D., Fu, J., Tian, X., Mei, T.: Multi-source multi-level attention networks for visual question answering. ACM Transact. Mult. Comput., Communicat., Applicat. 15(2), 1–20 (2019)
Google Scholar
Chen, K., Wang, J., Chen, L.-C., Gao, H., Xu, W., Nevatia, R.: Abc-cnn: An attention based convolutional neural network for visual question answering. arXiv preprint arXiv:1511.05960 (2015)
Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang, L.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)
Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for visual question answering. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 289–297 (2016)
Teney, D., Anderson, P., He, X., Van Den Hengel, A.: Tips and tricks for visual question answering: Learnings from the 2017 challenge. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4223–4232 (2018)
Nguyen, D.-K., Okatani, T.: Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6087–6096 (2018)
Li, L., Gan, Z., Cheng, Y., Liu, J.: Relation-aware graph attention network for visual question answering. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10313–10322 (2019)
Zhu, X., Mao, Z., Chen, Z., Li, Y., Wang, Z., Wang, B.: Object-difference drived graph convolutional networks for visual question answering. Mult. Tools Appl. 80(11), 16247–16265 (2021)
Article Google Scholar
Yang, Z., Qin, Z., Yu, J., Wan, T.: Prior visual relationship reasoning for visual question answering. In: 2020 IEEE International Conference on Image Processing (ICIP), pp. 1411–1415 (2020)
Wang, Y., Yasunaga, M., Ren, H., Wada, S., Leskovec, J.: Vqa-gnn: Reasoning with multimodal semantic graph for visual question answering. arXiv preprint arXiv:2205.11501 (2022)
Norcliffe-Brown, W., Vafeias, E., Parisot, S.: Learning conditioned graph structures for interpretable visual question answering. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 8344–8353 (2018)
Liang, W., Jiang, Y., Liu, Z.: Graphvqa: language-guided graph neural networks for scene graph question answering. arXiv preprint arXiv:2104.10283 (2021)
Teney, D., Liu, L., van Den Hengel, A.: Graph-structured representations for visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2017)
Huang, Q., Wei, J., Cai, Y., Zheng, C., Chen, J., Leung, H.-f., Li, Q.: Aligned dual channel graph convolutional network for visual question answering. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7166–7176 (2020)
Peng, L., Yang, S., Bin, Y., Wang, G.: Progressive graph attention network for video question answering. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 2871–2879 (2021)
Liu, T., Zhao, R., Lam, K.-M., Kong, J.: Visual-semantic graph neural network with pose-position attentive learning for group activity recognition. Neurocomputing 491, 217–231 (2022)
Article Google Scholar
Zhao, R., Liu, T., Huang, Z., Lun, D.P.K., Lam, K.K.: Geometry-aware facial expression recognition via attentive graph convolutional networks. IEEE Transactions on Affective Computing, 1–16 (2021)
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6000–6010 (2017)
Malinowski, M., Fritz, M.: A multi-world approach to question answering about real-world scenes based on uncertain input. In: Proceedings of the 27th International Conference on Neural Information Processing Systems-Volume 1, pp. 1682–1690 (2014)
Chen, L., Yan, X., Xiao, J., Zhang, H., Pu, S., Zhuang, Y.: Counterfactual samples synthesizing for robust visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10800–10809 (2020)
Zhou, B., Tian, Y., Sukhbaatar, S., Szlam, A., Fergus, R.: Simple baseline for visual question answering. arXiv preprint arXiv:1512.02167 (2015)
Cao, Q., Liang, X., Li, B., Lin, L.: Interpretable visual question answering by reasoning on dependency trees. IEEE Transact. Pattern Anal. Mach. Intell. 43(3), 887–901 (2019)
Article Google Scholar
Cao, Q., Wan, W., Wang, K., Liang, X., Lin, L.: Linguistically routing capsule network for out-of-distribution visual question answering. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1614–1623 (2021)
Wu, Q., Shen, C., Wang, P., Dick, A., Van Den Hengel, A.: Image captioning and visual question answering based on attributes and external knowledge. IEEE Transact. Pattern Anal. Mach. Intell. 40(6), 1367–1381 (2017)
Article Google Scholar
Shah, S., Mishra, A., Yadati, N., Talukdar, P.P.: Kvqa: Knowledge-aware visual question answering. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 8876–8884 (2019)
Qu, C., Zamani, H., Yang, L., Croft, W.B., Learned-Miller, E.: Passage retrieval for outside-knowledge visual question answering. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1753–1757 (2021)
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)
Liang, C., Wang, W., Zhou, T., Yang, Y.: Visual abductive reasoning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15565–15575 (2022)
Das, A., Kottur, S., Gupta, K., Singh, A., Yadav, D., Moura, J.M., Parikh, D., Batra, D.: Visual dialog. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 326–335 (2017)
Zheng, Z., Wang, W., Qi, S., Zhu, S.-C.: Reasoning visual dialogs with structural and partial observations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6669–6678 (2019)
Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., Sünderhauf, N., Reid, I., Gould, S., Van Den Hengel, A.: Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3674–3683 (2018)
Wang, H., Wang, W., Shu, T., Liang, W., Shen, J.: Active visual information gathering for vision-language navigation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXII 16, pp. 307–322 (2020)
Selvaraju, R.R., Lee, S., Shen, Y., Jin, H., Ghosh, S., Heck, L., Batra, D., Parikh, D.: Taking a hint: Leveraging explanations to make vision and language models more grounded. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2591–2600 (2019)
Liu, Y., Guo, Y., Yin, J., Song, X., Liu, W., Nie, L., Zhang, M.: Answer questions with right image regions: A visual attention regularization approach. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 18(4), 1–18 (2022)
Bordes, A., Usunier, N., Garcia-Durán, A., Weston, J., Yakhnenko, O.: Translating embeddings for modeling multi-relational data. In: Proceedings of the 26th International Conference on Neural Information Processing Systems-Volume 2, pp. 2787–2795 (2013)
Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transact. Pattern Anal. Mach. Intell. 39(06), 1137–1149 (2017)
Article Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 1532–1543 (2014)
Santoro, A., Raposo, D., Barrett, D.G., Malinowski, M., Pascanu, R., Battaglia, P., Lillicrap, T.: A simple neural network module for relational reasoning. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 4974–4983 (2017)
Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6904–6913 (2017)
Agrawal, A., Batra, D., Parikh, D., Kembhavi, A.: Don’t just assume; look and answer: Overcoming priors for visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4971–4980 (2018)
Hudson, D.A., Manning, C.D.: Gqa: A new dataset for real-world visual reasoning and compositional question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6700–6709 (2019)
Sun, Q., Fu, Y.: Stacked self-attention networks for visual question answering. In: Proceedings of the 2019 on International Conference on Multimedia Retrieval, pp. 207–211 (2019)
Xiaoqing, Z., Junjun, H.: Research on visual question answering based on deep stacked attention network. J. Phys. 1873, 1–8 (2021)
Google Scholar
Cadene, R., Ben-Younes, H., Cord, M., Thome, N.: Murel: Multimodal relational reasoning for visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1989–1998 (2019)
Yusuf, A.A., Chong, F., Xianling, M.: Evaluation of graph convolutional networks performance for visual question answering on reasoning datasets. Multimed. Tools Appl. 81(28), 40361–40370 (2022)
Article Google Scholar
Sharma, H., Jalal, A.S.: Visual question answering model based on graph neural network and contextual attention. Image Vis. Comput. 110, 104165 (2021)
Article Google Scholar
Zhang, L., Liu, S., Liu, D., Zeng, P., Li, X., Song, J., Gao, L.: Rich visual knowledge-based augmentation network for visual question answering. IEEE Transact. Neural Net Learn. Syst. 32(10), 4362–4373 (2020)
Article Google Scholar
Hu, R., Rohrbach, A., Darrell, T., Saenko, K.: Language-conditioned graph networks for relational reasoning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10294–10303 (2019)
Wang, Z., Wang, K., Yu, M., Xiong, J., Hwu, W.-m., Hasegawa-Johnson, M., Shi, H.: Interpretable visual reasoning via induced symbolic space. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1878–1887 (2021)
Zhang, P., Lan, H., Khan, M.A.: Multiple context learning networks for visual question answering. Scientific Programming, 1–11 (2022)

Download references

Acknowledgements

We would like to express our sincerest gratitude to Professor Xiaowang Zhang and Ph.D. Candidate Shaojuan Wu of Tianjin University for their valuable feedback and helpful discussions. This work was supported by National Natural Science Foundation of China (11801007, 12171002), Key scientific research projects of colleges and universities in Anhui Province (2022AH050093), Scientific research projects for graduate students of Anhui Province Education Department (YJS20210510, 2022cxcysj150).

Author information

Authors and Affiliations

School of Mathematics and Physics, Anhui Jianzhu University, Hefei, China
Chuan Liu, Ying-Ying Tan, Tian-Tian Xia & Jiajing Zhang
School of Integrated Circuits, Anhui University, Hefei, China
Ming Zhu
Operations Research and Data Science Laboratory, Anhui Jianzhu University, Hefei, China
Ying-Ying Tan

Authors

Chuan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Ying-Ying Tan
View author publications
You can also search for this author in PubMed Google Scholar
Tian-Tian Xia
View author publications
You can also search for this author in PubMed Google Scholar
Jiajing Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Ming Zhu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ming Zhu.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. This work was supported by National Natural Science Foundation of China (11801007, 12171002), Key scientific research projects of colleges and universities in Anhui Province (2022AH050093), Scientific research projects for graduate students of Anhui Province Education Department (YJS20210510, 2022cxcysj150).

Additional information

Communicated by S. Vrochidis.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Liu, C., Tan, YY., Xia, TT. et al. Co-attention graph convolutional network for visual question answering. Multimedia Systems 29, 2527–2543 (2023). https://doi.org/10.1007/s00530-023-01125-7

Download citation

Received: 07 March 2023
Accepted: 03 June 2023
Published: 20 June 2023
Issue Date: October 2023
DOI: https://doi.org/10.1007/s00530-023-01125-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Co-attention graph convolutional network for visual question answering

Abstract

Access this article

Similar content being viewed by others

Object-difference drived graph convolutional networks for visual question answering

Graph neural networks for visual question answering: a systematic review

An analysis of graph convolutional networks and recent datasets for visual question answering

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Co-attention graph convolutional network for visual question answering

Abstract

Access this article

Similar content being viewed by others

Object-difference drived graph convolutional networks for visual question answering

Graph neural networks for visual question answering: a systematic review

An analysis of graph convolutional networks and recent datasets for visual question answering

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation