Skip to main content
Log in

Co-attention graph convolutional network for visual question answering

  • Regular Paper
  • Published:
Multimedia Systems Aims and scope Submit manuscript

Abstract

Visual Question Answering (VQA) is a challenging task that requires a fine-grained understanding of both the visual content of images and the textual content of questions. Conventional visual attention model, which is designed primarily from the perspective of attention mechanism, lacks the ability to reason about relationships between visual objects and ignores the multimodal interactions between questions and images. In this work, we propose a combined both graph convolutional network and co-attention network to circumvent the aforementioned problem. The model employs binary relational reasoning as the graph learner module to learn a graph structure representation that captures relationships between visual objects and learns image representation related to the specific question that has an awareness of spatial location via spatial graph convolution. After that, we perform parallel co-attention learning by passing image representations and features of question words through a deep co-attention module. Experiment results demonstrate that the Overall accuracy of our model delivers \(68.67\%\) on the test-std set of the benchmark VQA v2.0 dataset, which outperforms most existing models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data availability

The datasets generated during and/or analyzed during the current study are available at https://visualqa.org and https://visualgenome.org.

References

  1. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C.L., Parikh, D.: Vqa: Visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2425–2433 (2015)

  2. Li, Y., Pan, Y., Yao, T., Mei, T.: Comprehending and ordering semantics for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17990–17999 (2022)

  3. Cheng, Y., Zhu, X., Qian, J., Wen, F., Liu, P.: Cross-modal graph matching network for image-text retrieval. ACM Transact. Multimedia. Comp. Communicat. Appl. (TOMM) 18(4), 1–23 (2022)

    Article  Google Scholar 

  4. Fukui, A., Park, D.H., Yang, D., Rohrbach, A., Darrell, T., Rohrbach, M.: Multimodal compact bilinear pooling for visual question answering and visual grounding. arXiv preprint arXiv:1606.01847 (2016)

  5. Zhang, W., Yu, J., Hu, H., Hu, H., Qin, Z.: Multimodal feature fusion by relational reasoning and attention for visual question answering. Informat. Fusion 55, 116–126 (2020)

    Article  Google Scholar 

  6. Yu, Z., Yu, J., Fan, J., Tao, D.: Multi-modal factorized bilinear pooling with co-attention learning for visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1821–1830 (2017)

  7. Kim, J.-H., On, K.-W., Lim, W., Kim, J., Ha, J.-W., Zhang, B.-T.: Hadamard product for low-rank bilinear pooling. arXiv preprint arXiv:1610.04325 (2016)

  8. Kim, J.-H., Jun, J., Zhang, B.-T.: Bilinear attention networks. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 1571–1581 (2018)

  9. Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for image question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 21–29 (2016)

  10. Yu, Z., Yu, J., Cui, Y., Tao, D., Tian, Q.: Deep modular co-attention networks for visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6281–6290 (2019)

  11. Yu, D., Fu, J., Tian, X., Mei, T.: Multi-source multi-level attention networks for visual question answering. ACM Transact. Mult. Comput., Communicat., Applicat. 15(2), 1–20 (2019)

    Google Scholar 

  12. Chen, K., Wang, J., Chen, L.-C., Gao, H., Xu, W., Nevatia, R.: Abc-cnn: An attention based convolutional neural network for visual question answering. arXiv preprint arXiv:1511.05960 (2015)

  13. Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang, L.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077–6086 (2018)

  14. Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for visual question answering. In: Proceedings of the 30th International Conference on Neural Information Processing Systems, pp. 289–297 (2016)

  15. Teney, D., Anderson, P., He, X., Van Den Hengel, A.: Tips and tricks for visual question answering: Learnings from the 2017 challenge. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4223–4232 (2018)

  16. Nguyen, D.-K., Okatani, T.: Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6087–6096 (2018)

  17. Li, L., Gan, Z., Cheng, Y., Liu, J.: Relation-aware graph attention network for visual question answering. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10313–10322 (2019)

  18. Zhu, X., Mao, Z., Chen, Z., Li, Y., Wang, Z., Wang, B.: Object-difference drived graph convolutional networks for visual question answering. Mult. Tools Appl. 80(11), 16247–16265 (2021)

    Article  Google Scholar 

  19. Yang, Z., Qin, Z., Yu, J., Wan, T.: Prior visual relationship reasoning for visual question answering. In: 2020 IEEE International Conference on Image Processing (ICIP), pp. 1411–1415 (2020)

  20. Wang, Y., Yasunaga, M., Ren, H., Wada, S., Leskovec, J.: Vqa-gnn: Reasoning with multimodal semantic graph for visual question answering. arXiv preprint arXiv:2205.11501 (2022)

  21. Norcliffe-Brown, W., Vafeias, E., Parisot, S.: Learning conditioned graph structures for interpretable visual question answering. In: Proceedings of the 32nd International Conference on Neural Information Processing Systems, pp. 8344–8353 (2018)

  22. Liang, W., Jiang, Y., Liu, Z.: Graphvqa: language-guided graph neural networks for scene graph question answering. arXiv preprint arXiv:2104.10283 (2021)

  23. Teney, D., Liu, L., van Den Hengel, A.: Graph-structured representations for visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2017)

  24. Huang, Q., Wei, J., Cai, Y., Zheng, C., Chen, J., Leung, H.-f., Li, Q.: Aligned dual channel graph convolutional network for visual question answering. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 7166–7176 (2020)

  25. Peng, L., Yang, S., Bin, Y., Wang, G.: Progressive graph attention network for video question answering. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 2871–2879 (2021)

  26. Liu, T., Zhao, R., Lam, K.-M., Kong, J.: Visual-semantic graph neural network with pose-position attentive learning for group activity recognition. Neurocomputing 491, 217–231 (2022)

    Article  Google Scholar 

  27. Zhao, R., Liu, T., Huang, Z., Lun, D.P.K., Lam, K.K.: Geometry-aware facial expression recognition via attentive graph convolutional networks. IEEE Transactions on Affective Computing, 1–16 (2021)

  28. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 6000–6010 (2017)

  29. Malinowski, M., Fritz, M.: A multi-world approach to question answering about real-world scenes based on uncertain input. In: Proceedings of the 27th International Conference on Neural Information Processing Systems-Volume 1, pp. 1682–1690 (2014)

  30. Chen, L., Yan, X., Xiao, J., Zhang, H., Pu, S., Zhuang, Y.: Counterfactual samples synthesizing for robust visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10800–10809 (2020)

  31. Zhou, B., Tian, Y., Sukhbaatar, S., Szlam, A., Fergus, R.: Simple baseline for visual question answering. arXiv preprint arXiv:1512.02167 (2015)

  32. Cao, Q., Liang, X., Li, B., Lin, L.: Interpretable visual question answering by reasoning on dependency trees. IEEE Transact. Pattern Anal. Mach. Intell. 43(3), 887–901 (2019)

    Article  Google Scholar 

  33. Cao, Q., Wan, W., Wang, K., Liang, X., Lin, L.: Linguistically routing capsule network for out-of-distribution visual question answering. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1614–1623 (2021)

  34. Wu, Q., Shen, C., Wang, P., Dick, A., Van Den Hengel, A.: Image captioning and visual question answering based on attributes and external knowledge. IEEE Transact. Pattern Anal. Mach. Intell. 40(6), 1367–1381 (2017)

    Article  Google Scholar 

  35. Shah, S., Mishra, A., Yadati, N., Talukdar, P.P.: Kvqa: Knowledge-aware visual question answering. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 8876–8884 (2019)

  36. Qu, C., Zamani, H., Yang, L., Croft, W.B., Learned-Miller, E.: Passage retrieval for outside-knowledge visual question answering. In: Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 1753–1757 (2021)

  37. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014)

  38. Liang, C., Wang, W., Zhou, T., Yang, Y.: Visual abductive reasoning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15565–15575 (2022)

  39. Das, A., Kottur, S., Gupta, K., Singh, A., Yadav, D., Moura, J.M., Parikh, D., Batra, D.: Visual dialog. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 326–335 (2017)

  40. Zheng, Z., Wang, W., Qi, S., Zhu, S.-C.: Reasoning visual dialogs with structural and partial observations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6669–6678 (2019)

  41. Anderson, P., Wu, Q., Teney, D., Bruce, J., Johnson, M., Sünderhauf, N., Reid, I., Gould, S., Van Den Hengel, A.: Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3674–3683 (2018)

  42. Wang, H., Wang, W., Shu, T., Liang, W., Shen, J.: Active visual information gathering for vision-language navigation. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXII 16, pp. 307–322 (2020)

  43. Selvaraju, R.R., Lee, S., Shen, Y., Jin, H., Ghosh, S., Heck, L., Batra, D., Parikh, D.: Taking a hint: Leveraging explanations to make vision and language models more grounded. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2591–2600 (2019)

  44. Liu, Y., Guo, Y., Yin, J., Song, X., Liu, W., Nie, L., Zhang, M.: Answer questions with right image regions: A visual attention regularization approach. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 18(4), 1–18 (2022)

  45. Bordes, A., Usunier, N., Garcia-Durán, A., Weston, J., Yakhnenko, O.: Translating embeddings for modeling multi-relational data. In: Proceedings of the 26th International Conference on Neural Information Processing Systems-Volume 2, pp. 2787–2795 (2013)

  46. Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transact. Pattern Anal. Mach. Intell. 39(06), 1137–1149 (2017)

    Article  Google Scholar 

  47. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

  48. Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 1532–1543 (2014)

  49. Santoro, A., Raposo, D., Barrett, D.G., Malinowski, M., Pascanu, R., Battaglia, P., Lillicrap, T.: A simple neural network module for relational reasoning. In: Proceedings of the 31st International Conference on Neural Information Processing Systems, pp. 4974–4983 (2017)

  50. Goyal, Y., Khot, T., Summers-Stay, D., Batra, D., Parikh, D.: Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6904–6913 (2017)

  51. Agrawal, A., Batra, D., Parikh, D., Kembhavi, A.: Don’t just assume; look and answer: Overcoming priors for visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4971–4980 (2018)

  52. Hudson, D.A., Manning, C.D.: Gqa: A new dataset for real-world visual reasoning and compositional question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6700–6709 (2019)

  53. Sun, Q., Fu, Y.: Stacked self-attention networks for visual question answering. In: Proceedings of the 2019 on International Conference on Multimedia Retrieval, pp. 207–211 (2019)

  54. Xiaoqing, Z., Junjun, H.: Research on visual question answering based on deep stacked attention network. J. Phys. 1873, 1–8 (2021)

    Google Scholar 

  55. Cadene, R., Ben-Younes, H., Cord, M., Thome, N.: Murel: Multimodal relational reasoning for visual question answering. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1989–1998 (2019)

  56. Yusuf, A.A., Chong, F., Xianling, M.: Evaluation of graph convolutional networks performance for visual question answering on reasoning datasets. Multimed. Tools Appl. 81(28), 40361–40370 (2022)

    Article  Google Scholar 

  57. Sharma, H., Jalal, A.S.: Visual question answering model based on graph neural network and contextual attention. Image Vis. Comput. 110, 104165 (2021)

    Article  Google Scholar 

  58. Zhang, L., Liu, S., Liu, D., Zeng, P., Li, X., Song, J., Gao, L.: Rich visual knowledge-based augmentation network for visual question answering. IEEE Transact. Neural Net Learn. Syst. 32(10), 4362–4373 (2020)

    Article  Google Scholar 

  59. Hu, R., Rohrbach, A., Darrell, T., Saenko, K.: Language-conditioned graph networks for relational reasoning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10294–10303 (2019)

  60. Wang, Z., Wang, K., Yu, M., Xiong, J., Hwu, W.-m., Hasegawa-Johnson, M., Shi, H.: Interpretable visual reasoning via induced symbolic space. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1878–1887 (2021)

  61. Zhang, P., Lan, H., Khan, M.A.: Multiple context learning networks for visual question answering. Scientific Programming, 1–11 (2022)

Download references

Acknowledgements

We would like to express our sincerest gratitude to Professor Xiaowang Zhang and Ph.D. Candidate Shaojuan Wu of Tianjin University for their valuable feedback and helpful discussions. This work was supported by National Natural Science Foundation of China (11801007, 12171002), Key scientific research projects of colleges and universities in Anhui Province (2022AH050093), Scientific research projects for graduate students of Anhui Province Education Department (YJS20210510, 2022cxcysj150).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ming Zhu.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. This work was supported by National Natural Science Foundation of China (11801007, 12171002), Key scientific research projects of colleges and universities in Anhui Province (2022AH050093), Scientific research projects for graduate students of Anhui Province Education Department (YJS20210510, 2022cxcysj150).

Additional information

Communicated by S. Vrochidis.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, C., Tan, YY., Xia, TT. et al. Co-attention graph convolutional network for visual question answering. Multimedia Systems 29, 2527–2543 (2023). https://doi.org/10.1007/s00530-023-01125-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00530-023-01125-7

Keywords

Navigation