Abstract
Capturing spatio-temporal information in videos related to the question remains a key challenge in video question answering task (VideoQA). Though great success has been achieved in VideoQA, most of the existing methods do not sufficiently consider the correlation among appearance, motion, and object features, making it difficult to fully exploit the spatio-temporal relationships at different granularities. Besides, recent researches typically use the same interaction method when different features in the video interact with the question features separately, which ignores the spatio-temporal characteristics of the appearance and motion features in the video which leads to the problem of spatio-temporal mismatch. In this paper, we propose an Appearance-Motion Dual-stream Heterogeneous Network for VideoQA (AMHN), which pays attention to the synergy among three different features by heterogeneous interactions in terms of their spatio-temporal characteristics. AMHN unites object features with appearance features and motion features respectively to obtain two high-level visual representations containing object information. Then they are fed into the object-relational reasoning module to acquire relation-aware visual features. We use a bilinear attention network for appearance and put forward a Video-Text Symmetric Attention Network (VTSAN) for motion to achieve diverse features, which are fused under the guidance of the question to predict the final answer. We evaluate the performance of AMHN on two VideoQA benchmark datasets and perform an extensive ablation study. The experimental results demonstrate its state-of-the-art.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Yang, Z., et al.: Stacked attention networks for image question answering. In: Proceedings of CVPR (2016)
Fan, C., et al.: Heterogeneous memory enhanced multimodal attention model for video question answering. In: Proceedings of CVPR (2019)
Xiao, J., et al. :NExT-QA: next phase of question-answering to explaining temporal actions. In: Proceedings of CVPR (2021)
Liu, Y., et al.: Cross-Attentional Spatio-Temporal semantic graph networks for video question answering. In: Proceedings of Image Processing (2022)
He, K., et al.: Deep residual learning for image recognition. In: Proceedings of CVPR (2016)
Hara, K., et al.: Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and imageNet? In: Proceedings of CVPR (2017)
Kim, J-H., et al.: Bilinear Attention Networks. In: Proceedings of NeurIPS (2018)
Xu, D., et al.: Video question answering via gradually refined attention over appearance and motion. In: Proceedings of ACM (2017)
Zhao, Z., et al.: Video question answering via hierarchical Dual-Level attention network learning. In: Proceedings of ACM (2017)
Simonyan, K., Zisserman, A.:Very deep convolutional networks for Large-Scale image recognition. CoRR abs/1409.1556 (2014)
Tran, D., et al.: Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of ICCV (2015)
Jang, Y., et al.: TGIF-QA: Toward Spatio-Temporal reasoning in visual question answering. In: Proceedings of CVPR (2017)
Jiang, J., et al.: Divide and conquer: Question-Guided Spatio-Temporal contextual attention for video question answering. In: Proceedings of AAAI (2020)
Gao, J. et al.: Motion-Appearance Co-memory networks for video question answering. In: Proceedings of CVPR (2018)
Huang, D., et al.: Location-Aware graph convolutional networks for video question answering. In: Proceedings of AAAI (2020)
Zeng, K-H et al.: Leveraging video descriptions to learn video question answering. In: Proceedings of AAAI (2016)
Zhu, L., et al.: Uncovering the temporal context for video question answering. In: Proceedings of IJCV (2017)
Zhao, Z., et al.: Multi-Turn video question answering via Multi-Stream hierarchical attention context network. In: Proceedings of IJCAI (2018)
Ren, S., et al.: Faster R-CNN: towards Real-Time object detection with region proposal networks. In: Proceedings of TPAMI (2015)
Devlin, J., et al.: BERT: Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT (2019)
He, K., et al.: Mask R-CNN. In: Proceedings of TPMAI (2017)
Seo, A., et al.: Attend what you need: Motion-Appearance synergistic networks for video question answering. In: Proceedings of ACL (2021)
krishna, R., et al.: Visual Genome: connecting language and vision using crowdsourced dense image annotations. Int. J. Comput. Vis. 123(1), 32–73 (2017). https://doi.org/10.1007/s11263-016-0981-7
Deng, J., et al. ImageNet: a large-scale hierarchical image database. In: Proceedings of CVPR (2009)
Kay, W., et al.: The kinetics human action video dataset. ArXiv abs/1705.06950 (2017)
Kingma, D., et al.: Adam: a method for stochastic optimization. CoRR abs/1412.6980 (2014)
Ye, Y., et al.: Video question answering via Attribute-Augmented attention network learning. In: Proceedings of SIGIR (2017)
Chen, D.L., William B.D.: Collecting Highly Parallel Data for Paraphrase Evaluation. In: Proceedings of ACL (2011)
Xu, J., et al.; MSR-VTT: a large video description dataset for bridging video and language. In: Proceedings of CVPR (2016)
Zha, Z., et al.: Spatiotemporal-Textual Co-Attention network for video question answering. In: Proceedings of TOMM (2019)
Yu, T., et al.: Compositional attention networks with Two-Stream fusion for video question answering. In: Proceedings of TIP (2020)
Jiang, P., Han, Y.: Reasoning with heterogeneous graph alignment for video question Answering. In: Proceedings of AAAI (2020)
Cai, J., et al.: Feature augmented memory with global attention network for VideoQA. In: Proceedings of IJCAI (2020)
Gu, M., et al.: Graph-Based Multi-Interaction network for video question answering. In: Proceedings of TIP(2021)
Abdessaied, A., et al.: Video language Co-Attention with multimodal Fast-Learning feature fusion for VideoQA. In: Proceedings of RepL4NLP (2022)
Liu, Y., et al.: Dynamic self-attention with vision synchronization networks for video question answering. In: Proceedings of ICPR (2022)
Li, X., et al.: Complementary spatiotemporal network for video question answering. In: Proceedings of Multimed Syst (2021)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Xu, F., Zhong, Z., Zhu, Y., Zhou, Y., Li, G. (2024). Appearance-Motion Dual-Stream Heterogeneous Network for VideoQA. In: Rudinac, S., et al. MultiMedia Modeling. MMM 2024. Lecture Notes in Computer Science, vol 14556. Springer, Cham. https://doi.org/10.1007/978-3-031-53311-2_16
Download citation
DOI: https://doi.org/10.1007/978-3-031-53311-2_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-53310-5
Online ISBN: 978-3-031-53311-2
eBook Packages: Computer ScienceComputer Science (R0)