Skip to main content
Log in

Hierarchical Conditional Relation Networks for Multimodal Video Question Answering

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Video Question Answering (Video QA) challenges modelers in multiple fronts. Modeling video necessitates building not only spatio-temporal models for the dynamic visual channel but also multimodal structures for associated information channels such as subtitles or audio. Video QA adds at least two more layers of complexity – selecting relevant content for each channel in the context of the linguistic query, and composing spatio-temporal concepts and relations hidden in the data in response to the query. To address these requirements, we start with two insights: (a) content selection and relation construction can be jointly encapsulated into a conditional computational structure, and (b) video-length structures can be composed hierarchically. For (a) this paper introduces a general-reusable reusable neural unit dubbed Conditional Relation Network (CRN) taking as input a set of tensorial objects and translating into a new set of objects that encode relations of the inputs. The generic design of CRN helps ease the common complex model building process of Video QA by simple block stacking and rearrangements with flexibility in accommodating diverse input modalities and conditioning features across both visual and linguistic domains. As a result, we realize insight (b) by introducing Hierarchical Conditional Relation Networks (HCRN) for Video QA. The HCRN primarily aims at exploiting intrinsic properties of the visual content of a video as well as its accompanying channels in terms of compositionality, hierarchy, and near-term and far-term relation. HCRN is then applied for Video QA in two forms, short-form where answers are reasoned solely from the visual content of a video, and long-form where an additional associated information channel, such as movie subtitles, presented. Our rigorous evaluations show consistent improvements over state-of-the-art methods on well-studied benchmarks including large-scale real-world datasets such as TGIF-QA and TVQA, demonstrating the strong capabilities of our CRN unit and the HCRN for complex domains such as Video QA. To the best of our knowledge, the HCRN is the very first method attempting to handle long and short-form multimodal Video QA at the same time.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. https://github.com/kenshohara/video-classification-3d-cnn-pytorch.

  2. https://github.com/thaolmk54.

  3. https://github.com/huggingface/transformers.

References

  • P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, Bottom-up and top-down attention for image captioning and visual question answering, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6077–6086.

  • L. Baraldi, C. Grana, and R. Cucchiara, Hierarchical boundary-aware neural encoder for video captioning, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1657–1666.

  • M. I. H. Chowdhury, K. Nguyen, S. Sridharan, and C. Fookes, Hierarchical relational attention for video question answering, in 2018 25th IEEE International Conference on Image Processing (ICIP).IEEE, 2018, pp. 599–603.

  • D.-A. Clevert, T. Unterthiner, and S. Hochreiter, Fast and accurate deep network learning by exponential linear units (elus), International Conference on Learning Representations, 2016.

  • J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, Bert: Pre-training of deep bidirectional transformers for language understanding, NAACL, 2019.

  • K. Do, T. Tran, and S. Venkatesh, Learning deep matrix representations, arXiv preprint arXiv:1703.01454, 2018.

  • A. Ezen-Can, A comparison of lstm and bert for small corpus, arXiv preprint arXiv:2009.05451, 2020.

  • C. Fan, X. Zhang, S. Zhang, W. Wang, C. Zhang, and H. Huang, Heterogeneous memory enhanced multimodal attention model for video question answering, in CVPR, 2019, pp. 1999–2007.

  • J. Gao, R. Ge, K. Chen, and R. Nevatia,Motion-appearance co-memory networks for video question answering,CVPR, 2018.

  • K. Hara, H. Kataoka, and Y. Satoh, Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018, pp. 6546–6555.

  • K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, CVPR, 2016.

  • Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735–1780.

    Article  Google Scholar 

  • Y. Jang, Y. Song, Y. Yu, Y. Kim, and G. Kim, Tgif-qa: Toward spatio-temporal reasoning in visual question answering, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2758–2766.

  • W. Jin, Z. Zhao, M. Gu, J. Yu, J. Xiao, and Y. Zhuang, Multi-interaction network with object relation for video question answering, in Proceedings of the 27th ACM International Conference on Multimedia. 1em plus 0.5em minus 0.4em ACM, 2019, pp. 1193–1201.

  • K.-M. Kim, S.-H. Choi, J.-H. Kim, and B.-T. Zhang, Multimodal dual attention memory for video story question answering, in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 673–688.

  • K.-M. Kim, M.-O. Heo, S.-H. Choi, and B.-T. Zhang, Deepstory: Video story qa by deep embedded memory networks, International Joint Conferences on Artificial Intelligence, pp. 2016–2022, 2017.

  • J.-H. Kim, J. Jun, and B.-T. Zhang, Bilinear attention networks, in Advances in Neural Information Processing Systems, 2018, pp. 1564–1574.

  • J. Kim, M. Ma, K. Kim, S. Kim, and C. D. Yoo, Progressive attention memory network for movie story question answering, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 8337–8346.

  • T. M. Le, V. Le, S. Venkatesh, and T. Tran, Hierarchical conditional relation networks for video question answering, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9972–9981.

  • T. M. Le, V. Le, S. Venkatesh, and T. Tran, Neural reasoning, fast and slow, for video question answering, International Joint Conference on Neural Networks, 2020.

  • J. Lei, L. Yu, M. Bansal, and T. L. Berg, Tvqa: Localized, compositional video question answering, Conference on Empirical Methods in Natural Language Processing, 2018.

  • J. Lei, L. Yu, T. L. Berg, and M. Bansal, Tvqa+: Spatio-temporal grounding for video question answering, in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 8211–8225.

  • F. Li, C. Gan, X. Liu, Y. Bian, X. Long, Y. Li, Z. Li, J. Zhou, and S. Wen, Temporal modeling approaches for large-scale youtube-8m video understanding, CVPR workshop, 2017.

  • X. Li, L. Gao, X. Wang, W. Liu, X. Xu, H. T. Shen, and J. Song, Learnable aggregating net with diversity learning for video question answering, in Proceedings of the 27th ACM International Conference on Multimedia. 1em plus 0.5em minus 0.4em ACM, 2019, pp. 1166–1174.

  • X. Li, J. Song, L. Gao, X. Liu, W. Huang, X. He, and C. Gan, Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering, in AAAI, 2019.

  • J. Liang, L. Jiang, L. Cao, L.-J. Li, and A. G. Hauptmann, Focal visual-text attention for visual question answering, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 6135–6143.

  • R. Lienhart, Abstracting home video automatically, in Proceedings of the seventh ACM international conference on Multimedia (Part 2). 1em plus 0.5em minus 0.4em ACM, 1999, pp. 37–40.

  • D. Mahapatra, S. Winkler, and S.-C. Yen, Motion saliency outweighs other low-level features while watching videos, in Human Vision and Electronic Imaging XIII, vol. 6806. International Society for Optics and Photonics, 2008, p. 68060P.

  • F. Mao, X. Wu, H. Xue, and R. Zhang, Hierarchical video frame sequence representation with deep convolutional graph network, in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 0–0.

  • Na, S., Lee, S., Kim, J., & Kim, G. (2017). A read-write memory network for movie story understanding, in International Conference on Computer Vision (ICCV 2017). Italy: Venice.

    Google Scholar 

  • P. Pan, Z. Xu, Y. Yang, F. Wu, and Y. Zhuang, Hierarchical recurrent neural encoder for video representation with application to captioning, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1029–1038.

  • J. Pennington, R. Socher, and C. D. Manning, Glove: Global vectors for word representation, in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543.

  • E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville, Film: Visual reasoning with a general conditioning layer, in AAAI, 2018.

  • Z. Qiu, T. Yao, and T. Mei, Learning spatio-temporal representation with pseudo-3d residual networks, in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 5533–5541.

  • J. Sang and C. Xu, Character-based movie summarization, in Proceedings of the 18th ACM international conference on Multimedia, 2010, pp. 855–858.

  • M. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi, Bidirectional attention flow for machine comprehension, ICLR, 2017.

  • G. Singh, L. Sigal, and J. J. Little, Spatio-temporal relational reasoning for video question answering, in BMVC, 2019.

  • X. Song, Y. Shi, X. Chen, and Y. Han, Explore multi-step reasoning in video question answering, in 2018 ACM Multimedia Conference on Multimedia Conference. 1em plus 0.5em minus 0.4em ACM, 2018, pp. 239–247.

  • C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, Going deeper with convolutions, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1–9.

  • Y. Tang, X. Zhang, L. Ma, J. Wang, S. Chen, and Y.-G. Jiang, Non-local netvlad encoding for video classification, in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 0–0.

  • M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, and S. Fidler, Movieqa: Understanding stories in movies through question-answering, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4631–4640.

  • D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri, A closer look at spatiotemporal convolutions for action recognition, in Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, 2018, pp. 6450–6459.

  • B. Wang, Y. Xu, Y. Han, and R. Hong, Movie question answering: Remembering the textual cues for layered visual contents, AAAI’18, 2018.

  • Wang, A., Luu, A. T., Foo, C.-S., Zhu, H., Tay, Y., & Chandrasekhar, V. (2019). “Holistic multi-modal memory network for movie question answering. IEEE Transactions on Image Processing, 29, 489–499.

    Article  MathSciNet  Google Scholar 

  • C.-Y. Wu, C. Feichtenhofer, H. Fan, K. He, P. Krahenbuhl, and R. Girshick, Long-term feature banks for detailed video understanding, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 284–293.

  • S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, Aggregated residual transformations for deep neural networks, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 1492–1500.

  • J. Xu, T. Mei, T. Yao, and Y. Rui, Msr-vtt: A large video description dataset for bridging video and language, in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5288–5296.

  • D. Xu, Z. Zhao, J. Xiao, F. Wu, H. Zhang, X. He, and Y. Zhuang, Video question answering via gradually refined attention over appearance and motion, in Proceedings of the 25th ACM international conference on Multimedia. 1em plus 0.5em minus 0.4em ACM, 2017, pp. 1645–1653.

  • Z. Yang, N. Garcia, C. Chu, M. Otani, Y. Nakashima, and H. Takemura, Bert representations for video question answering, in The IEEE Winter Conference on Applications of Computer Vision, 2020, pp. 1556–1565.

  • T. Yang, Z.-J. Zha, H. Xie, M. Wang, and H. Zhang, Question-aware tube-switch network for video question answering, in Proceedings of the 27th ACM International Conference on Multimedia. 1em plus 0.5em minus 0.4em ACM, 2019, pp. 1184–1192.

  • Y. Ye, Z. Zhao, Y. Li, L. Chen, J. Xiao, and Y. Zhuang, Video question answering via attribute-augmented attention network learning, in Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1em plus 0.5em minus 0.4em ACM, 2017, pp. 829–832.

  • Y. Yu, J. Kim, and G. Kim, A joint sequence fusion model for video question answering and retrieval, in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 471–487.

  • Z. Yu, J. Yu, J. Fan, and D. Tao, Multi-modal factorized bilinear pooling with co-attention learning for visual question answering, in Proceedings of the IEEE international conference on computer vision, 2017, pp. 1821–1830.

  • K.-H. Zeng, T.-H. Chen, C.-Y. Chuang, Y.-H. Liao, J. C. Niebles, and M. Sun, Leveraging video descriptions to learn video question answering, in Thirty-First AAAI Conference on Artificial Intelligence, 2017.

  • Z. Zhao, X. Jiang, D. Cai, J. Xiao, X. He, and S. Pu, Multi-turn video question answering via multi-stream hierarchical attention context network. in IJCAI, 2018, pp. 3690–3696.

  • Z. Zhao, Q. Yang, D. Cai, X. He, and Y. Zhuang, Video question answering via hierarchical spatio-temporal attention networks. in IJCAI, 2017, pp. 3518–3524.

  • Zhao, Z., Zhang, Z., Xiao, S., Xiao, Z., Yan, X., Yu, J., et al. (2019). Long-form video question answering via dynamic hierarchical reinforced networks. IEEE Transactions on Image Processing, 28(12), 5939–5952.

    Article  MathSciNet  Google Scholar 

  • B. Zhou, A. Andonian, A. Oliva, and A. Torralba, Temporal relational reasoning in videos, in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 803–818.

  • Zhu, L., Xu, Z., Yang, Y., & Hauptmann, A. G. (2017). Uncovering the temporal context for video question answering. International Journal of Computer Vision, 124(3), 409–421.

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thao Minh Le.

Additional information

Communicated by Alexander Schwing.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Le, T.M., Le, V., Venkatesh, S. et al. Hierarchical Conditional Relation Networks for Multimodal Video Question Answering. Int J Comput Vis 129, 3027–3050 (2021). https://doi.org/10.1007/s11263-021-01514-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-021-01514-3

Keywords

Navigation