Abstract
The task of Video-Grounded Dialogue involves developing a multimodal chatbot capable of answering sequential questions from humans regarding video content, audio, captions and dialog history. Although existing approaches utilizing deep learning models have demonstrated impressive performance, their success often relies on fusion of multimodal information in the limited datasets rather than understanding the interactions and dependencies among individual modalities such as video-audio, caption, dialog history, question or answer. In this paper, we present CFM (Coarse and Fine Grained Masking), a novel approach based on the pre-training model GPT2, aiming at enhancing cross-modal understanding among individual modalities in video-grounded dialogue. CFM achieves this by employing distinct coarse-grained and fine-grained masking strategies to differentiate various inputs, including video-audio, caption, dialog history, question and answer. Furthermore, we improve GPT2 model to strengthen its ability to integrate video-audio with text information effectively by incorporating multimodal feedforward network. Through extensive experiments on the Audio Visual Scene-Aware Dialog (AVSD) datasets, our proposed approach demonstrates promising performance, highlighting the benefits of our method in effectively figuring out dependencies and interactions among individual modalities.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Das, A., et al.: Visual dialog. In: CVPR, pp. 326–335 (2017)
Lei, J., Yu, L., Bansal, M., Berg, T.L.: Tvqa: localized, compositional video question answering. arXiv preprint arXiv:1809.01696 (2018)
Alamri, H., et al.: Audio visual scene-aware dialog. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7558–7567 (2019)
Hori, C., et al.: End-to-end audio visual scene-aware dialog using multimodal attention-based video features. In: ICASSP 2019–2019 IEEE ICASSP, pp. 2352–2356. IEEE (2019)
Nguyen, D.T., Sharma, S., Schulz, H., Asri, L.E., et al.: From film to video: Multi-turn question answering with multi-modal context. arXiv preprint arXiv:1812.07023, 2018
Sanabria, R., Palaskar, S., Metze, F.: Cmu sinbad’s submission for the dstc7 avsd challenge. In: DSTC7 at AAAI2019 workshop, vol. 6 (2019)
Le, H., Sahoo, D., Chen, N.F., Hoi, S.C., et al.: Multimodal transformer networks for end-to-end video-grounded dialogue systems. arXiv preprint arXiv:1907.01166, 2019
JKim, J., Yoon, S., Kim, D., Yoo, C.D.: Structured co-reference graph attention for video-grounded dialogue. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 1789–1797 (2021)
Le, H., Chen, N., Hoi, S., et al.: Vgnmn: Video-grounded neural module networks for video-grounded dialogue systems. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 3377–3393 (2022)
Radford, A., Jeffrey, W., Child, R., Luan, D., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)
Li, Z., Li, Z., Zhang, J., et al.: Bridging text and video: a universal multimodal transformer for audio-visual scene-aware dialog. IEEE/ACM Trans. Audio, Speech, Lang. Process. 29, 2476–2483 (2021)
Alamri, H., Hori, C., Marks, T.K., Batra, D., Parikh, D.: Audio visual scene-aware dialog (avsd) track for natural language generation in dstc7. In: DSTC7 at AAAI2019 Workshop, vol. 2 (2018)
Ji, Z., et al.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12), 1–38 (2023)
Biderman, S., et al.: Pythia: a suite for analyzing large language models across training and scaling. In: International Conference on Machine Learning, pp. 2397–2430. PMLR (2023)
Liu, P., Yuan, W., Jinlan, F., Jiang, Z., et al.: Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ACM Comput. Surv. 55(9), 1–35 (2023)
Le, H., Hoi, S.C.: Video-grounded dialogues with pretrained generation language models. arXiv preprint arXiv:2006.15319 (2020)
Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016)
Wang, H., et al.: Foundation transformers. arXiv preprint arXiv:2210.06423 (2022)
Brown, T., Mann, B., Ryder, N., Subbiah, M., et al.: Language models are few-shot learners. Adv. Neural. Inf. Process. Syst. 33, 1877–1901 (2020)
Shazeer, N.: Glu variants improve transformer. arXiv preprint arXiv:2002.05202 (2020)
Chowdhery, A., Narang, S., et al.: Palm: scaling language modeling with pathways. J. Mach. Learn. Res. 24(240), 1–113 (2023)
Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., Jégou, H.: Going deeper with image transformers. In: Proceedings of the IEEE/CVF International Conference On Computer Vision, pp. 32–42 (2021)
Lee, H., Yoon, S., Dernoncourt, F., Kim, D.S., Bui, T., Jung, K.: Dstc8-avsd: multimodal semantic transformer network with retrieval style word generator. arXiv preprint arXiv:2004.08299 (2020)
Lee, H., Yoon, S., Dernoncourt, F., Kim, D.S., Bui, T., Jung, K.: Bist: Bi-directional spatio-temporal reasoning for video-grounded dialogues. arXiv preprint arXiv:2010.10095 (2020)
Shah, A., et al.: Audio-visual scene-aware dialog and reasoning using audio-visual transformers with joint student-teacher learning. In: ICASSP 2022–2022 IEEE ICASSP, pages 7732–7736. IEEE (2022)
Chu, Y.W., Lin, K.Y., Hsu, C.C., Ku, L.W., et al.: Multi-step joint-modality attention network for scene-aware dialogue system. arXiv preprint arXiv:2001.06206 (2020)
Xie, H., Iacobacci, I.: Audio visual scene-aware dialog system using dynamic memory networks. In: DSTC8 at AAAI2020 workshop (2020)
Geng, S., Gao, P., Marks, T., Hori, C., Cherian, A.: Spatio-temporal scene graph reasoning for audio visual scene-aware dialog at dstc8. In: DSTC8 at AAAI2020 workshop (2020)
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the ACL, pp. 311–318 (2002)
Denkowski, M., Lavie, A.: Meteor universal: Language specific translation evaluation for any target language. In: Proceedings of the Ninth Workshop on Statistical Machine Translation, pp. 376–380 (2014)
Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp. 74–81 (2004)
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)
Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45 (2020)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Xu, F., Zhou, W., Sun, T., Lu, J., Yu, Z., Li, G. (2024). A Coarse and Fine Grained Masking Approach for Video-Grounded Dialogue. In: Rudinac, S., et al. MultiMedia Modeling. MMM 2024. Lecture Notes in Computer Science, vol 14555. Springer, Cham. https://doi.org/10.1007/978-3-031-53308-2_30
Download citation
DOI: https://doi.org/10.1007/978-3-031-53308-2_30
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-53307-5
Online ISBN: 978-3-031-53308-2
eBook Packages: Computer ScienceComputer Science (R0)