A Coarse and Fine Grained Masking Approach for Video-Grounded Dialogue

Xu, Feifei; Zhou, Wang; Sun, Tao; Lu, Jiahao; Yu, Ziheng; Li, Guangzhen

doi:10.1007/978-3-031-53308-2_30

Feifei Xu¹⁴,
Wang Zhou¹⁴,
Tao Sun¹⁴,
Jiahao Lu¹⁴,
Ziheng Yu¹⁴ &
…
Guangzhen Li¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14555))

Included in the following conference series:

International Conference on Multimedia Modeling

292 Accesses

Abstract

The task of Video-Grounded Dialogue involves developing a multimodal chatbot capable of answering sequential questions from humans regarding video content, audio, captions and dialog history. Although existing approaches utilizing deep learning models have demonstrated impressive performance, their success often relies on fusion of multimodal information in the limited datasets rather than understanding the interactions and dependencies among individual modalities such as video-audio, caption, dialog history, question or answer. In this paper, we present CFM (Coarse and Fine Grained Masking), a novel approach based on the pre-training model GPT2, aiming at enhancing cross-modal understanding among individual modalities in video-grounded dialogue. CFM achieves this by employing distinct coarse-grained and fine-grained masking strategies to differentiate various inputs, including video-audio, caption, dialog history, question and answer. Furthermore, we improve GPT2 model to strengthen its ability to integrate video-audio with text information effectively by incorporating multimodal feedforward network. Through extensive experiments on the Audio Visual Scene-Aware Dialog (AVSD) datasets, our proposed approach demonstrates promising performance, highlighting the benefits of our method in effectively figuring out dependencies and interactions among individual modalities.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Das, A., et al.: Visual dialog. In: CVPR, pp. 326–335 (2017)
Google Scholar
Lei, J., Yu, L., Bansal, M., Berg, T.L.: Tvqa: localized, compositional video question answering. arXiv preprint arXiv:1809.01696 (2018)
Alamri, H., et al.: Audio visual scene-aware dialog. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7558–7567 (2019)
Google Scholar
Hori, C., et al.: End-to-end audio visual scene-aware dialog using multimodal attention-based video features. In: ICASSP 2019–2019 IEEE ICASSP, pp. 2352–2356. IEEE (2019)
Google Scholar
Nguyen, D.T., Sharma, S., Schulz, H., Asri, L.E., et al.: From film to video: Multi-turn question answering with multi-modal context. arXiv preprint arXiv:1812.07023, 2018
Sanabria, R., Palaskar, S., Metze, F.: Cmu sinbad’s submission for the dstc7 avsd challenge. In: DSTC7 at AAAI2019 workshop, vol. 6 (2019)
Google Scholar
Le, H., Sahoo, D., Chen, N.F., Hoi, S.C., et al.: Multimodal transformer networks for end-to-end video-grounded dialogue systems. arXiv preprint arXiv:1907.01166, 2019
JKim, J., Yoon, S., Kim, D., Yoo, C.D.: Structured co-reference graph attention for video-grounded dialogue. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 1789–1797 (2021)
Google Scholar
Le, H., Chen, N., Hoi, S., et al.: Vgnmn: Video-grounded neural module networks for video-grounded dialogue systems. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 3377–3393 (2022)
Google Scholar
Radford, A., Jeffrey, W., Child, R., Luan, D., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)
Google Scholar
Li, Z., Li, Z., Zhang, J., et al.: Bridging text and video: a universal multimodal transformer for audio-visual scene-aware dialog. IEEE/ACM Trans. Audio, Speech, Lang. Process. 29, 2476–2483 (2021)
Article Google Scholar
Alamri, H., Hori, C., Marks, T.K., Batra, D., Parikh, D.: Audio visual scene-aware dialog (avsd) track for natural language generation in dstc7. In: DSTC7 at AAAI2019 Workshop, vol. 2 (2018)
Google Scholar
Ji, Z., et al.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12), 1–38 (2023)
Google Scholar
Biderman, S., et al.: Pythia: a suite for analyzing large language models across training and scaling. In: International Conference on Machine Learning, pp. 2397–2430. PMLR (2023)
Google Scholar
Liu, P., Yuan, W., Jinlan, F., Jiang, Z., et al.: Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ACM Comput. Surv. 55(9), 1–35 (2023)
Article Google Scholar
Le, H., Hoi, S.C.: Video-grounded dialogues with pretrained generation language models. arXiv preprint arXiv:2006.15319 (2020)
Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016)
Wang, H., et al.: Foundation transformers. arXiv preprint arXiv:2210.06423 (2022)
Brown, T., Mann, B., Ryder, N., Subbiah, M., et al.: Language models are few-shot learners. Adv. Neural. Inf. Process. Syst. 33, 1877–1901 (2020)
Google Scholar
Shazeer, N.: Glu variants improve transformer. arXiv preprint arXiv:2002.05202 (2020)
Chowdhery, A., Narang, S., et al.: Palm: scaling language modeling with pathways. J. Mach. Learn. Res. 24(240), 1–113 (2023)
Google Scholar
Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., Jégou, H.: Going deeper with image transformers. In: Proceedings of the IEEE/CVF International Conference On Computer Vision, pp. 32–42 (2021)
Google Scholar
Lee, H., Yoon, S., Dernoncourt, F., Kim, D.S., Bui, T., Jung, K.: Dstc8-avsd: multimodal semantic transformer network with retrieval style word generator. arXiv preprint arXiv:2004.08299 (2020)
Lee, H., Yoon, S., Dernoncourt, F., Kim, D.S., Bui, T., Jung, K.: Bist: Bi-directional spatio-temporal reasoning for video-grounded dialogues. arXiv preprint arXiv:2010.10095 (2020)
Shah, A., et al.: Audio-visual scene-aware dialog and reasoning using audio-visual transformers with joint student-teacher learning. In: ICASSP 2022–2022 IEEE ICASSP, pages 7732–7736. IEEE (2022)
Google Scholar
Chu, Y.W., Lin, K.Y., Hsu, C.C., Ku, L.W., et al.: Multi-step joint-modality attention network for scene-aware dialogue system. arXiv preprint arXiv:2001.06206 (2020)
Xie, H., Iacobacci, I.: Audio visual scene-aware dialog system using dynamic memory networks. In: DSTC8 at AAAI2020 workshop (2020)
Google Scholar
Geng, S., Gao, P., Marks, T., Hori, C., Cherian, A.: Spatio-temporal scene graph reasoning for audio visual scene-aware dialog at dstc8. In: DSTC8 at AAAI2020 workshop (2020)
Google Scholar
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the ACL, pp. 311–318 (2002)
Google Scholar
Denkowski, M., Lavie, A.: Meteor universal: Language specific translation evaluation for any target language. In: Proceedings of the Ninth Workshop on Statistical Machine Translation, pp. 376–380 (2014)
Google Scholar
Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp. 74–81 (2004)
Google Scholar
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)
Google Scholar
Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45 (2020)
Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

Download references

Author information

Authors and Affiliations

Shanghai University of Electric Power, Shanghai, China
Feifei Xu, Wang Zhou, Tao Sun, Jiahao Lu, Ziheng Yu & Guangzhen Li

Authors

Feifei Xu
View author publications
You can also search for this author in PubMed Google Scholar
Wang Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Tao Sun
View author publications
You can also search for this author in PubMed Google Scholar
Jiahao Lu
View author publications
You can also search for this author in PubMed Google Scholar
Ziheng Yu
View author publications
You can also search for this author in PubMed Google Scholar
Guangzhen Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Wang Zhou .

Editor information

Editors and Affiliations

University of Amsterdam, Amsterdam, The Netherlands
Stevan Rudinac
Delft University of Technology, Delft, The Netherlands
Alan Hanjalic
Delft University of Technology, Delft, The Netherlands
Cynthia Liem
University of Amsterdam, Amsterdam, The Netherlands
Marcel Worring
Reykjavik University, Reykjavik, Iceland
Björn Þór Jónsson
Microsoft Research Lab – Asia, Beijing, China
Bei Liu
The University of Tokyo, Tokyo, Japan
Yoko Yamakata

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Xu, F., Zhou, W., Sun, T., Lu, J., Yu, Z., Li, G. (2024). A Coarse and Fine Grained Masking Approach for Video-Grounded Dialogue. In: Rudinac, S., et al. MultiMedia Modeling. MMM 2024. Lecture Notes in Computer Science, vol 14555. Springer, Cham. https://doi.org/10.1007/978-3-031-53308-2_30

Download citation

DOI: https://doi.org/10.1007/978-3-031-53308-2_30
Published: 28 January 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-53307-5
Online ISBN: 978-3-031-53308-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

A Coarse and Fine Grained Masking Approach for Video-Grounded Dialogue