Skip to main content

A Coarse and Fine Grained Masking Approach for Video-Grounded Dialogue

  • Conference paper
  • First Online:
MultiMedia Modeling (MMM 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14555))

Included in the following conference series:

  • 292 Accesses

Abstract

The task of Video-Grounded Dialogue involves developing a multimodal chatbot capable of answering sequential questions from humans regarding video content, audio, captions and dialog history. Although existing approaches utilizing deep learning models have demonstrated impressive performance, their success often relies on fusion of multimodal information in the limited datasets rather than understanding the interactions and dependencies among individual modalities such as video-audio, caption, dialog history, question or answer. In this paper, we present CFM (Coarse and Fine Grained Masking), a novel approach based on the pre-training model GPT2, aiming at enhancing cross-modal understanding among individual modalities in video-grounded dialogue. CFM achieves this by employing distinct coarse-grained and fine-grained masking strategies to differentiate various inputs, including video-audio, caption, dialog history, question and answer. Furthermore, we improve GPT2 model to strengthen its ability to integrate video-audio with text information effectively by incorporating multimodal feedforward network. Through extensive experiments on the Audio Visual Scene-Aware Dialog (AVSD) datasets, our proposed approach demonstrates promising performance, highlighting the benefits of our method in effectively figuring out dependencies and interactions among individual modalities.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Das, A., et al.: Visual dialog. In: CVPR, pp. 326–335 (2017)

    Google Scholar 

  2. Lei, J., Yu, L., Bansal, M., Berg, T.L.: Tvqa: localized, compositional video question answering. arXiv preprint arXiv:1809.01696 (2018)

  3. Alamri, H., et al.: Audio visual scene-aware dialog. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7558–7567 (2019)

    Google Scholar 

  4. Hori, C., et al.: End-to-end audio visual scene-aware dialog using multimodal attention-based video features. In: ICASSP 2019–2019 IEEE ICASSP, pp. 2352–2356. IEEE (2019)

    Google Scholar 

  5. Nguyen, D.T., Sharma, S., Schulz, H., Asri, L.E., et al.: From film to video: Multi-turn question answering with multi-modal context. arXiv preprint arXiv:1812.07023, 2018

  6. Sanabria, R., Palaskar, S., Metze, F.: Cmu sinbad’s submission for the dstc7 avsd challenge. In: DSTC7 at AAAI2019 workshop, vol. 6 (2019)

    Google Scholar 

  7. Le, H., Sahoo, D., Chen, N.F., Hoi, S.C., et al.: Multimodal transformer networks for end-to-end video-grounded dialogue systems. arXiv preprint arXiv:1907.01166, 2019

  8. JKim, J., Yoon, S., Kim, D., Yoo, C.D.: Structured co-reference graph attention for video-grounded dialogue. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 1789–1797 (2021)

    Google Scholar 

  9. Le, H., Chen, N., Hoi, S., et al.: Vgnmn: Video-grounded neural module networks for video-grounded dialogue systems. In: Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 3377–3393 (2022)

    Google Scholar 

  10. Radford, A., Jeffrey, W., Child, R., Luan, D., et al.: Language models are unsupervised multitask learners. OpenAI blog 1(8), 9 (2019)

    Google Scholar 

  11. Li, Z., Li, Z., Zhang, J., et al.: Bridging text and video: a universal multimodal transformer for audio-visual scene-aware dialog. IEEE/ACM Trans. Audio, Speech, Lang. Process. 29, 2476–2483 (2021)

    Article  Google Scholar 

  12. Alamri, H., Hori, C., Marks, T.K., Batra, D., Parikh, D.: Audio visual scene-aware dialog (avsd) track for natural language generation in dstc7. In: DSTC7 at AAAI2019 Workshop, vol. 2 (2018)

    Google Scholar 

  13. Ji, Z., et al.: Survey of hallucination in natural language generation. ACM Comput. Surv. 55(12), 1–38 (2023)

    Google Scholar 

  14. Biderman, S., et al.: Pythia: a suite for analyzing large language models across training and scaling. In: International Conference on Machine Learning, pp. 2397–2430. PMLR (2023)

    Google Scholar 

  15. Liu, P., Yuan, W., Jinlan, F., Jiang, Z., et al.: Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ACM Comput. Surv. 55(9), 1–35 (2023)

    Article  Google Scholar 

  16. Le, H., Hoi, S.C.: Video-grounded dialogues with pretrained generation language models. arXiv preprint arXiv:2006.15319 (2020)

  17. Wu, Y., et al.: Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016)

  18. Wang, H., et al.: Foundation transformers. arXiv preprint arXiv:2210.06423 (2022)

  19. Brown, T., Mann, B., Ryder, N., Subbiah, M., et al.: Language models are few-shot learners. Adv. Neural. Inf. Process. Syst. 33, 1877–1901 (2020)

    Google Scholar 

  20. Shazeer, N.: Glu variants improve transformer. arXiv preprint arXiv:2002.05202 (2020)

  21. Chowdhery, A., Narang, S., et al.: Palm: scaling language modeling with pathways. J. Mach. Learn. Res. 24(240), 1–113 (2023)

    Google Scholar 

  22. Touvron, H., Cord, M., Sablayrolles, A., Synnaeve, G., Jégou, H.: Going deeper with image transformers. In: Proceedings of the IEEE/CVF International Conference On Computer Vision, pp. 32–42 (2021)

    Google Scholar 

  23. Lee, H., Yoon, S., Dernoncourt, F., Kim, D.S., Bui, T., Jung, K.: Dstc8-avsd: multimodal semantic transformer network with retrieval style word generator. arXiv preprint arXiv:2004.08299 (2020)

  24. Lee, H., Yoon, S., Dernoncourt, F., Kim, D.S., Bui, T., Jung, K.: Bist: Bi-directional spatio-temporal reasoning for video-grounded dialogues. arXiv preprint arXiv:2010.10095 (2020)

  25. Shah, A., et al.: Audio-visual scene-aware dialog and reasoning using audio-visual transformers with joint student-teacher learning. In: ICASSP 2022–2022 IEEE ICASSP, pages 7732–7736. IEEE (2022)

    Google Scholar 

  26. Chu, Y.W., Lin, K.Y., Hsu, C.C., Ku, L.W., et al.: Multi-step joint-modality attention network for scene-aware dialogue system. arXiv preprint arXiv:2001.06206 (2020)

  27. Xie, H., Iacobacci, I.: Audio visual scene-aware dialog system using dynamic memory networks. In: DSTC8 at AAAI2020 workshop (2020)

    Google Scholar 

  28. Geng, S., Gao, P., Marks, T., Hori, C., Cherian, A.: Spatio-temporal scene graph reasoning for audio visual scene-aware dialog at dstc8. In: DSTC8 at AAAI2020 workshop (2020)

    Google Scholar 

  29. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the ACL, pp. 311–318 (2002)

    Google Scholar 

  30. Denkowski, M., Lavie, A.: Meteor universal: Language specific translation evaluation for any target language. In: Proceedings of the Ninth Workshop on Statistical Machine Translation, pp. 376–380 (2014)

    Google Scholar 

  31. Lin, C.Y.: Rouge: A package for automatic evaluation of summaries. In: Text summarization branches out, pp. 74–81 (2004)

    Google Scholar 

  32. Vedantam, R., Lawrence Zitnick, C., Parikh, D.: Cider: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)

    Google Scholar 

  33. Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45 (2020)

    Google Scholar 

  34. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Wang Zhou .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Xu, F., Zhou, W., Sun, T., Lu, J., Yu, Z., Li, G. (2024). A Coarse and Fine Grained Masking Approach for Video-Grounded Dialogue. In: Rudinac, S., et al. MultiMedia Modeling. MMM 2024. Lecture Notes in Computer Science, vol 14555. Springer, Cham. https://doi.org/10.1007/978-3-031-53308-2_30

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-53308-2_30

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-53307-5

  • Online ISBN: 978-3-031-53308-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics