Local-Global Multi-Modal Distillation for Weakly-Supervised Temporal Video Grounding

Authors

  • Peijun Bao Nanyang Technological University
  • Yong Xia Northwestern Polytechnical University
  • Wenhan Yang Peng Cheng Laboratory
  • Boon Poh Ng Nanyang Technological University
  • Meng Hwa Er Nanyang Technological University
  • Alex C. Kot Nanyang Technological University

DOI:

https://doi.org/10.1609/aaai.v38i2.27831

Keywords:

CV: Language and Vision, CV: Multi-modal Vision, CV: Video Understanding & Activity Analysis, NLP: Language Grounding & Multi-modal NLP

Abstract

This paper for the first time leverages multi-modal videos for weakly-supervised temporal video grounding. As labeling the video moment is labor-intensive and subjective, the weakly-supervised approaches have gained increasing attention in recent years. However, these approaches could inherently compromise performance due to inadequate supervision. Therefore, to tackle this challenge, we for the first time pay attention to exploiting complementary information extracted from multi-modal videos (e.g., RGB frames, optical flows), where richer supervision is naturally introduced in the weaklysupervised context. Our motivation is that by integrating different modalities of the videos, the model is learned from synergic supervision and thereby can attain superior generalization capability. However, addressing multiple modalities† would also inevitably introduce additional computational overhead, and might become inapplicable if a particular modality is inaccessible. To solve this issue, we adopt a novel route: building a multi-modal distillation algorithm to capitalize on the multi-modal knowledge as supervision for model training, while still being able to work with only the single modal input during inference. As such, we can utilize the benefits brought by the supplementary nature of multiple modalities, without compromising the applicability in practical scenarios. Specifically, we first propose a cross-modal mutual learning framework and train a sophisticated teacher model to learn collaboratively from the multi-modal videos. Then we identify two sorts of knowledge from the teacher model, i.e., temporal boundaries and semantic activation map. And we devise a local-global distillation algorithm to transfer this knowledge to a student model of single-modal input at both local and global levels. Extensive experiments on large-scale datasets demonstrate that our method achieves state-of-the-art performance with/without multi-modal inputs.

Downloads

Published

2024-03-24

How to Cite

Bao, P., Xia, Y., Yang, W., Ng, B. P., Er, M. H., & Kot, A. C. (2024). Local-Global Multi-Modal Distillation for Weakly-Supervised Temporal Video Grounding. Proceedings of the AAAI Conference on Artificial Intelligence, 38(2), 738-746. https://doi.org/10.1609/aaai.v38i2.27831

Issue

Section

AAAI Technical Track on Computer Vision I