M2-CLIP: A Multimodal, Multi-task Adapting Framework for Video Action Recognition

Wang, Mengmeng; Xing, Jiazheng; Jiang, Boyuan; Chen, Jun; Mei, Jianbiao; Zuo, Xingxing; Dai, Guang; Wang, Jingdong; Liu, Yong

Computer Science > Computer Vision and Pattern Recognition

arXiv:2401.11649 (cs)

[Submitted on 22 Jan 2024]

Title:M2-CLIP: A Multimodal, Multi-task Adapting Framework for Video Action Recognition

Authors:Mengmeng Wang, Jiazheng Xing, Boyuan Jiang, Jun Chen, Jianbiao Mei, Xingxing Zuo, Guang Dai, Jingdong Wang, Yong Liu

View PDF HTML (experimental)

Abstract:Recently, the rise of large-scale vision-language pretrained models like CLIP, coupled with the technology of Parameter-Efficient FineTuning (PEFT), has captured substantial attraction in video action recognition. Nevertheless, prevailing approaches tend to prioritize strong supervised performance at the expense of compromising the models' generalization capabilities during transfer. In this paper, we introduce a novel Multimodal, Multi-task CLIP adapting framework named \name to address these challenges, preserving both high supervised performance and robust transferability. Firstly, to enhance the individual modality architectures, we introduce multimodal adapters to both the visual and text branches. Specifically, we design a novel visual TED-Adapter, that performs global Temporal Enhancement and local temporal Difference modeling to improve the temporal representation capabilities of the visual encoder. Moreover, we adopt text encoder adapters to strengthen the learning of semantic label information. Secondly, we design a multi-task decoder with a rich set of supervisory signals to adeptly satisfy the need for strong supervised performance and generalization within a multimodal framework. Experimental results validate the efficacy of our approach, demonstrating exceptional performance in supervised learning while maintaining strong generalization in zero-shot scenarios.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2401.11649 [cs.CV]
	(or arXiv:2401.11649v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2401.11649
Journal reference:	AAAI2024

Submission history

From: Mengmeng Wang [view email]
[v1] Mon, 22 Jan 2024 02:03:31 UTC (6,071 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:M2-CLIP: A Multimodal, Multi-task Adapting Framework for Video Action Recognition

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:M2-CLIP: A Multimodal, Multi-task Adapting Framework for Video Action Recognition

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators