Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval

Hu, Fan; Chen, Aozhu; Wang, Ziyue; Zhou, Fangming; Dong, Jianfeng; Li, Xirong

doi:10.1007/978-3-031-19781-9_26

Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval

Fan Hu ORCID: orcid.org/0000-0002-5371-7780^12,13,
Aozhu Chen^12,13,
Ziyue Wang^12,13,
Fangming Zhou^12,13,
Jianfeng Dong¹⁴ &
…
Xirong Li ORCID: orcid.org/0000-0002-0220-8310^12,13

Conference paper
First Online: 23 October 2022

2441 Accesses
12 Citations

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13674))

Abstract

In this paper we revisit feature fusion, an old-fashioned topic, in the new context of text-to-video retrieval. Different from previous research that considers feature fusion only at one end, let it be video or text, we aim for feature fusion for both ends within a unified framework. We hypothesize that optimizing the convex combination of the features is preferred to modeling their correlations by computationally heavy multi-head self attention. We propose Lightweight Attentional Feature Fusion (LAFF). LAFF performs feature fusion at both early and late stages and at both video and text ends, making it a powerful method for exploiting diverse (off-the-shelf) features. The interpretability of LAFF can be used for feature selection. Extensive experiments on five public benchmark sets (MSR-VTT, MSVD, TGIF, VATEX and TRECVID AVS 2016–2020) justify LAFF as a new baseline for text-to-video retrieval.

F. Hu, A. Chen and Z. Wang.—Equal contribution.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

1.
https://github.com/ruc-aimc-lab/laff.
2.
Other non-linear activations such as ReLU and sigmoid will make each dimension non-negative, constrain the feature space be in the first quadrant, and consequently put a lower boundary of 0 on the cosine similarity. As such, the similarity will be less discriminative than the tanh counterpart.
3.
We prefer the two-tower version to the single-tower version, as the latter has to compute video and text embeddings online, making it not scalable for real applications.
4.
Video features: clip-ft, x3d, ircsn and tf. Text features: clip-ft, bow, w2v and gru.

References

Amrani, E., Ben-Ari, R., Rotman, D., Bronstein, A.: Noise estimation using density estimation for self-supervised multimodal learning. In: AAAI (2021)
Google Scholar
Atrey, P.K., Hossain, M.A., El Saddik, A., Kankanhalli, M.S.: Multimodal fusion for multimedia analysis: a survey. Multimedia Syst. 16, 345–379 (2010)
Article Google Scholar
Bain, M., Nagrani, A., Varol, G., Zisserman, A.: Frozen in time: A joint video and image encoder for end-to-end retrieval. In: ICCV (2021)
Google Scholar
Baltrušaitis, T., Ahuja, C., Morency, L.P.: Multimodal machine learning: a survey and taxonomy. TPAMI 41(2), 423–443 (2018)
Article Google Scholar
Bastan, M., et al.: NTU ROSE lab at TRECVID 2018: Ad-hoc video search and video to text. In: TRECVID (2018)
Google Scholar
Berns, F., Rossetto, L., Schoeffmann, K., Beecks, C., Awad, G.: V3C1 dataset: An evaluation of content characteristics. In: ICMR (2019)
Google Scholar
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: ICML (2021)
Google Scholar
Chen, A., Hu, F., Wang, Z., Zhou, F., Li, X.: What matters for ad-hoc video search? a large-scale evaluation on TRECVID. In: ICCV Workshop on ViRal (2021)
Google Scholar
Chen, D., Dolan, W.: Collecting highly parallel data for paraphrase evaluation. In: CVPR (2011)
Google Scholar
Chen, S., Zhao, Y., Jin, Q., Wu, Q.: Fine-grained video-text retrieval with hierarchical graph reasoning. In: CVPR (2020)
Google Scholar
Croitoru, I., et al.: TEACHTEXT: Crossmodal generalized distillation for text-video retrieval. In: ICCV (2021)
Google Scholar
Dauphin, Y.N., Fan, A., Auli, M., Grangier, D.: Language modeling with gated convolutional networks. In: ICML (2017)
Google Scholar
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT (2019)
Google Scholar
Dong, J., Li, X., Snoek, C.G.: Predicting visual features from text for image and video caption retrieval. TMM 20(12), 3377–3388 (2018)
Google Scholar
Dong, J., Li, X., Xu, C., Yang, X., Yang, G., Wang, X.: Dual encoding for video retrieval by text. TPAMI (2021)
Google Scholar
Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: VSE++: improving visual-semantic embeddings with hard negatives. In: BMVC (2018)
Google Scholar
Fang, H., Xiong, P., Xu, L., Chen, Y.: CLIP2Video: Mastering video-text retrieval via image clip. arXiv preprint arXiv:2106.11097 (2021)
Feichtenhofer, C.: X3d: Expanding architectures for efficient video recognition. In: CVPR (2020)
Google Scholar
Gabeur, V., Sun, C., Alahari, K., Schmid, C.: Multi-modal transformer for video retrieval. In: ECCV (2020)
Google Scholar
Ghadiyaram, D., Tran, D., Mahajan, D.: Large-scale weakly-supervised pre-training for video action recognition. In: CVPR (2019)
Google Scholar
Huang, P.Y., Liang, J., Vaibhav, V., Chang, X., Hauptmann, A.: Informedia@TRECVID 2018: Ad-hoc video search with discrete and continuous representations. In: TRECVID (2018)
Google Scholar
Ilse, M., Tomczak, J., Welling, M.: Attention-based deep multiple instance learning. In: ICML (2018)
Google Scholar
Joulin, A., van der Maaten, L., Jabri, A., Vasilache, N.: Learning visual features from large weakly supervised data. In: ECCV (2016)
Google Scholar
Le, D.D., et al.: NII-HITACHI-UIT at TRECVID 2016. In: TRECVID (2016)
Google Scholar
Li, H., Chen, J., Hu, R., Yu, M., Chen, H., Xu, Z.: Action recognition using visual attention with reinforcement learning. In: MMM (2019)
Google Scholar
Li, X., Dong, J., Xu, C., Cao, J., Wang, X., Yang, G.: Renmin University of China and Zhejiang Gongshang University at TRECVID 2018: Deep Cross-Modal Embeddings for Video-Text Retrieval. In: TRECVID (2018)
Google Scholar
Li, X., et al.: Deep multiple instance learning with spatial attention for rop case classification, instance selection and abnormality localization. In: ICPR (2020)
Google Scholar
Li, X., Xu, C., Yang, G., Chen, Z., Dong, J.: W2VV++: Fully deep learning for ad-hoc video search. In: ACMMM (2019)
Google Scholar
Li, X., et al.: Renmin University of China and Zhejiang Gongshang University at TRECVID 2019: Learn to search and describe videos. In: TRECVID (2019)
Google Scholar
Li, X., Zhou, F., Chen, A.: Renmin University of China at TRECVID 2020: Sentence Encoder Assembly for Ad-hoc Video Search. In: TRECVID (2020)
Google Scholar
Li, X., Zhou, F., Xu, C., Ji, J., Yang, G.: SEA: sentence encoder assembly for video retrieval by textual queries. TMM 23, 4351–4362 (2021)
Google Scholar
Li, Y., et al.: TGIF: A new dataset and benchmark on animated gif description. In: CVPR (2015)
Google Scholar
Liang, J., et al.: Informedia @ TRECVID 2016. In: TRECVID (2016)
Google Scholar
Liu, M., Chen, X., Zhang, Y., Li, Y., Rehg, J.M.: Attention distillation for learning video representations. In: BMVC (2020)
Google Scholar
Liu, Y., Albanie, S., Nagrani, A., Zisserman, A.: Use what you have: Video retrieval using representations from collaborative experts. In: BMVC (2019)
Google Scholar
Luo, H., et al.: CLIP4Clip: An empirical study of clip for end to end video clip retrieval. arXiv preprint arXiv:2104.08860 (2021)
Mahajan, D., et al.: Exploring the limits of weakly supervised pretraining. In: ECCV (2018)
Google Scholar
Markatopoulou, F., et al.: ITI-CERTH participation in TRECVID 2016. In: TRECVID (2016)
Google Scholar
Mettes, P., Koelma, D.C., Snoek, C.G.M.: Shuffled ImageNet banks for video event detection and search. TOMM 16(2), 1–21 (2020)
Article Google Scholar
Miech, A., Laptev, I., Sivic, J.: Learning a text-video embedding from incomplete and heterogeneous data. arXiv (2018)
Google Scholar
Mithun, N.C., Li, J., Metze, F., Roy-Chowdhury, A.K.: Joint embeddings with multimodal cues for video-text retrieval. Int. J. Multimedia Inf. Retrieval 8(1), 3–18 (2019). https://doi.org/10.1007/s13735-018-00166-3
Article Google Scholar
Mithun, N.C., Li, J., Metze, F., Roy-Chowdhury, A.K.: Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In: ICMR (2018)
Google Scholar
Nguyen, P.A., et al.: Vireo @ TRECVID 2017: Video-to-text, ad-hoc video search and video hyperlinking. In: TRECVID (2017)
Google Scholar
Over, P., Awad, G., Smeaton, A.F., Foley, C., Lanagan, J.: Creating a web-scale video collection for research. In: The 1st Workshop on Web-scale Multimedia Corpus (2009)
Google Scholar
Paszke, A.,et al.: Pytorch: An imperative style, high-performance deep learning library. In: NeurIPS (2019)
Google Scholar
Patrick, M., et al.: Support-set bottlenecks for video-text representation learning. In: ICLR (2021)
Google Scholar
Portillo-Quintero, J.A., Ortiz-Bayliss, J.C., Terashima-Marín, H.: A straightforward framework for video retrieval using CLIP. In: MCPR (2021)
Google Scholar
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Google Scholar
Smeulders, A.W.M., Worring, M., Santini, S., Gupta, A., Jain, R.C.: Content-based image retrieval at the end of the early years. TPAMI 22(12), 1349–1380 (2000)
Article Google Scholar
Snoek, C.G.M., Worring, M.: Multimodal video indexing: a review of the state-of-the-art. Multimedia Tools Appl. 25(1), 5–35 (2005)
Article Google Scholar
Snoek, C.G., Li, X., Xu, C., Koelma, D.C.: University of Amsterdam and Renmin university at TRECVID 2017: Searching video, detecting events and describing video. In: TRECVID (2017)
Google Scholar
Song, Y., Soleymani, M.: Polysemous visual-semantic embedding for cross-modal retrieval. In: CVPR (2019)
Google Scholar
Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV (2015)
Google Scholar
Ueki, K., Hirakawa, K., Kikuchi, K., Ogawa, T., Kobayashi, T.: Waseda_Meisei at TRECVID 2017: Ad-hoc video search. In: TRECVID (2017)
Google Scholar
Ueki, Kazuya an Hori, T., Kobayashi, T.: Waseda_Meisei_SoftBank at TRECVID 2019: Ad-hoc video search. In: TRECVID (2019)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Google Scholar
Wang, X., Wu, J., Chen, J., Li, L., Wang, Y.F., Wang, W.Y.: VATEX: A large-scale, high-quality multilingual dataset for video-and-language research. In: ICCV (2019)
Google Scholar
Woo, S., Park, J., Lee, J., Kweon, I.S.: CBAM: Convolutional block attention module. In: ECCV (2018)
Google Scholar
Wray, M., Larlus, D., Csurka, G., Damen, D.: Fine-grained action retrieval through multiple parts-of-speech embeddings. In: ICCV (2019)
Google Scholar
Wu, J., Ngo, C.W.: Interpretable embedding for ad-hoc video search. In: ACMMM (2020)
Google Scholar
Wu, J., Nguyen, P.A., Ngo, C.W.: Vireo@ TRECVID 2020 ad-hoc video search. In: TRECVID (2020)
Google Scholar
Wu, X., Chen, D., He, Y., Xue, H., Song, M., Mao, F.: Hybrid sequence encoder for text based video retrieval. In: TRECVID (2019)
Google Scholar
Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: A large video description dataset for bridging video and language. In: CVPR (2016)
Google Scholar
Xue, X., Nie, F., Wang, S., Chang, X., Stantic, B., Yao, M.: Multi-view correlated feature learning by uncovering shared component. In: AAAI (2017)
Google Scholar
Yang, X., Dong, J., Cao, Y., Wang, X., Wang, M., Chua, T.S.: Tree-augmented cross-modal encoding for complex-query video retrieval. In: SIGIR (2020)
Google Scholar
Yilmaz, E., Aslam, J.A.: Estimating average precision with incomplete and imperfect judgments. In: CIKM (2006)
Google Scholar
Yu, Y., Kim, J., Kim, G.: A joint sequence fusion model for video question answering and retrieval. In: ECCV (2018)
Google Scholar
Zhao, Y., Song, Y., Chen, S., Jin, Q.: RUC_AIM3 at TRECVID 2020: Ad-hoc video search & video to text description. In: TRECVID (2020)
Google Scholar

Download references

Acknowledgments

This work was supported by NSFC (No. 62172420, No. 62072463), BJNSF (No. 4202033), and Public Computing Cloud, Renmin University of China.

Author information

Authors and Affiliations

MoE Key Lab of DEKE, Renmin University of China, Beijing, China
Fan Hu, Aozhu Chen, Ziyue Wang, Fangming Zhou & Xirong Li
AIMC Lab, School of Information, Renmin University of China, Beijing, China
Fan Hu, Aozhu Chen, Ziyue Wang, Fangming Zhou & Xirong Li
College of Computer and Information Engineering, Zhejiang Gongshang University, Hangzhou, China
Jianfeng Dong

Authors

Fan Hu
View author publications
You can also search for this author in PubMed Google Scholar
Aozhu Chen
View author publications
You can also search for this author in PubMed Google Scholar
Ziyue Wang
View author publications
You can also search for this author in PubMed Google Scholar
Fangming Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Jianfeng Dong
View author publications
You can also search for this author in PubMed Google Scholar
Xirong Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xirong Li .

Editor information

Editors and Affiliations

Tel Aviv University, Tel Aviv, Israel
Shai Avidan
University College London, London, UK
Gabriel Brostow
Google AI, Accra, Ghana
Moustapha Cissé
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 212 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hu, F., Chen, A., Wang, Z., Zhou, F., Dong, J., Li, X. (2022). Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13674. Springer, Cham. https://doi.org/10.1007/978-3-031-19781-9_26

Download citation

DOI: https://doi.org/10.1007/978-3-031-19781-9_26
Published: 23 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19780-2
Online ISBN: 978-3-031-19781-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics