Abstract
In this paper we revisit feature fusion, an old-fashioned topic, in the new context of text-to-video retrieval. Different from previous research that considers feature fusion only at one end, let it be video or text, we aim for feature fusion for both ends within a unified framework. We hypothesize that optimizing the convex combination of the features is preferred to modeling their correlations by computationally heavy multi-head self attention. We propose Lightweight Attentional Feature Fusion (LAFF). LAFF performs feature fusion at both early and late stages and at both video and text ends, making it a powerful method for exploiting diverse (off-the-shelf) features. The interpretability of LAFF can be used for feature selection. Extensive experiments on five public benchmark sets (MSR-VTT, MSVD, TGIF, VATEX and TRECVID AVS 2016–2020) justify LAFF as a new baseline for text-to-video retrieval.
F. Hu, A. Chen and Z. Wang.—Equal contribution.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsNotes
- 1.
- 2.
Other non-linear activations such as ReLU and sigmoid will make each dimension non-negative, constrain the feature space be in the first quadrant, and consequently put a lower boundary of 0 on the cosine similarity. As such, the similarity will be less discriminative than the tanh counterpart.
- 3.
We prefer the two-tower version to the single-tower version, as the latter has to compute video and text embeddings online, making it not scalable for real applications.
- 4.
Video features: clip-ft, x3d, ircsn and tf. Text features: clip-ft, bow, w2v and gru.
References
Amrani, E., Ben-Ari, R., Rotman, D., Bronstein, A.: Noise estimation using density estimation for self-supervised multimodal learning. In: AAAI (2021)
Atrey, P.K., Hossain, M.A., El Saddik, A., Kankanhalli, M.S.: Multimodal fusion for multimedia analysis: a survey. Multimedia Syst. 16, 345–379 (2010)
Bain, M., Nagrani, A., Varol, G., Zisserman, A.: Frozen in time: A joint video and image encoder for end-to-end retrieval. In: ICCV (2021)
Baltrušaitis, T., Ahuja, C., Morency, L.P.: Multimodal machine learning: a survey and taxonomy. TPAMI 41(2), 423–443 (2018)
Bastan, M., et al.: NTU ROSE lab at TRECVID 2018: Ad-hoc video search and video to text. In: TRECVID (2018)
Berns, F., Rossetto, L., Schoeffmann, K., Beecks, C., Awad, G.: V3C1 dataset: An evaluation of content characteristics. In: ICMR (2019)
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: ICML (2021)
Chen, A., Hu, F., Wang, Z., Zhou, F., Li, X.: What matters for ad-hoc video search? a large-scale evaluation on TRECVID. In: ICCV Workshop on ViRal (2021)
Chen, D., Dolan, W.: Collecting highly parallel data for paraphrase evaluation. In: CVPR (2011)
Chen, S., Zhao, Y., Jin, Q., Wu, Q.: Fine-grained video-text retrieval with hierarchical graph reasoning. In: CVPR (2020)
Croitoru, I., et al.: TEACHTEXT: Crossmodal generalized distillation for text-video retrieval. In: ICCV (2021)
Dauphin, Y.N., Fan, A., Auli, M., Grangier, D.: Language modeling with gated convolutional networks. In: ICML (2017)
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT (2019)
Dong, J., Li, X., Snoek, C.G.: Predicting visual features from text for image and video caption retrieval. TMM 20(12), 3377–3388 (2018)
Dong, J., Li, X., Xu, C., Yang, X., Yang, G., Wang, X.: Dual encoding for video retrieval by text. TPAMI (2021)
Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: VSE++: improving visual-semantic embeddings with hard negatives. In: BMVC (2018)
Fang, H., Xiong, P., Xu, L., Chen, Y.: CLIP2Video: Mastering video-text retrieval via image clip. arXiv preprint arXiv:2106.11097 (2021)
Feichtenhofer, C.: X3d: Expanding architectures for efficient video recognition. In: CVPR (2020)
Gabeur, V., Sun, C., Alahari, K., Schmid, C.: Multi-modal transformer for video retrieval. In: ECCV (2020)
Ghadiyaram, D., Tran, D., Mahajan, D.: Large-scale weakly-supervised pre-training for video action recognition. In: CVPR (2019)
Huang, P.Y., Liang, J., Vaibhav, V., Chang, X., Hauptmann, A.: Informedia@TRECVID 2018: Ad-hoc video search with discrete and continuous representations. In: TRECVID (2018)
Ilse, M., Tomczak, J., Welling, M.: Attention-based deep multiple instance learning. In: ICML (2018)
Joulin, A., van der Maaten, L., Jabri, A., Vasilache, N.: Learning visual features from large weakly supervised data. In: ECCV (2016)
Le, D.D., et al.: NII-HITACHI-UIT at TRECVID 2016. In: TRECVID (2016)
Li, H., Chen, J., Hu, R., Yu, M., Chen, H., Xu, Z.: Action recognition using visual attention with reinforcement learning. In: MMM (2019)
Li, X., Dong, J., Xu, C., Cao, J., Wang, X., Yang, G.: Renmin University of China and Zhejiang Gongshang University at TRECVID 2018: Deep Cross-Modal Embeddings for Video-Text Retrieval. In: TRECVID (2018)
Li, X., et al.: Deep multiple instance learning with spatial attention for rop case classification, instance selection and abnormality localization. In: ICPR (2020)
Li, X., Xu, C., Yang, G., Chen, Z., Dong, J.: W2VV++: Fully deep learning for ad-hoc video search. In: ACMMM (2019)
Li, X., et al.: Renmin University of China and Zhejiang Gongshang University at TRECVID 2019: Learn to search and describe videos. In: TRECVID (2019)
Li, X., Zhou, F., Chen, A.: Renmin University of China at TRECVID 2020: Sentence Encoder Assembly for Ad-hoc Video Search. In: TRECVID (2020)
Li, X., Zhou, F., Xu, C., Ji, J., Yang, G.: SEA: sentence encoder assembly for video retrieval by textual queries. TMM 23, 4351–4362 (2021)
Li, Y., et al.: TGIF: A new dataset and benchmark on animated gif description. In: CVPR (2015)
Liang, J., et al.: Informedia @ TRECVID 2016. In: TRECVID (2016)
Liu, M., Chen, X., Zhang, Y., Li, Y., Rehg, J.M.: Attention distillation for learning video representations. In: BMVC (2020)
Liu, Y., Albanie, S., Nagrani, A., Zisserman, A.: Use what you have: Video retrieval using representations from collaborative experts. In: BMVC (2019)
Luo, H., et al.: CLIP4Clip: An empirical study of clip for end to end video clip retrieval. arXiv preprint arXiv:2104.08860 (2021)
Mahajan, D., et al.: Exploring the limits of weakly supervised pretraining. In: ECCV (2018)
Markatopoulou, F., et al.: ITI-CERTH participation in TRECVID 2016. In: TRECVID (2016)
Mettes, P., Koelma, D.C., Snoek, C.G.M.: Shuffled ImageNet banks for video event detection and search. TOMM 16(2), 1–21 (2020)
Miech, A., Laptev, I., Sivic, J.: Learning a text-video embedding from incomplete and heterogeneous data. arXiv (2018)
Mithun, N.C., Li, J., Metze, F., Roy-Chowdhury, A.K.: Joint embeddings with multimodal cues for video-text retrieval. Int. J. Multimedia Inf. Retrieval 8(1), 3–18 (2019). https://doi.org/10.1007/s13735-018-00166-3
Mithun, N.C., Li, J., Metze, F., Roy-Chowdhury, A.K.: Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In: ICMR (2018)
Nguyen, P.A., et al.: Vireo @ TRECVID 2017: Video-to-text, ad-hoc video search and video hyperlinking. In: TRECVID (2017)
Over, P., Awad, G., Smeaton, A.F., Foley, C., Lanagan, J.: Creating a web-scale video collection for research. In: The 1st Workshop on Web-scale Multimedia Corpus (2009)
Paszke, A.,et al.: Pytorch: An imperative style, high-performance deep learning library. In: NeurIPS (2019)
Patrick, M., et al.: Support-set bottlenecks for video-text representation learning. In: ICLR (2021)
Portillo-Quintero, J.A., Ortiz-Bayliss, J.C., Terashima-Marín, H.: A straightforward framework for video retrieval using CLIP. In: MCPR (2021)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)
Smeulders, A.W.M., Worring, M., Santini, S., Gupta, A., Jain, R.C.: Content-based image retrieval at the end of the early years. TPAMI 22(12), 1349–1380 (2000)
Snoek, C.G.M., Worring, M.: Multimodal video indexing: a review of the state-of-the-art. Multimedia Tools Appl. 25(1), 5–35 (2005)
Snoek, C.G., Li, X., Xu, C., Koelma, D.C.: University of Amsterdam and Renmin university at TRECVID 2017: Searching video, detecting events and describing video. In: TRECVID (2017)
Song, Y., Soleymani, M.: Polysemous visual-semantic embedding for cross-modal retrieval. In: CVPR (2019)
Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV (2015)
Ueki, K., Hirakawa, K., Kikuchi, K., Ogawa, T., Kobayashi, T.: Waseda_Meisei at TRECVID 2017: Ad-hoc video search. In: TRECVID (2017)
Ueki, Kazuya an Hori, T., Kobayashi, T.: Waseda_Meisei_SoftBank at TRECVID 2019: Ad-hoc video search. In: TRECVID (2019)
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Wang, X., Wu, J., Chen, J., Li, L., Wang, Y.F., Wang, W.Y.: VATEX: A large-scale, high-quality multilingual dataset for video-and-language research. In: ICCV (2019)
Woo, S., Park, J., Lee, J., Kweon, I.S.: CBAM: Convolutional block attention module. In: ECCV (2018)
Wray, M., Larlus, D., Csurka, G., Damen, D.: Fine-grained action retrieval through multiple parts-of-speech embeddings. In: ICCV (2019)
Wu, J., Ngo, C.W.: Interpretable embedding for ad-hoc video search. In: ACMMM (2020)
Wu, J., Nguyen, P.A., Ngo, C.W.: Vireo@ TRECVID 2020 ad-hoc video search. In: TRECVID (2020)
Wu, X., Chen, D., He, Y., Xue, H., Song, M., Mao, F.: Hybrid sequence encoder for text based video retrieval. In: TRECVID (2019)
Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: A large video description dataset for bridging video and language. In: CVPR (2016)
Xue, X., Nie, F., Wang, S., Chang, X., Stantic, B., Yao, M.: Multi-view correlated feature learning by uncovering shared component. In: AAAI (2017)
Yang, X., Dong, J., Cao, Y., Wang, X., Wang, M., Chua, T.S.: Tree-augmented cross-modal encoding for complex-query video retrieval. In: SIGIR (2020)
Yilmaz, E., Aslam, J.A.: Estimating average precision with incomplete and imperfect judgments. In: CIKM (2006)
Yu, Y., Kim, J., Kim, G.: A joint sequence fusion model for video question answering and retrieval. In: ECCV (2018)
Zhao, Y., Song, Y., Chen, S., Jin, Q.: RUC_AIM3 at TRECVID 2020: Ad-hoc video search & video to text description. In: TRECVID (2020)
Acknowledgments
This work was supported by NSFC (No. 62172420, No. 62072463), BJNSF (No. 4202033), and Public Computing Cloud, Renmin University of China.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Hu, F., Chen, A., Wang, Z., Zhou, F., Dong, J., Li, X. (2022). Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13674. Springer, Cham. https://doi.org/10.1007/978-3-031-19781-9_26
Download citation
DOI: https://doi.org/10.1007/978-3-031-19781-9_26
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19780-2
Online ISBN: 978-3-031-19781-9
eBook Packages: Computer ScienceComputer Science (R0)