Skip to main content

Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13674))

Abstract

In this paper we revisit feature fusion, an old-fashioned topic, in the new context of text-to-video retrieval. Different from previous research that considers feature fusion only at one end, let it be video or text, we aim for feature fusion for both ends within a unified framework. We hypothesize that optimizing the convex combination of the features is preferred to modeling their correlations by computationally heavy multi-head self attention. We propose Lightweight Attentional Feature Fusion (LAFF). LAFF performs feature fusion at both early and late stages and at both video and text ends, making it a powerful method for exploiting diverse (off-the-shelf) features. The interpretability of LAFF can be used for feature selection. Extensive experiments on five public benchmark sets (MSR-VTT, MSVD, TGIF, VATEX and TRECVID AVS 2016–2020) justify LAFF as a new baseline for text-to-video retrieval.

F. Hu, A. Chen and Z. Wang.—Equal contribution.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://github.com/ruc-aimc-lab/laff.

  2. 2.

    Other non-linear activations such as ReLU and sigmoid will make each dimension non-negative, constrain the feature space be in the first quadrant, and consequently put a lower boundary of 0 on the cosine similarity. As such, the similarity will be less discriminative than the tanh counterpart.

  3. 3.

    We prefer the two-tower version to the single-tower version, as the latter has to compute video and text embeddings online, making it not scalable for real applications.

  4. 4.

    Video features: clip-ft, x3d, ircsn and tf. Text features: clip-ft, bow, w2v and gru.

References

  1. Amrani, E., Ben-Ari, R., Rotman, D., Bronstein, A.: Noise estimation using density estimation for self-supervised multimodal learning. In: AAAI (2021)

    Google Scholar 

  2. Atrey, P.K., Hossain, M.A., El Saddik, A., Kankanhalli, M.S.: Multimodal fusion for multimedia analysis: a survey. Multimedia Syst. 16, 345–379 (2010)

    Article  Google Scholar 

  3. Bain, M., Nagrani, A., Varol, G., Zisserman, A.: Frozen in time: A joint video and image encoder for end-to-end retrieval. In: ICCV (2021)

    Google Scholar 

  4. Baltrušaitis, T., Ahuja, C., Morency, L.P.: Multimodal machine learning: a survey and taxonomy. TPAMI 41(2), 423–443 (2018)

    Article  Google Scholar 

  5. Bastan, M., et al.: NTU ROSE lab at TRECVID 2018: Ad-hoc video search and video to text. In: TRECVID (2018)

    Google Scholar 

  6. Berns, F., Rossetto, L., Schoeffmann, K., Beecks, C., Awad, G.: V3C1 dataset: An evaluation of content characteristics. In: ICMR (2019)

    Google Scholar 

  7. Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? In: ICML (2021)

    Google Scholar 

  8. Chen, A., Hu, F., Wang, Z., Zhou, F., Li, X.: What matters for ad-hoc video search? a large-scale evaluation on TRECVID. In: ICCV Workshop on ViRal (2021)

    Google Scholar 

  9. Chen, D., Dolan, W.: Collecting highly parallel data for paraphrase evaluation. In: CVPR (2011)

    Google Scholar 

  10. Chen, S., Zhao, Y., Jin, Q., Wu, Q.: Fine-grained video-text retrieval with hierarchical graph reasoning. In: CVPR (2020)

    Google Scholar 

  11. Croitoru, I., et al.: TEACHTEXT: Crossmodal generalized distillation for text-video retrieval. In: ICCV (2021)

    Google Scholar 

  12. Dauphin, Y.N., Fan, A., Auli, M., Grangier, D.: Language modeling with gated convolutional networks. In: ICML (2017)

    Google Scholar 

  13. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL-HLT (2019)

    Google Scholar 

  14. Dong, J., Li, X., Snoek, C.G.: Predicting visual features from text for image and video caption retrieval. TMM 20(12), 3377–3388 (2018)

    Google Scholar 

  15. Dong, J., Li, X., Xu, C., Yang, X., Yang, G., Wang, X.: Dual encoding for video retrieval by text. TPAMI (2021)

    Google Scholar 

  16. Faghri, F., Fleet, D.J., Kiros, J.R., Fidler, S.: VSE++: improving visual-semantic embeddings with hard negatives. In: BMVC (2018)

    Google Scholar 

  17. Fang, H., Xiong, P., Xu, L., Chen, Y.: CLIP2Video: Mastering video-text retrieval via image clip. arXiv preprint arXiv:2106.11097 (2021)

  18. Feichtenhofer, C.: X3d: Expanding architectures for efficient video recognition. In: CVPR (2020)

    Google Scholar 

  19. Gabeur, V., Sun, C., Alahari, K., Schmid, C.: Multi-modal transformer for video retrieval. In: ECCV (2020)

    Google Scholar 

  20. Ghadiyaram, D., Tran, D., Mahajan, D.: Large-scale weakly-supervised pre-training for video action recognition. In: CVPR (2019)

    Google Scholar 

  21. Huang, P.Y., Liang, J., Vaibhav, V., Chang, X., Hauptmann, A.: Informedia@TRECVID 2018: Ad-hoc video search with discrete and continuous representations. In: TRECVID (2018)

    Google Scholar 

  22. Ilse, M., Tomczak, J., Welling, M.: Attention-based deep multiple instance learning. In: ICML (2018)

    Google Scholar 

  23. Joulin, A., van der Maaten, L., Jabri, A., Vasilache, N.: Learning visual features from large weakly supervised data. In: ECCV (2016)

    Google Scholar 

  24. Le, D.D., et al.: NII-HITACHI-UIT at TRECVID 2016. In: TRECVID (2016)

    Google Scholar 

  25. Li, H., Chen, J., Hu, R., Yu, M., Chen, H., Xu, Z.: Action recognition using visual attention with reinforcement learning. In: MMM (2019)

    Google Scholar 

  26. Li, X., Dong, J., Xu, C., Cao, J., Wang, X., Yang, G.: Renmin University of China and Zhejiang Gongshang University at TRECVID 2018: Deep Cross-Modal Embeddings for Video-Text Retrieval. In: TRECVID (2018)

    Google Scholar 

  27. Li, X., et al.: Deep multiple instance learning with spatial attention for rop case classification, instance selection and abnormality localization. In: ICPR (2020)

    Google Scholar 

  28. Li, X., Xu, C., Yang, G., Chen, Z., Dong, J.: W2VV++: Fully deep learning for ad-hoc video search. In: ACMMM (2019)

    Google Scholar 

  29. Li, X., et al.: Renmin University of China and Zhejiang Gongshang University at TRECVID 2019: Learn to search and describe videos. In: TRECVID (2019)

    Google Scholar 

  30. Li, X., Zhou, F., Chen, A.: Renmin University of China at TRECVID 2020: Sentence Encoder Assembly for Ad-hoc Video Search. In: TRECVID (2020)

    Google Scholar 

  31. Li, X., Zhou, F., Xu, C., Ji, J., Yang, G.: SEA: sentence encoder assembly for video retrieval by textual queries. TMM 23, 4351–4362 (2021)

    Google Scholar 

  32. Li, Y., et al.: TGIF: A new dataset and benchmark on animated gif description. In: CVPR (2015)

    Google Scholar 

  33. Liang, J., et al.: Informedia @ TRECVID 2016. In: TRECVID (2016)

    Google Scholar 

  34. Liu, M., Chen, X., Zhang, Y., Li, Y., Rehg, J.M.: Attention distillation for learning video representations. In: BMVC (2020)

    Google Scholar 

  35. Liu, Y., Albanie, S., Nagrani, A., Zisserman, A.: Use what you have: Video retrieval using representations from collaborative experts. In: BMVC (2019)

    Google Scholar 

  36. Luo, H., et al.: CLIP4Clip: An empirical study of clip for end to end video clip retrieval. arXiv preprint arXiv:2104.08860 (2021)

  37. Mahajan, D., et al.: Exploring the limits of weakly supervised pretraining. In: ECCV (2018)

    Google Scholar 

  38. Markatopoulou, F., et al.: ITI-CERTH participation in TRECVID 2016. In: TRECVID (2016)

    Google Scholar 

  39. Mettes, P., Koelma, D.C., Snoek, C.G.M.: Shuffled ImageNet banks for video event detection and search. TOMM 16(2), 1–21 (2020)

    Article  Google Scholar 

  40. Miech, A., Laptev, I., Sivic, J.: Learning a text-video embedding from incomplete and heterogeneous data. arXiv (2018)

    Google Scholar 

  41. Mithun, N.C., Li, J., Metze, F., Roy-Chowdhury, A.K.: Joint embeddings with multimodal cues for video-text retrieval. Int. J. Multimedia Inf. Retrieval 8(1), 3–18 (2019). https://doi.org/10.1007/s13735-018-00166-3

    Article  Google Scholar 

  42. Mithun, N.C., Li, J., Metze, F., Roy-Chowdhury, A.K.: Learning joint embedding with multimodal cues for cross-modal video-text retrieval. In: ICMR (2018)

    Google Scholar 

  43. Nguyen, P.A., et al.: Vireo @ TRECVID 2017: Video-to-text, ad-hoc video search and video hyperlinking. In: TRECVID (2017)

    Google Scholar 

  44. Over, P., Awad, G., Smeaton, A.F., Foley, C., Lanagan, J.: Creating a web-scale video collection for research. In: The 1st Workshop on Web-scale Multimedia Corpus (2009)

    Google Scholar 

  45. Paszke, A.,et al.: Pytorch: An imperative style, high-performance deep learning library. In: NeurIPS (2019)

    Google Scholar 

  46. Patrick, M., et al.: Support-set bottlenecks for video-text representation learning. In: ICLR (2021)

    Google Scholar 

  47. Portillo-Quintero, J.A., Ortiz-Bayliss, J.C., Terashima-Marín, H.: A straightforward framework for video retrieval using CLIP. In: MCPR (2021)

    Google Scholar 

  48. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021)

    Google Scholar 

  49. Smeulders, A.W.M., Worring, M., Santini, S., Gupta, A., Jain, R.C.: Content-based image retrieval at the end of the early years. TPAMI 22(12), 1349–1380 (2000)

    Article  Google Scholar 

  50. Snoek, C.G.M., Worring, M.: Multimodal video indexing: a review of the state-of-the-art. Multimedia Tools Appl. 25(1), 5–35 (2005)

    Article  Google Scholar 

  51. Snoek, C.G., Li, X., Xu, C., Koelma, D.C.: University of Amsterdam and Renmin university at TRECVID 2017: Searching video, detecting events and describing video. In: TRECVID (2017)

    Google Scholar 

  52. Song, Y., Soleymani, M.: Polysemous visual-semantic embedding for cross-modal retrieval. In: CVPR (2019)

    Google Scholar 

  53. Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3D convolutional networks. In: ICCV (2015)

    Google Scholar 

  54. Ueki, K., Hirakawa, K., Kikuchi, K., Ogawa, T., Kobayashi, T.: Waseda_Meisei at TRECVID 2017: Ad-hoc video search. In: TRECVID (2017)

    Google Scholar 

  55. Ueki, Kazuya an Hori, T., Kobayashi, T.: Waseda_Meisei_SoftBank at TRECVID 2019: Ad-hoc video search. In: TRECVID (2019)

    Google Scholar 

  56. Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)

    Google Scholar 

  57. Wang, X., Wu, J., Chen, J., Li, L., Wang, Y.F., Wang, W.Y.: VATEX: A large-scale, high-quality multilingual dataset for video-and-language research. In: ICCV (2019)

    Google Scholar 

  58. Woo, S., Park, J., Lee, J., Kweon, I.S.: CBAM: Convolutional block attention module. In: ECCV (2018)

    Google Scholar 

  59. Wray, M., Larlus, D., Csurka, G., Damen, D.: Fine-grained action retrieval through multiple parts-of-speech embeddings. In: ICCV (2019)

    Google Scholar 

  60. Wu, J., Ngo, C.W.: Interpretable embedding for ad-hoc video search. In: ACMMM (2020)

    Google Scholar 

  61. Wu, J., Nguyen, P.A., Ngo, C.W.: Vireo@ TRECVID 2020 ad-hoc video search. In: TRECVID (2020)

    Google Scholar 

  62. Wu, X., Chen, D., He, Y., Xue, H., Song, M., Mao, F.: Hybrid sequence encoder for text based video retrieval. In: TRECVID (2019)

    Google Scholar 

  63. Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: A large video description dataset for bridging video and language. In: CVPR (2016)

    Google Scholar 

  64. Xue, X., Nie, F., Wang, S., Chang, X., Stantic, B., Yao, M.: Multi-view correlated feature learning by uncovering shared component. In: AAAI (2017)

    Google Scholar 

  65. Yang, X., Dong, J., Cao, Y., Wang, X., Wang, M., Chua, T.S.: Tree-augmented cross-modal encoding for complex-query video retrieval. In: SIGIR (2020)

    Google Scholar 

  66. Yilmaz, E., Aslam, J.A.: Estimating average precision with incomplete and imperfect judgments. In: CIKM (2006)

    Google Scholar 

  67. Yu, Y., Kim, J., Kim, G.: A joint sequence fusion model for video question answering and retrieval. In: ECCV (2018)

    Google Scholar 

  68. Zhao, Y., Song, Y., Chen, S., Jin, Q.: RUC_AIM3 at TRECVID 2020: Ad-hoc video search & video to text description. In: TRECVID (2020)

    Google Scholar 

Download references

Acknowledgments

This work was supported by NSFC (No. 62172420, No. 62072463), BJNSF (No. 4202033), and Public Computing Cloud, Renmin University of China.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xirong Li .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 212 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hu, F., Chen, A., Wang, Z., Zhou, F., Dong, J., Li, X. (2022). Lightweight Attentional Feature Fusion: A New Baseline for Text-to-Video Retrieval. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13674. Springer, Cham. https://doi.org/10.1007/978-3-031-19781-9_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19781-9_26

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19780-2

  • Online ISBN: 978-3-031-19781-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics