research-article

Match4Match: Enhancing Text-Video Retrieval by Maximum Flow with Minimum Cost

Authors:
Zhongjie Duan

East China Normal University, China

East China Normal University, China

0000-0002-5973-8240
View Profile

,
Chengyu Wang

Alibaba Group, China

Alibaba Group, China

0000-0003-1010-9678
View Profile

,
Cen Chen

East China Normal University, China

East China Normal University, China

0000-0003-0325-1705
View Profile

,
Wenmeng Zhou

Alibaba Group, China

Alibaba Group, China

0000-0001-9967-5515
View Profile

,
Jun Huang

Alibaba Group, China

Alibaba Group, China

0000-0002-7706-7081
View Profile

,
Weining Qian

East China Normal University, China

East China Normal University, China

0000-0002-4132-8630
View Profile

Authors Info & Claims

WWW '23: Proceedings of the ACM Web Conference 2023April 2023Pages 3257–3267https://doi.org/10.1145/3543507.3583365

Published:30 April 2023Publication History

WWW '23: Proceedings of the ACM Web Conference 2023

Pages 3257–3267

ABSTRACT

With the explosive growth of video and text data on the web, text-video retrieval has become a vital task for online video platforms. Recently, text-video retrieval methods based on pre-trained models have attracted a lot of attention. However, existing methods cannot effectively capture the fine-grained information in videos, and typically suffer from the hubness problem where a collection of similar videos are retrieved by a large number of different queries. In this paper, we propose Match4Match, a new text-video retrieval method based on CLIP (Contrastive Language-Image Pretraining) and graph optimization theories. To balance calculation efficiency and model accuracy, Match4Match seamlessly supports three inference modes for different application scenarios. In fast vector retrieval mode, we embed texts and videos in the same space and employ a vector retrieval engine to obtain the top K videos. In fine-grained alignment mode, our method fully utilizes the pre-trained knowledge of the CLIP model to align words with corresponding video frames, and uses the fine-grained information to compute text-video similarity more accurately. In flow-style matching mode, to alleviate the detrimental impact of the hubness problem, we model the retrieval problem as a combinatorial optimization problem and solve it using maximum flow with minimum cost algorithm. To demonstrate the effectiveness of our method, we conduct experiments on five public text-video datasets. The overall performance of our proposed method outperforms state-of-the-art methods. Additionally, we evaluate the computational efficiency of Match4Match. Benefiting from the three flexible inference modes, Match4Match can respond to a large number of query requests with low latency or achieve high recall with acceptable time consumption.

References

Arnon Amir, Janne Argillander, Murray Campbell, Alexander Haubold, Giridharan Iyengar, Shahram Ebadollahi, Feng Kang, Milind R Naphade, Apostol Natsev, John R Smith, 2003. IBM Research TRECVID-2003 Video Retrieval System.. In TRECVID.Google Scholar
Elad Amrani, Rami Ben-Ari, Daniel Rotman, and Alex Bronstein. 2021. Noise estimation using density estimation for self-supervised multimodal learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 6644–6652.Google ScholarCross Ref
Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2017. Localizing moments in video with natural language. In Proceedings of the IEEE international conference on computer vision. 5803–5812.Google ScholarCross Ref
Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zisserman. 2021. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1728–1738.Google ScholarCross Ref
Cynthia Barnhart and Amy Cohn. 2004. Airline schedule planning: Accomplishments and opportunities. Manufacturing & service operations management 6, 1 (2004), 3–22.Google Scholar
Dimitris Bertsimas and Sarah Stock Patterson. 2000. The traffic flow management rerouting problem in air traffic control: A dynamic network flow approach. Transportation Science 34, 3 (2000), 239–255.Google ScholarDigital Library
Simion-Vlad Bogolin, Ioana Croitoru, Hailin Jin, Yang Liu, and Samuel Albanie. 2022. Cross Modal Retrieval with Querybank Normalisation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5194–5205.Google ScholarCross Ref
Ursula Bünnagel, Bernhard Korte, and Jens Vygen. 1998. Efficient implementation of the Goldberg–Tarjan minimum-cost flow algorithm. Optimization Methods and Software 10, 2 (1998), 157–174.Google ScholarCross Ref
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition¿ a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308.Google ScholarCross Ref
David Chen and William B Dolan. 2011. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies. 190–200.Google ScholarDigital Library
Li Chen, Rasmus Kyng, Yang P Liu, Richard Peng, Maximilian Probst Gutenberg, and Sushant Sachdeva. 2022. Maximum flow and minimum-cost flow in almost-linear time. arXiv preprint arXiv:2203.00671 (2022).Google Scholar
Shizhe Chen, Yida Zhao, Qin Jin, and Qi Wu. 2020. Fine-grained video-text retrieval with hierarchical graph reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10638–10647.Google ScholarCross Ref
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In International conference on machine learning. PMLR, 1597–1607.Google Scholar
Xing Cheng, Hezheng Lin, Xiangyu Wu, Fan Yang, and Dong Shen. 2021. Improving video-text retrieval by multi-stream corpus alignment and dual softmax loss. arXiv preprint arXiv:2109.04290 (2021).Google Scholar
Paul Christiano, Jonathan A Kelner, Aleksander Madry, Daniel A Spielman, and Shang-Hua Teng. 2011. Electrical flows, laplacian systems, and faster approximation of maximum flow in undirected graphs. In Proceedings of the forty-third annual ACM symposium on Theory of computing. 273–282.Google ScholarDigital Library
George B Dantzig. 1951. Application of the simplex method to a transportation problem. Activity analysis and production and allocation (1951).Google Scholar
Yefim A Dinitz. 1970. An algorithm for the solution of the problem of maximal flow in a network with power estimation. In Doklady Akademii nauk, Vol. 194. Russian Academy of Sciences, 754–757.Google Scholar
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).Google Scholar
Jack Edmonds and Richard M Karp. 1972. Theoretical improvements in algorithmic efficiency for network flow problems. Journal of the ACM (JACM) 19, 2 (1972), 248–264.Google ScholarDigital Library
Han Fang, Pengfei Xiong, Luhui Xu, and Yu Chen. 2021. Clip2video: Mastering video-text retrieval via image clip. arXiv preprint arXiv:2106.11097 (2021).Google Scholar
Lester Randolph Ford and Delbert R Fulkerson. 1956. Maximal flow through a network. Canadian journal of Mathematics 8 (1956), 399–404.Google ScholarCross Ref
Valentin Gabeur, Chen Sun, Karteek Alahari, and Cordelia Schmid. 2020. Multi-modal transformer for video retrieval. In European Conference on Computer Vision. Springer, 214–229.Google ScholarDigital Library
Yanjie Gao, Yu Liu, Hongyu Zhang, Zhengxian Li, Yonghao Zhu, Haoxiang Lin, and Mao Yang. 2020. Estimating gpu memory consumption of deep learning models. In Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1342–1352.Google ScholarDigital Library
Zijian Gao, Jingyu Liu, Sheng Chen, Dedan Chang, Hao Zhang, and Jinwei Yuan. 2021. Clip2tv: An empirical study on transformer-based methods for video-text retrieval. arXiv preprint arXiv:2111.05610 (2021).Google Scholar
Andrew V Goldberg. 1997. An efficient implementation of a scaling minimum-cost flow algorithm. Journal of algorithms 22, 1 (1997), 1–29.Google ScholarDigital Library
Andrew V Goldberg and Robert E Tarjan. 1990. Finding minimum-cost circulations by successive approximation. Mathematics of Operations Research 15, 3 (1990), 430–466.Google ScholarDigital Library
Ning Han, Jingjing Chen, Guangyi Xiao, Hao Zhang, Yawen Zeng, and Hao Chen. 2021. Fine-grained cross-modal alignment network for text-video retrieval. In Proceedings of the 29th ACM International Conference on Multimedia. 3826–3834.Google ScholarDigital Library
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9729–9738.Google ScholarCross Ref
Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7132–7141.Google ScholarCross Ref
Weiming Hu, Nianhua Xie, Li Li, Xianglin Zeng, and Stephen Maybank. 2011. A survey on visual content-based video indexing and retrieval. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 41, 6 (2011), 797–819.Google ScholarDigital Library
Hiroshi Imai and Kazuo Iwano. 1990. Efficient sequential and parallel algorithms for planar minimum cost flow. In International Symposium on Algorithms. Springer, 21–30.Google ScholarCross Ref
Herve Jegou, Matthijs Douze, and Cordelia Schmid. 2010. Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence 33, 1 (2010), 117–128.Google Scholar
Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2019. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data 7, 3 (2019), 535–547.Google ScholarCross Ref
Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of naacL-HLT. 4171–4186.Google Scholar
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).Google Scholar
Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. 2017. Dense-captioning events in videos. In Proceedings of the IEEE international conference on computer vision. 706–715.Google ScholarCross Ref
Harold W Kuhn. 1955. The Hungarian method for the assignment problem. Naval research logistics quarterly 2, 1-2 (1955), 83–97.Google Scholar
Ananya Kumar, Aditi Raghunathan, Robbie Jones, Tengyu Ma, and Percy Liang. 2022. Fine-tuning can distort pretrained features and underperform out-of-distribution. arXiv preprint arXiv:2202.10054 (2022).Google Scholar
Alexander Kunitsyn, Maksim Kalashnikov, Maksim Dzabraev, and Andrei Ivaniuta. 2022. MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization. arXiv preprint arXiv:2203.07086 (2022).Google Scholar
Jie Lei, Linjie Li, Luowei Zhou, Zhe Gan, Tamara L Berg, Mohit Bansal, and Jingjing Liu. 2021. Less is more: Clipbert for video-and-language learning via sparse sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7331–7341.Google ScholarCross Ref
Dongxu Li, Junnan Li, Hongdong Li, Juan Carlos Niebles, and Steven CH Hoi. 2022. Align and Prompt: Video-and-Language Pre-training with Entity Prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4953–4963.Google ScholarCross Ref
Song Liu, Haoqi Fan, Shengsheng Qian, Yiru Chen, Wenkui Ding, and Zhongyuan Wang. 2021. Hit: Hierarchical transformer with momentum contrast for video-text retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11915–11925.Google ScholarCross Ref
Yang Liu, Samuel Albanie, Arsha Nagrani, and Andrew Zisserman. 2019. Use what you have: Video retrieval using representations from collaborative experts. arXiv preprint arXiv:1907.13487 (2019).Google Scholar
Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. 2022. CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval and Captioning. Neurocomputing (2022).Google Scholar
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).Google Scholar
Mayu Otani, Yuta Nakashima, Esa Rahtu, Janne Heikkilä, and Naokazu Yokoya. 2016. Learning joint representations of videos and sentences with web image search. In European Conference on Computer Vision. Springer, 651–667.Google ScholarCross Ref
Yingwei Pan, Tao Mei, Ting Yao, Houqiang Li, and Yong Rui. 2016. Jointly modeling embedding and translation to bridge video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4594–4602.Google ScholarCross Ref
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, 2019. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems 32 (2019).Google Scholar
Laurent Perron and Vincent Furnon. 2022. OR-Tools. Google. https://developers.google.com/optimization/Google Scholar
Jesús Andrés Portillo-Quintero, José Carlos Ortiz-Bayliss, and Hugo Terashima-Marín. 2021. A straightforward framework for video retrieval using clip. In Mexican Conference on Pattern Recognition. Springer, 3–12.Google ScholarDigital Library
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 8748–8763.Google Scholar
Milos Radovanovic, Alexandros Nanopoulos, and Mirjana Ivanovic. 2010. Hubs in space: Popular nearest neighbors in high-dimensional data. Journal of Machine Learning Research 11, sept (2010), 2487–2531.Google Scholar
Herbert Robbins and Sutton Monro. 1951. A stochastic approximation method. The annals of mathematical statistics (1951), 400–407.Google Scholar
Anna Rohrbach, Marcus Rohrbach, and Bernt Schiele. 2015. The long-short story of movie description. In German conference on pattern recognition. Springer, 209–221.Google ScholarDigital Library
Maiko Shigeno. 2004. A SURVEY OF COMBINATORIAL MAXIMUM FLOW ALGORITHMS ON A NETWORK WITH GAINS (<Special Issue> Network Design, Control and Optimization). Journal of the Operations Research Society of Japan 47, 4 (2004), 244–264.Google ScholarCross Ref
Josef Sivic and Andrew Zisserman. 2003. Video Google: A text retrieval approach to object matching in videos. In Computer Vision, IEEE International Conference on, Vol. 3. IEEE Computer Society, 1470–1470.Google Scholar
Cees GM Snoek, Marcel Worring, 2009. Concept-based video retrieval. Foundations and Trends® in Information Retrieval 2, 4 (2009), 215–322.Google Scholar
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017).Google Scholar
Chengyu Wang, Minghui Qiu, Taolin Zhang, Tingting Liu, Lei Li, Jianing Wang, Ming Wang, Jun Huang, and Wei Lin. 2022. EasyNLP: A Comprehensive and Easy-to-use Toolkit for Natural Language Processing. (2022). https://doi.org/10.48550/ARXIV.2205.00258Google ScholarCross Ref
Max Welling and Thomas N Kipf. 2016. Semi-supervised classification with graph convolutional networks. In J. International Conference on Learning Representations (ICLR 2017).Google Scholar
Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. 2018. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3733–3742.Google ScholarCross Ref
Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. 2017. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1492–1500.Google ScholarCross Ref
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5288–5296.Google ScholarCross Ref
Jianwei Yang, Yonatan Bisk, and Jianfeng Gao. 2021. Taco: Token-aware cascade contrastive learning for video-text alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 11562–11572.Google ScholarCross Ref
Youngjae Yu, Jongseok Kim, and Gunhee Kim. 2018. A joint sequence fusion model for video question answering and retrieval. In Proceedings of the European Conference on Computer Vision (ECCV). 471–487.Google ScholarDigital Library
Shuai Zhao, Linchao Zhu, Xiaohan Wang, and Yi Yang. 2022. CenterCLIP: Token Clustering for Efficient Text-Video Retrieval. arXiv preprint arXiv:2205.00823 (2022).Google Scholar

Index Terms

Match4Match: Enhancing Text-Video Retrieval by Maximum Flow with Minimum Cost
1. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval
        Video search

Recommendations

A Learned Lexicon-Driven Paradigm for Interactive Video Retrieval

Effective video retrieval is the result of interplay between interactive query selection, advanced visualization of results, and a goal-oriented human user. Traditional interactive video retrieval approaches emphasize paradigms, such as query-by-keyword ...
Read More
Multimodal Video Retrieval with the 2017 IMOTION System
ICMR '17: Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval

The IMOTION system is a multimodal content-based video search and browsing application offering a rich set of query modes on the basis of a broad range of different features. It is able to scale with the size of the collection due to its underlying ...
Read More
A review of text and image retrieval approaches for broadcast news video
Abstract
The effectiveness of a video retrieval system largely depends on the choice of underlying text and image retrieval components. The unique properties of video collections (e.g., multiple sources, noisy features and temporal relations) suggest we ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '23: Proceedings of the ACM Web Conference 2023
April 2023
4293 pages
ISBN:9781450394161
DOI:10.1145/3543507
Editors:
Ying Ding,
Jie Tang,
Juan Sequeda,
Lora Aroyo,
Carlos Castillo,
Geert-Jan Houben
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 30 April 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
multimodal learning
network flow
video retrieval
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Upcoming Conference
WWW '24

Sponsor:

sigweb

The ACM Web Conference 2024

May 13 - 17, 2024

Singapore , Singapore
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 137
  Total Downloads
- Downloads (Last 12 months)137
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Match4Match: Enhancing Text-Video Retrieval by Maximum Flow with Minimum Cost

WWW '23: Proceedings of the ACM Web Conference 2023

ABSTRACT

References

Cited By

Index Terms

Recommendations

A Learned Lexicon-Driven Paradigm for Interactive Video Retrieval

Multimodal Video Retrieval with the 2017 IMOTION System

A review of text and image retrieval approaches for broadcast news video

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media