Are All Combinations Equal? Combining Textual and Visual Features with Multiple Space Learning for Text-Based Video Retrieval

Galanopoulos, Damianos; Mezaris, Vasileios

Computer Science > Computer Vision and Pattern Recognition

arXiv:2211.11351 (cs)

[Submitted on 21 Nov 2022]

Title:Are All Combinations Equal? Combining Textual and Visual Features with Multiple Space Learning for Text-Based Video Retrieval

Authors:Damianos Galanopoulos, Vasileios Mezaris

View PDF

Abstract:In this paper we tackle the cross-modal video retrieval problem and, more specifically, we focus on text-to-video retrieval. We investigate how to optimally combine multiple diverse textual and visual features into feature pairs that lead to generating multiple joint feature spaces, which encode text-video pairs into comparable representations. To learn these representations our proposed network architecture is trained by following a multiple space learning procedure. Moreover, at the retrieval stage, we introduce additional softmax operations for revising the inferred query-video similarities. Extensive experiments in several setups based on three large-scale datasets (IACC.3, V3C1, and MSR-VTT) lead to conclusions on how to best combine text-visual features and document the performance of the proposed network. Source code is made publicly available at: this https URL

Comments:	Accepted for publication; to be included in Proc. ECCV Workshops 2022. The version posted here is the "submitted manuscript" version
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2211.11351 [cs.CV]
	(or arXiv:2211.11351v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2211.11351

Submission history

From: Vasileios Mezaris [view email]
[v1] Mon, 21 Nov 2022 11:08:13 UTC (983 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Are All Combinations Equal? Combining Textual and Visual Features with Multiple Space Learning for Text-Based Video Retrieval

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Are All Combinations Equal? Combining Textual and Visual Features with Multiple Space Learning for Text-Based Video Retrieval

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators