FlexCap: Generating Rich, Localized, and Flexible Captions in Images

Dwibedi, Debidatta; Jain, Vidhi; Tompson, Jonathan; Zisserman, Andrew; Aytar, Yusuf

Computer Science > Computer Vision and Pattern Recognition

arXiv:2403.12026 (cs)

[Submitted on 18 Mar 2024]

Title:FlexCap: Generating Rich, Localized, and Flexible Captions in Images

Authors:Debidatta Dwibedi, Vidhi Jain, Jonathan Tompson, Andrew Zisserman, Yusuf Aytar

View PDF HTML (experimental)

Abstract:We introduce a versatile $\textit{flexible-captioning}$ vision-language model (VLM) capable of generating region-specific descriptions of varying lengths. The model, FlexCap, is trained to produce length-conditioned captions for input bounding boxes, and this allows control over the information density of its output, with descriptions ranging from concise object labels to detailed captions. To achieve this we create large-scale training datasets of image region descriptions of varying length, starting from captioned images. This flexible-captioning capability has several valuable applications.
First, FlexCap demonstrates superior performance in dense captioning tasks on the Visual Genome dataset. Second, a visual question answering (VQA) system can be built by employing FlexCap to generate localized descriptions as inputs to a large language model. The resulting system achieves state-of-the-art zero-shot performance on a number of VQA datasets. We also demonstrate a $\textit{localize-then-describe}$ approach with FlexCap can be better at open-ended object detection than a $\textit{describe-then-localize}$ approach with other VLMs. We highlight a novel characteristic of FlexCap, which is its ability to extract diverse visual information through prefix conditioning. Finally, we qualitatively demonstrate FlexCap's broad applicability in tasks such as image labeling, object attribute recognition, and visual dialog. Project webpage: this https URL .

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2403.12026 [cs.CV]
	(or arXiv:2403.12026v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2403.12026

Submission history

From: Debidatta Dwibedi [view email]
[v1] Mon, 18 Mar 2024 17:57:02 UTC (4,628 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:FlexCap: Generating Rich, Localized, and Flexible Captions in Images

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:FlexCap: Generating Rich, Localized, and Flexible Captions in Images

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators