Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding

Yuan, Zhihao; Ren, Jinke; Feng, Chun-Mei; Zhao, Hengshuang; Cui, Shuguang; Li, Zhen

Computer Science > Computer Vision and Pattern Recognition

arXiv:2311.15383 (cs)

[Submitted on 26 Nov 2023 (v1), last revised 23 Mar 2024 (this version, v2)]

Title:Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding

Authors:Zhihao Yuan, Jinke Ren, Chun-Mei Feng, Hengshuang Zhao, Shuguang Cui, Zhen Li

View PDF HTML (experimental)

Abstract:3D Visual Grounding (3DVG) aims at localizing 3D object based on textual descriptions. Conventional supervised methods for 3DVG often necessitate extensive annotations and a predefined vocabulary, which can be restrictive. To address this issue, we propose a novel visual programming approach for zero-shot open-vocabulary 3DVG, leveraging the capabilities of large language models (LLMs). Our approach begins with a unique dialog-based method, engaging with LLMs to establish a foundational understanding of zero-shot 3DVG. Building on this, we design a visual program that consists of three types of modules, i.e., view-independent, view-dependent, and functional modules. These modules, specifically tailored for 3D scenarios, work collaboratively to perform complex reasoning and inference. Furthermore, we develop an innovative language-object correlation module to extend the scope of existing 3D object detectors into open-vocabulary scenarios. Extensive experiments demonstrate that our zero-shot approach can outperform some supervised baselines, marking a significant stride towards effective 3DVG.

Comments:	Accepted by CVPR 2024, project website: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2311.15383 [cs.CV]
	(or arXiv:2311.15383v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2311.15383

Submission history

From: Jinke Ren [view email]
[v1] Sun, 26 Nov 2023 19:01:14 UTC (4,462 KB)
[v2] Sat, 23 Mar 2024 05:21:14 UTC (4,462 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators