Contextual Object Detection with Multimodal Large Language Models

Zang, Yuhang; Li, Wei; Han, Jun; Zhou, Kaiyang; Loy, Chen Change

Computer Science > Computer Vision and Pattern Recognition

arXiv:2305.18279 (cs)

[Submitted on 29 May 2023]

Title:Contextual Object Detection with Multimodal Large Language Models

Authors:Yuhang Zang, Wei Li, Jun Han, Kaiyang Zhou, Chen Change Loy

View PDF

Abstract:Recent Multimodal Large Language Models (MLLMs) are remarkable in vision-language tasks, such as image captioning and question answering, but lack the essential perception ability, i.e., object detection. In this work, we address this limitation by introducing a novel research problem of contextual object detection -- understanding visible objects within different human-AI interactive contexts. Three representative scenarios are investigated, including the language cloze test, visual captioning, and question answering. Moreover, we present ContextDET, a unified multimodal model that is capable of end-to-end differentiable modeling of visual-language contexts, so as to locate, identify, and associate visual objects with language inputs for human-AI interaction. Our ContextDET involves three key submodels: (i) a visual encoder for extracting visual representations, (ii) a pre-trained LLM for multimodal context decoding, and (iii) a visual decoder for predicting bounding boxes given contextual object words. The new generate-then-detect framework enables us to detect object words within human vocabulary. Extensive experiments show the advantages of ContextDET on our proposed CODE benchmark, open-vocabulary detection, and referring image segmentation. Github: this https URL.

Comments:	Github: this https URL, Project Page: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2305.18279 [cs.CV]
	(or arXiv:2305.18279v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2305.18279

Submission history

From: Yuhang Zang [view email]
[v1] Mon, 29 May 2023 17:50:33 UTC (12,925 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Contextual Object Detection with Multimodal Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Contextual Object Detection with Multimodal Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators