Compositional Chain-of-Thought Prompting for Large Multimodal Models

Mitra, Chancharik; Huang, Brandon; Darrell, Trevor; Herzig, Roei

Computer Science > Computer Vision and Pattern Recognition

arXiv:2311.17076 (cs)

[Submitted on 27 Nov 2023 (v1), last revised 1 Apr 2024 (this version, v3)]

Title:Compositional Chain-of-Thought Prompting for Large Multimodal Models

Authors:Chancharik Mitra, Brandon Huang, Trevor Darrell, Roei Herzig

View PDF HTML (experimental)

Abstract:The combination of strong visual backbones and Large Language Model (LLM) reasoning has led to Large Multimodal Models (LMMs) becoming the current standard for a wide range of vision and language (VL) tasks. However, recent research has shown that even the most advanced LMMs still struggle to capture aspects of compositional visual reasoning, such as attributes and relationships between objects. One solution is to utilize scene graphs (SGs)--a formalization of objects and their relations and attributes that has been extensively used as a bridge between the visual and textual domains. Yet, scene graph data requires scene graph annotations, which are expensive to collect and thus not easily scalable. Moreover, finetuning an LMM based on SG data can lead to catastrophic forgetting of the pretraining objective. To overcome this, inspired by chain-of-thought methods, we propose Compositional Chain-of-Thought (CCoT), a novel zero-shot Chain-of-Thought prompting method that utilizes SG representations in order to extract compositional knowledge from an LMM. Specifically, we first generate an SG using the LMM, and then use that SG in the prompt to produce a response. Through extensive experiments, we find that the proposed CCoT approach not only improves LMM performance on several vision and language VL compositional benchmarks but also improves the performance of several popular LMMs on general multimodal benchmarks, without the need for fine-tuning or annotated ground-truth SGs. Code: this https URL

Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI); Computation and Language (cs.CL); Machine Learning (cs.LG)
Cite as:	arXiv:2311.17076 [cs.CV]
	(or arXiv:2311.17076v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2311.17076

Submission history

From: Chancharik Mitra [view email]
[v1] Mon, 27 Nov 2023 22:23:27 UTC (2,910 KB)
[v2] Thu, 28 Mar 2024 23:02:27 UTC (6,252 KB)
[v3] Mon, 1 Apr 2024 03:17:09 UTC (6,252 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Compositional Chain-of-Thought Prompting for Large Multimodal Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Compositional Chain-of-Thought Prompting for Large Multimodal Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators