CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations

Qi, Ji; Ding, Ming; Wang, Weihan; Bai, Yushi; Lv, Qingsong; Hong, Wenyi; Xu, Bin; Hou, Lei; Li, Juanzi; Dong, Yuxiao; Tang, Jie

Computer Science > Computer Vision and Pattern Recognition

arXiv:2402.04236 (cs)

[Submitted on 6 Feb 2024]

Title:CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations

Authors:Ji Qi, Ming Ding, Weihan Wang, Yushi Bai, Qingsong Lv, Wenyi Hong, Bin Xu, Lei Hou, Juanzi Li, Yuxiao Dong, Jie Tang

View PDF

Abstract:Vision-Language Models (VLMs) have demonstrated their widespread viability thanks to extensive training in aligning visual instructions to answers. However, this conclusive alignment leads models to ignore critical visual reasoning, and further result in failures on meticulous visual problems and unfaithful responses. In this paper, we propose Chain of Manipulations, a mechanism that enables VLMs to solve problems with a series of manipulations, where each manipulation refers to an operation on the visual input, either from intrinsic abilities (e.g., grounding) acquired through prior training or from imitating human-like behaviors (e.g., zoom in). This mechanism encourages VLMs to generate faithful responses with evidential visual reasoning, and permits users to trace error causes in the interpretable paths. We thus train CogCoM, a general 17B VLM with a memory-based compatible architecture endowed this reasoning mechanism. Experiments show that our model achieves the state-of-the-art performance across 8 benchmarks from 3 categories, and a limited number of training steps with the data swiftly gains a competitive performance. The code and data are publicly available at this https URL.

Comments:	17 pages, 7 figures
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Computation and Language (cs.CL)
Cite as:	arXiv:2402.04236 [cs.CV]
	(or arXiv:2402.04236v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2402.04236

Submission history

From: Ji Qi [view email]
[v1] Tue, 6 Feb 2024 18:43:48 UTC (20,764 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:CogCoM: Train Large Vision-Language Models Diving into Details through Chain of Manipulations

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators