VIGC: Visual Instruction Generation and Correction

Wang, Bin; Wu, Fan; Han, Xiao; Peng, Jiahui; Zhong, Huaping; Zhang, Pan; Dong, Xiaoyi; Li, Weijia; Li, Wei; Wang, Jiaqi; He, Conghui

Computer Science > Computer Vision and Pattern Recognition

arXiv:2308.12714 (cs)

[Submitted on 24 Aug 2023 (v1), last revised 4 Feb 2024 (this version, v3)]

Title:VIGC: Visual Instruction Generation and Correction

Authors:Bin Wang, Fan Wu, Xiao Han, Jiahui Peng, Huaping Zhong, Pan Zhang, Xiaoyi Dong, Weijia Li, Wei Li, Jiaqi Wang, Conghui He

View PDF

Abstract:The integration of visual encoders and large language models (LLMs) has driven recent progress in multimodal large language models (MLLMs). However, the scarcity of high-quality instruction-tuning data for vision-language tasks remains a challenge. The current leading paradigm, such as LLaVA, relies on language-only GPT-4 to generate data, which requires pre-annotated image captions and detection bounding boxes, suffering from understanding image details. A practical solution to this problem would be to utilize the available multimodal large language models (MLLMs) to generate instruction data for vision-language tasks. However, it's worth noting that the currently accessible MLLMs are not as powerful as their LLM counterparts, as they tend to produce inadequate responses and generate false information. As a solution for addressing the current issue, this paper proposes the Visual Instruction Generation and Correction (VIGC) framework that enables multimodal large language models to generate instruction-tuning data and progressively enhance its quality on-the-fly. Specifically, Visual Instruction Generation (VIG) guides the vision-language model to generate diverse instruction-tuning data. To ensure generation quality, Visual Instruction Correction (VIC) adopts an iterative update mechanism to correct any inaccuracies in data produced by VIG, effectively reducing the risk of hallucination. Leveraging the diverse, high-quality data generated by VIGC, we finetune mainstream models and validate data quality based on various evaluations. Experimental results demonstrate that VIGC not only compensates for the shortcomings of language-only data generation methods, but also effectively enhances the benchmark performance. The models, datasets, and code are available at this https URL.

Comments:	Accepted by AAAI 2024, Project Website: this https URL, Code and Pretrained Model: this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2308.12714 [cs.CV]
	(or arXiv:2308.12714v3 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2308.12714

Submission history

From: Bin Wang [view email]
[v1] Thu, 24 Aug 2023 11:21:05 UTC (7,775 KB)
[v2] Mon, 11 Sep 2023 08:29:48 UTC (9,652 KB)
[v3] Sun, 4 Feb 2024 06:46:03 UTC (12,387 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:VIGC: Visual Instruction Generation and Correction

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:VIGC: Visual Instruction Generation and Correction

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators