Can MLLMs Perform Text-to-Image In-Context Learning?

Zeng, Yuchen; Kang, Wonjun; Chen, Yicong; Koo, Hyung Il; Lee, Kangwook

Computer Science > Machine Learning

arXiv:2402.01293 (cs)

[Submitted on 2 Feb 2024 (v1), last revised 15 Apr 2024 (this version, v2)]

Title:Can MLLMs Perform Text-to-Image In-Context Learning?

Authors:Yuchen Zeng, Wonjun Kang, Yicong Chen, Hyung Il Koo, Kangwook Lee

View PDF HTML (experimental)

Abstract:The evolution from Large Language Models (LLMs) to Multimodal Large Language Models (MLLMs) has spurred research into extending In-Context Learning (ICL) to its multimodal counterpart. Existing such studies have primarily concentrated on image-to-text ICL. However, the Text-to-Image ICL (T2I-ICL), with its unique characteristics and potential applications, remains underexplored. To address this gap, we formally define the task of T2I-ICL and present CoBSAT, the first T2I-ICL benchmark dataset, encompassing ten tasks. Utilizing our dataset to benchmark six state-of-the-art MLLMs, we uncover considerable difficulties MLLMs encounter in solving T2I-ICL. We identify the primary challenges as the inherent complexity of multimodality and image generation, and show that strategies such as fine-tuning and Chain-of-Thought prompting help to mitigate these difficulties, leading to notable improvements in performance. Our code and dataset are available at this https URL.

Subjects:	Machine Learning (cs.LG); Computation and Language (cs.CL)
Cite as:	arXiv:2402.01293 [cs.LG]
	(or arXiv:2402.01293v2 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2402.01293

Submission history

From: Wonjun Kang [view email]
[v1] Fri, 2 Feb 2024 10:30:05 UTC (44,945 KB)
[v2] Mon, 15 Apr 2024 21:30:10 UTC (36,921 KB)

Computer Science > Machine Learning

Title:Can MLLMs Perform Text-to-Image In-Context Learning?

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Can MLLMs Perform Text-to-Image In-Context Learning?

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators