MedXChat: Bridging CXR Modalities with a Unified Multimodal Large Model

Yang, Ling; Wang, Zhanyu; Zhou, Luping

Abstract:Despite the success of Large Language Models (LLMs) in general image tasks, a gap persists in the medical field for a multimodal large model adept at handling the nuanced diversity of medical images. Addressing this, we propose MedXChat, a unified multimodal large model designed for seamless interactions between medical assistants and users. MedXChat encompasses three key functionalities: CXR(Chest X-ray)-to-Report generation, CXR-based visual question-answering (VQA), and Text-to-CXR synthesis. Our contributions are as follows. Firstly, our model showcases exceptional cross-task adaptability, displaying adeptness across all three defined tasks and outperforming the benchmark models on the MIMIC dataset in medical multimodal applications. Secondly, we introduce an innovative Text-to-CXR synthesis approach that utilizes instruction-following capabilities within the Stable Diffusion (SD) architecture. This technique integrates smoothly with the existing model framework, requiring no extra parameters, thereby maintaining the SD's generative strength while also bestowing upon it the capacity to render fine-grained medical images with high fidelity. Comprehensive experiments validate MedXChat's synergistic enhancement across all tasks. Our instruction data and model will be open-sourced.

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2312.02233 [cs.CV]
	(or arXiv:2312.02233v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2312.02233

Computer Science > Computer Vision and Pattern Recognition

Title:MedXChat: Bridging CXR Modalities with a Unified Multimodal Large Model

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators