Rewrite Caption Semantics: Bridging Semantic Gaps for Language-Supervised Semantic Segmentation

Xing, Yun; Kang, Jian; Xiao, Aoran; Nie, Jiahao; Shao, Ling; Lu, Shijian

Computer Science > Computer Vision and Pattern Recognition

arXiv:2309.13505 (cs)

[Submitted on 24 Sep 2023 (v1), last revised 4 Jan 2024 (this version, v4)]

Title:Rewrite Caption Semantics: Bridging Semantic Gaps for Language-Supervised Semantic Segmentation

Authors:Yun Xing, Jian Kang, Aoran Xiao, Jiahao Nie, Ling Shao, Shijian Lu

View PDF HTML (experimental)

Abstract:Vision-Language Pre-training has demonstrated its remarkable zero-shot recognition ability and potential to learn generalizable visual representations from language supervision. Taking a step ahead, language-supervised semantic segmentation enables spatial localization of textual inputs by learning pixel grouping solely from image-text pairs. Nevertheless, the state-of-the-art suffers from clear semantic gaps between visual and textual modality: plenty of visual concepts appeared in images are missing in their paired captions. Such semantic misalignment circulates in pre-training, leading to inferior zero-shot performance in dense predictions due to insufficient visual concepts captured in textual representations. To close such semantic gap, we propose Concept Curation (CoCu), a pipeline that leverages CLIP to compensate for the missing semantics. For each image-text pair, we establish a concept archive that maintains potential visually-matched concepts with our proposed vision-driven expansion and text-to-vision-guided ranking. Relevant concepts can thus be identified via cluster-guided sampling and fed into pre-training, thereby bridging the gap between visual and textual semantics. Extensive experiments over a broad suite of 8 segmentation benchmarks show that CoCu achieves superb zero-shot transfer performance and greatly boosts language-supervised segmentation baseline by a large margin, suggesting the value of bridging semantic gap in pre-training data.

Comments:	NeurIPS 2023. Code is available at this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2309.13505 [cs.CV]
	(or arXiv:2309.13505v4 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2309.13505

Submission history

From: Yun Xing [view email]
[v1] Sun, 24 Sep 2023 00:05:39 UTC (1,215 KB)
[v2] Wed, 27 Sep 2023 02:39:40 UTC (1,215 KB)
[v3] Tue, 24 Oct 2023 11:01:24 UTC (739 KB)
[v4] Thu, 4 Jan 2024 06:46:53 UTC (739 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:Rewrite Caption Semantics: Bridging Semantic Gaps for Language-Supervised Semantic Segmentation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:Rewrite Caption Semantics: Bridging Semantic Gaps for Language-Supervised Semantic Segmentation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators