CLIP-Count: Towards Text-Guided Zero-Shot Object Counting

Jiang, Ruixiang; Liu, Lingbo; Chen, Changwen

doi:10.1145/3581783.3611789

Computer Science > Computer Vision and Pattern Recognition

arXiv:2305.07304 (cs)

[Submitted on 12 May 2023 (v1), last revised 10 Aug 2023 (this version, v2)]

Title:CLIP-Count: Towards Text-Guided Zero-Shot Object Counting

Authors:Ruixiang Jiang, Lingbo Liu, Changwen Chen

View PDF

Abstract:Recent advances in visual-language models have shown remarkable zero-shot text-image matching ability that is transferable to downstream tasks such as object detection and segmentation. Adapting these models for object counting, however, remains a formidable challenge. In this study, we first investigate transferring vision-language models (VLMs) for class-agnostic object counting. Specifically, we propose CLIP-Count, the first end-to-end pipeline that estimates density maps for open-vocabulary objects with text guidance in a zero-shot manner. To align the text embedding with dense visual features, we introduce a patch-text contrastive loss that guides the model to learn informative patch-level visual representations for dense prediction. Moreover, we design a hierarchical patch-text interaction module to propagate semantic information across different resolution levels of visual features. Benefiting from the full exploitation of the rich image-text alignment knowledge of pretrained VLMs, our method effectively generates high-quality density maps for objects-of-interest. Extensive experiments on FSC-147, CARPK, and ShanghaiTech crowd counting datasets demonstrate state-of-the-art accuracy and generalizability of the proposed method. Code is available: this https URL.

Comments:	Accepted by ACM MM 2023
Subjects:	Computer Vision and Pattern Recognition (cs.CV); Artificial Intelligence (cs.AI)
Cite as:	arXiv:2305.07304 [cs.CV]
	(or arXiv:2305.07304v2 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2305.07304
Related DOI:	https://doi.org/10.1145/3581783.3611789

Submission history

From: Ruixiang Jiang [view email]
[v1] Fri, 12 May 2023 08:19:39 UTC (6,675 KB)
[v2] Thu, 10 Aug 2023 04:04:37 UTC (3,969 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:CLIP-Count: Towards Text-Guided Zero-Shot Object Counting

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:CLIP-Count: Towards Text-Guided Zero-Shot Object Counting

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators