Task-like training paradigm in CLIP for zero-shot sketch-based image retrieval

Zhang, Haoxiang; Cheng, Deqiang; Jiang, He; Liu, Jingjing; Kou, Qiqi

doi:10.1007/s11042-023-17675-x

Task-like training paradigm in CLIP for zero-shot sketch-based image retrieval

Published: 12 December 2023

(2023)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

210 Accesses
Explore all metrics

Abstract

The Contrastive Language-Image Pre-training model (CLIP) has recently gained attention in the zero-shot domain. However it still falls short in addressing cross-modal perception, and the semantic gap between seen and unseen classes in Zero-Shot Sketch-Based Image Retrieval (ZS-SBIR). To overcome these obstacles, we propose a Task-Like Training paradigm (TLT). In this work, we view the cross-modal perception and the semantic gap as a multi-task learning process. Before tackling the challenges, we fully utilize CLIP’s text encoder and propose text-based identification learning mechanism to assist the model to learn discriminative features quickly. Next, we propose text prompt tutoring and the cross-modal consistency learning to solve cross-modal perception and the semantic gap, respectively. Meanwhile, we present a collaborative architecture to explore the potential shared information between tasks. Extensive results show that our approach significantly outperforms the state-of-the-art methods on Sketchy, Sketchy-No, Tuberlin, and QuickDraw datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learning to Prompt for Vision-Language Models

Article 31 July 2022

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Article 15 September 2023

Visual attention network

Article Open access 28 July 2023

Data Availability

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

References

Rui Y, Huang TS, Chang S-F (1999) Image retrieval: current techniques, promising directions, and open issues. J Vis Commun Image Represent 10(1):39–62
Swets DL, Weng JJ (1996) Using discriminant eigenfeatures for image retrieval. IEEE Trans Pattern Anal Mach Intell 18(8):831–836
Article Google Scholar
Hu R, Collomosse J (2013) A performance evaluation of gradient field hog descriptor for sketch based image retrieval. Comput Vis Image Underst 117(7):790–806
Article Google Scholar
Liu L, Shen F, Shen Y, Liu X, Shao L (2017) Deep sketch hashing: fast free-hand sketch-based image retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2862–2871
Han C, Cheng D, Kou Q, Wang X, Chen L, Zhao J (2022) Self-supervised monocular depth estimation with multi-scale structure similarity loss. Multimedia Tools and Applications, 1–16
Jiang H, Asad M, Liu J, Zhang H, Cheng D (2023) Single image detail enhancement via metropolis theorem. Multimedia Tools and Applications, 1–25
Yang Z, Zhu X, Qian J, Liu P (2021) Dark-aware network for fine-grained sketch-based image retrieval. IEEE Signal Process Lett 28:264–268
Article Google Scholar
Smeulders AWM, Worring M, Santini S, Gupta A, Jain R (2000) Content-based image retrieval at the end of the early years. IEEE Trans Pattern Anal Mach Intell 22(12):1349–1380
Article Google Scholar
Dutta T, Singh A, Biswas S (2020) Styleguide: zero-shot sketch-based image retrieval using style-guided image generation. IEEE Trans Multimed 23:2833–2842
Article Google Scholar
Jiao S, Han X, Xiong F, Yang X, Han H, He L, Kuang L (2022) Deep cross-modal discriminant adversarial learning for zero-shot sketch-based image retrieval. Neural Comput & Applic 34(16):13469–13483
Article Google Scholar
Zhang H, Liu S, Zhang C, Ren W, Wang R, Cao X (2016) Sketchnet: sketch classification with web images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1105–1113
Fu Y, Hospedales TM, Xiang T, Gong S (2015) Transductive multi-view zero-shot learning. IEEE Trans Pattern Anal Mach Intell 37(11):2332–2345
Article Google Scholar
Yang Y, Luo YD, Chen WL, Shen FM, Shao J, Shen HT (2016) ACM: zero-Shot Hashing via Transferring Supervised Knowledge
Zhang Z, Saligrama V (2015) Zero-shot learning via semantic similarity embedding. In: In Proceedings of the IEEE international conference on computer vision, pp 4166–4174
Tursun O, Denman S, Sridharan S, Goan E, Fookes C (2022) An efficient framework for zero-shot sketch-based image retrieval. Pattern Recogn 126:108528
Article Google Scholar
Jing T, Xia H, Hamm J, Ding Z (2022) Augmented multimodality fusion for generalized zero-shot sketch-based visual retrieval. IEEE Trans Image Process 31:3657–3668
Sain A, Bhunia AK, Chowdhury PN, Koley S, Xiang T, Song Y-Z (2023) CLIP for all things zero-shot sketch-based image retrieval, Fine-Grained or Not. arXiv:2303.13440
Zhu J, Xu X, Shen F, Lee RK-W, Wang Z, Shen HT (2020) Ocean: a dual learning approach for generalized zero-shot sketch-based image retrieval. In: 2020 IEEE international conference on multimedia and expo (ICME), pp 1–6
Xu X, Yang M, Yang Y, Wang H (2020) Progressive domain-independent feature decomposition network for zero-shot sketch-based image retrieval. arXiv:2003.09869
Dey S, Riba P, Dutta A, Llad JL, Song, Y-Z (2019) Doodle to search: practical zero-shot sketch-based image retrieval. In: 2019 IEEE/CVF Conference on computer vision and pattern recognition (CVPR), pp 2174–2183
Liu Q, Xie L, Wang H, Yuille AL (2019) Semantic-aware knowledge preservation for zero-shot sketch-based image retrieval. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3662–3671
Wang Z, Wang H, Yan J, Wu A, Deng C (2021) Domain-smoothing network for zero-shot sketch-based image retrieval. arXiv:2106.11841
Wang W, Shi Y, Chen S, Peng Q, Zheng F, You X (2021) Norm-guided adaptive visual embedding for zero-shot sketch-based image retrieval. In: IJCAI, pp 1106–1112
Zhang H, Jiang H, Wang Z, Cheng D (2023) Ontology-aware network for zero-shot sketch-based image retrieval. In Proceedings of the IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 1–5
Zhan Y-W, Luo X, Wang Y, Chen Z-D, Xu X-S (2022) Three-stream joint network for zero-shot sketch-based image retrieval. arXiv:2204.05666
Wang K, Wang Y, Xu X, Liu X, Ou W, Lu H (2022) Prototype-based selective knowledge distillation for zero-shot sketch based image retrieval. In Proceedings of the 30th ACM international conference on multimedia, pp 601–609
Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning (ICML), pp 8748–8763
Lin F, Li M, Li D, Hospedales T, Song Y-Z, Qi Y (2023) Zero-shot everything sketch-based image retrieval, and in explainable style. arXiv:2303.14348
Shen Y, Liu L, Shen F, Shao L (2018) Zero-shot sketch-image hashing. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 3598–3607
Zhang Z, Zhang Y, Feng R, Zhang T, Fan W (2020) Zero-shot sketch-based image retrieval via graph convolution network. In Proceedings of the conference on association for the advance of artificial intelligence (AAAI) 34:12943–12950
Article Google Scholar
Dutta A, Akata Z (2019) Semantically tied paired cycle consistency for zero-shot sketch-based image retrieval. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 5089–5098
Li J, Ling Z, Niu L, Zhang L (2022) Zero-shot sketch-based image retrieval with structure-aware asymmetric disentanglement. Comput Vis Image Underst 218:103412
Article Google Scholar
Deng C, Xu X, Wang H, Yang M, Tao D (2020) Progressive cross-modal semantic network for zero-shot sketch-based image retrieval. IEEE Trans Image Process 29:8892–8902
Article MATH Google Scholar
Huang Z, Sun Y, Han C, Gao C, Sang N (2021) Modality-aware triplet hard mining for zero-shot sketch-based image retrieval. arXiv:2112.07966
Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90
Article Google Scholar
Oksuz K, Cam BC, Kalkan S, Akbas E (2020) Imbalance problems in object detection: a review. IEEE Trans Pattern Anal Mach Intell 43(10):3388–3415
Liu X, He P, Chen W, Gao J (2019) Multi-task deep neural networks for natural language understanding. arXiv:1901.11504
Kendall A, Gal Y, Cipolla R (2018) Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)
Zhang R, Fang R, Zhang W, Gao P, Li K, Dai J, Qiao Y, Li H (2021) Tip-adapter: training-free clip-adapter for better vision-language modeling. arXiv:2111.03930
Almeida F, Xex G (2019) Word embeddings: a survey. arXiv:1901.09069
Ermolov A, Mirvakhabova L, Khrulkov V, Sebe N, Oseledets I (2022) Hyperbolic vision transformers: combining improvements in metric learning. Proceedings of the IEEE computer society conference on computer vision and pattern recognition 2022-June, 7399–7409
Tian J, Xu X, Shen, F, Yang Y, Shen HT (2022) TVT: three-way vision transformer through multi-modal hypersphere learning for zero-shot sketch-based image retrieval. In Proceedings of the 36th AAAI conference on artificial intelligence, AAAI 2022 36:2370–2378
Yelamarthi SK, Reddy SK, Mishra A, Mittal A (2018) A zero-shot framework for sketch based image retrieval. In Proceedings of the European conference on computer vision (ECCV), pp 300–317
Lin K, Xu X, Gao L, Wang Z, Shen HT (2020) Learning cross-aligned latent embeddings for zero-shot cross-modal retrieval. In Proceedings of the AAAI conference on artificial intelligence 34:11515–11522
Wang H, Deng C, Liu T, Tao D (2021) Transferable coupled network for zero-shot sketch-based image retrieval. IEEE Trans Pattern Anal Mach Intell
Liang J, Zhou T, Liu D, Wang W (2023) CLUSTSEG: clustering for Universal Segmentation
Yan L, Ma S, Wang Q, Chen Y, Zhang X, Savakis A, Liu D (2022) Video captioning using global-local representation. IEEE Trans Circ Syst Vid Technol 32(10):6642–6656
Article Google Scholar
Wang W, Liang J, Liu D (2022) Learning equivariant segmentation with instance-unique querying. Adv Neural Inform Process Syst 35:12826–12840
Google Scholar
Tian J, Xu X, Wang Z, Shen F, Liu X (2021) Relationship-preserving knowledge distillation for zero-shot sketch based image retrieval. In Proceedings of the 29th ACM international conference on multimedia, pp 5473–5481
Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9:2579–2605
Han C, Wang Q, Cui Y, Cao Z, Wang W, Qi S, Liu D (2023) E2VPT: an effective and efficient approach for visual prompt tuning. In Proceedings of the IEEE/CVF International conference on computer vision (ICCV), pp 17491–17502
Yan L, Han C, Xu Z, Liu D, Wang Q (2023) Prompt learns prompt: exploring knowledge-aware generative prompt collaboration for video captioning. In Proceedings of international joint conference on artificial intelligence (IJCAI), pp 1622–1630

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grant 52204177 and Grant 52304182, and in part by the Fundamental Research Funds for the Central Universities under Grant 2020QN49.

Author information

Authors and Affiliations

The School of Information and Control Engineering, China University of Mining and Technology, Xuzhou, 221116, Jiangsu, China
Haoxiang Zhang, Deqiang Cheng, He Jiang & Jingjing Liu
The School of Computer Science and Technology, China University of Mining and Technology, Xuzhou, 221116, Jiangsu, China
Qiqi Kou

Authors

Haoxiang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Deqiang Cheng
View author publications
You can also search for this author in PubMed Google Scholar
He Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Jingjing Liu
View author publications
You can also search for this author in PubMed Google Scholar
Qiqi Kou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Deqiang Cheng or He Jiang.

Ethics declarations

Competing interest

The authors declare that they are not aware of the possibility of competing for financial interests or personal relationships affecting the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, H., Cheng, D., Jiang, H. et al. Task-like training paradigm in CLIP for zero-shot sketch-based image retrieval. Multimed Tools Appl (2023). https://doi.org/10.1007/s11042-023-17675-x

Download citation

Received: 12 October 2023
Revised: 02 November 2023
Accepted: 18 November 2023
Published: 12 December 2023
DOI: https://doi.org/10.1007/s11042-023-17675-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Task-like training paradigm in CLIP for zero-shot sketch-based image retrieval

Abstract

Access this article

Similar content being viewed by others

Learning to Prompt for Vision-Language Models

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Visual attention network

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Competing interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Task-like training paradigm in CLIP for zero-shot sketch-based image retrieval

Abstract

Access this article

Similar content being viewed by others

Learning to Prompt for Vision-Language Models

CLIP-Adapter: Better Vision-Language Models with Feature Adapters

Visual attention network

Data Availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Competing interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation