Skip to main content
Log in

Task-like training paradigm in CLIP for zero-shot sketch-based image retrieval

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

The Contrastive Language-Image Pre-training model (CLIP) has recently gained attention in the zero-shot domain. However it still falls short in addressing cross-modal perception, and the semantic gap between seen and unseen classes in Zero-Shot Sketch-Based Image Retrieval (ZS-SBIR). To overcome these obstacles, we propose a Task-Like Training paradigm (TLT). In this work, we view the cross-modal perception and the semantic gap as a multi-task learning process. Before tackling the challenges, we fully utilize CLIP’s text encoder and propose text-based identification learning mechanism to assist the model to learn discriminative features quickly. Next, we propose text prompt tutoring and the cross-modal consistency learning to solve cross-modal perception and the semantic gap, respectively. Meanwhile, we present a collaborative architecture to explore the potential shared information between tasks. Extensive results show that our approach significantly outperforms the state-of-the-art methods on Sketchy, Sketchy-No, Tuberlin, and QuickDraw datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data Availability

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

References

  1. Rui Y, Huang TS, Chang S-F (1999) Image retrieval: current techniques, promising directions, and open issues. J Vis Commun Image Represent 10(1):39–62

  2. Swets DL, Weng JJ (1996) Using discriminant eigenfeatures for image retrieval. IEEE Trans Pattern Anal Mach Intell 18(8):831–836

    Article  Google Scholar 

  3. Hu R, Collomosse J (2013) A performance evaluation of gradient field hog descriptor for sketch based image retrieval. Comput Vis Image Underst 117(7):790–806

    Article  Google Scholar 

  4. Liu L, Shen F, Shen Y, Liu X, Shao L (2017) Deep sketch hashing: fast free-hand sketch-based image retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2862–2871

  5. Han C, Cheng D, Kou Q, Wang X, Chen L, Zhao J (2022) Self-supervised monocular depth estimation with multi-scale structure similarity loss. Multimedia Tools and Applications, 1–16

  6. Jiang H, Asad M, Liu J, Zhang H, Cheng D (2023) Single image detail enhancement via metropolis theorem. Multimedia Tools and Applications, 1–25

  7. Yang Z, Zhu X, Qian J, Liu P (2021) Dark-aware network for fine-grained sketch-based image retrieval. IEEE Signal Process Lett 28:264–268

    Article  Google Scholar 

  8. Smeulders AWM, Worring M, Santini S, Gupta A, Jain R (2000) Content-based image retrieval at the end of the early years. IEEE Trans Pattern Anal Mach Intell 22(12):1349–1380

    Article  Google Scholar 

  9. Dutta T, Singh A, Biswas S (2020) Styleguide: zero-shot sketch-based image retrieval using style-guided image generation. IEEE Trans Multimed 23:2833–2842

    Article  Google Scholar 

  10. Jiao S, Han X, Xiong F, Yang X, Han H, He L, Kuang L (2022) Deep cross-modal discriminant adversarial learning for zero-shot sketch-based image retrieval. Neural Comput & Applic 34(16):13469–13483

    Article  Google Scholar 

  11. Zhang H, Liu S, Zhang C, Ren W, Wang R, Cao X (2016) Sketchnet: sketch classification with web images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1105–1113

  12. Fu Y, Hospedales TM, Xiang T, Gong S (2015) Transductive multi-view zero-shot learning. IEEE Trans Pattern Anal Mach Intell 37(11):2332–2345

    Article  Google Scholar 

  13. Yang Y, Luo YD, Chen WL, Shen FM, Shao J, Shen HT (2016) ACM: zero-Shot Hashing via Transferring Supervised Knowledge

  14. Zhang Z, Saligrama V (2015) Zero-shot learning via semantic similarity embedding. In: In Proceedings of the IEEE international conference on computer vision, pp 4166–4174

  15. Tursun O, Denman S, Sridharan S, Goan E, Fookes C (2022) An efficient framework for zero-shot sketch-based image retrieval. Pattern Recogn 126:108528

    Article  Google Scholar 

  16. Jing T, Xia H, Hamm J, Ding Z (2022) Augmented multimodality fusion for generalized zero-shot sketch-based visual retrieval. IEEE Trans Image Process 31:3657–3668

  17. Sain A, Bhunia AK, Chowdhury PN, Koley S, Xiang T, Song Y-Z (2023) CLIP for all things zero-shot sketch-based image retrieval, Fine-Grained or Not. arXiv:2303.13440

  18. Zhu J, Xu X, Shen F, Lee RK-W, Wang Z, Shen HT (2020) Ocean: a dual learning approach for generalized zero-shot sketch-based image retrieval. In: 2020 IEEE international conference on multimedia and expo (ICME), pp 1–6

  19. Xu X, Yang M, Yang Y, Wang H (2020) Progressive domain-independent feature decomposition network for zero-shot sketch-based image retrieval. arXiv:2003.09869

  20. Dey S, Riba P, Dutta A, Llad JL, Song, Y-Z (2019) Doodle to search: practical zero-shot sketch-based image retrieval. In: 2019 IEEE/CVF Conference on computer vision and pattern recognition (CVPR), pp 2174–2183

  21. Liu Q, Xie L, Wang H, Yuille AL (2019) Semantic-aware knowledge preservation for zero-shot sketch-based image retrieval. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3662–3671

  22. Wang Z, Wang H, Yan J, Wu A, Deng C (2021) Domain-smoothing network for zero-shot sketch-based image retrieval. arXiv:2106.11841

  23. Wang W, Shi Y, Chen S, Peng Q, Zheng F, You X (2021) Norm-guided adaptive visual embedding for zero-shot sketch-based image retrieval. In: IJCAI, pp 1106–1112

  24. Zhang H, Jiang H, Wang Z, Cheng D (2023) Ontology-aware network for zero-shot sketch-based image retrieval. In Proceedings of the IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 1–5

  25. Zhan Y-W, Luo X, Wang Y, Chen Z-D, Xu X-S (2022) Three-stream joint network for zero-shot sketch-based image retrieval. arXiv:2204.05666

  26. Wang K, Wang Y, Xu X, Liu X, Ou W, Lu H (2022) Prototype-based selective knowledge distillation for zero-shot sketch based image retrieval. In Proceedings of the 30th ACM international conference on multimedia, pp 601–609

  27. Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning (ICML), pp 8748–8763

  28. Lin F, Li M, Li D, Hospedales T, Song Y-Z, Qi Y (2023) Zero-shot everything sketch-based image retrieval, and in explainable style. arXiv:2303.14348

  29. Shen Y, Liu L, Shen F, Shao L (2018) Zero-shot sketch-image hashing. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 3598–3607

  30. Zhang Z, Zhang Y, Feng R, Zhang T, Fan W (2020) Zero-shot sketch-based image retrieval via graph convolution network. In Proceedings of the conference on association for the advance of artificial intelligence (AAAI) 34:12943–12950

    Article  Google Scholar 

  31. Dutta A, Akata Z (2019) Semantically tied paired cycle consistency for zero-shot sketch-based image retrieval. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 5089–5098

  32. Li J, Ling Z, Niu L, Zhang L (2022) Zero-shot sketch-based image retrieval with structure-aware asymmetric disentanglement. Comput Vis Image Underst 218:103412

    Article  Google Scholar 

  33. Deng C, Xu X, Wang H, Yang M, Tao D (2020) Progressive cross-modal semantic network for zero-shot sketch-based image retrieval. IEEE Trans Image Process 29:8892–8902

    Article  MATH  Google Scholar 

  34. Huang Z, Sun Y, Han C, Gao C, Sang N (2021) Modality-aware triplet hard mining for zero-shot sketch-based image retrieval. arXiv:2112.07966

  35. Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90

    Article  Google Scholar 

  36. Oksuz K, Cam BC, Kalkan S, Akbas E (2020) Imbalance problems in object detection: a review. IEEE Trans Pattern Anal Mach Intell 43(10):3388–3415

  37. Liu X, He P, Chen W, Gao J (2019) Multi-task deep neural networks for natural language understanding. arXiv:1901.11504

  38. Kendall A, Gal Y, Cipolla R (2018) Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)

  39. Zhang R, Fang R, Zhang W, Gao P, Li K, Dai J, Qiao Y, Li H (2021) Tip-adapter: training-free clip-adapter for better vision-language modeling. arXiv:2111.03930

  40. Almeida F, Xex G (2019) Word embeddings: a survey. arXiv:1901.09069

  41. Ermolov A, Mirvakhabova L, Khrulkov V, Sebe N, Oseledets I (2022) Hyperbolic vision transformers: combining improvements in metric learning. Proceedings of the IEEE computer society conference on computer vision and pattern recognition 2022-June, 7399–7409

  42. Tian J, Xu X, Shen, F, Yang Y, Shen HT (2022) TVT: three-way vision transformer through multi-modal hypersphere learning for zero-shot sketch-based image retrieval. In Proceedings of the 36th AAAI conference on artificial intelligence, AAAI 2022 36:2370–2378

  43. Yelamarthi SK, Reddy SK, Mishra A, Mittal A (2018) A zero-shot framework for sketch based image retrieval. In Proceedings of the European conference on computer vision (ECCV), pp 300–317

  44. Lin K, Xu X, Gao L, Wang Z, Shen HT (2020) Learning cross-aligned latent embeddings for zero-shot cross-modal retrieval. In Proceedings of the AAAI conference on artificial intelligence 34:11515–11522

  45. Wang H, Deng C, Liu T, Tao D (2021) Transferable coupled network for zero-shot sketch-based image retrieval. IEEE Trans Pattern Anal Mach Intell

  46. Liang J, Zhou T, Liu D, Wang W (2023) CLUSTSEG: clustering for Universal Segmentation

  47. Yan L, Ma S, Wang Q, Chen Y, Zhang X, Savakis A, Liu D (2022) Video captioning using global-local representation. IEEE Trans Circ Syst Vid Technol 32(10):6642–6656

    Article  Google Scholar 

  48. Wang W, Liang J, Liu D (2022) Learning equivariant segmentation with instance-unique querying. Adv Neural Inform Process Syst 35:12826–12840

    Google Scholar 

  49. Tian J, Xu X, Wang Z, Shen F, Liu X (2021) Relationship-preserving knowledge distillation for zero-shot sketch based image retrieval. In Proceedings of the 29th ACM international conference on multimedia, pp 5473–5481

  50. Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9:2579–2605

  51. Han C, Wang Q, Cui Y, Cao Z, Wang W, Qi S, Liu D (2023) E2VPT: an effective and efficient approach for visual prompt tuning. In Proceedings of the IEEE/CVF International conference on computer vision (ICCV), pp 17491–17502

  52. Yan L, Han C, Xu Z, Liu D, Wang Q (2023) Prompt learns prompt: exploring knowledge-aware generative prompt collaboration for video captioning. In Proceedings of international joint conference on artificial intelligence (IJCAI), pp 1622–1630

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China under Grant 52204177 and Grant 52304182, and in part by the Fundamental Research Funds for the Central Universities under Grant 2020QN49.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Deqiang Cheng or He Jiang.

Ethics declarations

Competing interest

The authors declare that they are not aware of the possibility of competing for financial interests or personal relationships affecting the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, H., Cheng, D., Jiang, H. et al. Task-like training paradigm in CLIP for zero-shot sketch-based image retrieval. Multimed Tools Appl (2023). https://doi.org/10.1007/s11042-023-17675-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s11042-023-17675-x

Keywords

Navigation