Abstract
The Contrastive Language-Image Pre-training model (CLIP) has recently gained attention in the zero-shot domain. However it still falls short in addressing cross-modal perception, and the semantic gap between seen and unseen classes in Zero-Shot Sketch-Based Image Retrieval (ZS-SBIR). To overcome these obstacles, we propose a Task-Like Training paradigm (TLT). In this work, we view the cross-modal perception and the semantic gap as a multi-task learning process. Before tackling the challenges, we fully utilize CLIP’s text encoder and propose text-based identification learning mechanism to assist the model to learn discriminative features quickly. Next, we propose text prompt tutoring and the cross-modal consistency learning to solve cross-modal perception and the semantic gap, respectively. Meanwhile, we present a collaborative architecture to explore the potential shared information between tasks. Extensive results show that our approach significantly outperforms the state-of-the-art methods on Sketchy, Sketchy-No, Tuberlin, and QuickDraw datasets.
Similar content being viewed by others
Data Availability
The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.
References
Rui Y, Huang TS, Chang S-F (1999) Image retrieval: current techniques, promising directions, and open issues. J Vis Commun Image Represent 10(1):39–62
Swets DL, Weng JJ (1996) Using discriminant eigenfeatures for image retrieval. IEEE Trans Pattern Anal Mach Intell 18(8):831–836
Hu R, Collomosse J (2013) A performance evaluation of gradient field hog descriptor for sketch based image retrieval. Comput Vis Image Underst 117(7):790–806
Liu L, Shen F, Shen Y, Liu X, Shao L (2017) Deep sketch hashing: fast free-hand sketch-based image retrieval. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2862–2871
Han C, Cheng D, Kou Q, Wang X, Chen L, Zhao J (2022) Self-supervised monocular depth estimation with multi-scale structure similarity loss. Multimedia Tools and Applications, 1–16
Jiang H, Asad M, Liu J, Zhang H, Cheng D (2023) Single image detail enhancement via metropolis theorem. Multimedia Tools and Applications, 1–25
Yang Z, Zhu X, Qian J, Liu P (2021) Dark-aware network for fine-grained sketch-based image retrieval. IEEE Signal Process Lett 28:264–268
Smeulders AWM, Worring M, Santini S, Gupta A, Jain R (2000) Content-based image retrieval at the end of the early years. IEEE Trans Pattern Anal Mach Intell 22(12):1349–1380
Dutta T, Singh A, Biswas S (2020) Styleguide: zero-shot sketch-based image retrieval using style-guided image generation. IEEE Trans Multimed 23:2833–2842
Jiao S, Han X, Xiong F, Yang X, Han H, He L, Kuang L (2022) Deep cross-modal discriminant adversarial learning for zero-shot sketch-based image retrieval. Neural Comput & Applic 34(16):13469–13483
Zhang H, Liu S, Zhang C, Ren W, Wang R, Cao X (2016) Sketchnet: sketch classification with web images. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1105–1113
Fu Y, Hospedales TM, Xiang T, Gong S (2015) Transductive multi-view zero-shot learning. IEEE Trans Pattern Anal Mach Intell 37(11):2332–2345
Yang Y, Luo YD, Chen WL, Shen FM, Shao J, Shen HT (2016) ACM: zero-Shot Hashing via Transferring Supervised Knowledge
Zhang Z, Saligrama V (2015) Zero-shot learning via semantic similarity embedding. In: In Proceedings of the IEEE international conference on computer vision, pp 4166–4174
Tursun O, Denman S, Sridharan S, Goan E, Fookes C (2022) An efficient framework for zero-shot sketch-based image retrieval. Pattern Recogn 126:108528
Jing T, Xia H, Hamm J, Ding Z (2022) Augmented multimodality fusion for generalized zero-shot sketch-based visual retrieval. IEEE Trans Image Process 31:3657–3668
Sain A, Bhunia AK, Chowdhury PN, Koley S, Xiang T, Song Y-Z (2023) CLIP for all things zero-shot sketch-based image retrieval, Fine-Grained or Not. arXiv:2303.13440
Zhu J, Xu X, Shen F, Lee RK-W, Wang Z, Shen HT (2020) Ocean: a dual learning approach for generalized zero-shot sketch-based image retrieval. In: 2020 IEEE international conference on multimedia and expo (ICME), pp 1–6
Xu X, Yang M, Yang Y, Wang H (2020) Progressive domain-independent feature decomposition network for zero-shot sketch-based image retrieval. arXiv:2003.09869
Dey S, Riba P, Dutta A, Llad JL, Song, Y-Z (2019) Doodle to search: practical zero-shot sketch-based image retrieval. In: 2019 IEEE/CVF Conference on computer vision and pattern recognition (CVPR), pp 2174–2183
Liu Q, Xie L, Wang H, Yuille AL (2019) Semantic-aware knowledge preservation for zero-shot sketch-based image retrieval. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3662–3671
Wang Z, Wang H, Yan J, Wu A, Deng C (2021) Domain-smoothing network for zero-shot sketch-based image retrieval. arXiv:2106.11841
Wang W, Shi Y, Chen S, Peng Q, Zheng F, You X (2021) Norm-guided adaptive visual embedding for zero-shot sketch-based image retrieval. In: IJCAI, pp 1106–1112
Zhang H, Jiang H, Wang Z, Cheng D (2023) Ontology-aware network for zero-shot sketch-based image retrieval. In Proceedings of the IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 1–5
Zhan Y-W, Luo X, Wang Y, Chen Z-D, Xu X-S (2022) Three-stream joint network for zero-shot sketch-based image retrieval. arXiv:2204.05666
Wang K, Wang Y, Xu X, Liu X, Ou W, Lu H (2022) Prototype-based selective knowledge distillation for zero-shot sketch based image retrieval. In Proceedings of the 30th ACM international conference on multimedia, pp 601–609
Radford A, Kim JW, Hallacy C, Ramesh A, Goh G, Agarwal S, Sastry G, Askell A, Mishkin P, Clark J (2021) Learning transferable visual models from natural language supervision. In: International conference on machine learning (ICML), pp 8748–8763
Lin F, Li M, Li D, Hospedales T, Song Y-Z, Qi Y (2023) Zero-shot everything sketch-based image retrieval, and in explainable style. arXiv:2303.14348
Shen Y, Liu L, Shen F, Shao L (2018) Zero-shot sketch-image hashing. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp. 3598–3607
Zhang Z, Zhang Y, Feng R, Zhang T, Fan W (2020) Zero-shot sketch-based image retrieval via graph convolution network. In Proceedings of the conference on association for the advance of artificial intelligence (AAAI) 34:12943–12950
Dutta A, Akata Z (2019) Semantically tied paired cycle consistency for zero-shot sketch-based image retrieval. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 5089–5098
Li J, Ling Z, Niu L, Zhang L (2022) Zero-shot sketch-based image retrieval with structure-aware asymmetric disentanglement. Comput Vis Image Underst 218:103412
Deng C, Xu X, Wang H, Yang M, Tao D (2020) Progressive cross-modal semantic network for zero-shot sketch-based image retrieval. IEEE Trans Image Process 29:8892–8902
Huang Z, Sun Y, Han C, Gao C, Sang N (2021) Modality-aware triplet hard mining for zero-shot sketch-based image retrieval. arXiv:2112.07966
Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60(6):84–90
Oksuz K, Cam BC, Kalkan S, Akbas E (2020) Imbalance problems in object detection: a review. IEEE Trans Pattern Anal Mach Intell 43(10):3388–3415
Liu X, He P, Chen W, Gao J (2019) Multi-task deep neural networks for natural language understanding. arXiv:1901.11504
Kendall A, Gal Y, Cipolla R (2018) Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)
Zhang R, Fang R, Zhang W, Gao P, Li K, Dai J, Qiao Y, Li H (2021) Tip-adapter: training-free clip-adapter for better vision-language modeling. arXiv:2111.03930
Almeida F, Xex G (2019) Word embeddings: a survey. arXiv:1901.09069
Ermolov A, Mirvakhabova L, Khrulkov V, Sebe N, Oseledets I (2022) Hyperbolic vision transformers: combining improvements in metric learning. Proceedings of the IEEE computer society conference on computer vision and pattern recognition 2022-June, 7399–7409
Tian J, Xu X, Shen, F, Yang Y, Shen HT (2022) TVT: three-way vision transformer through multi-modal hypersphere learning for zero-shot sketch-based image retrieval. In Proceedings of the 36th AAAI conference on artificial intelligence, AAAI 2022 36:2370–2378
Yelamarthi SK, Reddy SK, Mishra A, Mittal A (2018) A zero-shot framework for sketch based image retrieval. In Proceedings of the European conference on computer vision (ECCV), pp 300–317
Lin K, Xu X, Gao L, Wang Z, Shen HT (2020) Learning cross-aligned latent embeddings for zero-shot cross-modal retrieval. In Proceedings of the AAAI conference on artificial intelligence 34:11515–11522
Wang H, Deng C, Liu T, Tao D (2021) Transferable coupled network for zero-shot sketch-based image retrieval. IEEE Trans Pattern Anal Mach Intell
Liang J, Zhou T, Liu D, Wang W (2023) CLUSTSEG: clustering for Universal Segmentation
Yan L, Ma S, Wang Q, Chen Y, Zhang X, Savakis A, Liu D (2022) Video captioning using global-local representation. IEEE Trans Circ Syst Vid Technol 32(10):6642–6656
Wang W, Liang J, Liu D (2022) Learning equivariant segmentation with instance-unique querying. Adv Neural Inform Process Syst 35:12826–12840
Tian J, Xu X, Wang Z, Shen F, Liu X (2021) Relationship-preserving knowledge distillation for zero-shot sketch based image retrieval. In Proceedings of the 29th ACM international conference on multimedia, pp 5473–5481
Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9:2579–2605
Han C, Wang Q, Cui Y, Cao Z, Wang W, Qi S, Liu D (2023) E2VPT: an effective and efficient approach for visual prompt tuning. In Proceedings of the IEEE/CVF International conference on computer vision (ICCV), pp 17491–17502
Yan L, Han C, Xu Z, Liu D, Wang Q (2023) Prompt learns prompt: exploring knowledge-aware generative prompt collaboration for video captioning. In Proceedings of international joint conference on artificial intelligence (IJCAI), pp 1622–1630
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China under Grant 52204177 and Grant 52304182, and in part by the Fundamental Research Funds for the Central Universities under Grant 2020QN49.
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Competing interest
The authors declare that they are not aware of the possibility of competing for financial interests or personal relationships affecting the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, H., Cheng, D., Jiang, H. et al. Task-like training paradigm in CLIP for zero-shot sketch-based image retrieval. Multimed Tools Appl (2023). https://doi.org/10.1007/s11042-023-17675-x
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s11042-023-17675-x