ABSTRACT
Image retrieval system (IRS) is commonly used in E-Commerce platforms for a wide range of applications such as price comparison and commodity recommendation. However, customers may experience inconsistent retrieval problems. Although the retrieved image contains the query object, the main product of the retrieved image is not associated with the query product. This is caused by the wrong product instance location when building the product image retrieval library. We can easily determine which product is on sale through the hint of the title, so we propose Text-Guided MuliModal Product Location (TMML) to use additional product titles to assist in locating the actual selling product instance. We design a weakly-aligned region-text data collection method to generate region-text pseudo-label by utilizing the IRS and user behavior from the E-commerce platform. To mitigate the impact of data noise, we propose a Mutual-Aware Contrastive Loss. Our results show that the proposed TMML outperforms the state-of-the-art method GLIP [11] by 3.95% in top-1 precision on our multi-objects test set, and 2.53% error located images in AliExpress has been corrected, which greatly alleviates the retrieval inconsistencies in IRS.
Supplemental Material
- aliyun. 2022. TP Toolbox. https://ai.aliyun.com/nlp/ke.Google Scholar
- Xiyang Dai, Yinpeng Chen, Bin Xiao, Dongdong Chen, Mengchen Liu, Lu Yuan, and Lei Zhang. 2021. Dynamic Head: Unifying Object Detection Heads with Attentions. computer vision and pattern recognition (2021).Google Scholar
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. north american chapter of the association for computational linguistics (2018).Google Scholar
- Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. 2021. Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430 (2021).Google Scholar
- Tanmay Gupta, Arash Vahdat, Gal Chechik, Xiaodong Yang, Jan Kautz, and Derek Hoiem. 2020. Contrastive Learning for Weakly Supervised Phrase Grounding. european conference on computer vision (2020).Google Scholar
- Bo Han, Jiangchao Yao, Gang Niu, Mingyuan Zhou, Ivor W. Tsang, Ya Zhang, and Masashi Sugiyama. 2018. Masking: A New Perspective of Noisy Supervision. neural information processing systems (2018).Google Scholar
- Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor W. Tsang, and Masashi Sugiyama. 2018. Co-teaching: Robust Training of Deep Neural Networks with Extremely Noisy Labels. neural information processing systems (2018).Google Scholar
- Xiao Han, Licheng Yu, Xiatian Zhu, Li Zhang, Yi-Zhe Song, and Tao Xiang. 2022. FashionViL: Fashion-Focused Vision-and-Language Representation Learning.Google Scholar
- Aishwarya Kamath, Mannat Singh, Yann LeCun, Ishan Misra, Gabriel Synnaeve, and Nicolas Carion. 2021. MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding. international conference on computer vision (2021).Google Scholar
- Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven C. H. Hoi. 2021. Align before Fuse: Vision and Language Representation Learning with Momentum Distillation. neural information processing systems (2021).Google Scholar
- Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan5, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, and Jianfeng Gao. 2022. Grounded Language-Image Pre-training.Google Scholar
- Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2016. Feature Pyramid Networks for Object Detection. arXiv: Computer Vision and Pattern Recognition (2016).Google Scholar
- Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. international conference on computer vision (2021).Google Scholar
- Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017).Google Scholar
- Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 8748--8763.Google Scholar
- Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. 2019. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 658--666.Google ScholarCross Ref
- Zhi Tian, Xiangxiang Chu, Xiaoming Wang, Xiaolin Wei, and Chunhua Shen. 2022. Fully Convolutional One-Stage 3D Object Detection on LiDAR Range Images. arXiv preprint arXiv:2205.13764 (2022).Google Scholar
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. neural information processing systems (2017).Google Scholar
- JosiahWang and Lucia Specia. 2019. Phrase Localization Without Paired Training Examples. international conference on computer vision (2019).Google Scholar
- Liwei Wang, Jing Huang, Yin Li, Kun Xu, Zhengyuan Yang, and Dong Yu. 2021. Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation. computer vision and pattern recognition (2021).Google Scholar
- Hongxin Wei, Lue Tao, Renchunzi Xie, and Bo An. 2021. Open-set Label Noise Can Improve Robustness Against Inherent Label Noise. arXiv: Learning (2021).Google Scholar
- Jiannan Wu, Yi Jiang, Peize Sun, Zehuan Yuan, and Ping Luo. 2022. Language as Queries for Referring Video Object Segmentation.Google Scholar
- Fan Yang, Ajinkya Gorakhnath Kale, Yury Bubnov, Leon Stein, Qiaosong Wang, M. Hadi Kiapour, and Robinson Piramuthu. 2017. Visual Search at eBay. knowledge discovery and data mining (2017).Google Scholar
- Mouxing Yang, Yunfan Li, Zhenyu Huang, Zitao Liu, Peng Hu, and Xi Peng. 2021. Partially View-Aligned Representation Learning With Noise-Robust Contrastive Loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 1134--1143.Google ScholarCross Ref
- Raymond A. Yeh, Minh N. Do, and Alexander G. Schwing. 2018. Unsupervised Textual Grounding: Linking Words to Image Concepts. computer vision and pattern recognition (2018).Google Scholar
- Licheng Yu, Jun Chen, Animesh Sinha, Mengjiao MJWang, Hugo Chen, Tamara L. Berg, and Ning Zhang. 2022. CommerceMM: Large-Scale Commerce MultiModal Representation Learning with Omni Retrieval. knowledge discovery and data mining (2022).Google ScholarDigital Library
- Xunlin Zhan, Yangxin Wu, Xiao Dong, Yunchao Wei, Minlong Lu, Yichi Zhang, Hang Xu, and Xiaodan Liang. 2021. Product1M: Towards Weakly Supervised Instance-Level Product Retrieval via Cross-Modal Pretraining. international conference on computer vision (2021).Google Scholar
- Shifeng Zhang, Cheng Chi, Yongqiang Yao, Zhen Lei, and Stan Z Li. 2020. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9759--9768.Google ScholarCross Ref
- Yanhao Zhang, Pan Pan, Yun Zheng, Kang Zhao, Yingya Zhang, Xiaofeng Ren, and Rong Jin. 2022. Visual Search at Alibaba.Google Scholar
- Fang Zhao, Jianshu Li, Jian Zhao, and Jiashi Feng. 2018. Weakly Supervised Phrase Localization with Multi-scale Anchored Transformer Network. computer vision and pattern recognition (2018).Google Scholar
Index Terms
- TMML: Text-Guided MuliModal Product Location For Alleviating Retrieval Inconsistency in E-Commerce
Recommendations
Leveraging non-relevant images to enhance image retrieval performance
MULTIMEDIA '02: Proceedings of the tenth ACM international conference on MultimediaInherent subjectivity in user's perception of an image has motivated the use of relevance feedback (RF) in the image desigined output's retrieval process. RF techniques interactively determine the user's query concept, given the user's relevance ...
Image Retrieval Using Spatial Multi-color Coherence Vectors Mixing Location Information
CCCM '08: Proceedings of the 2008 ISECS International Colloquium on Computing, Communication, Control, and Management - Volume 01Color coherence vectors (CCV) are commonly used for image retrieval in content-based image retrieval (CBIR). It is a more sophisticated form of histogram refinement, in which histogram buckets are partitioned based on spatial coherence, but a single CCV ...
Re-ranking algorithm using post-retrieval clustering for content-based image retrieval
In this paper, we propose a re-ranking algorithm using post-retrieval clustering for content-based image retrieval (CBIR). In conventional CBIR systems, it is often observed that images visually dissimilar to a query image are ranked high in retrieval ...
Comments