short-paper

TMML: Text-Guided MuliModal Product Location For Alleviating Retrieval Inconsistency in E-Commerce

Authors:
Youhua Tang

Alibaba Group, Hangzhou, China

Alibaba Group, Hangzhou, China

0009-0007-9037-8128
View Profile

,
Xiong Xiong

Alibaba Group, Hangzhou, China

Alibaba Group, Hangzhou, China

0000-0002-5992-9733
View Profile

,
Siyang Sun

Alibaba Group, Hangzhou, China

Alibaba Group, Hangzhou, China

0009-0006-3185-8431
View Profile

,
Baoliang Cui

Alibaba Group, Hangzhou, China

Alibaba Group, Hangzhou, China

0009-0008-7863-5387
View Profile

,
Yun Zheng

Alibaba Group, Hangzhou, China

Alibaba Group, Hangzhou, China

0009-0001-5767-014X
View Profile

,
Haihong Tang

Alibaba Group, Hangzhou, China

Alibaba Group, Hangzhou, China

0000-0002-7103-975X
View Profile

SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information RetrievalJuly 2023Pages 3275–3279https://doi.org/10.1145/3539618.3591836

Published:18 July 2023Publication History

SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 3275–3279

ABSTRACT

Image retrieval system (IRS) is commonly used in E-Commerce platforms for a wide range of applications such as price comparison and commodity recommendation. However, customers may experience inconsistent retrieval problems. Although the retrieved image contains the query object, the main product of the retrieved image is not associated with the query product. This is caused by the wrong product instance location when building the product image retrieval library. We can easily determine which product is on sale through the hint of the title, so we propose Text-Guided MuliModal Product Location (TMML) to use additional product titles to assist in locating the actual selling product instance. We design a weakly-aligned region-text data collection method to generate region-text pseudo-label by utilizing the IRS and user behavior from the E-commerce platform. To mitigate the impact of data noise, we propose a Mutual-Aware Contrastive Loss. Our results show that the proposed TMML outperforms the state-of-the-art method GLIP [11] by 3.95% in top-1 precision on our multi-objects test set, and 2.53% error located images in AliExpress has been corrected, which greatly alleviates the retrieval inconsistencies in IRS.

Supplemental Material

SIGIR2023-fp3713.mp4

mp4

21.3 MB

Download

References

aliyun. 2022. TP Toolbox. https://ai.aliyun.com/nlp/ke.Google Scholar
Xiyang Dai, Yinpeng Chen, Bin Xiao, Dongdong Chen, Mengchen Liu, Lu Yuan, and Lei Zhang. 2021. Dynamic Head: Unifying Object Detection Heads with Attentions. computer vision and pattern recognition (2021).Google Scholar
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. north american chapter of the association for computational linguistics (2018).Google Scholar
Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. 2021. Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430 (2021).Google Scholar
Tanmay Gupta, Arash Vahdat, Gal Chechik, Xiaodong Yang, Jan Kautz, and Derek Hoiem. 2020. Contrastive Learning for Weakly Supervised Phrase Grounding. european conference on computer vision (2020).Google Scholar
Bo Han, Jiangchao Yao, Gang Niu, Mingyuan Zhou, Ivor W. Tsang, Ya Zhang, and Masashi Sugiyama. 2018. Masking: A New Perspective of Noisy Supervision. neural information processing systems (2018).Google Scholar
Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor W. Tsang, and Masashi Sugiyama. 2018. Co-teaching: Robust Training of Deep Neural Networks with Extremely Noisy Labels. neural information processing systems (2018).Google Scholar
Xiao Han, Licheng Yu, Xiatian Zhu, Li Zhang, Yi-Zhe Song, and Tao Xiang. 2022. FashionViL: Fashion-Focused Vision-and-Language Representation Learning.Google Scholar
Aishwarya Kamath, Mannat Singh, Yann LeCun, Ishan Misra, Gabriel Synnaeve, and Nicolas Carion. 2021. MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding. international conference on computer vision (2021).Google Scholar
Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven C. H. Hoi. 2021. Align before Fuse: Vision and Language Representation Learning with Momentum Distillation. neural information processing systems (2021).Google Scholar
Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan5, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, and Jianfeng Gao. 2022. Grounded Language-Image Pre-training.Google Scholar
Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2016. Feature Pyramid Networks for Object Detection. arXiv: Computer Vision and Pattern Recognition (2016).Google Scholar
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. international conference on computer vision (2021).Google Scholar
Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017).Google Scholar
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 8748--8763.Google Scholar
Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. 2019. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 658--666.Google ScholarCross Ref
Zhi Tian, Xiangxiang Chu, Xiaoming Wang, Xiaolin Wei, and Chunhua Shen. 2022. Fully Convolutional One-Stage 3D Object Detection on LiDAR Range Images. arXiv preprint arXiv:2205.13764 (2022).Google Scholar
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. neural information processing systems (2017).Google Scholar
JosiahWang and Lucia Specia. 2019. Phrase Localization Without Paired Training Examples. international conference on computer vision (2019).Google Scholar
Liwei Wang, Jing Huang, Yin Li, Kun Xu, Zhengyuan Yang, and Dong Yu. 2021. Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation. computer vision and pattern recognition (2021).Google Scholar
Hongxin Wei, Lue Tao, Renchunzi Xie, and Bo An. 2021. Open-set Label Noise Can Improve Robustness Against Inherent Label Noise. arXiv: Learning (2021).Google Scholar
Jiannan Wu, Yi Jiang, Peize Sun, Zehuan Yuan, and Ping Luo. 2022. Language as Queries for Referring Video Object Segmentation.Google Scholar
Fan Yang, Ajinkya Gorakhnath Kale, Yury Bubnov, Leon Stein, Qiaosong Wang, M. Hadi Kiapour, and Robinson Piramuthu. 2017. Visual Search at eBay. knowledge discovery and data mining (2017).Google Scholar
Mouxing Yang, Yunfan Li, Zhenyu Huang, Zitao Liu, Peng Hu, and Xi Peng. 2021. Partially View-Aligned Representation Learning With Noise-Robust Contrastive Loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 1134--1143.Google ScholarCross Ref
Raymond A. Yeh, Minh N. Do, and Alexander G. Schwing. 2018. Unsupervised Textual Grounding: Linking Words to Image Concepts. computer vision and pattern recognition (2018).Google Scholar
Licheng Yu, Jun Chen, Animesh Sinha, Mengjiao MJWang, Hugo Chen, Tamara L. Berg, and Ning Zhang. 2022. CommerceMM: Large-Scale Commerce MultiModal Representation Learning with Omni Retrieval. knowledge discovery and data mining (2022).Google ScholarDigital Library
Xunlin Zhan, Yangxin Wu, Xiao Dong, Yunchao Wei, Minlong Lu, Yichi Zhang, Hang Xu, and Xiaodan Liang. 2021. Product1M: Towards Weakly Supervised Instance-Level Product Retrieval via Cross-Modal Pretraining. international conference on computer vision (2021).Google Scholar
Shifeng Zhang, Cheng Chi, Yongqiang Yao, Zhen Lei, and Stan Z Li. 2020. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9759--9768.Google ScholarCross Ref
Yanhao Zhang, Pan Pan, Yun Zheng, Kang Zhao, Yingya Zhang, Xiaofeng Ren, and Rong Jin. 2022. Visual Search at Alibaba.Google Scholar
Fang Zhao, Jianshu Li, Jian Zhao, and Jiashi Feng. 2018. Weakly Supervised Phrase Localization with Multi-scale Anchored Transformer Network. computer vision and pattern recognition (2018).Google Scholar

Index Terms

TMML: Text-Guided MuliModal Product Location For Alleviating Retrieval Inconsistency in E-Commerce
1. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals

Recommendations

Leveraging non-relevant images to enhance image retrieval performance
MULTIMEDIA '02: Proceedings of the tenth ACM international conference on Multimedia

Inherent subjectivity in user's perception of an image has motivated the use of relevance feedback (RF) in the image desigined output's retrieval process. RF techniques interactively determine the user's query concept, given the user's relevance ...
Read More
Image Retrieval Using Spatial Multi-color Coherence Vectors Mixing Location Information
CCCM '08: Proceedings of the 2008 ISECS International Colloquium on Computing, Communication, Control, and Management - Volume 01

Color coherence vectors (CCV) are commonly used for image retrieval in content-based image retrieval (CBIR). It is a more sophisticated form of histogram refinement, in which histogram buckets are partitioned based on spatial coherence, but a single CCV ...
Read More
Re-ranking algorithm using post-retrieval clustering for content-based image retrieval

In this paper, we propose a re-ranking algorithm using post-retrieval clustering for content-based image retrieval (CBIR). In conventional CBIR systems, it is often observed that images visually dissimilar to a query image are ranked high in retrieval ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval
July 2023
3567 pages
ISBN:9781450394086
DOI:10.1145/3539618
General Chairs:
Hsin-Hsi Chen
National Taiwan University
,
Wei-Jou (Edward) Duh
National Taiwan University
,
Hen-Hsen Huang
Academia Sinica
,
Program Chairs:
Makoto P. Kato
Spotify
,
Josiane Mothe
Universite de Toulouse
,
Barbara Poblete
University of Chile and Amazon Visiting Academic
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 July 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
image retrieval
multimodal learning
product location
Qualifiers
- short-paper
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 61
  Total Downloads
- Downloads (Last 12 months)61
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

TMML: Text-Guided MuliModal Product Location For Alleviating Retrieval Inconsistency in E-Commerce

SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

ABSTRACT

Supplemental Material

References

Cited By

Index Terms

Recommendations

Leveraging non-relevant images to enhance image retrieval performance

Image Retrieval Using Spatial Multi-color Coherence Vectors Mixing Location Information

Re-ranking algorithm using post-retrieval clustering for content-based image retrieval