skip to main content
10.1145/3539618.3591836acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
short-paper

TMML: Text-Guided MuliModal Product Location For Alleviating Retrieval Inconsistency in E-Commerce

Published:18 July 2023Publication History

ABSTRACT

Image retrieval system (IRS) is commonly used in E-Commerce platforms for a wide range of applications such as price comparison and commodity recommendation. However, customers may experience inconsistent retrieval problems. Although the retrieved image contains the query object, the main product of the retrieved image is not associated with the query product. This is caused by the wrong product instance location when building the product image retrieval library. We can easily determine which product is on sale through the hint of the title, so we propose Text-Guided MuliModal Product Location (TMML) to use additional product titles to assist in locating the actual selling product instance. We design a weakly-aligned region-text data collection method to generate region-text pseudo-label by utilizing the IRS and user behavior from the E-commerce platform. To mitigate the impact of data noise, we propose a Mutual-Aware Contrastive Loss. Our results show that the proposed TMML outperforms the state-of-the-art method GLIP [11] by 3.95% in top-1 precision on our multi-objects test set, and 2.53% error located images in AliExpress has been corrected, which greatly alleviates the retrieval inconsistencies in IRS.

Skip Supplemental Material Section

Supplemental Material

SIGIR2023-fp3713.mp4

mp4

21.3 MB

References

  1. aliyun. 2022. TP Toolbox. https://ai.aliyun.com/nlp/ke.Google ScholarGoogle Scholar
  2. Xiyang Dai, Yinpeng Chen, Bin Xiao, Dongdong Chen, Mengchen Liu, Lu Yuan, and Lei Zhang. 2021. Dynamic Head: Unifying Object Detection Heads with Attentions. computer vision and pattern recognition (2021).Google ScholarGoogle Scholar
  3. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. north american chapter of the association for computational linguistics (2018).Google ScholarGoogle Scholar
  4. Zheng Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun. 2021. Yolox: Exceeding yolo series in 2021. arXiv preprint arXiv:2107.08430 (2021).Google ScholarGoogle Scholar
  5. Tanmay Gupta, Arash Vahdat, Gal Chechik, Xiaodong Yang, Jan Kautz, and Derek Hoiem. 2020. Contrastive Learning for Weakly Supervised Phrase Grounding. european conference on computer vision (2020).Google ScholarGoogle Scholar
  6. Bo Han, Jiangchao Yao, Gang Niu, Mingyuan Zhou, Ivor W. Tsang, Ya Zhang, and Masashi Sugiyama. 2018. Masking: A New Perspective of Noisy Supervision. neural information processing systems (2018).Google ScholarGoogle Scholar
  7. Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor W. Tsang, and Masashi Sugiyama. 2018. Co-teaching: Robust Training of Deep Neural Networks with Extremely Noisy Labels. neural information processing systems (2018).Google ScholarGoogle Scholar
  8. Xiao Han, Licheng Yu, Xiatian Zhu, Li Zhang, Yi-Zhe Song, and Tao Xiang. 2022. FashionViL: Fashion-Focused Vision-and-Language Representation Learning.Google ScholarGoogle Scholar
  9. Aishwarya Kamath, Mannat Singh, Yann LeCun, Ishan Misra, Gabriel Synnaeve, and Nicolas Carion. 2021. MDETR -- Modulated Detection for End-to-End Multi-Modal Understanding. international conference on computer vision (2021).Google ScholarGoogle Scholar
  10. Junnan Li, Ramprasaath R. Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven C. H. Hoi. 2021. Align before Fuse: Vision and Language Representation Learning with Momentum Distillation. neural information processing systems (2021).Google ScholarGoogle Scholar
  11. Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan5, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, and Jianfeng Gao. 2022. Grounded Language-Image Pre-training.Google ScholarGoogle Scholar
  12. Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. 2016. Feature Pyramid Networks for Object Detection. arXiv: Computer Vision and Pattern Recognition (2016).Google ScholarGoogle Scholar
  13. Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. international conference on computer vision (2021).Google ScholarGoogle Scholar
  14. Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017).Google ScholarGoogle Scholar
  15. Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning. PMLR, 8748--8763.Google ScholarGoogle Scholar
  16. Hamid Rezatofighi, Nathan Tsoi, JunYoung Gwak, Amir Sadeghian, Ian Reid, and Silvio Savarese. 2019. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 658--666.Google ScholarGoogle ScholarCross RefCross Ref
  17. Zhi Tian, Xiangxiang Chu, Xiaoming Wang, Xiaolin Wei, and Chunhua Shen. 2022. Fully Convolutional One-Stage 3D Object Detection on LiDAR Range Images. arXiv preprint arXiv:2205.13764 (2022).Google ScholarGoogle Scholar
  18. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. neural information processing systems (2017).Google ScholarGoogle Scholar
  19. JosiahWang and Lucia Specia. 2019. Phrase Localization Without Paired Training Examples. international conference on computer vision (2019).Google ScholarGoogle Scholar
  20. Liwei Wang, Jing Huang, Yin Li, Kun Xu, Zhengyuan Yang, and Dong Yu. 2021. Improving Weakly Supervised Visual Grounding by Contrastive Knowledge Distillation. computer vision and pattern recognition (2021).Google ScholarGoogle Scholar
  21. Hongxin Wei, Lue Tao, Renchunzi Xie, and Bo An. 2021. Open-set Label Noise Can Improve Robustness Against Inherent Label Noise. arXiv: Learning (2021).Google ScholarGoogle Scholar
  22. Jiannan Wu, Yi Jiang, Peize Sun, Zehuan Yuan, and Ping Luo. 2022. Language as Queries for Referring Video Object Segmentation.Google ScholarGoogle Scholar
  23. Fan Yang, Ajinkya Gorakhnath Kale, Yury Bubnov, Leon Stein, Qiaosong Wang, M. Hadi Kiapour, and Robinson Piramuthu. 2017. Visual Search at eBay. knowledge discovery and data mining (2017).Google ScholarGoogle Scholar
  24. Mouxing Yang, Yunfan Li, Zhenyu Huang, Zitao Liu, Peng Hu, and Xi Peng. 2021. Partially View-Aligned Representation Learning With Noise-Robust Contrastive Loss. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 1134--1143.Google ScholarGoogle ScholarCross RefCross Ref
  25. Raymond A. Yeh, Minh N. Do, and Alexander G. Schwing. 2018. Unsupervised Textual Grounding: Linking Words to Image Concepts. computer vision and pattern recognition (2018).Google ScholarGoogle Scholar
  26. Licheng Yu, Jun Chen, Animesh Sinha, Mengjiao MJWang, Hugo Chen, Tamara L. Berg, and Ning Zhang. 2022. CommerceMM: Large-Scale Commerce MultiModal Representation Learning with Omni Retrieval. knowledge discovery and data mining (2022).Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Xunlin Zhan, Yangxin Wu, Xiao Dong, Yunchao Wei, Minlong Lu, Yichi Zhang, Hang Xu, and Xiaodan Liang. 2021. Product1M: Towards Weakly Supervised Instance-Level Product Retrieval via Cross-Modal Pretraining. international conference on computer vision (2021).Google ScholarGoogle Scholar
  28. Shifeng Zhang, Cheng Chi, Yongqiang Yao, Zhen Lei, and Stan Z Li. 2020. Bridging the gap between anchor-based and anchor-free detection via adaptive training sample selection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9759--9768.Google ScholarGoogle ScholarCross RefCross Ref
  29. Yanhao Zhang, Pan Pan, Yun Zheng, Kang Zhao, Yingya Zhang, Xiaofeng Ren, and Rong Jin. 2022. Visual Search at Alibaba.Google ScholarGoogle Scholar
  30. Fang Zhao, Jianshu Li, Jian Zhao, and Jiashi Feng. 2018. Weakly Supervised Phrase Localization with Multi-scale Anchored Transformer Network. computer vision and pattern recognition (2018).Google ScholarGoogle Scholar

Index Terms

  1. TMML: Text-Guided MuliModal Product Location For Alleviating Retrieval Inconsistency in E-Commerce

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval
      July 2023
      3567 pages
      ISBN:9781450394086
      DOI:10.1145/3539618

      Copyright © 2023 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 18 July 2023

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • short-paper

      Acceptance Rates

      Overall Acceptance Rate792of3,983submissions,20%
    • Article Metrics

      • Downloads (Last 12 months)61
      • Downloads (Last 6 weeks)1

      Other Metrics

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader