Adaptive Uncertainty-Based Learning for Text-Based Person Retrieval

Authors

  • Shenshen Li University of Electronic Science and Technology of China
  • Chen He University of Electronic Science and Technology of China
  • Xing Xu University of Electronic Science and Technology of China
  • Fumin Shen University of Electronic Science and Technology of China
  • Yang Yang University of Electronic Science and Technology of China
  • Heng Tao Shen University of Electronic Science and Technology of China

DOI:

https://doi.org/10.1609/aaai.v38i4.28101

Keywords:

CV: Image and Video Retrieval, CV: Language and Vision, ML: Calibration & Uncertainty Quantification

Abstract

Text-based person retrieval aims at retrieving a specific pedestrian image from a gallery based on textual descriptions. The primary challenge is how to overcome the inherent heterogeneous modality gap in the situation of significant intra-class variation and minimal inter-class variation. Existing approaches commonly employ vision-language pre-training or attention mechanisms to learn appropriate cross-modal alignments from noise inputs. Despite commendable progress, current methods inevitably suffer from two defects: 1) Matching ambiguity, which mainly derives from unreliable matching pairs; 2) One-sided cross-modal alignments, stemming from the absence of exploring one-to-many correspondence, i.e., coarse-grained semantic alignment. These critical issues significantly deteriorate retrieval performance. To this end, we propose a novel framework termed Adaptive Uncertainty-based Learning (AUL) for text-based person retrieval from the uncertainty perspective. Specifically, our AUL framework consists of three key components: 1) Uncertainty-aware Matching Filtration that leverages Subjective Logic to effectively mitigate the disturbance of unreliable matching pairs and select high-confidence cross-modal matches for training; 2) Uncertainty-based Alignment Refinement, which not only simulates coarse-grained alignments by constructing uncertainty representations but also performs progressive learning to incorporate coarse- and fine-grained alignments properly; 3) Cross-modal Masked Modeling that aims at exploring more comprehensive relations between vision and language. Extensive experiments demonstrate that our AUL method consistently achieves state-of-the-art performance on three benchmark datasets in supervised, weakly supervised, and domain generalization settings. Our code is available at https://github.com/CFM-MSG/Code-AUL.

Published

2024-03-24

How to Cite

Li, S., He, C., Xu, X., Shen, F., Yang, Y., & Shen, H. T. (2024). Adaptive Uncertainty-Based Learning for Text-Based Person Retrieval. Proceedings of the AAAI Conference on Artificial Intelligence, 38(4), 3172-3180. https://doi.org/10.1609/aaai.v38i4.28101

Issue

Section

AAAI Technical Track on Computer Vision III