skip to main content
10.1145/3581783.3611709acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Towards Unified Text-based Person Retrieval: A Large-scale Multi-Attribute and Language Search Benchmark

Published:27 October 2023Publication History

ABSTRACT

In this paper, we introduce a large Multi-Attribute and Language Search dataset for text-based person retrieval, called MALS, and explore the feasibility of performing pre-training on both attribute recognition and image-text matching tasks in one stone. In particular, MALS contains 1,510,330 image-text pairs, which is about 37.5 × larger than prevailing CUHK-PEDES, and all images are annotated with 27 attributes. Considering the privacy concerns and annotation costs, we leverage the off-the-shelf diffusion models to generate the dataset. To verify the feasibility of learning from the generated data, we develop a new joint Attribute Prompt Learning and Text Matching Learning (APTM) framework, considering the shared knowledge between attribute and text. As the name implies, APTM contains an attribute prompt learning stream and a text matching learning stream. (1) The attribute prompt learning leverages the attribute prompts for image-attribute alignment, which enhances the text matching learning. (2) The text matching learning facilitates the representation learning on fine-grained details, and in turn, boosts the attribute prompt learning. Extensive experiments validate the effectiveness of the pre-training on MALS, achieving state-of-the-art retrieval performance via APTM on three challenging real-world benchmarks. In particular, APTM achieves a consistent improvement of +6.96 %, +7.68%, and +16.95% Recall@1 accuracy on CUHK-PEDES, ICFG-PEDES, and RSTPReid datasets by a clear margin, respectively. The dataset, model, and code are available at https://github.com/Shuyu-XJTU/APTM.

Skip Supplemental Material Section

Supplemental Material

mmfp0088-video.mp4

mp4

40.6 MB

References

  1. Surbhi Aggarwal, Venkatesh Babu Radhakrishnan, and Anirban Chakraborty. 2020. Text-based person search via attribute-aided matching. In WACV. 2617--2625.Google ScholarGoogle Scholar
  2. Shekoofeh Azizi, Simon Kornblith, Chitwan Saharia, Mohammad Norouzi, and David J Fleet. 2023. Synthetic Data from Diffusion Models Improves ImageNet Classification. arXiv preprint arXiv:2304.08466 (2023).Google ScholarGoogle Scholar
  3. Tim Brooks, Aleksander Holynski, and Alexei A Efros. 2023. Instructpix2pix: Learning to follow image editing instructions. In CVPR. 18392--18402.Google ScholarGoogle Scholar
  4. Z. Cao, G. Hidalgo Martinez, T. Simon, S. Wei, and Y. A. Sheikh. 2019. OpenPose: Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields. IEEE Transactions on Pattern Analysis and Machine Intelligence (2019).Google ScholarGoogle Scholar
  5. Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. 2017. Realtime multi-person 2d pose estimation using part affinity fields. In CVPR. 7291--7299.Google ScholarGoogle Scholar
  6. Dapeng Chen, Hongsheng Li, Xihui Liu, Yantao Shen, Jing Shao, Zejian Yuan, and Xiaogang Wang. 2018a. Improving deep visual representation for person re-identification by global and local image-language association. In ECCV. 54--70.Google ScholarGoogle Scholar
  7. Tianlang Chen, Chenliang Xu, and Jiebo Luo. 2018b. Improving text-based person search by spatial matching and adaptive threshold. In WACV. 1879--1887.Google ScholarGoogle Scholar
  8. Weihua Chen, Xianzhe Xu, Jian Jia, Hao Luo, Yaohua Wang, Fan Wang, Rong Jin, and Xiuyu Sun. 2023. Beyond Appearance: a Semantic Controllable Self-Supervised Learning Framework for Human-Centric Visual Tasks. In CVPR. 15050--15061.Google ScholarGoogle Scholar
  9. Yuhao Chen, Guoqing Zhang, Yujiang Lu, Zhenxing Wang, and Yuhui Zheng. 2022. TIPCB: A simple but effective part-based convolutional baseline for text-based person search. Neurocomputing, Vol. 494 (2022), 171--181.Google ScholarGoogle ScholarCross RefCross Ref
  10. Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. 2020. Randaugment: Practical automated data augmentation with a reduced search space. In CVPR workshop. 702--703.Google ScholarGoogle ScholarCross RefCross Ref
  11. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. ImageNet: A large-scale hierarchical image database. In CVPR. 248--255. https://doi.org/10.1109/CVPR.2009.5206848Google ScholarGoogle ScholarCross RefCross Ref
  12. Zefeng Ding, Changxing Ding, Zhiyin Shao, and Dacheng Tao. 2021. Semantically Self-Aligned Network for Text-to-Image Part-aware Person Re-identification. arXiv preprint arXiv:2107.12666 (2021).Google ScholarGoogle Scholar
  13. Bryce Drennan. 2022. imaginAIry. https://github.com/brycedrennan/imaginAIry. Accessed: 2022-05-04.Google ScholarGoogle Scholar
  14. Ammarah Farooq, Muhammad Awais, Josef Kittler, and Syed Safwan Khalid. 2022. AXM-Net: Implicit Cross-Modal Feature Alignment for Person Re-identification. In AAAI, Vol. 36. 4477--4485.Google ScholarGoogle ScholarCross RefCross Ref
  15. Chenyang Gao, Guanyu Cai, Xinyang Jiang, Feng Zheng, Jun Zhang, Yifei Gong, Pai Peng, Xiaowei Guo, and Xing Sun. 2021. Contextual non-local alignment over full-scale representation for text-based person search. arXiv preprint arXiv:2101.03036 (2021).Google ScholarGoogle Scholar
  16. Kai Han, Jianyuan Guo, Chao Zhang, and Mingjian Zhu. 2018. Attribute-aware attention model for fine-grained representation learning. In ACM MM. 2040--2048.Google ScholarGoogle Scholar
  17. Xiao Han, Sen He, Li Zhang, and Tao Xiang. 2021. Text-Based Person Search with Limited Data. In BMVC.Google ScholarGoogle Scholar
  18. Keke He, Zhanxiong Wang, Yanwei Fu, Rui Feng, Yu-Gang Jiang, and Xiangyang Xue. 2017. Adaptively weighted multi-task deep network for person attribute classification. In ACM MM. 1636--1644.Google ScholarGoogle Scholar
  19. Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. 2019. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (2019).Google ScholarGoogle Scholar
  20. Amir Hertz, Ron Mokady, Jay Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. 2022. Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626 (2022).Google ScholarGoogle Scholar
  21. Jian Jia, Houjing Huang, Xiaotang Chen, and Kaiqi Huang. 2021. Rethinking of pedestrian attribute recognition: A reliable evaluation under zero-shot pedestrian identity setting. arXiv preprint arXiv:2107.03576 (2021).Google ScholarGoogle Scholar
  22. Jian Jia, Houjing Huang, Wenjie Yang, Xiaotang Chen, and Kaiqi Huang. 2020. Rethinking of pedestrian attribute recognition: Realistic datasets with efficient method. arXiv preprint arXiv:2005.11909 (2020).Google ScholarGoogle Scholar
  23. Yiqi Jiang, Weihua Chen, Xiuyu Sun, Xiaoyu Shi, Fan Wang, and Hao Li. 2021. Exploring the quality of gan generated images for person re-identification. In ACM MM. 4146--4155.Google ScholarGoogle Scholar
  24. Ya Jing, Chenyang Si, Junbo Wang, Wei Wang, Liang Wang, and Tieniu Tan. 2020. Pose-guided multi-granularity attention network for text-based person search. In AAAI, Vol. 34. 11189--11196.Google ScholarGoogle ScholarCross RefCross Ref
  25. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, Vol. 1. 2.Google ScholarGoogle Scholar
  26. Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked cross attention for image-text matching. In ECCV. 201--216.Google ScholarGoogle Scholar
  27. Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. 2022. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In ICML.Google ScholarGoogle Scholar
  28. Shuang Li, Tong Xiao, Hongsheng Li, Wei Yang, and Xiaogang Wang. 2017a. Identity-aware textual-visual matching with latent co-attention. In ICCV. 1890--1899.Google ScholarGoogle Scholar
  29. Shuang Li, Tong Xiao, Hongsheng Li, Bolei Zhou, Dayu Yue, and Xiaogang Wang. 2017b. Person search with natural language description. In CVPR. 1970--1979.Google ScholarGoogle Scholar
  30. Shuzhao Li, Huimin Yu, and Roland Hu. 2020. Attributes-aided part detection and refinement for person re-identification. Pattern Recognition, Vol. 97 (2020), 107016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Shengcai Liao, Yang Hu, Xiangyu Zhu, and Stan Z Li. 2015. Person re-identification by local maximal occurrence representation and metric learning. In CVPR. 2197--2206.Google ScholarGoogle Scholar
  32. Yutian Lin, Liang Zheng, Zhedong Zheng, Yu Wu, Zhilan Hu, Chenggang Yan, and Yi Yang. 2019. Improving person re-identification by attribute and identity learning. Pattern recognition, Vol. 95 (2019), 151--161.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Hefei Ling, Ziyang Wang, Ping Li, Yuxuan Shi, Jiazhong Chen, and Fuhao Zou. 2019. Improving person re-identification by multi-task learning. Neurocomputing, Vol. 347 (2019), 109--118.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Jiawei Liu, Zheng-Jun Zha, Richang Hong, Meng Wang, and Yongdong Zhang. 2019. Deep adversarial graph attention convolution network for text-based person search. In ACM MM. 665--673.Google ScholarGoogle Scholar
  35. Xihui Liu, Haiyu Zhao, Maoqing Tian, Lu Sheng, Jing Shao, Shuai Yi, Junjie Yan, and Xiaogang Wang. 2017. Hydraplus-net: Attentive deep features for pedestrian analysis. In ICCV. 350--359.Google ScholarGoogle Scholar
  36. Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021a. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV. 10012--10022.Google ScholarGoogle Scholar
  37. Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021b. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In ICCV.Google ScholarGoogle Scholar
  38. Ilya Loshchilov and Frank Hutter. 2018. Decoupled Weight Decay Regularization. In International Conference on Learning Representations (ICLR).Google ScholarGoogle Scholar
  39. Jinghao Luo, Yaohua Liu, Changxin Gao, and Nong Sang. 2019. Learning what and where from attributes to improve person re-identification. In ICIP. IEEE, 165--169.Google ScholarGoogle Scholar
  40. Binh X Nguyen, Binh D Nguyen, Tuong Do, Erman Tjiputra, Quang D Tran, and Anh Nguyen. 2021. Graph-based person signature for person re-identifications. In CVPR. 3492--3501.Google ScholarGoogle Scholar
  41. Kai Niu, Yan Huang, Wanli Ouyang, and Liang Wang. 2020. Improving description-based person re-identification by multi-granularity image-text alignments. IEEE Transactions on Image Processing (TIP), Vol. 29 (2020), 5542--5556.Google ScholarGoogle ScholarCross RefCross Ref
  42. Scott Reed, Zeynep Akata, Honglak Lee, and Bernt Schiele. 2016. Learning deep representations of fine-grained visual descriptions. In CVPR. 49--58.Google ScholarGoogle Scholar
  43. Ergys Ristani, Francesco Solera, Roger Zou, Rita Cucchiara, and Carlo Tomasi. 2016. Performance measures and a data set for multi-target, multi-camera tracking. In ECCV workshop. Springer, 17--35.Google ScholarGoogle ScholarCross RefCross Ref
  44. Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In CVPR. 10684--10695.Google ScholarGoogle Scholar
  45. Mert Bulent Sariyildiz, Karteek Alahari, Diane Larlus, and Yannis Kalantidis. 2023. Fake it till you make it: Learning transferable representations from synthetic ImageNet clones. In CVPR.Google ScholarGoogle Scholar
  46. Zhiyin Shao, Xinyu Zhang, Meng Fang, Zhifeng Lin, Jian Wang, and Changxing Ding. 2022. Learning Granularity-Unified Representations for Text-to-Image Person Re-identification. In ACM MM. 5566--5574.Google ScholarGoogle Scholar
  47. Yuxuan Shi, Zhen Wei, Hefei Ling, Ziyang Wang, Jialie Shen, and Ping Li. 2020. Person retrieval in surveillance videos via deep attribute mining and reasoning. IEEE Transactions on Multimedia, Vol. 23 (2020), 4376--4387.Google ScholarGoogle ScholarCross RefCross Ref
  48. Jordan Shipard, Arnold Wiliem, Kien Nguyen Thanh, Wei Xiang, and Clinton Fookes. 2023. Diversity is Definitely Needed: Improving Model-Agnostic Zero-shot Classification via Stable Diffusion. arxiv: 2302.03298 [cs.CV]Google ScholarGoogle Scholar
  49. Xiujun Shu, Wei Wen, Haoqian Wu, Keyu Chen, Yiran Song, Ruizhi Qiao, Bo Ren, and Xiao Wang. 2023. See finer, see more: Implicit modality alignment for text-based person retrieval. In ECCV workshop.Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Tomas Simon, Hanbyul Joo, Iain Matthews, and Yaser Sheikh. 2017. Hand Keypoint Detection in Single Images using Multiview Bootstrapping. In CVPR.Google ScholarGoogle Scholar
  51. Xiaoxiao Sun and Liang Zheng. 2019. Dissecting person re-identification from the viewpoint of viewpoint. In CVPR. 608--617.Google ScholarGoogle Scholar
  52. Wei Suo, Mengyang Sun, Kai Niu, Yiqi Gao, Peng Wang, Yanning Zhang, and Qi Wu. 2022. A Simple and Robust Correlation Filtering Method for Text-Based Person Search. In ECCV. Springer, 726--742.Google ScholarGoogle Scholar
  53. Chufeng Tang, Lu Sheng, Zhaoxiang Zhang, and Xiaolin Hu. 2019a. Improving pedestrian attribute recognition with weakly-supervised multi-scale attribute-specific localization. In ICCV. 4997--5006.Google ScholarGoogle Scholar
  54. Geyu Tang, Xingyu Gao, and Zhenyu Chen. 2022. Learning semantic representation on visual attribute graph for person re-identification and beyond. ACM Transactions on Multimedia Computing, Communications and Applications (2022).Google ScholarGoogle Scholar
  55. Hao Tang, Dan Xu, Gaowen Liu, Wei Wang, Nicu Sebe, and Yan Yan. 2019b. Cycle in cycle generative adversarial networks for keypoint-guided image generation. In ACM MM. 2052--2060.Google ScholarGoogle Scholar
  56. Chiat-Pin Tay, Sharmili Roy, and Kim-Hui Yap. 2019. Aanet: Attribute attention network for person re-identifications. In CVPR. 7134--7143.Google ScholarGoogle Scholar
  57. Chengji Wang, Zhiming Luo, Yaojin Lin, and Shaozi Li. 2021. Text-based person search via multi-granularity embedding learning. In IJCAI. 1068--1074.Google ScholarGoogle Scholar
  58. Jingya Wang, Xiatian Zhu, Shaogang Gong, and Wei Li. 2018. Transferable joint attribute-identity deep learning for unsupervised person re-identification. In CVPR. 2275--2284.Google ScholarGoogle Scholar
  59. Yanan Wang, Shengcai Liao, and Ling Shao. 2020b. Surpassing real-world source training data: Random 3d characters for generalizable person re-identification. In ACM MM. 3422--3430.Google ScholarGoogle Scholar
  60. Zhe Wang, Zhiyuan Fang, Jun Wang, and Yezhou Yang. 2020a. Vitaa: Visual-textual attributes alignment in person search by natural language. In ECCV. 402--420.Google ScholarGoogle Scholar
  61. Zijie Wang, Aichun Zhu, Jingyi Xue, Xili Wan, Chao Liu, Tian Wang, and Yifeng Li. 2022a. CAIBC: Capturing All-round Information Beyond Color for Text-based Person Retrieval. In ACM MM. 5314--5322.Google ScholarGoogle Scholar
  62. Zijie Wang, Aichun Zhu, Jingyi Xue, Xili Wan, Chao Liu, Tian Wang, and Yifeng Li. 2022b. Look Before You Leap: Improving Text-based Person Retrieval by Learning A Consistent Cross-modal Common Manifold. In ACM MM. 1984--1992.Google ScholarGoogle Scholar
  63. Zijie Wang, Aichun Zhu, Zhe Zheng, Jing Jin, Zhouxin Xue, and Gang Hua. 2020c. IMG-Net: inner-cross-modal attentional multigranular network for description-based person re-identification. Journal of Electronic Imaging (JEI), Vol. 29, 4 (2020), 043028.Google ScholarGoogle ScholarCross RefCross Ref
  64. Jason Wei and Kai Zou. 2019. EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks. In EMNLP-IJCNLP. 6382--6388.Google ScholarGoogle Scholar
  65. Longhui Wei, Shiliang Zhang, Wen Gao, and Qi Tian. 2018. Person transfer gan to bridge domain gap for person re-identification. In CVPR. 79--88.Google ScholarGoogle Scholar
  66. Shih-En Wei, Varun Ramakrishna, Takeo Kanade, and Yaser Sheikh. 2016. Convolutional pose machines. In CVPR.Google ScholarGoogle Scholar
  67. Suncheng Xiang, Dahong Qian, Mengyuan Guan, Binjie Yan, Ting Liu, Yuzhuo Fu, and Guanjie You. 2021. Less is more: Learning from synthetic data with fine-grained attributes for person re-identification. ACM Transactions on Multimedia Computing, Communications and Applications (2021).Google ScholarGoogle Scholar
  68. Tong Xiao, Shuang Li, Bochao Wang, Liang Lin, and Xiaogang Wang. 2016. End-to-end deep learning for person search. arXiv preprint arXiv:1604.01850, Vol. 2, 2 (2016), 4.Google ScholarGoogle Scholar
  69. Shuanglin Yan, Neng Dong, Liyan Zhang, and Jinhui Tang. 2022. CLIP-Driven Fine-grained Text-Image Person Re-identification. arXiv preprint arXiv:2210.10276 (2022).Google ScholarGoogle Scholar
  70. Yan Zhang, Xusheng Gu, Jun Tang, Ke Cheng, and Shoubiao Tan. 2019. Part-based attribute-aware network for person re-identification. IEEE Access, Vol. 7 (2019), 53585--53595.Google ScholarGoogle ScholarCross RefCross Ref
  71. Ying Zhang and Huchuan Lu. 2018. Deep cross-modal projection learning for image-text matching. In ECCV. 686--701.Google ScholarGoogle Scholar
  72. Kecheng Zheng, Wu Liu, Jiawei Liu, Zheng-Jun Zha, and Tao Mei. 2020a. Hierarchical Gumbel Attention Network for Text-based Person Search. In ACM MM.Google ScholarGoogle Scholar
  73. Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. 2015. Scalable Person Re-Identification: A Benchmark. In ICCV.Google ScholarGoogle Scholar
  74. Wei-Shi Zheng, Shaogang Gong, and Tao Xiang. 2011. Person re-identification by probabilistic relative distance comparison. In CVPR. IEEE, 649--656.Google ScholarGoogle Scholar
  75. Zhedong Zheng, Xiaodong Yang, Zhiding Yu, Liang Zheng, Yi Yang, and Jan Kautz. 2019. Joint discriminative and generative learning for person re-identification. In CVPR.Google ScholarGoogle Scholar
  76. Zhedong Zheng, Liang Zheng, Michael Garrett, Yi Yang, Mingliang Xu, and Yi-Dong Shen. 2020b. Dual-path convolutional image-text embeddings with instance loss. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), Vol. 16, 2 (2020), 1--23.Google ScholarGoogle ScholarDigital LibraryDigital Library
  77. Zhedong Zheng, Liang Zheng, and Yi Yang. 2017. Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In ICCV. 3754--3762.Google ScholarGoogle Scholar
  78. Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. 2020. Random erasing data augmentation. In AAAI, Vol. 34. 13001--13008.Google ScholarGoogle ScholarCross RefCross Ref
  79. Aichun Zhu, Zijie Wang, Yifeng Li, Xili Wan, Jing Jin, Tian Wang, Fangqiang Hu, and Gang Hua. 2021. DSSL: Deep Surroundings-person Separation Learning for Text-based Person Retrieval. In ACM MM. 209--217.Google ScholarGoogle Scholar

Index Terms

  1. Towards Unified Text-based Person Retrieval: A Large-scale Multi-Attribute and Language Search Benchmark

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          MM '23: Proceedings of the 31st ACM International Conference on Multimedia
          October 2023
          9913 pages
          ISBN:9798400701085
          DOI:10.1145/3581783

          Copyright © 2023 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 27 October 2023

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate995of4,171submissions,24%

          Upcoming Conference

          MM '24
          MM '24: The 32nd ACM International Conference on Multimedia
          October 28 - November 1, 2024
          Melbourne , VIC , Australia

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader