SHaRPose: Sparse High-Resolution Representation for Human Pose Estimation

Authors

  • Xiaoqi An PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, Jiangsu Key Lab of Image and Video Understanding for Social Security, School of Computer Science and Engineering, Nanjing University of Science and Technology State Key Laboratory of Integrated Services Networks, Xidian University
  • Lin Zhao PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, Jiangsu Key Lab of Image and Video Understanding for Social Security, School of Computer Science and Engineering, Nanjing University of Science and Technology State Key Laboratory of Integrated Services Networks, Xidian University
  • Chen Gong PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, Jiangsu Key Lab of Image and Video Understanding for Social Security, School of Computer Science and Engineering, Nanjing University of Science and Technology
  • Nannan Wang State Key Laboratory of Integrated Services Networks, Xidian University
  • Di Wang State Key Laboratory of Integrated Services Networks, Xidian University
  • Jian Yang PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, Jiangsu Key Lab of Image and Video Understanding for Social Security, School of Computer Science and Engineering, Nanjing University of Science and Technology

DOI:

https://doi.org/10.1609/aaai.v38i2.27826

Keywords:

CV: Biometrics, Face, Gesture & Pose, CV: Representation Learning for Vision

Abstract

High-resolution representation is essential for achieving good performance in human pose estimation models. To obtain such features, existing works utilize high-resolution input images or fine-grained image tokens. However, this dense high-resolution representation brings a significant computational burden. In this paper, we address the following question: "Only sparse human keypoint locations are detected for human pose estimation, is it really necessary to describe the whole image in a dense, high-resolution manner?" Based on dynamic transformer models, we propose a framework that only uses Sparse High-resolution Representations for human Pose estimation (SHaRPose). In detail, SHaRPose consists of two stages. At the coarse stage, the relations between image regions and keypoints are dynamically mined while a coarse estimation is generated. Then, a quality predictor is applied to decide whether the coarse estimation results should be refined. At the fine stage, SHaRPose builds sparse high-resolution representations only on the regions related to the keypoints and provides refined high-precision human pose estimations. Extensive experiments demonstrate the outstanding performance of the proposed method. Specifically, compared to the state-of-the-art method ViTPose, our model SHaRPose-Base achieves 77.4 AP (+0.5 AP) on the COCO validation set and 76.7 AP (+0.5 AP) on the COCO test-dev set, and infers at a speed of 1.4x faster than ViTPose-Base. Code is available at https://github.com/AnxQ/sharpose.

Published

2024-03-24

How to Cite

An, X., Zhao, L., Gong, C., Wang, N., Wang, D., & Yang, J. (2024). SHaRPose: Sparse High-Resolution Representation for Human Pose Estimation. Proceedings of the AAAI Conference on Artificial Intelligence, 38(2), 691-699. https://doi.org/10.1609/aaai.v38i2.27826

Issue

Section

AAAI Technical Track on Computer Vision I