Abstract
When estimating human pose with a partial image of a person, we, humans, do not confine the spatial range of our estimation to the given image and can readily localize keypoints outside of the image by referring to visual clues such as the body size. However, computational methods for human pose estimation do not consider those keypoints outside and focus only on the bounded area of a given image. In this paper, we propose a neural network and a data augmentation method to extend the range of human pose estimation beyond the bounding box. While our Position Puzzle Network expands the spatial range of keypoint localization by refining the position and the size of the target’s bounding box, Position Puzzle Augmentation enables the keypoint detector to estimate keypoints not only within, but also beyond the input image. We show that the proposed method enhances the baseline keypoint detectors by 39.5% and 30.5% on average in mAP and mAR, respectively, by enabling the localization of keypoints out of the bounding box using a cropped image dataset prepared for proper evaluation. Additionally, we verify that the proposed method does not degrade the performance under the original benchmarks and instead, improves the performance by alleviating false-positive errors.
Similar content being viewed by others
References
Johnson, S., Everingham, M.: Learning effective human pose estimation from inaccurate annotation. In: CVPR 2011, pp. 1465–1472. IEEE (2011)
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: common objects in context. In: Proceedings of the European Conference on Computer Vision, pp. 740–755. Springer (2014)
Liang, X., Gong, K., Shen, X., Lin, L.: Look into person: joint body parsing and pose estimation network and a new benchmark. IEEE Trans. Pattern Anal. Mach. Intell. 41(4), 871–885 (2018)
Gao, C., Zou, Y., Huang, J.-B.: ican: instance-centric attention network for human-object interaction detection. In: British Machine Vision Conference (2018)
Wang, T., Yang, T., Danelljan, M., Khan, F.S., Zhang, X., Sun, J.: Learning human-object interaction detection using interaction points. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4116–4125 (2020)
Bansal, A., Rambhatla, S.S., Shrivastava, A., Chellappa, R.: Detecting human-object interactions via functional generalization. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol.e 34, pp. 10460–10469 (2020)
Zhou, T., Wang, W., Qi, S., Ling, H., Shen, J.: Cascaded human-object interaction recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4263–4272 (2020)
Su, C., Li, J., Zhang, S., Xing, J., Gao, W., Tian, Q.: Pose-driven deep convolutional model for person re-identification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3960–3969 (2017)
Miao, J., Wu, Y., Liu, P., Ding, Y., Yang, Y.: Pose-guided feature alignment for occluded person re-identification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 542–551 (2019)
Yan, C., Pang, G., Jiao, J., Bai, X., Feng, X., Shen, C.: Occluded person re-identification with single-scale global representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11875–11884 (2021)
Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Proceedings of the European Conference on Computer Vision, pp. 483–499. Springer (2016)
Cao, Z., Simon, T., Wei, S.-E., Sheikh, Y.: Realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7291–7299 (2017)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2980–2988. IEEE (2017)
Ke, L., Chang, M.-C., Qi, H., Lyu, S.: Multi-scale structure-aware network for human pose estimation. In: Proceedings of the European Conference on Computer Vision, pp. 713–728 (2018)
Kocabas, M., Karagoz, S., Akbas, E.: Multiposenet: fast multi-person pose estimation using pose residual network. In: Proceedings of the European Conference on Computer Vision, pp. 417–433 (2018)
Papandreou, G., Zhu, T., Chen, L.-C., Gidaris, S., Tompson, J., Murphy, K.: Personlab: person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. In: Proceedings of the European Conference on Computer Vision, pp. 269–286 (2018)
Xiao, B., Wu, H., Wei, Y.: Simple baselines for human pose estimation and tracking. In: Proceedings of the European Conference on Computer Vision, pp. 466–481 (2018)
Wang, Z., Li, W., Yin, B., Peng, Q., Xiao, T., Du, Y., Li, Z., Zhang, X., Yu, G., Sun, J.: Mscoco keypoints challenge 2018. In: Joint Recognition Challenge Workshop at ECCV (2018)
Sun, K., Xiao, B., Liu, D., Wang, J.: Deep high-resolution representation learning for human pose estimation (2019). arXiv preprint arXiv:1902.09212
Cheng, Y., Yang, B., Wang, B., Yan, W., Tan, R.T.: Occlusion-aware networks for 3d human pose estimation in video. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 723–732 (2019)
Zhong, Z., Zheng, L., Kang, G., Li, S., Yang, Y.: Random erasing data augmentation. In: AAAI, pp. 13001–13008 (2020)
Park, S., Lee, S., Park, J.: Data augmentation method for improving the accuracy of human pose estimation with cropped images. Pattern Recognit. Lett. 136, 244–250 (2020)
Bin, Y., Chen, Z.-M., Wei, X.-S., Chen, X., Gao, C., Sang, N.: Structure-aware human pose estimation with graph convolutional networks. Pattern Recognit. 106, 107410 (2020)
Park, S., Park, J.: Localizing human keypoints beyond the bounding box. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1602–1611 (2021)
Tian, L., Wang, P., Liang, G., Shen, C.: An adversarial human pose estimation network injected with graph structure. Pattern Recognit. 115, 107863 (2021)
Li, Y., Zhang, S., Wang, Z., Yang, S., Yang, W., Xia, S.-T., Zhou, E.: Tokenpose: learning keypoint tokens for human pose estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 11313–11322 (2021)
Chang, J.Y., Moon, G., Lee, K.M.: Poselifter: absolute 3d human pose lifting network from a single noisy 2d human pose (2019). arXiv preprint arXiv:1910.12029
Chen, T., Fang, C., Shen, X., Zhu, Y., Chen, Z., Luo, J.: Anatomy-aware 3d human pose estimation with bone-based pose decomposition. IEEE Trans. Circuits Syst. Video Technol. 32(1), 198–209 (2021)
Lutz, S., Blythman, R., Ghosal, K., Moynihan, M., Simms, C., Smolic, A.: Jointformer: single-frame lifting transformer with error prediction and refinement for 3d human pose estimation. In: Proceedings of International Conference on Pattern Recognition, pp. 1156–1163. IEEE (2022)
Song, L., Gang, Yu., Yuan, J., Liu, Z.: Human pose estimation and its application to action recognition: a survey. J. Vis. Commun. Image Represent. 76, 103055 (2021)
Du, Y., Wang, W., Wang, L.: Hierarchical recurrent neural network for skeleton based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1110–1118 (2015)
Si, C., Chen, W., Wang, W., Wang, L., Tan, T.: An attention enhanced graph convolutional LSTM network for skeleton-based action recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1227–1236 (2019)
Zhao, R., Wang, K., Su, H., Ji, Q.: Bayesian graph convolution LSTM for skeleton based action recognition. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6882–6892 (2019a)
Rao, H., Wang, S., Hu, X., Tan, M., Da, H., Cheng, J., Hu, B.: Self-supervised gait encoding with locality-aware attention for person re-identification. In: Proceedings of the Twenty-Ninth International Conference on International Joint Conferences on Artificial Intelligence, pp. 898–905 (2021)
Rao, H., Li, Y., Miao, C.: Revisiting-reciprocal distance re-ranking for skeleton-based person re-identification. IEEE Signal Process. Lett. 29, 2103–2107 (2022)
Rao, H., Miao, C.: Transg: Transformer-based skeleton graph prototype contrastive learning with structure-trajectory prompted reconstruction for person re-identification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 22118–22128 (2023)
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)
Girshick, R.: Fast R-CNN. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1440–1448 (2015)
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., Berg, A.C.: SSD: single shot multibox detector. In: Proceedings of the European Conference on Computer Vision, pp. 21–37. Springer (2016)
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: unified, real-time object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 779–788 (2016)
Zhao, Q., Sheng, T., Wang, Y., Tang, Z., Chen, Y., Cai, L., Ling, H.: M2det: A single-shot object detector based on multi-level feature pyramid network. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 9259–9266 (2019)
Chen, S., Sun, P., Song, Y., Luo, P.: Diffusiondet: diffusion model for object detection (2022). arXiv preprint arXiv:2211.09788
Tang, B., Liu, Z., Tan, Y., He, Q.: Hrtransnet: Hrformer-driven two-modality salient object detection. IEEE Trans. Circuits Syst. Video Technol. 33(2), 728–742 (2022)
Yoo, D., Park, S., Lee, J.-Y., Paek, A.S., Kweon, I.S.: Attentionnet: aggregating weak directions for accurate object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2659–2667 (2015)
Najibi, M., Rastegari, M., Davis, L.S.: G-CNN: an iterative grid based object detector. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2369–2377 (2016)
Cai, Z., Vasconcelos, N.: Cascade R-CNN: delving into high quality object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6154–6162 (2018)
Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2d human pose estimation: new benchmark and state of the art analysis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3686–3693 (2014)
Yu, C., Xiao, B., Gao, C., Yuan, L., Zhang, L., Sang, N., Wang, J.: Lite-hrnet: a lightweight high-resolution network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10440–10450 (2021)
Yuan, Y., Rao, F., Huang, L., Lin, W., Zhang, C., Chen, X., Wang, J.: Hrformer: high-resolution transformer for dense prediction. In: Proceedings of Advances in Neural Information Processing Systems vol. 34, pp. 7281–7293 (2021)
Yufei, X., Zhang, J., Zhang, Q., Tao, D.: Vitpose: simple vision transformer baselines for human pose estimation. In: Proceedings of Advances in Neural Information Processing Systems, vol. 35, pp. 38571–38584 (2022)
Qiu, Z., Yang, Q., Wang, J., Wang, X., Xu, C., Fu, D., Yao, K., Han, J., Ding, E., Wang, J.: Learning structure-guided diffusion model for 2d human pose estimation (2023). arXiv preprint arXiv:2306.17074
Cheng, B., Xiao, B., Wang, J., Shi, H., Huang, T.S., Zhang, L.: Higherhrnet: scale-aware representation learning for bottom-up human pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5386–5395 (2020)
Geng, Z., Sun, K., Xiao, B., Zhang, Z., Wang, J.: Bottom-up human pose estimation via disentangled keypoint regression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14676–14686 (2021)
Shi, D., Wei, X., Li, L., Ren, Y., Tan, W.: End-to-end multi-person pose estimation with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11069–11078 (2022)
Jin, L., Wang, X., Nie, X., Wang, W., Guo, Y., Yan, S., Zhao, J.: Rethinking the person localization for single-stage multi-person pose estimation. IEEE Trans. Multimed. (2023). https://doi.org/10.1109/TMM.2023.3282139
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: transformers for image recognition at scale. In: Proceedings of International Conference on Learning Representations (2020)
Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. In: Proceedings of Advances in Neural Information Processing Systems, vol. 33, pp. 6840–6851 (2020)
Dhariwal, P., Nichol, A.: Diffusion models beat GANS on image synthesis. In: Proceedings of Advances in neural information processing systems, vol. 34, pp. 8780–8794 (2021)
Yu, J., Jiang, Y., Wang, Z., Cao, Z., Huang, T.: Unitbox: an advanced object detection network. In: Proceedings of the 24th ACM International Conference on Multimedia, pp. 516–520 (2016)
Rezatofighi, H., Tsoi, N., Gwak, J.Y., Sadeghian, A., Reid, I., Savarese, S.: Generalized intersection over union: a metric and a loss for bounding box regression. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 658–666, (2019)
Zheng, Z., Wang, P., Liu, W., Li, J., Ye, R., Ren, D.: Distance-IOU loss: faster and better learning for bounding box regression. In: AAAI, pp. 12993–13000 (2020)
leoxiaobin. deep-high-resolution-net.pytorch (2019). https://github.com/leoxiaobin/deep-high-resolution-net.pytorch
leeyegy. Tokenpose (2021). https://github.com/leeyegy/TokenPose
Daniil-Osokin. gccpm-look-into-person-cvpr19.pytorch (2019). https://github.com/Daniil-Osokin/gccpm-look-into-person-cvpr19.pytorch
Acknowledgements
This research is supported by Ministry of Culture, Sports and Tourism and Korea Creative Content Agency (Project Number: R2020070002).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have no relevant financial or non-financial interests to disclose.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Supplementary information
We have supplementary materials containing additional explanation and visualizations. (10,674 KB)
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Park, S., Park, J. Position Puzzle Network and Augmentation: localizing human keypoints beyond the bounding box. Machine Vision and Applications 34, 129 (2023). https://doi.org/10.1007/s00138-023-01471-6
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s00138-023-01471-6