Skip to main content
Log in

Dual Graph Networks for Pose Estimation in Crowded Scenes

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Pose estimation in crowded scenes is key to understanding human behavior in real-life applications. Most existing CNN-based pose estimation methods often depend on the appearance of visible parts as cues to localize human joints. However, occlusion is typical in crowded scenes, and invisible body parts have no valid features for joint localization. Introducing prior information about the human pose structure to infer the locations of occluded parts is a natural solution to this problem. In this paper, we argue that learning structural information based on human joints alone is not enough to address human body variations and could be prone to overfitting. From a perspective on the human pose as a dual representation of joints and limbs, we propose a pose refinement network, coined as dual graph network (DGN), to jointly learn its structural information of body joints and limbs by incorporating the cooperative constraints between two branches. Specifically, our DGN has two coupled graph convolutional network (GCN) branches to model the structure information of joints and limbs. Each stage in the branch is composed of a feature aggregator and a GCN module for inter-branch information fusion and intra-branch context extraction, respectively. In addition, to enhance the modeling capacity of GCN, we design an adaptive GCN layer (AGL) embedded in the GCN module to handle each pose instance based on its graph structure. We also propose a heatmap-guided sampling to leverage the features of the body parts to provide rich visual features for the inference of occluded parts. We perform extensive experiments on five challenging datasets to demonstrate the effectiveness of our DGN on pose estimation. Our DGN obtains significant performance improvement from 67.9 to 72.4 mAP in the CrowdPose dataset with the same CNN-based pose estimator and training strategy as the OPEC-Net. It shows that, compared to the OPEC-Net only considering joints, our DGN has a clear advantage due to the joint consideration of both joints and limbs. Meanwhile, our DGN is also helpful for pose estimation in general datasets (i.e., COCO and Pose track) with less occlusion and mutual interference, demonstrating the generalization power of DGN on refining human poses.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Data Availability

This manuscript develops its method based on the publicly available dataset: CrowdPose, OCPose, OCHuman, PoseTrack2017, and MS COCO. All these datasets are public available. There is no specific associated data with this manuscript.

References

  • Andriluka, M., Pishchulin, L., Gehler, P. V., & Schiele, B. (2014). 2d human pose estimation: New benchmark and state of the art analysis. In CVPR, IEEE Computer Society (pp. 3686–3693).

  • Andriluka, M., Iqbal, U., Insafutdinov, E., Pishchulin, L., Milan, A., Gall, J., & Schiele, B. (2018). Posetrack: A benchmark for human pose estimation and tracking. In CVPR, IEEE Computer Society (pp. 5167–5176).

  • Bai, L., Yao, L., Li, C., Wang, X., & Wang, C. (2020). Adaptive graph convolutional recurrent network for traffic forecasting. In NeurIPS.

  • Cai, Y., Wang, Z., Luo, Z., Yin, B., Du, A., Wang, H., Zhang, X., Zhou, X., Zhou, E., & Sun, J. (2020). Learning delicate local representations for multi-person pose estimation. In ECCV (3), Lecture Notes in Computer Science (Vol. 12348, pp. 455–472). Springer.

  • Cao, Z., Hidalgo, G., Simon, T., Wei, S., & Sheikh, Y. (2021). Openpose: Realtime multi-person 2d pose estimation using part affinity fields. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43(1), 172–186.

    Article  PubMed  Google Scholar 

  • Carreira, J., Agrawal, P., Fragkiadaki, K., & Malik, J. (2016). Human pose estimation with iterative error feedback. In CVPR, IEEE Computer Society (pp. 4733–4742).

  • Chen, Y., Wang, Z., Peng, Y., Zhang, Z., Yu, G., & Sun, J. (2018). Cascaded pyramid network for multi-person pose estimation. In CVPR, IEEE Computer Society (pp. 7103–7112).

  • Cheng, B., Xiao, B., Wang, J., Shi, H., Huang, T. S., & Zhang, L. (2020a). Higherhrnet: Scale-aware representation learning for bottom-up human pose estimation. In CVPR, IEEE (pp. 5385–5394).

  • Cheng, B., Xiao, B., Wang, J., Shi, H., Huang, T. S., & Zhang, L. (2020b). Higherhrnet: Scale-aware representation learning for bottom-up human pose estimation. In CVPR.

  • Chu, X., Ouyang, W., Li, H., & Wang, X. (2016). Structured feature learning for pose estimation. In CVPR, IEEE Computer Society (pp. 4715–4723).

  • Fang, H., Xie, S., Tai, Y., & Lu, C. (2017). RMPE: regional multi-person pose estimation. In ICCV, IEEE Computer Society (pp. 2353–2362).

  • Fieraru, M., Khoreva, A., Pishchulin, L., & Schiele, B. (2018). Learning to refine human pose estimation. In CVPR Workshops, IEEE Computer Society (pp. 205–214).

  • Geng, Z., Sun, K., Xiao, B., Zhang, Z., & Wang, J. (2021). Bottom-up human pose estimation via disentangled keypoint regression. In CVPR.

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR, IEEE Computer Society (pp. 770–778).

  • He, K., Gkioxari, G., Dollár, P., & Girshick, R. B. (2017). Mask R-CNN. In ICCV, IEEE Computer Society (pp. 2980–2988).

  • Huang, J., Zhu, Z., Guo, F., & Huang, G. (2020). The devil is in the details: Delving into unbiased data processing for human pose estimation. In CVPR, IEEE (pp. 5699–5708).

  • Insafutdinov, E., Pishchulin, L., Andres, B., Andriluka, M., & Schiele, B. (2016). Deepercut: A deeper, stronger, and faster multi-person pose estimation model. In ECCV (6), Lecture Notes in Computer Science (Vol. 9910, pp. 34–50). Springer.

  • Insafutdinov, E., Andriluka, M., Pishchulin, L., Tang, S., Levinkov, E., Andres, B., & Schiele, B. (2017). Arttrack: Articulated multi-person tracking in the wild. In CVPR, IEEE Computer Society (pp. 1293–1301).

  • Iqbal, U., & Gall, J. (2016). Multi-person pose estimation with local joint-to-person associations. In ECCV Workshops (2), Lecture Notes in Computer Science (Vol. 9914, pp. 627–642).

  • Jin, S., Liu, W., Xie, E., Wang, W., Qian, C., Ouyang, W., & Luo, P. (2020). Differentiable hierarchical graph grouping for multi-person pose estimation. In ECCV (7), Lecture Notes in Computer Science (Vol. 12352, pp. 718–734).

  • Kocabas, M., Karagoz, S., & Akbas, E. (2018). Multiposenet: Fast multi-person pose estimation using pose residual network. In ECCV (11), Lecture Notes in Computer Science (Vol. 11215, pp. 437–453). Springer.

  • Kojima, R., Ishida, S., Ohta, M., Iwata, H., Honma, T., & Okuno, Y. (2020). kgcn: A graph-based deep learning framework for chemical structures. J Cheminformatics, 12(1), 32.

    Article  CAS  Google Scholar 

  • Kreiss, S., Bertoni, L., & Alahi, A. (2019). Pifpaf: Composite fields for human pose estimation. In CVPR, Computer Vision Foundation/IEEE (pp. 11977–11986).

  • Li, J., Wang, C., Zhu, H., Mao, Y., Fang, H., & Lu, C. (2019a). Crowdpose: Efficient crowded scenes pose estimation and a new benchmark. In CVPR, Computer Vision Foundation/IEEE (pp. 10863–10872).

  • Li, W., Wang, Z., Yin, B., Peng, Q., Du, Y., Xiao, T., Yu, G., Lu, H., Wei, Y., & Sun, J. (2019b). Rethinking on multi-stage networks for human pose estimation. CoRRarXiv:1901.00148

  • Li, W., Duan, Y., Lu, J., Feng, J., & Zhou, J. (2020). Graph-based social relation reasoning. In ECCV (15), Lecture Notes in Computer Science (Vol. 12360, pp. 18–34). Springer.

  • Li, Y., Ouyang, W., Zhou, B., Shi, J., Zhang, C., & Wang, X. (2018). Factorizable net: An efficient subgraph-based framework for scene graph generation. In ECCV (1), Lecture Notes in Computer Science (Vol. 11205, pp. 346–363). Springer.

  • Lin, T., Maire, M., Belongie, S. J., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft COCO: Common objects in context. In ECCV (5), Lecture Notes in Computer Science (Vol. 8693, pp. 740–755). Springer.

  • Lin, W., Gao, Z., & Li, B. (2020). Guardian: Evaluating trust in online social networks with graph convolutional networks. In INFOCOM, IEEE (pp. 914–923).

  • Moon, G., Chang, J. Y., & Lee, K. M. (2019). Posefix: Model-agnostic general human pose refinement network. In CVPR, Computer Vision Foundation/IEEE (pp. 7773–7781).

  • Newell, A., Yang, K., & Deng, J. (2016). Stacked hourglass networks for human pose estimation. In ECCV (8), Lecture Notes in Computer Science (Vol. 9912, pp. 483–499). Springer.

  • Newell, A., Huang, Z., & Deng, J. (2017). Associative embedding: End-to-end learning for joint detection and grouping. In NIPS (pp. 2277–2287).

  • Nie, X., Feng, J., Zhang, J., & Yan, S. (2019). Single-stage multi-person pose machines. In ICCV, IEEE (pp. 6950–6959).

  • Papandreou, G., Zhu, T., Kanazawa, N., Toshev, A., Tompson, J., Bregler, C., & Murphy, K. (2017). Towards accurate multi-person pose estimation in the wild. In CVPR, IEEE Computer Society (pp. 3711–3719).

  • Papandreou, G., Zhu, T., Chen, L., Gidaris, S., Tompson, J., & Murphy, K. (2018). Personlab: Person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model. In ECCV (14), Lecture Notes in Computer Science (Vol. 11218, pp. 282–299). Springer.

  • Pishchulin, L., Insafutdinov, E., Tang, S., Andres, B., Andriluka, M., Gehler, P. V., & Schiele, B. (2016). Deepcut: Joint subset partition and labeling for multi person pose estimation. In CVPR, IEEE Computer Society (pp. 4929–4937).

  • Li, Q., Zhang, F. Z. Z., & Xiao, F. (2023). Hrnext: High-resolution context network for crowd pose estimation. IEEE Transactions on Multimedia, 25, 1521–1528. https://doi.org/10.1109/TMM.2023.3248144

  • Qiu, L., Zhang, X., Li, Y., Li, G., Wu, X., Xiong, Z., Han, X., & Cui, S. (2020). Peeking into occluded joints: A novel framework for crowd pose estimation. In ECCV (19), Lecture Notes in Computer Science (Vol. 12364, pp. 488–504). Springer.

  • Ren, S., He, K., Girshick, R. B., & Sun, J. (2017). Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(6), 1137–1149.

    Article  PubMed  Google Scholar 

  • Ryu, S., Lim, J., & Kim, W. Y. (2018). Deeply learning molecular structure-property relationships using graph attention neural network. CoRRarXiv:1805.10988

  • Sapp, B., Toshev, A., & Taskar, B. (2010). Cascaded models for articulated pose estimation. In ECCV (2), Lecture Notes in Computer Science (Vol. 6312, pp. 406–420). Springer.

  • Shi, L., Zhang, Y., Cheng, J., & Lu, H. (2019a). Skeleton-based action recognition with directed graph neural networks. In CVPR, Computer Vision Foundation/IEEE (pp. 7912–7921).

  • Shi, L., Zhang, Y., Cheng, J., & Lu, H. (2019b). Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In CVPR, Computer Vision Foundation/IEEE (pp. 12026–12035).

  • Su, K., Yu, D., Xu, Z., Geng, X., & Wang, C. (2019). Multi-person pose estimation with enhanced channel-wise and spatial information. In CVPR, Computer Vision Foundation/IEEE (pp. 5674–5682).

  • Sun, K., Xiao, B., Liu, D., & Wang, J. (2019). Deep high-resolution representation learning for human pose estimation. In CVPR, Computer Vision Foundation/IEEE (pp. 5693–5703).

  • Tang, W., Yu, P., & Wu, Y. (2018). Deeply learned compositional models for human pose estimation. In ECCV (3), Lecture Notes in Computer Science (Vol. 11207, pp. 197–214). Springer.

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. In NIPS (pp. 5998–6008).

  • Wang, D., & Zhang, S. (2022). Contextual instance decoupling for robust multi-person pose estimation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, 18–24 June, 2022, IEEE (pp. 11050–11058). https://doi.org/10.1109/CVPR52688.2022.01078

  • Wang, F., & Li, Y. (2013). Beyond physical connections: Tree models in human pose estimation. In CVPR, IEEE Computer Society (pp. 596–603).

  • Wang, J., Long, X., Gao, Y., Ding, E., & Wen, S. (2020). Graph-pcnn: Two stage human pose estimation with graph pose refinement. In ECCV (11), Lecture Notes in Computer Science (Vol. 12356, pp. 492–508). Springer.

  • Wang, Y., & Mori, G. (2008). Multiple tree models for occlusion and spatial constraints in human pose estimation. In ECCV (3), Lecture Notes in Computer Science (Vol. 5304, pp. 710–724). Springer.

  • Wei, S., Ramakrishna, V., Kanade, T., & Sheikh, Y. (2016). Convolutional pose machines. In CVPR, IEEE Computer Society (pp. 4724–4732).

  • Xiao, B., Wu, H., & Wei, Y. (2018). Simple baselines for human pose estimation and tracking. In ECCV (6), Lecture Notes in Computer Science (Vol. 11210, pp. 472–487). Springer.

  • Xu, D., Zhu, Y., Choy, C. B., Fei-Fei, L. (2017a). Scene graph generation by iterative message passing. In CVPR, IEEE Computer Society (pp. 3097–3106).

  • Xu, Y., Pei, J., & Lai, L. (2017b). Molecular graph encoding convolutional neural networks for automatic chemical feature extraction. CoRRarXiv:1704.04718

  • Yan, S., Xiong, Y., & Lin, D. (2018). Spatial temporal graph convolutional networks for skeleton-based action recognition. In AAAI (pp. 7444–7452). AAAI Press.

  • Yang, J., Lu, J., Lee, S., Batra, D., & Parikh, D. (2018). Graph R-CNN for scene graph generation. In ECCV (1), Lecture Notes in Computer Science (Vol. 11205, pp. 690–706). Springer.

  • Yang, Y., & Ramanan, D. (2011). Articulated pose estimation with flexible mixtures-of-parts. In CVPR, IEEE Computer Society (pp. 1385–1392).

  • Zhang, F., Zhu, X., Dai, H., Ye, M., & Zhu, C. (2020a). Distribution-aware coordinate representation for human pose estimation. In CVPR, IEEE (pp. 7091–7100).

  • Zhang, H., Ouyang, H., Liu, S., Qi, X., Shen, X., Yang, R., Jia, J. (2019a). Human pose estimation with spatial contextual information. CoRRarXiv:1901.01760

  • Zhang, J., Zhu, Z., Zou, W., Li, P., Li, Y., Su, H., & Huang, G. (2019b). Fastpose: Towards real-time pose estimation and tracking via scale-normalized multi-task networks. CoRR arXiv:1908.05593

  • Zhang, S., Li, R., Dong, X., Rosin, P. L., Cai, Z., Han, X., Yang, D., Huang, H., & Hu, S. (2019c). Pose2seg: Detection free human instance segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, 16–20 June, 2019, Computer Vision Foundation/IEEE (pp. 889–898). https://doi.org/10.1109/CVPR.2019.00098. http://openaccess.thecvf.com/content_CVPR_2019/html/Zhang_Pose2Seg_Detection_Free_Human_Instance_Segmentation_CVPR_2019_paper.html

  • Zhang, X., Li, C., Tong, X., Hu, W., Maybank, S. J., & Zhang, Y. (2009). Efficient human pose estimation via parsing a tree structure based human model. In ICCV, IEEE Computer Society (pp. 1349–1356).

  • Zhang, X., Xu, C., & Tao, D. (2020b). Context aware graph convolution for skeleton-based action recognition. In CVPR, IEEE (pp. 14321–14330).

  • Zhao, L., Song, Y., Deng, M., & Li, H. (2018). Temporal graph convolutional network for urban traffic flow prediction method. CoRRarXiv:1811.05320

Download references

Acknowledgements

This work is supported by the National Key R &D Program of China (No. 2022ZD0160 900), the National Natural Science Foundation of China (No. 6207 6119, No. 61921006), and Collaborative Innovation Center of Novel Software Technology and Industrialization.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Limin Wang.

Additional information

Communicated by Xiaowei Zhou.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tu, J., Wu, G. & Wang, L. Dual Graph Networks for Pose Estimation in Crowded Scenes. Int J Comput Vis 132, 633–653 (2024). https://doi.org/10.1007/s11263-023-01901-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-023-01901-y

Keywords

Navigation