Skip to main content
Log in

Ultra-FastNet: an end-to-end learnable network for multi-person posture prediction

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

At present, the top-down approach requires the introduction of pedestrian detection algorithms in multi-person pose estimation. In this paper, we propose an end-to-end trainable human pose estimation network named Ultra-FastNet, which has three main components: shape knowledge extractor, corner prediction module, and human body geometric knowledge encoder. Firstly, the shape knowledge extractor is built using the ultralightweight bottleneck module, which effectively reduces network parameters and effectively learns high-resolution local representations of keypoints; the global attention module was introduced to build an ultralightweight bottleneck block to capture keypoint shape knowledge and build high-resolution features. Secondly, the human body geometric knowledge encoder, which is made up of Transformer, was introduced to modeling and discovering body geometric knowledge in data. The network uses both shape knowledge and body geometric knowledge which is called knowledge-enhanced, to deduce keypoints. Finally, the pedestrian detection task is modeled as a keypoint detection task using the corner prediction module. As a result, an end-to-end multitask network can be created without the requirement to include pedestrian detection algorithms in order to execute multi-person pose estimation. In the experiments, we show that Ultra-FastNet can achieve high accuracy on the COCO2017 and MPII datasets. Furthermore, experiments show that our method outperforms the mainstream lightweight network.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Algorithm 1
Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data availability

Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.

References

  1. Wang L, Zhang X, Song Z, Bi J, Zhang G, Wei H, Tang L, Yang L, Li J, Jia C et al (2023) Multi-modal 3d object detection in autonomous driving: a survey and taxonomy. IEEE Trans Intell Vehicles 8(7):3781–3798

    Article  Google Scholar 

  2. Song Z, Wei H, Jia C, Xia Y, Li X, Zhang C (2023) Vp-net: voxels as points for 3-d object detection. IEEE Trans Geosci Remote Sens 61:1–12

    Google Scholar 

  3. Wang L, Zhang X, Zhao F, Wu C, Wang Y, Song Z, Yang L, Xu B, Li J, Ge SS (2024) Fuzzy-NMS: improving 3d object detection with fuzzy classification in NMS. IEEE Trans Intell Vehicles. https://doi.org/10.1109/TIV.2024.3409684

    Article  Google Scholar 

  4. Zhang X, Wang L, Chen J, Fang C, Yang L, Song Z, Yang G, Wang Y, Zhang X, Li J (2023) Dual radar: a multi-modal dataset with dual 4d radar for autononous driving. arXiv preprint arXiv:2310.07602

  5. Song Z, Zhang Y, Liu Y, Yang K, Sun M (2022) Msfyolo: feature fusion-based detection for small objects. IEEE Lat Am Trans 20(5):823–830

    Article  Google Scholar 

  6. Song Z, Wu P, Yang K, Zhang Y, Liu Y (2021) Msfnet: a novel small object detection based on multi-scale feature fusion. In: 2021 17th international conference on mobility, sensing and networking (MSN), pp 700–704. IEEE

  7. Song Z, Wang L, Zhang G, Jia C, Bi J, Wei H, Xia Y, Zhang C, Zhao L (2022) Fast detection of multi-direction remote sensing ship object based on scale space pyramid. In: 2022 18th International conference on mobility, sensing and networking (MSN), pp 1019–1024. IEEE

  8. Song Z, Zhang G, Liu L, Yang L, Xu S, Jia C, Jia F, Wang L (2024) Robofusion: towards robust multi-modal 3d obiect detection via sam. arXiv preprint arXiv:2401.03907

  9. Xiang W, Song Z, Zhang G, Wu X (2022) Birds detection in natural scenes based on improved faster RCNN. Appl Sci 12(12):6094

    Article  Google Scholar 

  10. Yang K, Song Z (2021) Deep learning-based object detection improvement for fine-grained birds. IEEE Access 9:67901–67915

    Article  Google Scholar 

  11. Song Z, Jia C, Yang L, Wei H, Liu L (2023) Graphalign++: an accurate feature alignment by graph matching for multi-modal 3d object detection. IEEE Trans Circ Syst Video Technol 34:2619–2632

    Article  Google Scholar 

  12. Song Z, Wei H, Bai L, Yang L, Jia C (2023) Graphalign: enhancing accurate feature alignment by graph matching for multi-modal 3d object detection. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3358–3369

  13. Song Z, Yang L, Xu S, Liu L, Xu D, Jia C, Jia F, Wang L (2024) Graphbev: towards robust bev feature alignment for multi-modal 3d object detection. arXiv preprint arXiv:2403.11848

  14. Xu S, Li F, Song Z, Fang J, Wang S, Yang Z-X (2024) Multi-sem fusion: multimodal semantic fusion for 3d object detection. IEEE Trans Geosci Remote Sens 62:5703114

    Google Scholar 

  15. Song Z, Zhang G, Xie J, Liu L, Jia C, Xu S, Wang Z (2024) Voxelnextfusion: a simple, unified and effective voxel fusion framework for multi-modal 3d object detection. arXiv preprint arXiv:2401.02702

  16. Liu Y, Zhang H, Xu D, He K (2022) Graph transformer network with temporal kernel attention for skeleton-based action recognition. Knowl-Based Syst 240:108146. https://doi.org/10.1016/j.knosys.2022.108146

    Article  Google Scholar 

  17. Xu Y, Zhao L, Qin F (2021) Dual attention-based method for occluded person re-identification. Knowl-Based Syst 212:106554. https://doi.org/10.1016/j.knosys.2020.106554

    Article  Google Scholar 

  18. Yang Z, Chen Y, Yang Y, Chen Y (2023) Robust feature mining transformer for occluded person re-identification. Digital Signal Process 141:104166. https://doi.org/10.1016/j.dsp.2023.104166

    Article  Google Scholar 

  19. Yadav SK, Luthra A, Tiwari K, Pandey HM, Akbar SA (2022) Arfdnet: an efficient activity recognition & fall detection system using latent feature pooling. Knowl-Based Syst 239:107948. https://doi.org/10.1016/j.knosys.2021.107948

    Article  Google Scholar 

  20. Sharifi A, Harati A, Vahedian A (2017) Marker-based human pose tracking using adaptive annealed particle swarm optimization with search space partitioning. Image Vis Comput 62:28–38. https://doi.org/10.1016/j.imavis.2017.03.003

    Article  Google Scholar 

  21. Xiao B, Wu H, Wei Y (2018) Simple baselines for human pose estimation and tracking. In: Proceedings of the European conference on computer vision (ECCV), pp 466–481

  22. Sun K, Xiao B, Liu D, Wang J (2019) Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5693–5703

  23. Zhang F, Zhu X, Dai H, Ye M, Zhu C (2020) Distribution-aware coordinate representation for human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR)

  24. Ou Z, Luo Y, Chen J, Chen G (2021) SRFNet: selective receptive field network for human pose estimation. J Supercomput. https://doi.org/10.1007/s11227-021-03889-z

    Article  Google Scholar 

  25. Cao Z, Simon T, Wei S-E, Sheikh Y (2017) OpenPose: realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7291–7299

  26. Papandreou G, Zhu T, Kanazawa N, Toshev A, Tompson J, Bregler C, Murphy K (2017) Towards accurate multi-person pose estimation in the wild. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4903–4911

  27. Zhang Z, Luo Y, Gou J (2021) Double anchor embedding for accurate multi-person 2d pose estimation. Image Vis Comput 111:104198. https://doi.org/10.1016/j.imavis.2021.104198

    Article  Google Scholar 

  28. Luo Y, Xu Z, Liu P, Du Y, Guo J-M (2018) Multi-person pose estimation via multi-layer fractal network and joints kinship pattern. IEEE Trans Image Process 28(1):142–155

    Article  MathSciNet  Google Scholar 

  29. Newell A, Huang Z, Deng J (2017) Associative embedding: end-to-end learning for joint detection and grouping. In: Advances in neural information processing systems, pp 2277–2287

  30. Cheng B, Xiao B, Wang J, Shi H, Huang TS, Zhang L (2020) HigherHRNet: Scale-aware representation learning for bottom-up human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5386–5395

  31. Law H, Deng J (2020) CornerNet: detecting objects as paired keypoints. Int J Comput Vision 128(3):642–656. https://doi.org/10.1007/s11263-019-01204-1

    Article  Google Scholar 

  32. Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: Vedaldi A, Bischof H, Brox T, Frahm J-M (eds) Computer Vision - ECCV 2020. Springer, Cham, pp 213–229

    Chapter  Google Scholar 

  33. Yang S, Quan Z, Nie M, Yang W (2021) Transpose: keypoint localization via transformer. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp. 11802–11812

  34. Toshev A, Szegedy C (2014) DeepPose: human pose estimation via deep neural networks. In: 2014 IEEE conference on computer vision and pattern recognition (cvpr), pp 1653–1660. https://doi.org/10.1109/CVPR.2014.214 . ISSN: 1063-6919 WOS:000361555601089

  35. Dantone M, Gall J, Leistner C, Van Gool L (2013) Human pose estimation using body parts dependent joint regressors. In: 2013 IEEE conference on computer vision and pattern recognition, pp 3041–3048. https://doi.org/10.1109/CVPR.2013.391

  36. Gkioxari G, Hariharan B, Girshick R, Malik J (2014) Using k-poselets for detecting people and localizing their keypoints. In: 2014 IEEE conference on computer vision and pattern recognition, pp 3582–3589. https://doi.org/10.1109/CVPR.2014.458

  37. Carreira J, Agrawal P, Fragkiadaki K, Malik J (2016) Human pose estimation with iterative error feedback. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)

  38. Nie X, Feng J, Zhang J, Yan S (2019) Single-stage multi-person pose machines. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV)

  39. Sun X, Xiao B, Wei F, Liang S, Wei Y (2018) Integral human pose regression. In: Proceedings of the European conference on computer vision (ECCV)

  40. Wei F, Sun X, Li H, Wang J, Lin S (2020) Point-set anchors for object detection, instance segmentation and pose estimation. In: European conference on computer vision, pp 527–544. Springer

  41. Luvizon DC, Tabia H, Picard D (2019) Human pose regression by combining indirect part detection and contextual information. Computers & Graphics. Elsevier, Amsterdam, pp 15–22

    Google Scholar 

  42. Newell A, Yang K, Deng J (2016) Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) Computer Vision - Eccv 2016, Pt Viii vol. 9912, pp. 483–499. https://doi.org/10.1007/978-3-319-46484-8_29 . ISSN: 0302-9743 WOS:000389500600029

  43. Cao Z, Simon T, Wei S, Sheikh Y (2017) Realtime Multi-person 2D pose estimation using part affinity fields. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp. 1302–1310. https://doi.org/10.1109/CVPR.2017.143 . Journal Abbreviation: 2017 IEEE conference on computer vision and pattern recognition (CVPR)

  44. Chen Y, Wang Z, Peng Y, Zhang Z, Yu G, Sun J (2018) Cascaded pyramid network for multi-person pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7103–7112

  45. Yang W, Li S, Ouyang W, Li H, Wang X (2017) Learning feature pyramids for human pose estimation. In: 2017 IEEE international conference on computer vision (ICCV), pp 1290–1299. https://doi.org/10.1109/ICCV.2017.144

  46. Wang D, Xie W, Cai Y, Liu X (2022) Adaptive data augmentation network for human pose estimation. Digital Signal Process 129:103681. https://doi.org/10.1016/j.dsp.2022.103681

    Article  Google Scholar 

  47. Wang D, Xie W, Cai Y, Li X, Liu X (2023) Multi-order spatial interaction network for human pose estimation. Digital Signal Process 142:104219. https://doi.org/10.1016/j.dsp.2023.104219

    Article  Google Scholar 

  48. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in neural information processing systems, vol. 25. https://proceedings.neurips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf

  49. Sun X, Shang J, Liang S, Wei Y (2017) Compositional human pose regression. In: Proceedings of the IEEE international conference on computer vision (ICCV)

  50. Yu C, Xiao B, Gao C, Yuan L, Zhang L, Sang N, Wang J (2021) Lite-hrnet: a lightweight high-resolution network. In: CVPR 2021. https://www.microsoft.com/en-us/research/publication/lite-hrnet-a-lightweight-high-resolution-network/

  51. Huang J, Zhu Z, Guo F, Huang G (2020) The devil is in the details: delving into unbiased data processing for human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5700–5709

  52. Wei S-E, Ramakrishna V, Kanade T, Sheikh Y (2016) Convolutional pose machines. In: 2016 IEEE conference on computer vision and pattern recognition (cvpr), pp 4724–4732. https://doi.org/10.1109/CVPR.2016.511 . ISSN: 1063-6919 WOS:000400012304085

  53. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need, vol. 30. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf

  54. Zhang G, Xie J, Liu L, Wang Z, Yang K, Song Z (2023) Urformer: Unified representation lidar-camera 3d object detection with transformer. In: Chinese conference on pattern recognition and computer vision (PRCV), pp 401–413. Springer

  55. Xu D, Li H, Wang Q, Song Z, Chen L, Deng H (2024) M2da: multi-modal fusion transformer incorporating driver attention for autonomous driving. arXiv preprint arXiv:2403.12552

  56. Bi J, Wei H, Zhang G, Yang K, Song Z (2024) Dyfusion: cross-attention 3d object detection with dynamic fusion. IEEE Lat Am Trans 22(2):106–112

    Article  Google Scholar 

  57. Zhu X, Su W, Lu L, Li B, Wang X, Dai J (2020) Deformable detr: deformable transformers for end-to-end object detection. In: International conference on learning representations

  58. Song Z, Jia F, Pan H, Luo Y, Jia C, Zhang G, Liu L, Ji Y, Yang L, Wang L (2024) Contrastalign: toward robust BEV feature alignment via contrastive learning for multi-modal 3d object detection. arXiv preprint arXiv:2405.16873

  59. Song Z, Liu L, Jia F, Luo Y, Zhang G, Yang L, Wang L, Jia C (2024) Robustness-aware 3d object detection in autonomous driving: a review and outlook. arXiv preprint arXiv:2401.06542

  60. Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N (2021) An image is worth 16x16 words: transformers for image recognition at scale. In: International conference on learning representations. https://openreview.net/forum?id=YicbFdNTTy

  61. Yang F, Yang H, Fu J, Lu H, Guo B (2020) Learning texture transformer network for image super-resolution. In: CVPR

  62. Shan B, Shi Q, Yang F (2023) Msrt: multi-scale representation transformer for regression-based human pose estimation. Pattern Anal Appl 26(2):591–603

    Article  Google Scholar 

  63. Ren S, He K, Girshick R, Sun J (2017) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031

    Article  Google Scholar 

  64. Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 1440–1448

  65. Girshick R, Donahue J, Darrell T, Malik J (2013) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition. https://doi.org/10.1109/CVPR.2014.81

  66. Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C-Y, Berg AC (2016) Ssd: single shot multibox detector. In: European conference on computer vision, pp 21–37. Springer

  67. Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection, pp 779–788. https://doi.org/10.1109/CVPR.2016.91

  68. Womg A, Shafiee MJ, Li F, Chwyl B (2018) Tiny ssd: a tiny single-shot detection deep convolutional neural network for real-time embedded object detection, pp 95–101. https://doi.org/10.1109/CRV.2018.00023

  69. Redmon J, Farhadi A (2017) Yolo9000: better, faster, stronger. pp 6517–6525. https://doi.org/10.1109/CVPR.2017.690

  70. Gong X, Ma L, Ouyang H (2020) An improved method of tiny yolov3. In: IOP conference series: earth and environmental science, vol 440, p 052025. IOP Publishing

  71. Ning G, Zhang Z, He Z (2017) Knowledge-guided deep fractal neural networks for human pose estimation. IEEE Trans Multimed. https://doi.org/10.1109/TMM.2017.2762010

    Article  Google Scholar 

  72. Howard A, Sandler M, Chu G, Chen L-C, Chen B, Tan M, Wang W, Zhu Y, Pang R, Vasudevan V, Le QV, Adam H (2019) Searching for mobilenetv3. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV)

  73. Qiu L, Zhang X, Li Y, Li G, Wu X, Xiong Z, Han X, Cui S (2020) Peeking into occluded joints: a novel framework for crowd pose estimation. In: European conference on computer vision, pp 488–504. Springer

  74. Lin T-Y, Goyal P, Girshick R, He K, Dollár P (2017) Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp 2980–2988

  75. Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollar P, Zitnick CL (2014) Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) Computer vision - Eccv 2014, Pt V vol. 8693, pp. 740–755. https://doi.org/10.1007/978-3-319-10602-1_48 . ISSN: 0302-9743 WOS:000345528200048

  76. Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L-C (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4510–4520

  77. Ma N, Zhang X, Zheng H-T, Sun J (2018) ShuffleNet V2: practical guidelines for efficient CNN architecture design. In: Proceedings of the European conference on computer vision (ECCV)

  78. Luo Y, Ou Z, Wan T, Guo J-M (2022) Fastnet: fast high-resolution network for human pose estimation. Image Vis Comput 119:104390. https://doi.org/10.1016/j.imavis.2022.104390

    Article  Google Scholar 

  79. Andriluka M, Pishchulin L, Gehler P, Schiele B (2014) 2D Human pose estimation: new benchmark and state of the art analysis. In: 2014 IEEE Conference on computer vision and pattern recognition (cvpr), pp 3686–3693. https://doi.org/10.1109/CVPR.2014.471 . ISSN: 1063-6919 WOS:000361555603094

  80. Bulat A, Tzimiropoulos G (2016) Human pose estimation via convolutional part heatmap regression. In: Leibe B, Matas J, Sebe N, Welling M (eds) Computer Vision - ECCV 2016. Springer, Cham, pp 717–732

    Chapter  Google Scholar 

  81. Tang Z, Peng X, Geng S, Wu L, Zhang S, Metaxas D (2018) Quantized densely connected u-nets for efficient landmark localization. In: Proceedings of the European conference on computer vision (ECCV), pp. 339–354

  82. Ning G, Zhang Z, He Z (2017) Knowledge-guided deep fractal neural networks for human pose estimation. IEEE Trans Multimed 20(5):1246–1259

    Article  Google Scholar 

  83. Chu X, Yang W, Ouyang W, Ma C, Yuille AL, Wang X (2017) Multi-context attention for human pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1831–1840

  84. Chou C-J, Chien J-T, Chen H-T (2018) Self adversarial training for human pose estimation. In: 2018 Asia-Pacific signal and information processing association annual summit and conference (APSIPA ASC), pp 17–30. IEEE

  85. Chen Y, Shen C, Wei X-S, Liu L, Yang J (2017) Adversarial PoseNet: a structure-aware convolutional network for human pose estimation. In: 2017 IEEE International conference on computer vision (ICCV), pp 1221–1230. https://doi.org/10.1109/ICCV.2017.137 . ISSN: 1550-5499 WOS:000425498401030

  86. Ke L, Chang M-C, Qi H, Lyu S (2018) Multi-scale structure-aware network for human pose estimation. In: Proceedings of the European conference on computer vision (ECCV), pp 713–728

  87. Tang W, Yu P, Wu Y (2018) Deeply learned compositional models for human pose estimation. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y (eds) Computer Vision - ECCV 2018. Springer, Cham, pp 197–214

    Chapter  Google Scholar 

Download references

Acknowledgements

This work was supported by Natural Science Foundation of Fujian Province, China under grant 2020J01082.

Author information

Authors and Affiliations

Authors

Contributions

[T.Peng]:Ideas, Programming, Model validation, specifically performing the experiments, Visualization and Writing-Original Draft&Final Draft [Y.Luo]:Overarching research goals, Writing-Review and Supervision [Z.Ou]:Programming, Model validation, Writing-Original Draft and Visualization [J.Du]:Computing resources, Visualization and Formal analysis [G.Lin]: Data Curation and Formal analysis

Corresponding author

Correspondence to Yanmin Luo.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Peng, T., Luo, Y., Ou, Z. et al. Ultra-FastNet: an end-to-end learnable network for multi-person posture prediction. J Supercomput 80, 26462–26482 (2024). https://doi.org/10.1007/s11227-024-06444-8

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-024-06444-8

Keywords