Abstract
At present, the top-down approach requires the introduction of pedestrian detection algorithms in multi-person pose estimation. In this paper, we propose an end-to-end trainable human pose estimation network named Ultra-FastNet, which has three main components: shape knowledge extractor, corner prediction module, and human body geometric knowledge encoder. Firstly, the shape knowledge extractor is built using the ultralightweight bottleneck module, which effectively reduces network parameters and effectively learns high-resolution local representations of keypoints; the global attention module was introduced to build an ultralightweight bottleneck block to capture keypoint shape knowledge and build high-resolution features. Secondly, the human body geometric knowledge encoder, which is made up of Transformer, was introduced to modeling and discovering body geometric knowledge in data. The network uses both shape knowledge and body geometric knowledge which is called knowledge-enhanced, to deduce keypoints. Finally, the pedestrian detection task is modeled as a keypoint detection task using the corner prediction module. As a result, an end-to-end multitask network can be created without the requirement to include pedestrian detection algorithms in order to execute multi-person pose estimation. In the experiments, we show that Ultra-FastNet can achieve high accuracy on the COCO2017 and MPII datasets. Furthermore, experiments show that our method outperforms the mainstream lightweight network.






Similar content being viewed by others
Data availability
Data sharing not applicable to this article as no datasets were generated or analyzed during the current study.
References
Wang L, Zhang X, Song Z, Bi J, Zhang G, Wei H, Tang L, Yang L, Li J, Jia C et al (2023) Multi-modal 3d object detection in autonomous driving: a survey and taxonomy. IEEE Trans Intell Vehicles 8(7):3781–3798
Song Z, Wei H, Jia C, Xia Y, Li X, Zhang C (2023) Vp-net: voxels as points for 3-d object detection. IEEE Trans Geosci Remote Sens 61:1–12
Wang L, Zhang X, Zhao F, Wu C, Wang Y, Song Z, Yang L, Xu B, Li J, Ge SS (2024) Fuzzy-NMS: improving 3d object detection with fuzzy classification in NMS. IEEE Trans Intell Vehicles. https://doi.org/10.1109/TIV.2024.3409684
Zhang X, Wang L, Chen J, Fang C, Yang L, Song Z, Yang G, Wang Y, Zhang X, Li J (2023) Dual radar: a multi-modal dataset with dual 4d radar for autononous driving. arXiv preprint arXiv:2310.07602
Song Z, Zhang Y, Liu Y, Yang K, Sun M (2022) Msfyolo: feature fusion-based detection for small objects. IEEE Lat Am Trans 20(5):823–830
Song Z, Wu P, Yang K, Zhang Y, Liu Y (2021) Msfnet: a novel small object detection based on multi-scale feature fusion. In: 2021 17th international conference on mobility, sensing and networking (MSN), pp 700–704. IEEE
Song Z, Wang L, Zhang G, Jia C, Bi J, Wei H, Xia Y, Zhang C, Zhao L (2022) Fast detection of multi-direction remote sensing ship object based on scale space pyramid. In: 2022 18th International conference on mobility, sensing and networking (MSN), pp 1019–1024. IEEE
Song Z, Zhang G, Liu L, Yang L, Xu S, Jia C, Jia F, Wang L (2024) Robofusion: towards robust multi-modal 3d obiect detection via sam. arXiv preprint arXiv:2401.03907
Xiang W, Song Z, Zhang G, Wu X (2022) Birds detection in natural scenes based on improved faster RCNN. Appl Sci 12(12):6094
Yang K, Song Z (2021) Deep learning-based object detection improvement for fine-grained birds. IEEE Access 9:67901–67915
Song Z, Jia C, Yang L, Wei H, Liu L (2023) Graphalign++: an accurate feature alignment by graph matching for multi-modal 3d object detection. IEEE Trans Circ Syst Video Technol 34:2619–2632
Song Z, Wei H, Bai L, Yang L, Jia C (2023) Graphalign: enhancing accurate feature alignment by graph matching for multi-modal 3d object detection. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 3358–3369
Song Z, Yang L, Xu S, Liu L, Xu D, Jia C, Jia F, Wang L (2024) Graphbev: towards robust bev feature alignment for multi-modal 3d object detection. arXiv preprint arXiv:2403.11848
Xu S, Li F, Song Z, Fang J, Wang S, Yang Z-X (2024) Multi-sem fusion: multimodal semantic fusion for 3d object detection. IEEE Trans Geosci Remote Sens 62:5703114
Song Z, Zhang G, Xie J, Liu L, Jia C, Xu S, Wang Z (2024) Voxelnextfusion: a simple, unified and effective voxel fusion framework for multi-modal 3d object detection. arXiv preprint arXiv:2401.02702
Liu Y, Zhang H, Xu D, He K (2022) Graph transformer network with temporal kernel attention for skeleton-based action recognition. Knowl-Based Syst 240:108146. https://doi.org/10.1016/j.knosys.2022.108146
Xu Y, Zhao L, Qin F (2021) Dual attention-based method for occluded person re-identification. Knowl-Based Syst 212:106554. https://doi.org/10.1016/j.knosys.2020.106554
Yang Z, Chen Y, Yang Y, Chen Y (2023) Robust feature mining transformer for occluded person re-identification. Digital Signal Process 141:104166. https://doi.org/10.1016/j.dsp.2023.104166
Yadav SK, Luthra A, Tiwari K, Pandey HM, Akbar SA (2022) Arfdnet: an efficient activity recognition & fall detection system using latent feature pooling. Knowl-Based Syst 239:107948. https://doi.org/10.1016/j.knosys.2021.107948
Sharifi A, Harati A, Vahedian A (2017) Marker-based human pose tracking using adaptive annealed particle swarm optimization with search space partitioning. Image Vis Comput 62:28–38. https://doi.org/10.1016/j.imavis.2017.03.003
Xiao B, Wu H, Wei Y (2018) Simple baselines for human pose estimation and tracking. In: Proceedings of the European conference on computer vision (ECCV), pp 466–481
Sun K, Xiao B, Liu D, Wang J (2019) Deep high-resolution representation learning for human pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5693–5703
Zhang F, Zhu X, Dai H, Ye M, Zhu C (2020) Distribution-aware coordinate representation for human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR)
Ou Z, Luo Y, Chen J, Chen G (2021) SRFNet: selective receptive field network for human pose estimation. J Supercomput. https://doi.org/10.1007/s11227-021-03889-z
Cao Z, Simon T, Wei S-E, Sheikh Y (2017) OpenPose: realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7291–7299
Papandreou G, Zhu T, Kanazawa N, Toshev A, Tompson J, Bregler C, Murphy K (2017) Towards accurate multi-person pose estimation in the wild. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4903–4911
Zhang Z, Luo Y, Gou J (2021) Double anchor embedding for accurate multi-person 2d pose estimation. Image Vis Comput 111:104198. https://doi.org/10.1016/j.imavis.2021.104198
Luo Y, Xu Z, Liu P, Du Y, Guo J-M (2018) Multi-person pose estimation via multi-layer fractal network and joints kinship pattern. IEEE Trans Image Process 28(1):142–155
Newell A, Huang Z, Deng J (2017) Associative embedding: end-to-end learning for joint detection and grouping. In: Advances in neural information processing systems, pp 2277–2287
Cheng B, Xiao B, Wang J, Shi H, Huang TS, Zhang L (2020) HigherHRNet: Scale-aware representation learning for bottom-up human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5386–5395
Law H, Deng J (2020) CornerNet: detecting objects as paired keypoints. Int J Comput Vision 128(3):642–656. https://doi.org/10.1007/s11263-019-01204-1
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: Vedaldi A, Bischof H, Brox T, Frahm J-M (eds) Computer Vision - ECCV 2020. Springer, Cham, pp 213–229
Yang S, Quan Z, Nie M, Yang W (2021) Transpose: keypoint localization via transformer. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV), pp. 11802–11812
Toshev A, Szegedy C (2014) DeepPose: human pose estimation via deep neural networks. In: 2014 IEEE conference on computer vision and pattern recognition (cvpr), pp 1653–1660. https://doi.org/10.1109/CVPR.2014.214 . ISSN: 1063-6919 WOS:000361555601089
Dantone M, Gall J, Leistner C, Van Gool L (2013) Human pose estimation using body parts dependent joint regressors. In: 2013 IEEE conference on computer vision and pattern recognition, pp 3041–3048. https://doi.org/10.1109/CVPR.2013.391
Gkioxari G, Hariharan B, Girshick R, Malik J (2014) Using k-poselets for detecting people and localizing their keypoints. In: 2014 IEEE conference on computer vision and pattern recognition, pp 3582–3589. https://doi.org/10.1109/CVPR.2014.458
Carreira J, Agrawal P, Fragkiadaki K, Malik J (2016) Human pose estimation with iterative error feedback. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)
Nie X, Feng J, Zhang J, Yan S (2019) Single-stage multi-person pose machines. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV)
Sun X, Xiao B, Wei F, Liang S, Wei Y (2018) Integral human pose regression. In: Proceedings of the European conference on computer vision (ECCV)
Wei F, Sun X, Li H, Wang J, Lin S (2020) Point-set anchors for object detection, instance segmentation and pose estimation. In: European conference on computer vision, pp 527–544. Springer
Luvizon DC, Tabia H, Picard D (2019) Human pose regression by combining indirect part detection and contextual information. Computers & Graphics. Elsevier, Amsterdam, pp 15–22
Newell A, Yang K, Deng J (2016) Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) Computer Vision - Eccv 2016, Pt Viii vol. 9912, pp. 483–499. https://doi.org/10.1007/978-3-319-46484-8_29 . ISSN: 0302-9743 WOS:000389500600029
Cao Z, Simon T, Wei S, Sheikh Y (2017) Realtime Multi-person 2D pose estimation using part affinity fields. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp. 1302–1310. https://doi.org/10.1109/CVPR.2017.143 . Journal Abbreviation: 2017 IEEE conference on computer vision and pattern recognition (CVPR)
Chen Y, Wang Z, Peng Y, Zhang Z, Yu G, Sun J (2018) Cascaded pyramid network for multi-person pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7103–7112
Yang W, Li S, Ouyang W, Li H, Wang X (2017) Learning feature pyramids for human pose estimation. In: 2017 IEEE international conference on computer vision (ICCV), pp 1290–1299. https://doi.org/10.1109/ICCV.2017.144
Wang D, Xie W, Cai Y, Liu X (2022) Adaptive data augmentation network for human pose estimation. Digital Signal Process 129:103681. https://doi.org/10.1016/j.dsp.2022.103681
Wang D, Xie W, Cai Y, Li X, Liu X (2023) Multi-order spatial interaction network for human pose estimation. Digital Signal Process 142:104219. https://doi.org/10.1016/j.dsp.2023.104219
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in neural information processing systems, vol. 25. https://proceedings.neurips.cc/paper/2012/file/c399862d3b9d6b76c8436e924a68c45b-Paper.pdf
Sun X, Shang J, Liang S, Wei Y (2017) Compositional human pose regression. In: Proceedings of the IEEE international conference on computer vision (ICCV)
Yu C, Xiao B, Gao C, Yuan L, Zhang L, Sang N, Wang J (2021) Lite-hrnet: a lightweight high-resolution network. In: CVPR 2021. https://www.microsoft.com/en-us/research/publication/lite-hrnet-a-lightweight-high-resolution-network/
Huang J, Zhu Z, Guo F, Huang G (2020) The devil is in the details: delving into unbiased data processing for human pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 5700–5709
Wei S-E, Ramakrishna V, Kanade T, Sheikh Y (2016) Convolutional pose machines. In: 2016 IEEE conference on computer vision and pattern recognition (cvpr), pp 4724–4732. https://doi.org/10.1109/CVPR.2016.511 . ISSN: 1063-6919 WOS:000400012304085
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need, vol. 30. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
Zhang G, Xie J, Liu L, Wang Z, Yang K, Song Z (2023) Urformer: Unified representation lidar-camera 3d object detection with transformer. In: Chinese conference on pattern recognition and computer vision (PRCV), pp 401–413. Springer
Xu D, Li H, Wang Q, Song Z, Chen L, Deng H (2024) M2da: multi-modal fusion transformer incorporating driver attention for autonomous driving. arXiv preprint arXiv:2403.12552
Bi J, Wei H, Zhang G, Yang K, Song Z (2024) Dyfusion: cross-attention 3d object detection with dynamic fusion. IEEE Lat Am Trans 22(2):106–112
Zhu X, Su W, Lu L, Li B, Wang X, Dai J (2020) Deformable detr: deformable transformers for end-to-end object detection. In: International conference on learning representations
Song Z, Jia F, Pan H, Luo Y, Jia C, Zhang G, Liu L, Ji Y, Yang L, Wang L (2024) Contrastalign: toward robust BEV feature alignment via contrastive learning for multi-modal 3d object detection. arXiv preprint arXiv:2405.16873
Song Z, Liu L, Jia F, Luo Y, Zhang G, Yang L, Wang L, Jia C (2024) Robustness-aware 3d object detection in autonomous driving: a review and outlook. arXiv preprint arXiv:2401.06542
Dosovitskiy A, Beyer L, Kolesnikov A, Weissenborn D, Zhai X, Unterthiner T, Dehghani M, Minderer M, Heigold G, Gelly S, Uszkoreit J, Houlsby N (2021) An image is worth 16x16 words: transformers for image recognition at scale. In: International conference on learning representations. https://openreview.net/forum?id=YicbFdNTTy
Yang F, Yang H, Fu J, Lu H, Guo B (2020) Learning texture transformer network for image super-resolution. In: CVPR
Shan B, Shi Q, Yang F (2023) Msrt: multi-scale representation transformer for regression-based human pose estimation. Pattern Anal Appl 26(2):591–603
Ren S, He K, Girshick R, Sun J (2017) Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149. https://doi.org/10.1109/TPAMI.2016.2577031
Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision, pp 1440–1448
Girshick R, Donahue J, Darrell T, Malik J (2013) Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition. https://doi.org/10.1109/CVPR.2014.81
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C-Y, Berg AC (2016) Ssd: single shot multibox detector. In: European conference on computer vision, pp 21–37. Springer
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You only look once: unified, real-time object detection, pp 779–788. https://doi.org/10.1109/CVPR.2016.91
Womg A, Shafiee MJ, Li F, Chwyl B (2018) Tiny ssd: a tiny single-shot detection deep convolutional neural network for real-time embedded object detection, pp 95–101. https://doi.org/10.1109/CRV.2018.00023
Redmon J, Farhadi A (2017) Yolo9000: better, faster, stronger. pp 6517–6525. https://doi.org/10.1109/CVPR.2017.690
Gong X, Ma L, Ouyang H (2020) An improved method of tiny yolov3. In: IOP conference series: earth and environmental science, vol 440, p 052025. IOP Publishing
Ning G, Zhang Z, He Z (2017) Knowledge-guided deep fractal neural networks for human pose estimation. IEEE Trans Multimed. https://doi.org/10.1109/TMM.2017.2762010
Howard A, Sandler M, Chu G, Chen L-C, Chen B, Tan M, Wang W, Zhu Y, Pang R, Vasudevan V, Le QV, Adam H (2019) Searching for mobilenetv3. In: Proceedings of the IEEE/CVF international conference on computer vision (ICCV)
Qiu L, Zhang X, Li Y, Li G, Wu X, Xiong Z, Han X, Cui S (2020) Peeking into occluded joints: a novel framework for crowd pose estimation. In: European conference on computer vision, pp 488–504. Springer
Lin T-Y, Goyal P, Girshick R, He K, Dollár P (2017) Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision, pp 2980–2988
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollar P, Zitnick CL (2014) Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) Computer vision - Eccv 2014, Pt V vol. 8693, pp. 740–755. https://doi.org/10.1007/978-3-319-10602-1_48 . ISSN: 0302-9743 WOS:000345528200048
Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L-C (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4510–4520
Ma N, Zhang X, Zheng H-T, Sun J (2018) ShuffleNet V2: practical guidelines for efficient CNN architecture design. In: Proceedings of the European conference on computer vision (ECCV)
Luo Y, Ou Z, Wan T, Guo J-M (2022) Fastnet: fast high-resolution network for human pose estimation. Image Vis Comput 119:104390. https://doi.org/10.1016/j.imavis.2022.104390
Andriluka M, Pishchulin L, Gehler P, Schiele B (2014) 2D Human pose estimation: new benchmark and state of the art analysis. In: 2014 IEEE Conference on computer vision and pattern recognition (cvpr), pp 3686–3693. https://doi.org/10.1109/CVPR.2014.471 . ISSN: 1063-6919 WOS:000361555603094
Bulat A, Tzimiropoulos G (2016) Human pose estimation via convolutional part heatmap regression. In: Leibe B, Matas J, Sebe N, Welling M (eds) Computer Vision - ECCV 2016. Springer, Cham, pp 717–732
Tang Z, Peng X, Geng S, Wu L, Zhang S, Metaxas D (2018) Quantized densely connected u-nets for efficient landmark localization. In: Proceedings of the European conference on computer vision (ECCV), pp. 339–354
Ning G, Zhang Z, He Z (2017) Knowledge-guided deep fractal neural networks for human pose estimation. IEEE Trans Multimed 20(5):1246–1259
Chu X, Yang W, Ouyang W, Ma C, Yuille AL, Wang X (2017) Multi-context attention for human pose estimation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1831–1840
Chou C-J, Chien J-T, Chen H-T (2018) Self adversarial training for human pose estimation. In: 2018 Asia-Pacific signal and information processing association annual summit and conference (APSIPA ASC), pp 17–30. IEEE
Chen Y, Shen C, Wei X-S, Liu L, Yang J (2017) Adversarial PoseNet: a structure-aware convolutional network for human pose estimation. In: 2017 IEEE International conference on computer vision (ICCV), pp 1221–1230. https://doi.org/10.1109/ICCV.2017.137 . ISSN: 1550-5499 WOS:000425498401030
Ke L, Chang M-C, Qi H, Lyu S (2018) Multi-scale structure-aware network for human pose estimation. In: Proceedings of the European conference on computer vision (ECCV), pp 713–728
Tang W, Yu P, Wu Y (2018) Deeply learned compositional models for human pose estimation. In: Ferrari V, Hebert M, Sminchisescu C, Weiss Y (eds) Computer Vision - ECCV 2018. Springer, Cham, pp 197–214
Acknowledgements
This work was supported by Natural Science Foundation of Fujian Province, China under grant 2020J01082.
Author information
Authors and Affiliations
Contributions
[T.Peng]:Ideas, Programming, Model validation, specifically performing the experiments, Visualization and Writing-Original Draft&Final Draft [Y.Luo]:Overarching research goals, Writing-Review and Supervision [Z.Ou]:Programming, Model validation, Writing-Original Draft and Visualization [J.Du]:Computing resources, Visualization and Formal analysis [G.Lin]: Data Curation and Formal analysis
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Peng, T., Luo, Y., Ou, Z. et al. Ultra-FastNet: an end-to-end learnable network for multi-person posture prediction. J Supercomput 80, 26462–26482 (2024). https://doi.org/10.1007/s11227-024-06444-8
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-024-06444-8