Skip to main content
Log in

ConvFormer: parameter reduction in transformer models for 3D human pose estimation by leveraging dynamic multi-headed convolutional attention

  • Original article
  • Published:
The Visual Computer Aims and scope Submit manuscript

Abstract

Recently, fully-transformer architectures have replaced the defacto convolutional architecture for the 3D human pose estimation task. In this paper, we propose ConvFormer, a novel convolutional transformer that leverages a new dynamic multi-headed convolutional self-attention mechanism for monocular 3D human pose estimation. We designed a spatial and temporal convolutional transformer to comprehensively model human joint relations within individual frames and globally across the motion sequence. Moreover, we introduce a novel notion of temporal joints profile for our temporal ConvFormer that fuses complete temporal information immediately for a local neighborhood of joint features. We have quantitatively and qualitatively validated our method on three common benchmark datasets: Human3.6 M, MPI-INF-3DHP, and HumanEva. Extensive experiments have been conducted to identify the optimal hyper-parameter set. These experiments demonstrated that we achieved a significant parameter reduction relative to prior transformer models while attaining State-of-the-Art (SOTA) or near SOTA on all three datasets. Additionally, we achieved SOTA for Protocol III on H36M for both GT and CPN detection inputs. Finally, we obtained SOTA on all three metrics for the MPI-INF-3DHP dataset and for all three subjects on HumanEva under Protocol II.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  1. Banik, S., GarcÍa, A.M., Knoll, A.: 3D human pose regression using graph convolutional network. In: 2021 IEEE International Conference on Image Processing (ICIP), Anchorage, AK, USA, pp. 924–928 (2021)

  2. Cai, Y., Ge, L., Liu, J., Cai, J., Cham, T.-J., Yuan, J., Thalmann, N.M.: Exploiting spatial-temporal relationships for 3D pose estimation via graph convolutional networks. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2272–2281 (2019)

  3. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillob, A., Zagoruyko, S.: Ene-to-End Object Detection with Transformers. arXiv:2005.12872 (2020)

  4. Chaitanya, K., Joshua, M., Hang, D., Roderick, M-S.: CpT: Convolutional Point Transformer for 3D Point Cloud Processing. arXiv:2111.10866 (2021)

  5. Chen, T., Fang, C., Shen, X., Zhu, Y., Chen, Z., Luo, J.: Anatomy-aware 3d human pose estimation with bone-based pose decomposition. In: IEEE Transactions on Circuits and Systems for Video Technology (2021)

  6. Chen, Y., Wang, Z., Peng, X., Zhang, Z., Yu, G., Sun, J.: Cascaded pyramid network for multi-person pose estimation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7103–7112 (2018)

  7. Dabral, R., Mundhaada, A., Kusupati, U., Afaque, S., Sharma, A., Jain, A.: Learning 3D human pose from structure and motion. In: European Conference on Computer Vision (ECCV) (2018)

  8. Diaz-Arias, A., Messmore, M., Shin, D., Baek, S.: On the role of depth predictions for 3D human pose estimation. arXiv:2103.02521 (2021)

  9. Faang, H.S., Xu, Y., Wang, W., Liu, X., Zhu, S.C.: Learning pose grammar to encode human body configuration for 3d pose estimation. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)

  10. He, Y., Yan, R., Fragkiadaki, K., Yu, S.: Epipolr transformers. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7776–7785 (2020)

  11. Ionescu, C., Papava, D., Olarue, V., Sminchisescu, C.: Human3.6m: large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Trans. Pattern Anal. Mach. Intell. 36(7), 1325–1339 (2013)

    Article  Google Scholar 

  12. Junfa, L., Juan, R., Yihui, L., Zhijun, L., Yisheng, G., Ning, X., Haifei, Z.: A graph attention spatio-temporal convolutional network for 3D human pose estimation in video. In: 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, pp. 3374–3380 (2021)

  13. Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1725–1732 (2014)

  14. Kitaev, N., Kaiser, L., Levskaya, A.: Reformer: the efficient transformer. In: Proceedings of International Conference on Learning Representations (ICLR) (2020)

  15. Kocabas, M., Karagozz, S., Akbas, E.: Self-supervised learning of 3d human pose using multi-view geometry. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1077–1086 (2019)

  16. Kolesnikov, A., Dosovitskiy, A., Weissenborn, D., Heigold, G., Uszkoreit, J., Beyer, L., Minderer, M., Dehghani, M., Houlsby, N., Gelly, S., Unterthiner, T., Zhai, X.: An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (2021)

  17. Li, W., Hong, T., Hao, W., Picho, V., Van Gool, L.: MHFormer: multi-hypothesis transformer for 3D human pose estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

  18. Li, S., Ke, L., Pratama, K., Tai, Y.W., Tang, C.K., Cheng, K.T.: Cascaded deep monocular 3D human pose estimation with evolutionary training data. arXiv:2006.07778 (2020)

  19. Li, W., Liu, H., Ding, R., Liu, M., Wang, P., Yang, W.: Exploiting temporal contexts with strided transformer for 3d human pose estimation. In: IEEE Transaactions on Multimedia (2022)

  20. Liaw, R., Liang, E., Nishihara, R., Moritz, P., Gonzalez, J., Stoica, I.: Tune: A Research Platform for Distributed Model Selection and Training. arXiv:1807.05118 (2018)

  21. Lin, J., Lee, G.: Trajectory space factorization for deep video-based 3D human pose estimation. In: British Machine Vision Conference (BMVC) (2019)

  22. Lin, K., Wang, L., Liu, Z.: End-to-end human pose and mesh reconstruction with transformers. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

  23. Liu, R., Shen, J., Wang, H., Chen, C., Cheung, S.C., Asari, V.: Attention mechanism exploits temporal contexts: real-time 3d human pose reconstruction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CPRV), pp. 5064–5073 (2020)

  24. Liu, Z., Shun, N., Li, W., Lu, J., Wu, Y., Li, C., Yang, L.: ConvTransformer: a convolutional transformer network for video frame synthesis. arXiv:2011.10185 (2020)

  25. Martinez, J., Hossain, R., Romero, J., Little, J. J.: A simple yet effective baseline for 3d human pose estimation, a simple yet effective baseline for 3d human pose estimation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2640–2649 (2017)

  26. Mehta, D., Rhodin, H., Casas, D., Fua, P., Sotnychenko, O., Xu, W., Theobalt, C.: Monocular 3D human pose estimation in the wild using improved CNN supervision. In: 2017 Fifth International Conference on 3D Vision (3DV) (2017)

  27. Michel, P., Levy, O., Neubig, G.: Are sixteen heads really better than one? In: Proceedings of the International Conference on Neural Information Processing Systems (NeurIPS) (2019)

  28. Paaschalis, P., Antonis, A.: PE-former: Pose Estimation Transformer. arXiv:2112.04981 (2021)

  29. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic Differentiation in PyTorch, NIPS 2017 Workshop on Autodiff (2017)

  30. Pavaalkos, G., Zhou, X., Derpanis, K.G., Daniilidis, K.: Coarse-to-fine volumetric prediction for single-image 3d human pose. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7025–7034 (2017)

  31. Pavllo, D., Feichtenhofer, C., Grangier, D., Auli, M.: 3d human pose estimation in video with temporal convolutions and semi-supervised training. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7753–7762 (2019)

  32. Raayat Imtia Hossain, M., Little, J.J.: Exploiting temporal information for 3d human pose estimation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 68–84 (2018)

  33. Sarlin, P.-E., DeTone, D., Malisiewicz, T., Rabinovich, A.: SuperGlue: learning feature matching with graph neural networks. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, pp. 4937–4946 (2020)

  34. Sebastian, J., Aakanksha, C., Afroz, M., Lukasz, K., Wojciech, G., Henryk, M., Jonni, K.: Sparse is enough in scaling transformers (2021)

  35. Shuai, H., Wu, L., Liu, Q.: Adaptively Multi-view and Temporal Fusing Transformer for 3D Human Pose Estimation. arXiv:2110.05092 (2021)

  36. Sigal, L., Balaan, A.O., Black, M.J.: Humaneva: synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. IEEE Trans. Pattern Anal. Mach. Intell. 87(12), 4–27 (2010)

    Google Scholar 

  37. Srivastava, N., Hinton, G., Krihevsky, A., Sutskever, I., Salakhudinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(56), 1929–1958 (2014)

    MathSciNet  Google Scholar 

  38. Sun, X., Xiao, B., Liang, S., Wei, Y.: Integral human pose regression. arXiv:1711.08229 (2017)

  39. Touvron, H., Cord, M., Doue, M., Massa, F., Sablayrolles, A., J’egou, H.: Training data-efficient image transformers and distillation through attention. In: International Conference of Machine Learning (ICML) (2021)

  40. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L.U., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems (NIPS), pp. 5998–6008 (2017)

  41. Veličković, P., Cucurull, G., Casanova, A., Romero, A., Liò, P., Bengio, Y.: Graph attention networks. In: International Conference on Learning Representations (ICLR) (2018)

  42. Wang, J., Yan, S., Xiong, Y., Lin, D.: Deep networks with stochastic depth. In: European Conference on Computer Vision (ECCV), pp. 646–661 (2016)

  43. Wang, J., Yan, S., Xiong, Y., Lin, D.: Motion guided 3d pose estimation from videos. arXiv:2004.13985 (2020)

  44. Wu, Z., Liu, Z., Lin, J., Han, S.: Lite transformer with long-short range attention. In: International Conference on Learning Representations (ICLR) (2020)

  45. Wu, J., Hu, D., Xiang, F., Yuan, X., Su, J.: 3D human pose estimation by depth map. Vis. Comput. 36(7), 1401–1410 (2020)

    Article  Google Scholar 

  46. Yeh, R., Hu, Y.T., Schwing, A.: Chirality nets for human pose regression. In: International Conference on Neural Information Processing Systems (NeurIPS) (2019)

  47. Yuan, L., Chen, Y., Wang, T., Yu, W., Shi, Y., Jiang, Tay, F., Feng, J., Yan, S.: Tokens-to-token ViT: training vision transformers from scratch on ImageNet. In: IEEE International Conference on Computer Vision (ICCV), pp. 538–547 (2021)

  48. Zeng, A., Sun, X., Huaang, F., Liu, M., Xu, Q., Lin, S.: Srnet: Improving generalization in 3d human pose estimation with a split-and-recombine approach. In: European Conference on Computer Vision (ECCV) (2020)

  49. Zheng, C., Zhu, S., Mendieta, M., Yang, T., Cheng, C., Ding, Z.: 3D human pose estimation with spatial and temporal transformers. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2021)

  50. Zhou, K., Han, X., Jaing, N., Jia, K., Lu, J.: HEMlets pose: learning part-centric heatmap triplets for accurate 3D human pose estimation. In: International Conference of Computer Vision (ICCV) (2019)

  51. Zhou, K., Han, X., Jiang, N., Jai, K., Lu, J.: HEMlets PoSh: learning part-centric heatmap triplets for 3D human pose and shape estimation. In: IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (2021)

  52. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2921–2929 (2016)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Alec Diaz-Arias.

Ethics declarations

Conflict of interest

Both authors Alec Diaz-Arias and Dmitriy Shin declare that there is no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix A: Attention visualization

Appendix A: Attention visualization

We provide in Fig. 4 additional visualizations of temporal attention heads for part of our ablation on number of attention heads with quantitative results reported in Table 3.We note that our 8 head model achieves the lowest MPJPE on H3.6 M. We hypothesize that although there is a clear visual indication that as the number of heads increases, redundancies within the attention maps occur with subtle variations that this redundancy acts as a noise filtering mechanism by highlighting critical information. In the NLP landscape, an extensive analysis on BERT was conducted to understand the optimal number of heads and how one can perform head pruning during test-time without substantial performance impact, [27].

We also provide visualizations of the attention heads for both the spatial and temporal ConvFormer for all attention heads utilized in our model. We evaluate the attention heads for subject 9 from H3.6 M for the Directions action. The spatial self-attention maps are seen in Fig. 5, and the x-axis corresponds to the 17 joints of the H3.6 M skeleton, and the y-axis corresponds to the attention output. These maps correspond to the 143 frame model, and the temporal attention maps x-axis is the 143 frames of the sequence, while the y-axis is the attention at each frame. The attention heads return different attention magnitudes which represent either spatial correlations or frame-wise global correlations learned from the joint temporal profiles.

Fig. 4
figure 4

Temporal attention for 9 frame ConvFormer trained on H36M using CPN detections as input—a is our one head model, b is 2 head model, c is 4 head and d is the 8 head model which achieves lowest MPJPE

Fig. 5
figure 5

Example of Attention maps, top is the spatial ConvFormer and the bottom is the temporal ConvFormer for the 143 frame model trained on CPN detections for H3.6 M. These maps were generated for S9 for the Directions action

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Diaz-Arias, A., Shin, D. ConvFormer: parameter reduction in transformer models for 3D human pose estimation by leveraging dynamic multi-headed convolutional attention. Vis Comput 40, 2555–2569 (2024). https://doi.org/10.1007/s00371-023-02936-5

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00371-023-02936-5

Keywords

Navigation