A key-points-assisted network with transfer learning for precision human action recognition in still images

Lu, Xinbiao; Xing, Hao; Ye, Chunlin; Xie, Xupeng; Liu, Zecheng

doi:10.1007/s11760-023-02862-y

A key-points-assisted network with transfer learning for precision human action recognition in still images

Original Paper
Published: 22 November 2023

Volume 18, pages 1561–1575, (2024)
Cite this article

Signal, Image and Video Processing Aims and scope Submit manuscript

Xinbiao Lu¹,
Hao Xing¹,
Chunlin Ye¹,
Xupeng Xie¹ &
…
Zecheng Liu¹

171 Accesses
Explore all metrics

Abstract

Still image-based human action recognition is a highly sought-after but challenging field in computer vision, and such challenge mainly stems from the lack of information in single images. Therefore, efficient extraction of visual appearance features and other valuable information of images is crucial for action recognition. To this purpose, on the one hand, we use a convolutional neural network (CNN) classifier based on EfficientNetV2-S network as the main pathway for extracting appearance features from images and classification. To make the CNN classifier focus on important spatial features, we propose the residual spatial attention module (RSAM) and incorporate it into the CNN classifier. In addition, we leverage transfer learning to enhance the training speed and recognition precision of the CNN classifier. On the other hand, we utilize the OpenPose algorithm to extract the coordinates of human key-points in the auxiliary pathway and perform information extraction and classification on the obtained key-points with a self-made network. Finally, we use one-dimensional convolution to merge the results of these two classifications. One-dimensional convolution can automatically learn the weights of these two results and merge them based on their importance. Experimental results on three challenging datasets, namely Stanford 40 Actions, People Play Music Instrument (PPMI) and MPII Human Pose datasets, illustrate the superiority of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 2

Fig. 3

Fig. 6

An efficient attention module for 3d convolutional neural networks in action recognition

Article 24 February 2021

Two-Stream Adaptive Weight Convolutional Neural Network Based on Spatial Attention for Human Action Recognition

Collaborative Positional-Motion Excitation Module for Efficient Action Recognition

Data availability

Stanford40: http://vision.stanford.edu/Datasets/40actions.html PPMI: http://ai.stanford.edu/~bangpeng/ppmi.html MPII: http://human-pose.mpi-inf.mpg.de/#download

References

Li, C., Tong, R., Tang, M.: Modelling human body pose for action recognition using deep neural networks. Arab. J. Sci. Eng. 43, 7777–7788 (2018)
Article Google Scholar
Singh, T., Vishwakarma, D.K.: Video benchmarks of human action datasets: a review. Artif. Intell. Rev. 52, 1107–1154 (2019)
Article Google Scholar
Bozkurt, F.: A comparative study on classifying human activities using classical machine and deep learning methods. Arab. J. Sci. Eng. 47(2), 1507–1521 (2022)
Article Google Scholar
Dash, S.K., Acharya, S., Pakray, P., Das, R., Gelbukh, A.: Topic-based image caption generation. Arab. J. Sci. Eng. 45(4), 3025–3034 (2020)
Article Google Scholar
Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., Paluri, M.: C3D: generic features for video analysis. CoRR. 2(7), 8 (2014)
Google Scholar
Hashemzadeh, M., Pan, G., Wang, Y., Yao, M., Wu, J.: Combining velocity and location-specific spatial clues in trajectories for counting crowded moving objects. Int. J. Pattern Recognit Artif Intell. 27(02), 1354003 (2013)
Article MathSciNet Google Scholar
Hashemzadeh, M., Pan, G., Yao, M.: Counting moving people in crowds using motion statistics of feature-points. Multimed. Tools Appl. 72, 453–487 (2014)
Article Google Scholar
Hashemzadeh, M., Farajzadeh, N.: Combining keypoint-based and segment-based features for counting people in crowded scenes. Inf. Sci. 345, 199–216 (2016)
Article Google Scholar
Vishwakarma, D.K., Singh, T.: A visual cognizance based multi-resolution descriptor for human action recognition using key pose. AEU-Int. J. Electr. Commun. 107, 157–169 (2019)
Article Google Scholar
Singh, T., Vishwakarma, D.K.: A deeply coupled ConvNet for human activity recognition using dynamic and RGB images. Neural Comput. Appl. 33, 469–485 (2021)
Article CAS Google Scholar
Dhiman, C., Vishwakarma, D.K.: View-invariant deep architecture for human action recognition using two-stream motion and shape temporal dynamics. IEEE Trans. Image Process. 29, 3835–3844 (2020)
Article ADS Google Scholar
Che, Y., Sivaparthipan, C.B., Alfred Daniel, J.: RETRACTED ARTICLE: human-computer interaction on IoT-based college physical education. Arabian J. Sci. Eng. 48, 4119 (2021)
Article Google Scholar
Krizhevsky, A.; Sutskever, I.; Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems. 25 (2012)
Simonyan, K.; Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv. 1409.1556 (2014)
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D., Vanhoucke, V.; Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9 (2015)
He, K.; Zhang, X.; Ren, S.; Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778 (2016)
Peng, X., Wang, L., Wang, X., Qiao, Y.: Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. Comput. Vis. Image Underst. 150, 109–125 (2016)
Article Google Scholar
Oneata, D.; Verbeek, J.; Schmid, C.: Action and event recognition with fisher vectors on a compact feature set. In: Proceedings of the IEEE international conference on computer vision, pp. 1817–1824 (2013)
Prest, A., Schmid, C., Ferrari, V.: Weakly supervised learning of interactions between humans and objects. IEEE Trans. Pattern Anal. Mach. Intell. 34(3), 601–614 (2011)
Article Google Scholar
Gkioxari, G.; Girshick, R.; Malik, J.: Actions and attributes from wholes and parts. In: Proceedings of the IEEE international conference on computer vision, pp. 2470–2478 (2015)
Zhao, Z.; Ma, H.; You, S.: Single image action recognition using semantic body part actions. In: Proceedings of the IEEE international conference on computer vision, pp. 3391–3399 (2017)
Singh, P.K., Kundu, S., Adhikary, T., Sarkar, R., Bhattacharjee, D.: Progress of human action recognition research in the last ten years: a comprehensive survey. Arch. Comput. Methods Eng.. 29, 2309–2349 (2021)
Article Google Scholar
Gkioxari, G.; Hariharan, B.; Girshick, R.; Malik, J.: R-cnns for pose estimation and action detection. arXiv preprint (2014)
Oquab, M.; Bottou, L.; Laptev, I.; Sivic, J.: Learning and transferring mid-level image representations using convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1717–1724 (2014)
Qi, T., Xu, Y., Quan, Y., Wang, Y., Ling, H.: Image-based action recognition using hint-enhanced deep neural networks. Neurocomputing 267, 475–488 (2017)
Article Google Scholar
Gkioxari, G.; Girshick, R.; Malik, J.: Contextual action recognition with r* cnn. In: Proceedings of the IEEE international conference on computer vision, pp. 1080–1088 (2015)
Ashrafi, S.S.; Shokouhi, S.B.: Knowledge distillation framework for action recognition in still images. In: 2020 10th international conference on computer and knowledge engineering, pp. 274–277 (2020)
Safaei, M.; Balouchian, P.; Foroosh, H.: TICNN: a hierarchical deep learning framework for still image action recognition using temporal image prediction. In: 2018 25th IEEE international conference on image processing, pp. 3463–3467 (2018)
Gao, R.; Xiong, B.; Grauman, K.: Im2flow: Motion hallucination from static images for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5937–5947 (2018)
Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Bengio, Y.: Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning, pp. 2048–2057 (2015)
Hu, J.; Shen. L.; Sun, G.: Squeeze-and-Excitation Networks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S.: Cbam: Convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV), pp. 3–19 (2018)
Wang, Q., Wu, B., Zhu, P., Li, P., Zuo, W., & Hu, Q.: ECA-Net: Efficient channel attention for deep convolutional neural networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11534–11542 (2020)
Liu, J.; Wang, G.; Hu, P.; Duan, L.Y.; Kot, A.C.: Global context-aware attention lstm networks for 3d action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1647–1656 (2017)
Yao, B.; Fei-Fei, L.: Grouplet: A structured image representation for recognizing human and object interactions. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 9–16 (2010)
Banerjee, A., Roy, S., Kundu, R., Singh, P.K., Bhateja, V., Sarkar, R.: An ensemble approach for still image-based human action recognition. Neural Comput. Appl. 34(21), 19269–19282 (2022)
Article Google Scholar
Krawczyk, B., Minku, L.L., Gama, J., Stefanowski, J., Woźniak, M.: Ensemble learning for data stream analysis: A survey. Inf. Fusion. 37, 132–156 (2017)
Article Google Scholar
Zhang, X.L., Wang, D.: A deep ensemble learning method for monaural speech separation. IEEE/ACM Trans. Audio, Speech, Lang Process. 24(5), 967–977 (2016)
Article ADS PubMed Google Scholar
Safaei, M.: Action Recognition in Still Images: Confluence of Multilinear Methods and Deep Learning (2020)
Yu, X., Zhang, Z., Wu, L., Pang, W., Chen, H., Yu, Z., Li, B.: Deep ensemble learning for human action recognition in still images. Complexity 2020, 1–23 (2020)
Article ADS CAS Google Scholar
Tan, M.; Le, Q.: Efficientnetv2: Smaller models and faster training. In: International Conference on Machine Learning. PMLR, pp. 10096–10106 (2021)
Safaei, M.; Balouchian, P.; Foroosh, H.: UCF-STAR: A large scale still image dataset for understanding human actions. In: Proceedings of the AAAI conference on artificial intelligence, pp. 2677–2684 (2020)
Weiss, K., Khoshgoftaar, T.M., Wang, D.: A survey of transfer learning. J. Big Data. 3(1), 1–40 (2016)
Article Google Scholar
Cao, Z.; Simon, T.; Wei, S.E.; Sheikh, Y.: Realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7291–7299 (2017)
Liu, Z.; Li, J.; Shen, Z.; Huang, G.; Yan, S.; Zhang, C.: Learning efficient convolutional networks through network slimming. In: Proceedings of the IEEE international conference on computer vision, pp. 2736–2744 (2017)
Yao, B.; Jiang, X.; Khosla, A.; Lin, A.L.; Guibas, L.; Fei-Fei, L.: Human action recognition by learning bases of action attributes and parts. In: 2011 International conference on computer vision, pp. 1331–1338 (2011)
Andriluka, M.; Pishchulin, L.; Gehler, P.; Schiele, B.: 2d human pose estimation: New benchmark and state of the art analysis. In: Proceedings of the IEEE conference on computer Vision and Pattern Recognition, pp. 3686–3693 (2014)
Lavinia, Y., Vo, H., Verma, A.: New colour fusion deep learning model for large-scale action recognition. Int. J. Comput. Vision Robotics. 10(1), 41–60 (2020)
Article Google Scholar
Wu, W.; Yu, J.: A part fusion model for action recognition in still images. In: Neural Information Processing: 27th International Conference, pp. 101–112 (2020)
Chakraborty, S., Mondal, R., Singh, P.K., Sarkar, R., Bhattacharjee, D.: Transfer learning with fine tuning for human action recognition from still images. Multimed. Tools Appl. 80, 20547–20578 (2021)
Article Google Scholar
Dehkordi, H.A.; Nezhad, A.S.; Ashrafi, S.S.; Shokouhi, S.B.: Still image action recognition using ensemble learning. In: 2021 7th international conference on web research, pp. 125–129 (2021)
Zhang, J., Han, Y., Jiang, J.: Tucker decomposition-based tensor learning for human action recognition. Multimedia Syst. 22, 343–353 (2016)
Article Google Scholar
Zhao, Z., Ma, H., Chen, X.: Generalized symmetric pair model for action classification in still images. Pattern Recogn. 64, 347–360 (2017)
Article ADS Google Scholar
Li, Z.; Ge, Y.; Feng, J.; Qin, X.; Yu, J.; Yu, H.: Deep selective feature learning for action recognition. In: 2020 IEEE international conference on multimedia and expo, pp. 1–6 (2020)
Liu, S.; Wu, N.; Jin, H.: Human action recognition based on attention mechanism and HRNet. In: Proceeding of 2021 international conference on wireless communications, networking and applications, pp. 279–291 (2022)

Download references

Author information

Authors and Affiliations

School of Energy and Electrical Engineering, Hohai University, Nanjing, 211100, People’s Republic of China
Xinbiao Lu, Hao Xing, Chunlin Ye, Xupeng Xie & Zecheng Liu

Authors

Xinbiao Lu
View author publications
You can also search for this author in PubMed Google Scholar
Hao Xing
View author publications
You can also search for this author in PubMed Google Scholar
Chunlin Ye
View author publications
You can also search for this author in PubMed Google Scholar
Xupeng Xie
View author publications
You can also search for this author in PubMed Google Scholar
Zecheng Liu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

XL contributed to conceptualization and methodology. HX was involved in methodology and manuscript writing and provided software. CY contributed to data interpretation and manuscript editing. XX was involved in validation and manuscript editing. ZL contributed to supervision and manuscript editing.

Corresponding author

Correspondence to Hao Xing.

Ethics declarations

Conflict of interest

All authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Lu, X., Xing, H., Ye, C. et al. A key-points-assisted network with transfer learning for precision human action recognition in still images. SIViP 18, 1561–1575 (2024). https://doi.org/10.1007/s11760-023-02862-y

Download citation

Received: 12 June 2023
Revised: 16 October 2023
Accepted: 21 October 2023
Published: 22 November 2023
Issue Date: March 2024
DOI: https://doi.org/10.1007/s11760-023-02862-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A key-points-assisted network with transfer learning for precision human action recognition in still images

Abstract

Access this article

Similar content being viewed by others

An efficient attention module for 3d convolutional neural networks in action recognition

Two-Stream Adaptive Weight Convolutional Neural Network Based on Spatial Attention for Human Action Recognition

Collaborative Positional-Motion Excitation Module for Efficient Action Recognition

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A key-points-assisted network with transfer learning for precision human action recognition in still images

Abstract

Access this article

Similar content being viewed by others

An efficient attention module for 3d convolutional neural networks in action recognition

Two-Stream Adaptive Weight Convolutional Neural Network Based on Spatial Attention for Human Action Recognition

Collaborative Positional-Motion Excitation Module for Efficient Action Recognition

Data availability

References

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation