Skip to main content

Advertisement

Log in

A key-points-assisted network with transfer learning for precision human action recognition in still images

  • Original Paper
  • Published:
Signal, Image and Video Processing Aims and scope Submit manuscript

Abstract

Still image-based human action recognition is a highly sought-after but challenging field in computer vision, and such challenge mainly stems from the lack of information in single images. Therefore, efficient extraction of visual appearance features and other valuable information of images is crucial for action recognition. To this purpose, on the one hand, we use a convolutional neural network (CNN) classifier based on EfficientNetV2-S network as the main pathway for extracting appearance features from images and classification. To make the CNN classifier focus on important spatial features, we propose the residual spatial attention module (RSAM) and incorporate it into the CNN classifier. In addition, we leverage transfer learning to enhance the training speed and recognition precision of the CNN classifier. On the other hand, we utilize the OpenPose algorithm to extract the coordinates of human key-points in the auxiliary pathway and perform information extraction and classification on the obtained key-points with a self-made network. Finally, we use one-dimensional convolution to merge the results of these two classifications. One-dimensional convolution can automatically learn the weights of these two results and merge them based on their importance. Experimental results on three challenging datasets, namely Stanford 40 Actions, People Play Music Instrument (PPMI) and MPII Human Pose datasets, illustrate the superiority of the proposed method.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Data availability

Stanford40: http://vision.stanford.edu/Datasets/40actions.html PPMI: http://ai.stanford.edu/~bangpeng/ppmi.html MPII: http://human-pose.mpi-inf.mpg.de/#download

References

  1. Li, C., Tong, R., Tang, M.: Modelling human body pose for action recognition using deep neural networks. Arab. J. Sci. Eng. 43, 7777–7788 (2018)

    Article  Google Scholar 

  2. Singh, T., Vishwakarma, D.K.: Video benchmarks of human action datasets: a review. Artif. Intell. Rev. 52, 1107–1154 (2019)

    Article  Google Scholar 

  3. Bozkurt, F.: A comparative study on classifying human activities using classical machine and deep learning methods. Arab. J. Sci. Eng. 47(2), 1507–1521 (2022)

    Article  Google Scholar 

  4. Dash, S.K., Acharya, S., Pakray, P., Das, R., Gelbukh, A.: Topic-based image caption generation. Arab. J. Sci. Eng. 45(4), 3025–3034 (2020)

    Article  Google Scholar 

  5. Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., Paluri, M.: C3D: generic features for video analysis. CoRR. 2(7), 8 (2014)

    Google Scholar 

  6. Hashemzadeh, M., Pan, G., Wang, Y., Yao, M., Wu, J.: Combining velocity and location-specific spatial clues in trajectories for counting crowded moving objects. Int. J. Pattern Recognit Artif Intell. 27(02), 1354003 (2013)

    Article  MathSciNet  Google Scholar 

  7. Hashemzadeh, M., Pan, G., Yao, M.: Counting moving people in crowds using motion statistics of feature-points. Multimed. Tools Appl. 72, 453–487 (2014)

    Article  Google Scholar 

  8. Hashemzadeh, M., Farajzadeh, N.: Combining keypoint-based and segment-based features for counting people in crowded scenes. Inf. Sci. 345, 199–216 (2016)

    Article  Google Scholar 

  9. Vishwakarma, D.K., Singh, T.: A visual cognizance based multi-resolution descriptor for human action recognition using key pose. AEU-Int. J. Electr. Commun. 107, 157–169 (2019)

    Article  Google Scholar 

  10. Singh, T., Vishwakarma, D.K.: A deeply coupled ConvNet for human activity recognition using dynamic and RGB images. Neural Comput. Appl. 33, 469–485 (2021)

    Article  CAS  Google Scholar 

  11. Dhiman, C., Vishwakarma, D.K.: View-invariant deep architecture for human action recognition using two-stream motion and shape temporal dynamics. IEEE Trans. Image Process. 29, 3835–3844 (2020)

    Article  ADS  Google Scholar 

  12. Che, Y., Sivaparthipan, C.B., Alfred Daniel, J.: RETRACTED ARTICLE: human-computer interaction on IoT-based college physical education. Arabian J. Sci. Eng. 48, 4119 (2021)

    Article  Google Scholar 

  13. Krizhevsky, A.; Sutskever, I.; Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems. 25 (2012)

  14. Simonyan, K.; Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv. 1409.1556 (2014)

  15. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D., Vanhoucke, V.; Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9 (2015)

  16. He, K.; Zhang, X.; Ren, S.; Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778 (2016)

  17. Peng, X., Wang, L., Wang, X., Qiao, Y.: Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. Comput. Vis. Image Underst. 150, 109–125 (2016)

    Article  Google Scholar 

  18. Oneata, D.; Verbeek, J.; Schmid, C.: Action and event recognition with fisher vectors on a compact feature set. In: Proceedings of the IEEE international conference on computer vision, pp. 1817–1824 (2013)

  19. Prest, A., Schmid, C., Ferrari, V.: Weakly supervised learning of interactions between humans and objects. IEEE Trans. Pattern Anal. Mach. Intell. 34(3), 601–614 (2011)

    Article  Google Scholar 

  20. Gkioxari, G.; Girshick, R.; Malik, J.: Actions and attributes from wholes and parts. In: Proceedings of the IEEE international conference on computer vision, pp. 2470–2478 (2015)

  21. Zhao, Z.; Ma, H.; You, S.: Single image action recognition using semantic body part actions. In: Proceedings of the IEEE international conference on computer vision, pp. 3391–3399 (2017)

  22. Singh, P.K., Kundu, S., Adhikary, T., Sarkar, R., Bhattacharjee, D.: Progress of human action recognition research in the last ten years: a comprehensive survey. Arch. Comput. Methods Eng.. 29, 2309–2349 (2021)

    Article  Google Scholar 

  23. Gkioxari, G.; Hariharan, B.; Girshick, R.; Malik, J.: R-cnns for pose estimation and action detection. arXiv preprint (2014)

  24. Oquab, M.; Bottou, L.; Laptev, I.; Sivic, J.: Learning and transferring mid-level image representations using convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1717–1724 (2014)

  25. Qi, T., Xu, Y., Quan, Y., Wang, Y., Ling, H.: Image-based action recognition using hint-enhanced deep neural networks. Neurocomputing 267, 475–488 (2017)

    Article  Google Scholar 

  26. Gkioxari, G.; Girshick, R.; Malik, J.: Contextual action recognition with r* cnn. In: Proceedings of the IEEE international conference on computer vision, pp. 1080–1088 (2015)

  27. Ashrafi, S.S.; Shokouhi, S.B.: Knowledge distillation framework for action recognition in still images. In: 2020 10th international conference on computer and knowledge engineering, pp. 274–277 (2020)

  28. Safaei, M.; Balouchian, P.; Foroosh, H.: TICNN: a hierarchical deep learning framework for still image action recognition using temporal image prediction. In: 2018 25th IEEE international conference on image processing, pp. 3463–3467 (2018)

  29. Gao, R.; Xiong, B.; Grauman, K.: Im2flow: Motion hallucination from static images for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5937–5947 (2018)

  30. Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Bengio, Y.: Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning, pp. 2048–2057 (2015)

  31. Hu, J.; Shen. L.; Sun, G.: Squeeze-and-Excitation Networks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)

  32. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S.: Cbam: Convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV), pp. 3–19 (2018)

  33. Wang, Q., Wu, B., Zhu, P., Li, P., Zuo, W., & Hu, Q.: ECA-Net: Efficient channel attention for deep convolutional neural networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11534–11542 (2020)

  34. Liu, J.; Wang, G.; Hu, P.; Duan, L.Y.; Kot, A.C.: Global context-aware attention lstm networks for 3d action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1647–1656 (2017)

  35. Yao, B.; Fei-Fei, L.: Grouplet: A structured image representation for recognizing human and object interactions. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 9–16 (2010)

  36. Banerjee, A., Roy, S., Kundu, R., Singh, P.K., Bhateja, V., Sarkar, R.: An ensemble approach for still image-based human action recognition. Neural Comput. Appl. 34(21), 19269–19282 (2022)

    Article  Google Scholar 

  37. Krawczyk, B., Minku, L.L., Gama, J., Stefanowski, J., Woźniak, M.: Ensemble learning for data stream analysis: A survey. Inf. Fusion. 37, 132–156 (2017)

    Article  Google Scholar 

  38. Zhang, X.L., Wang, D.: A deep ensemble learning method for monaural speech separation. IEEE/ACM Trans. Audio, Speech, Lang Process. 24(5), 967–977 (2016)

    Article  ADS  PubMed  Google Scholar 

  39. Safaei, M.: Action Recognition in Still Images: Confluence of Multilinear Methods and Deep Learning (2020)

  40. Yu, X., Zhang, Z., Wu, L., Pang, W., Chen, H., Yu, Z., Li, B.: Deep ensemble learning for human action recognition in still images. Complexity 2020, 1–23 (2020)

    Article  ADS  CAS  Google Scholar 

  41. Tan, M.; Le, Q.: Efficientnetv2: Smaller models and faster training. In: International Conference on Machine Learning. PMLR, pp. 10096–10106 (2021)

  42. Safaei, M.; Balouchian, P.; Foroosh, H.: UCF-STAR: A large scale still image dataset for understanding human actions. In: Proceedings of the AAAI conference on artificial intelligence, pp. 2677–2684 (2020)

  43. Weiss, K., Khoshgoftaar, T.M., Wang, D.: A survey of transfer learning. J. Big Data. 3(1), 1–40 (2016)

    Article  Google Scholar 

  44. Cao, Z.; Simon, T.; Wei, S.E.; Sheikh, Y.: Realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7291–7299 (2017)

  45. Liu, Z.; Li, J.; Shen, Z.; Huang, G.; Yan, S.; Zhang, C.: Learning efficient convolutional networks through network slimming. In: Proceedings of the IEEE international conference on computer vision, pp. 2736–2744 (2017)

  46. Yao, B.; Jiang, X.; Khosla, A.; Lin, A.L.; Guibas, L.; Fei-Fei, L.: Human action recognition by learning bases of action attributes and parts. In: 2011 International conference on computer vision, pp. 1331–1338 (2011)

  47. Andriluka, M.; Pishchulin, L.; Gehler, P.; Schiele, B.: 2d human pose estimation: New benchmark and state of the art analysis. In: Proceedings of the IEEE conference on computer Vision and Pattern Recognition, pp. 3686–3693 (2014)

  48. Lavinia, Y., Vo, H., Verma, A.: New colour fusion deep learning model for large-scale action recognition. Int. J. Comput. Vision Robotics. 10(1), 41–60 (2020)

    Article  Google Scholar 

  49. Wu, W.; Yu, J.: A part fusion model for action recognition in still images. In: Neural Information Processing: 27th International Conference, pp. 101–112 (2020)

  50. Chakraborty, S., Mondal, R., Singh, P.K., Sarkar, R., Bhattacharjee, D.: Transfer learning with fine tuning for human action recognition from still images. Multimed. Tools Appl. 80, 20547–20578 (2021)

    Article  Google Scholar 

  51. Dehkordi, H.A.; Nezhad, A.S.; Ashrafi, S.S.; Shokouhi, S.B.: Still image action recognition using ensemble learning. In: 2021 7th international conference on web research, pp. 125–129 (2021)

  52. Zhang, J., Han, Y., Jiang, J.: Tucker decomposition-based tensor learning for human action recognition. Multimedia Syst. 22, 343–353 (2016)

    Article  Google Scholar 

  53. Zhao, Z., Ma, H., Chen, X.: Generalized symmetric pair model for action classification in still images. Pattern Recogn. 64, 347–360 (2017)

    Article  ADS  Google Scholar 

  54. Li, Z.; Ge, Y.; Feng, J.; Qin, X.; Yu, J.; Yu, H.: Deep selective feature learning for action recognition. In: 2020 IEEE international conference on multimedia and expo, pp. 1–6 (2020)

  55. Liu, S.; Wu, N.; Jin, H.: Human action recognition based on attention mechanism and HRNet. In: Proceeding of 2021 international conference on wireless communications, networking and applications, pp. 279–291 (2022)

Download references

Author information

Authors and Affiliations

Authors

Contributions

XL contributed to conceptualization and methodology. HX was involved in methodology and manuscript writing and provided software. CY contributed to data interpretation and manuscript editing. XX was involved in validation and manuscript editing. ZL contributed to supervision and manuscript editing.

Corresponding author

Correspondence to Hao Xing.

Ethics declarations

Conflict of interest

All authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lu, X., Xing, H., Ye, C. et al. A key-points-assisted network with transfer learning for precision human action recognition in still images. SIViP 18, 1561–1575 (2024). https://doi.org/10.1007/s11760-023-02862-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11760-023-02862-y

Keywords

Navigation