Abstract
Still image-based human action recognition is a highly sought-after but challenging field in computer vision, and such challenge mainly stems from the lack of information in single images. Therefore, efficient extraction of visual appearance features and other valuable information of images is crucial for action recognition. To this purpose, on the one hand, we use a convolutional neural network (CNN) classifier based on EfficientNetV2-S network as the main pathway for extracting appearance features from images and classification. To make the CNN classifier focus on important spatial features, we propose the residual spatial attention module (RSAM) and incorporate it into the CNN classifier. In addition, we leverage transfer learning to enhance the training speed and recognition precision of the CNN classifier. On the other hand, we utilize the OpenPose algorithm to extract the coordinates of human key-points in the auxiliary pathway and perform information extraction and classification on the obtained key-points with a self-made network. Finally, we use one-dimensional convolution to merge the results of these two classifications. One-dimensional convolution can automatically learn the weights of these two results and merge them based on their importance. Experimental results on three challenging datasets, namely Stanford 40 Actions, People Play Music Instrument (PPMI) and MPII Human Pose datasets, illustrate the superiority of the proposed method.
Similar content being viewed by others
Data availability
References
Li, C., Tong, R., Tang, M.: Modelling human body pose for action recognition using deep neural networks. Arab. J. Sci. Eng. 43, 7777–7788 (2018)
Singh, T., Vishwakarma, D.K.: Video benchmarks of human action datasets: a review. Artif. Intell. Rev. 52, 1107–1154 (2019)
Bozkurt, F.: A comparative study on classifying human activities using classical machine and deep learning methods. Arab. J. Sci. Eng. 47(2), 1507–1521 (2022)
Dash, S.K., Acharya, S., Pakray, P., Das, R., Gelbukh, A.: Topic-based image caption generation. Arab. J. Sci. Eng. 45(4), 3025–3034 (2020)
Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., Paluri, M.: C3D: generic features for video analysis. CoRR. 2(7), 8 (2014)
Hashemzadeh, M., Pan, G., Wang, Y., Yao, M., Wu, J.: Combining velocity and location-specific spatial clues in trajectories for counting crowded moving objects. Int. J. Pattern Recognit Artif Intell. 27(02), 1354003 (2013)
Hashemzadeh, M., Pan, G., Yao, M.: Counting moving people in crowds using motion statistics of feature-points. Multimed. Tools Appl. 72, 453–487 (2014)
Hashemzadeh, M., Farajzadeh, N.: Combining keypoint-based and segment-based features for counting people in crowded scenes. Inf. Sci. 345, 199–216 (2016)
Vishwakarma, D.K., Singh, T.: A visual cognizance based multi-resolution descriptor for human action recognition using key pose. AEU-Int. J. Electr. Commun. 107, 157–169 (2019)
Singh, T., Vishwakarma, D.K.: A deeply coupled ConvNet for human activity recognition using dynamic and RGB images. Neural Comput. Appl. 33, 469–485 (2021)
Dhiman, C., Vishwakarma, D.K.: View-invariant deep architecture for human action recognition using two-stream motion and shape temporal dynamics. IEEE Trans. Image Process. 29, 3835–3844 (2020)
Che, Y., Sivaparthipan, C.B., Alfred Daniel, J.: RETRACTED ARTICLE: human-computer interaction on IoT-based college physical education. Arabian J. Sci. Eng. 48, 4119 (2021)
Krizhevsky, A.; Sutskever, I.; Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems. 25 (2012)
Simonyan, K.; Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv. 1409.1556 (2014)
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D., Vanhoucke, V.; Rabinovich, A.: Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1–9 (2015)
He, K.; Zhang, X.; Ren, S.; Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778 (2016)
Peng, X., Wang, L., Wang, X., Qiao, Y.: Bag of visual words and fusion methods for action recognition: comprehensive study and good practice. Comput. Vis. Image Underst. 150, 109–125 (2016)
Oneata, D.; Verbeek, J.; Schmid, C.: Action and event recognition with fisher vectors on a compact feature set. In: Proceedings of the IEEE international conference on computer vision, pp. 1817–1824 (2013)
Prest, A., Schmid, C., Ferrari, V.: Weakly supervised learning of interactions between humans and objects. IEEE Trans. Pattern Anal. Mach. Intell. 34(3), 601–614 (2011)
Gkioxari, G.; Girshick, R.; Malik, J.: Actions and attributes from wholes and parts. In: Proceedings of the IEEE international conference on computer vision, pp. 2470–2478 (2015)
Zhao, Z.; Ma, H.; You, S.: Single image action recognition using semantic body part actions. In: Proceedings of the IEEE international conference on computer vision, pp. 3391–3399 (2017)
Singh, P.K., Kundu, S., Adhikary, T., Sarkar, R., Bhattacharjee, D.: Progress of human action recognition research in the last ten years: a comprehensive survey. Arch. Comput. Methods Eng.. 29, 2309–2349 (2021)
Gkioxari, G.; Hariharan, B.; Girshick, R.; Malik, J.: R-cnns for pose estimation and action detection. arXiv preprint (2014)
Oquab, M.; Bottou, L.; Laptev, I.; Sivic, J.: Learning and transferring mid-level image representations using convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1717–1724 (2014)
Qi, T., Xu, Y., Quan, Y., Wang, Y., Ling, H.: Image-based action recognition using hint-enhanced deep neural networks. Neurocomputing 267, 475–488 (2017)
Gkioxari, G.; Girshick, R.; Malik, J.: Contextual action recognition with r* cnn. In: Proceedings of the IEEE international conference on computer vision, pp. 1080–1088 (2015)
Ashrafi, S.S.; Shokouhi, S.B.: Knowledge distillation framework for action recognition in still images. In: 2020 10th international conference on computer and knowledge engineering, pp. 274–277 (2020)
Safaei, M.; Balouchian, P.; Foroosh, H.: TICNN: a hierarchical deep learning framework for still image action recognition using temporal image prediction. In: 2018 25th IEEE international conference on image processing, pp. 3463–3467 (2018)
Gao, R.; Xiong, B.; Grauman, K.: Im2flow: Motion hallucination from static images for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5937–5947 (2018)
Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.; Bengio, Y.: Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning, pp. 2048–2057 (2015)
Hu, J.; Shen. L.; Sun, G.: Squeeze-and-Excitation Networks. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S.: Cbam: Convolutional block attention module. In: Proceedings of the European conference on computer vision (ECCV), pp. 3–19 (2018)
Wang, Q., Wu, B., Zhu, P., Li, P., Zuo, W., & Hu, Q.: ECA-Net: Efficient channel attention for deep convolutional neural networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 11534–11542 (2020)
Liu, J.; Wang, G.; Hu, P.; Duan, L.Y.; Kot, A.C.: Global context-aware attention lstm networks for 3d action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1647–1656 (2017)
Yao, B.; Fei-Fei, L.: Grouplet: A structured image representation for recognizing human and object interactions. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 9–16 (2010)
Banerjee, A., Roy, S., Kundu, R., Singh, P.K., Bhateja, V., Sarkar, R.: An ensemble approach for still image-based human action recognition. Neural Comput. Appl. 34(21), 19269–19282 (2022)
Krawczyk, B., Minku, L.L., Gama, J., Stefanowski, J., Woźniak, M.: Ensemble learning for data stream analysis: A survey. Inf. Fusion. 37, 132–156 (2017)
Zhang, X.L., Wang, D.: A deep ensemble learning method for monaural speech separation. IEEE/ACM Trans. Audio, Speech, Lang Process. 24(5), 967–977 (2016)
Safaei, M.: Action Recognition in Still Images: Confluence of Multilinear Methods and Deep Learning (2020)
Yu, X., Zhang, Z., Wu, L., Pang, W., Chen, H., Yu, Z., Li, B.: Deep ensemble learning for human action recognition in still images. Complexity 2020, 1–23 (2020)
Tan, M.; Le, Q.: Efficientnetv2: Smaller models and faster training. In: International Conference on Machine Learning. PMLR, pp. 10096–10106 (2021)
Safaei, M.; Balouchian, P.; Foroosh, H.: UCF-STAR: A large scale still image dataset for understanding human actions. In: Proceedings of the AAAI conference on artificial intelligence, pp. 2677–2684 (2020)
Weiss, K., Khoshgoftaar, T.M., Wang, D.: A survey of transfer learning. J. Big Data. 3(1), 1–40 (2016)
Cao, Z.; Simon, T.; Wei, S.E.; Sheikh, Y.: Realtime multi-person 2d pose estimation using part affinity fields. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 7291–7299 (2017)
Liu, Z.; Li, J.; Shen, Z.; Huang, G.; Yan, S.; Zhang, C.: Learning efficient convolutional networks through network slimming. In: Proceedings of the IEEE international conference on computer vision, pp. 2736–2744 (2017)
Yao, B.; Jiang, X.; Khosla, A.; Lin, A.L.; Guibas, L.; Fei-Fei, L.: Human action recognition by learning bases of action attributes and parts. In: 2011 International conference on computer vision, pp. 1331–1338 (2011)
Andriluka, M.; Pishchulin, L.; Gehler, P.; Schiele, B.: 2d human pose estimation: New benchmark and state of the art analysis. In: Proceedings of the IEEE conference on computer Vision and Pattern Recognition, pp. 3686–3693 (2014)
Lavinia, Y., Vo, H., Verma, A.: New colour fusion deep learning model for large-scale action recognition. Int. J. Comput. Vision Robotics. 10(1), 41–60 (2020)
Wu, W.; Yu, J.: A part fusion model for action recognition in still images. In: Neural Information Processing: 27th International Conference, pp. 101–112 (2020)
Chakraborty, S., Mondal, R., Singh, P.K., Sarkar, R., Bhattacharjee, D.: Transfer learning with fine tuning for human action recognition from still images. Multimed. Tools Appl. 80, 20547–20578 (2021)
Dehkordi, H.A.; Nezhad, A.S.; Ashrafi, S.S.; Shokouhi, S.B.: Still image action recognition using ensemble learning. In: 2021 7th international conference on web research, pp. 125–129 (2021)
Zhang, J., Han, Y., Jiang, J.: Tucker decomposition-based tensor learning for human action recognition. Multimedia Syst. 22, 343–353 (2016)
Zhao, Z., Ma, H., Chen, X.: Generalized symmetric pair model for action classification in still images. Pattern Recogn. 64, 347–360 (2017)
Li, Z.; Ge, Y.; Feng, J.; Qin, X.; Yu, J.; Yu, H.: Deep selective feature learning for action recognition. In: 2020 IEEE international conference on multimedia and expo, pp. 1–6 (2020)
Liu, S.; Wu, N.; Jin, H.: Human action recognition based on attention mechanism and HRNet. In: Proceeding of 2021 international conference on wireless communications, networking and applications, pp. 279–291 (2022)
Author information
Authors and Affiliations
Contributions
XL contributed to conceptualization and methodology. HX was involved in methodology and manuscript writing and provided software. CY contributed to data interpretation and manuscript editing. XX was involved in validation and manuscript editing. ZL contributed to supervision and manuscript editing.
Corresponding author
Ethics declarations
Conflict of interest
All authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Lu, X., Xing, H., Ye, C. et al. A key-points-assisted network with transfer learning for precision human action recognition in still images. SIViP 18, 1561–1575 (2024). https://doi.org/10.1007/s11760-023-02862-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11760-023-02862-y