Skip to main content
Log in

Action recognition based on adaptive region perception

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Temporal modelling is still challenging for action recognition in video. To alleviate this problem, this paper proposes a new video architecture, called Adaptive Region Aware (ARP) network. The network focuses on combining short-range temporal information with long-range temporal information to perform effective action recognition. The core of our ARP is the Movement and Spatial-Temporal module (MST), which is made up of two modules, movement information and Spatial-Temporal information. MST uses the idea of difference for short-range temporal information extraction and adaptive region sensing and temporal convolution for long-range temporal information extraction. To extract temporal information for the entire video, our MST module contains two branches. Specifically, for local temporal information, we use the difference in motion between successive frames to extract a fine-grained representation of the motion, thus obtaining short-range temporal information. For the global temporal information, we use adaptive region awareness for feature extraction of the whole video to enhance the representation of the model in the spatio-temporal domain, and use temporal convolution for modelling to obtain our long-range temporal information. We insert the MST module into ResNet-50 to build our ARP network and experiment on the Something V1, Something V2 and Kinetics-400 datasets with optimal performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Data availibility

The data that support the findings of this study are not openly available due to site restrictions and are available from the corresponding author upon reasonable request “https://20bn.com/datasets/something-something/”.

References

  1. Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60:84–90. https://doi.org/10.1145/3065386

    Article  Google Scholar 

  2. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition 2016-December, pp 770–778. https://doi.org/10.1109/CVPR.2016.90

  3. Wang F, Cheng J, Liu W, Liu H (2018) Additive margin softmax for face verification. IEEE Signal Process Lett 25(7):926–930

    Article  Google Scholar 

  4. Ren S, He K, Girshick R, Sun J (2017) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149

    Article  Google Scholar 

  5. He K, Gkioxari G, Dollar P, Girshick R (2017). Mask R-CNN. https://doi.org/10.1109/ICCV.2017.322

  6. Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition 2016-December, pp 1933–1941. https://doi.org/10.1109/CVPR.2016.213

  7. Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings—30th IEEE conference on computer vision and pattern recognition, CVPR 2017 2017-January, pp 4724–4733. https://doi.org/10.1109/CVPR.2017.502

  8. Chen Y, Kalantidis Y, Li J, Yan S, Feng J (2018) A2-nets: double attention networks. Adv Neural Inf Process Syst 352

  9. Wang H, Klaser A, Schmid C, Liu C-L (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vis 103(1):60–79

    Article  MathSciNet  Google Scholar 

  10. Yang B, Le QV, Bender G, Ngiam J (2019) Condconv: conditionally parameterized convolutions for efficient inference. Adv Neural Inf Process Syst 32

  11. Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, van Gool L (2016) Temporal segment networks: towards good practices for deep action recognition. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 9912 LNCS, pp 20–36. https://doi.org/10.1007/978-3-319-46484-8_2

  12. Wang Y, Long M, Wang J, Yu PS (2017) Spatiotemporal pyramid network for video action recognition. https://doi.org/10.1109/CVPR.2017.226

  13. Feichtenhofer C, Pinz A, Wildes RP (2017) Spatiotemporal multiplier networks for video action recognition. In: Proceedings—30th IEEE conference on computer vision and pattern recognition, CVPR 2017 2017-January, pp 7445–7454. https://doi.org/10.1109/CVPR.2017.787

  14. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision 2015 international conference on computer vision ICCV 2015, pp 4489–4497. https://doi.org/10.1109/ICCV.2015.510

  15. Stroud JC, Ross DA, Sun C, Deng J, Sukthankar R (2020) D3d: Distilled 3d networks for video action recognition. In: Proceedings—2020 IEEE winter conference on applications of computer vision, WACV 2020, pp 614–623. https://doi.org/10.1109/WACV45572.2020.9093274

  16. Bobick A, Davis J (1996) An appearance-based representation of action, vol 1, pp 307–312. https://doi.org/10.1109/ICPR.1996.546039

  17. Weinland D, Ronfard R, Boyer E (2006) Free viewpoint action recognition using motion history volumes. Comput Vis Image Underst 104(2–3 SPEC. ISS.):249–257

    Article  Google Scholar 

  18. Nguyen TP, Manzanera A (2013) Action recognition using bag of features extracted from a beam of trajectories, pp 4354–4357. https://doi.org/10.1109/ICIP.2013.6738897

  19. Wang H, Schmid C (2013) Action recognition with improved trajectories. In: International conference on computer vision, pp 3551–3558

  20. Shi C, Wang Y, Jia F, He K, Wang C, Xiao B (2017) Fisher vector for scene character recognition: a comprehensive evaluation. Pattern Recognit 72:1–14

    Article  Google Scholar 

  21. Danafar S, Gheissari N (2007) Action recognition for surveillance applications using optic flow and svm. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (PART 2), pp 457–466

  22. Oreifej O, Liu Z (2013) Hon4d: histogram of oriented 4d normals for activity recognition from depth sequences. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 716–723

  23. Wang Y, Zhou L, Qiao Y (2018) Temporal hallucinating for action recognition with few still images. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 5314–5322

  24. Wang L, Koniusz P, Huynh D (2019) Hallucinating idt descriptors and i3d optical flow features for action recognition with cnns. In: Proceedings of the IEEE international conference on computer vision, pp 8697–8707

  25. Tang Y, Ma L, Zhou L (2019) Hallucinating optical flow features for video classification. In: IJCAI international joint conference on artificial intelligence, pp 926–932

  26. Wang L, Koniusz P (2021) Self-supervising action recognition by statistical moment and subspace descriptors. In: MM 2021—proceedings of the 29th ACM international conference on multimedia, pp 4324–4333

  27. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Adv Neural Inf Process Syst 1:568–576

    Google Scholar 

  28. Diba A, Sharma V, VanGool L (2017) Deep temporal linear encoding networks. In: Proceedings—30th IEEE conference on computer vision and pattern recognition, CVPR 2017 2017-January, pp 1541–1550. https://doi.org/10.1109/CVPR.2017.168

  29. Wang L, Ge L, Li R, Fang Y (2017) Three-stream CNNs for action recognition. Pattern Recognit Lett 92:33–40. https://doi.org/10.1016/j.patrec.2017.04.004

    Article  Google Scholar 

  30. Ji S, Xu W, Yang M, Yu K (2013) 3d convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231

    Article  Google Scholar 

  31. Tran D, Ray J, Shou Z, Chang S-F, Paluri M (2017) Convnet architecture search for spatiotemporal feature learning. Comput Res Repos 8(16):1–12

    Google Scholar 

  32. Hara K, Kataoka H, Satoh Y (2018) Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 6546–6555. https://doi.org/10.1109/CVPR.2018.00685

  33. Diba A, Fayyaz M, Sharma V, Karami AH, Arzani MM, Yousefzadeh R, Van Gool L (2017) Temporal 3D ConvNets: new architecture and transfer learning for video classification. https://arxiv.org/abs/1711.08200

  34. Zhou B, Andonian A, Oliva A, Torralba A (2018) Temporal relational reasoning in videos. Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics) 11205 LNCS, pp 831–846. https://doi.org/10.1007/978-3-030-01246-5_49

  35. Zolfaghari M, Singh K, Brox T (2018) Eco: efficient convolutional network for online video understanding. Lecture Notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics) 11206 LNCS, pp 713–730. https://doi.org/10.1007/978-3-030-01216-8_43

  36. Lee M, Lee S, Son S, Park G, Kwak N (2018) Motion feature network: fixed motion filter for action recognition. Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics) 11214 LNCS, pp 392–408. https://doi.org/10.1007/978-3-030-01249-6_24

  37. Lin J, Gan C, Han S (2019) Tsm: temporal shift module for efficient video understanding. In: Proceedings of the IEEE international conference on computer vision 2019-October, pp 7082–7092. https://doi.org/10.1109/ICCV.2019.00718

  38. Shao H, Qian S, Liu Y (2020) Temporal interlacing network. In: Paper presented at the AAAI 2020—34th AAAI conference on artificial intelligence

  39. Wang L, Koniusz P (2022) Uncertainty-DTW for time series and sequences

  40. Liu Z, Luo D, Wang Y, Wang L, Tai Y, Wang C, Li J, Huang F, Lu T (2020) TEINet: towards an efficient architecture for video recognition. In: Paper presented at the AAAI 2020—34th AAAI conference on artificial intelligence

  41. Liu Z, Wang L, Wu W, Qian C, Lu T (2021) Tam: temporal adaptive module for video recognition. In: Proceedings of the IEEE international conference on computer vision, pp 13688–13698. https://doi.org/10.1109/ICCV48922.2021.01345

  42. Li Y, Ji B, Shi X, Zhang J, Kang B, Wang L (2020) Tea: temporal excitation and aggregation for action recognition. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 906–915. https://doi.org/10.1109/CVPR42600.2020.00099

  43. Jiang B, Wang M, Gan W, Wu W, Yan J (2019) Stm: spatiotemporal and motion encoding for action recognition. In: Proceedings of the IEEE international conference on computer vision 2019-October, pp 2000–2009. https://doi.org/10.1109/ICCV.2019.00209

  44. Chen Y, Dai X, Liu M, Chen D, Yuan L, Liu Z (2020) Dynamic convolution: attention over convolution kernels. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 11027–11036. https://doi.org/10.1109/CVPR42600.2020.01104

  45. Zhang Y, Zhang J, Wang Q, Zhong Z (2020) Dynet: dynamic convolution for accelerating convolutional neural networks

  46. Hu J, Shen L, Albanie S, Sun G, Wu E (2020) Squeeze-and-excitation networks. IEEE Trans Pattern Anal Mach Intell 42(8):2011–2023

    Article  Google Scholar 

  47. Goyal R, Kahou SE, Michalski V, Materzynska J, Westphal S, Kim H, Haenel V, Fruend I, Yianilos P, Mueller-Freitag M, Hoppe F, Thurau C, Bax I, Memisevic R (2017) The ’something something’ video database for learning and evaluating visual common sense. In: Proceedings of the IEEE international conference on computer vision 2017-October, pp 5843–5851. https://doi.org/10.1109/ICCV.2017.622

  48. Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Viola F, Green T, Back T, Natsev P, Suleyman M, Zisserman A (2017) The kinetics human action video dataset

  49. Feichtenhofer C, Fan H, Malik J, He K (2019) Slowfast networks for video recognition. In: Proceedings of the IEEE international conference on computer vision 2019-October, pp 6201–6210. https://doi.org/10.1109/ICCV.2019.00630

  50. Wang X, Girshick R, Gupta A, He K (2018) Non-local neural networks. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 7794–7803. https://doi.org/10.1109/CVPR.2018.00813

  51. Deng J, Dong W, Socher R, Li L-J, Li K, Li F-F (2009) Imagenet: a large-scale hierarchical image database, pp 248–255

  52. Xie S, Sun C, Huang J, Tu Z, Murphy K (2018) Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics) 11219 LNCS, pp 318–335. https://doi.org/10.1007/978-3-030-01267-0_19

  53. Yang Q, Lu T, Zhou H (2022) A spatio-temporal motion network for action recognition based on spatial attention. Entropy 24(3):368

    Article  Google Scholar 

  54. Liu X, Lee J-Y, Jin H (2019) Learning video representations from correspondence proposals. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 4268–4276. https://doi.org/10.1109/CVPR.2019.00440

  55. Fan Q, Chen C-F, Kuehne H, Pistoia M, Cox D (2019) More is less: learning efficient video representations by big-little network and depthwise temporal aggregation. Adv Neural Inf Process Syst 32

  56. Li X, Wang Y, Zhou Z, Qiao Y (2020) Smallbignet: integrating core and contextual views for video classification. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 1089–1098. https://doi.org/10.1109/CVPR42600.2020.00117

  57. Wang L, Li W, Van Gool L (2018) Appearance-and-relation networks for video classification. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 1430–1439. https://doi.org/10.1109/CVPR.2018.00155

  58. Feichtenhofer C (2020) X3d: expanding architectures for efficient video recognition. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 200–210. https://doi.org/10.1109/CVPR42600.2020.00028

  59. Zhang Y, Li X, Liu C, Shuai B, Zhu Y, Brattoli B, Chen H, Marsic I, Tighe J (2021) Vidtr: video transformer without convolutions. In: Proceedings of the IEEE international conference on computer vision, pp 13557–13567. https://doi.org/10.1109/ICCV48922.2021.01332

  60. Bertasius G, Wang H, Torresani L (2021) Is space-time attention all you need for video understanding?

  61. Arnab A, Dehghani M, Heigold G, Sun C, Lui M, Schmid C (2021) Vivit: a video vision transformer. In: Proceedings of the IEEE international conference on computer vision, pp 6816–6826. https://doi.org/10.1109/ICCV48922.2021.00676

  62. Fan H, Xiong B, Mangalam K, Li Y, Yan Z, Malik J, Feichtenhofer C (2021) Multiscale vision transformers. In: Proceedings of the IEEE international conference on computer vision, pp 6804–6815. https://doi.org/10.1109/ICCV48922.2021.00675

  63. Patrick M, Campbell D, Asano Y, Misra I, Metze F, Feichtenhofer C, Vedaldi A, Henriques JF (2021) Keeping your eye on the ball: trajectory attention in video transformers. Adv Neural Inf Process Syst 15:12493–12506

    Google Scholar 

Download references

Acknowledgements

This work is supported by the Hubei Technology Innovation Project (2019AAA045), the National Natural Science Foundation of China (62171328), the National Natural Science Foundation of China (62171327) and the Graduate Innovative Fund of Wuhan Institute of Technology (CX2021276).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tongwei Lu.

Ethics declarations

Conflict of interest

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Lu, T., Yang, Q., Min, F. et al. Action recognition based on adaptive region perception. Neural Comput & Applic 36, 943–959 (2024). https://doi.org/10.1007/s00521-023-09069-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-023-09069-9

Keywords

Navigation