Action recognition based on adaptive region perception

Lu, Tongwei; Yang, Qi; Min, Feng; Zhang, Yanduo

doi:10.1007/s00521-023-09069-9

Action recognition based on adaptive region perception

Original Article
Published: 09 October 2023

Volume 36, pages 943–959, (2024)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Tongwei Lu ORCID: orcid.org/0000-0002-3900-6456^1,2,
Qi Yang^1,2,
Feng Min^1,2 &
…
Yanduo Zhang¹

138 Accesses
Explore all metrics

Abstract

Temporal modelling is still challenging for action recognition in video. To alleviate this problem, this paper proposes a new video architecture, called Adaptive Region Aware (ARP) network. The network focuses on combining short-range temporal information with long-range temporal information to perform effective action recognition. The core of our ARP is the Movement and Spatial-Temporal module (MST), which is made up of two modules, movement information and Spatial-Temporal information. MST uses the idea of difference for short-range temporal information extraction and adaptive region sensing and temporal convolution for long-range temporal information extraction. To extract temporal information for the entire video, our MST module contains two branches. Specifically, for local temporal information, we use the difference in motion between successive frames to extract a fine-grained representation of the motion, thus obtaining short-range temporal information. For the global temporal information, we use adaptive region awareness for feature extraction of the whole video to enhance the representation of the model in the spatio-temporal domain, and use temporal convolution for modelling to obtain our long-range temporal information. We insert the MST module into ResNet-50 to build our ARP network and experiment on the Something V1, Something V2 and Kinetics-400 datasets with optimal performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Video summarization using deep learning techniques: a detailed analysis and investigation

Article 15 March 2023

Deep Learning Techniques—R-CNN to Mask R-CNN: A Survey

Deep learning for video object segmentation: a review

Article Open access 08 April 2022

Data availibility

The data that support the findings of this study are not openly available due to site restrictions and are available from the corresponding author upon reasonable request “https://20bn.com/datasets/something-something/”.

References

Krizhevsky A, Sutskever I, Hinton GE (2017) Imagenet classification with deep convolutional neural networks. Commun ACM 60:84–90. https://doi.org/10.1145/3065386
Article Google Scholar
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition 2016-December, pp 770–778. https://doi.org/10.1109/CVPR.2016.90
Wang F, Cheng J, Liu W, Liu H (2018) Additive margin softmax for face verification. IEEE Signal Process Lett 25(7):926–930
Article Google Scholar
Ren S, He K, Girshick R, Sun J (2017) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149
Article Google Scholar
He K, Gkioxari G, Dollar P, Girshick R (2017). Mask R-CNN. https://doi.org/10.1109/ICCV.2017.322
Feichtenhofer C, Pinz A, Zisserman A (2016) Convolutional two-stream network fusion for video action recognition. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition 2016-December, pp 1933–1941. https://doi.org/10.1109/CVPR.2016.213
Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In: Proceedings—30th IEEE conference on computer vision and pattern recognition, CVPR 2017 2017-January, pp 4724–4733. https://doi.org/10.1109/CVPR.2017.502
Chen Y, Kalantidis Y, Li J, Yan S, Feng J (2018) A2-nets: double attention networks. Adv Neural Inf Process Syst 352
Wang H, Klaser A, Schmid C, Liu C-L (2013) Dense trajectories and motion boundary descriptors for action recognition. Int J Comput Vis 103(1):60–79
Article MathSciNet Google Scholar
Yang B, Le QV, Bender G, Ngiam J (2019) Condconv: conditionally parameterized convolutions for efficient inference. Adv Neural Inf Process Syst 32
Wang L, Xiong Y, Wang Z, Qiao Y, Lin D, Tang X, van Gool L (2016) Temporal segment networks: towards good practices for deep action recognition. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 9912 LNCS, pp 20–36. https://doi.org/10.1007/978-3-319-46484-8_2
Wang Y, Long M, Wang J, Yu PS (2017) Spatiotemporal pyramid network for video action recognition. https://doi.org/10.1109/CVPR.2017.226
Feichtenhofer C, Pinz A, Wildes RP (2017) Spatiotemporal multiplier networks for video action recognition. In: Proceedings—30th IEEE conference on computer vision and pattern recognition, CVPR 2017 2017-January, pp 7445–7454. https://doi.org/10.1109/CVPR.2017.787
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In: Proceedings of the IEEE international conference on computer vision 2015 international conference on computer vision ICCV 2015, pp 4489–4497. https://doi.org/10.1109/ICCV.2015.510
Stroud JC, Ross DA, Sun C, Deng J, Sukthankar R (2020) D3d: Distilled 3d networks for video action recognition. In: Proceedings—2020 IEEE winter conference on applications of computer vision, WACV 2020, pp 614–623. https://doi.org/10.1109/WACV45572.2020.9093274
Bobick A, Davis J (1996) An appearance-based representation of action, vol 1, pp 307–312. https://doi.org/10.1109/ICPR.1996.546039
Weinland D, Ronfard R, Boyer E (2006) Free viewpoint action recognition using motion history volumes. Comput Vis Image Underst 104(2–3 SPEC. ISS.):249–257
Article Google Scholar
Nguyen TP, Manzanera A (2013) Action recognition using bag of features extracted from a beam of trajectories, pp 4354–4357. https://doi.org/10.1109/ICIP.2013.6738897
Wang H, Schmid C (2013) Action recognition with improved trajectories. In: International conference on computer vision, pp 3551–3558
Shi C, Wang Y, Jia F, He K, Wang C, Xiao B (2017) Fisher vector for scene character recognition: a comprehensive evaluation. Pattern Recognit 72:1–14
Article Google Scholar
Danafar S, Gheissari N (2007) Action recognition for surveillance applications using optic flow and svm. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) (PART 2), pp 457–466
Oreifej O, Liu Z (2013) Hon4d: histogram of oriented 4d normals for activity recognition from depth sequences. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 716–723
Wang Y, Zhou L, Qiao Y (2018) Temporal hallucinating for action recognition with few still images. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 5314–5322
Wang L, Koniusz P, Huynh D (2019) Hallucinating idt descriptors and i3d optical flow features for action recognition with cnns. In: Proceedings of the IEEE international conference on computer vision, pp 8697–8707
Tang Y, Ma L, Zhou L (2019) Hallucinating optical flow features for video classification. In: IJCAI international joint conference on artificial intelligence, pp 926–932
Wang L, Koniusz P (2021) Self-supervising action recognition by statistical moment and subspace descriptors. In: MM 2021—proceedings of the 29th ACM international conference on multimedia, pp 4324–4333
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. Adv Neural Inf Process Syst 1:568–576
Google Scholar
Diba A, Sharma V, VanGool L (2017) Deep temporal linear encoding networks. In: Proceedings—30th IEEE conference on computer vision and pattern recognition, CVPR 2017 2017-January, pp 1541–1550. https://doi.org/10.1109/CVPR.2017.168
Wang L, Ge L, Li R, Fang Y (2017) Three-stream CNNs for action recognition. Pattern Recognit Lett 92:33–40. https://doi.org/10.1016/j.patrec.2017.04.004
Article Google Scholar
Ji S, Xu W, Yang M, Yu K (2013) 3d convolutional neural networks for human action recognition. IEEE Trans Pattern Anal Mach Intell 35(1):221–231
Article Google Scholar
Tran D, Ray J, Shou Z, Chang S-F, Paluri M (2017) Convnet architecture search for spatiotemporal feature learning. Comput Res Repos 8(16):1–12
Google Scholar
Hara K, Kataoka H, Satoh Y (2018) Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 6546–6555. https://doi.org/10.1109/CVPR.2018.00685
Diba A, Fayyaz M, Sharma V, Karami AH, Arzani MM, Yousefzadeh R, Van Gool L (2017) Temporal 3D ConvNets: new architecture and transfer learning for video classification. https://arxiv.org/abs/1711.08200
Zhou B, Andonian A, Oliva A, Torralba A (2018) Temporal relational reasoning in videos. Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics) 11205 LNCS, pp 831–846. https://doi.org/10.1007/978-3-030-01246-5_49
Zolfaghari M, Singh K, Brox T (2018) Eco: efficient convolutional network for online video understanding. Lecture Notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics) 11206 LNCS, pp 713–730. https://doi.org/10.1007/978-3-030-01216-8_43
Lee M, Lee S, Son S, Park G, Kwak N (2018) Motion feature network: fixed motion filter for action recognition. Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics) 11214 LNCS, pp 392–408. https://doi.org/10.1007/978-3-030-01249-6_24
Lin J, Gan C, Han S (2019) Tsm: temporal shift module for efficient video understanding. In: Proceedings of the IEEE international conference on computer vision 2019-October, pp 7082–7092. https://doi.org/10.1109/ICCV.2019.00718
Shao H, Qian S, Liu Y (2020) Temporal interlacing network. In: Paper presented at the AAAI 2020—34th AAAI conference on artificial intelligence
Wang L, Koniusz P (2022) Uncertainty-DTW for time series and sequences
Liu Z, Luo D, Wang Y, Wang L, Tai Y, Wang C, Li J, Huang F, Lu T (2020) TEINet: towards an efficient architecture for video recognition. In: Paper presented at the AAAI 2020—34th AAAI conference on artificial intelligence
Liu Z, Wang L, Wu W, Qian C, Lu T (2021) Tam: temporal adaptive module for video recognition. In: Proceedings of the IEEE international conference on computer vision, pp 13688–13698. https://doi.org/10.1109/ICCV48922.2021.01345
Li Y, Ji B, Shi X, Zhang J, Kang B, Wang L (2020) Tea: temporal excitation and aggregation for action recognition. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 906–915. https://doi.org/10.1109/CVPR42600.2020.00099
Jiang B, Wang M, Gan W, Wu W, Yan J (2019) Stm: spatiotemporal and motion encoding for action recognition. In: Proceedings of the IEEE international conference on computer vision 2019-October, pp 2000–2009. https://doi.org/10.1109/ICCV.2019.00209
Chen Y, Dai X, Liu M, Chen D, Yuan L, Liu Z (2020) Dynamic convolution: attention over convolution kernels. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 11027–11036. https://doi.org/10.1109/CVPR42600.2020.01104
Zhang Y, Zhang J, Wang Q, Zhong Z (2020) Dynet: dynamic convolution for accelerating convolutional neural networks
Hu J, Shen L, Albanie S, Sun G, Wu E (2020) Squeeze-and-excitation networks. IEEE Trans Pattern Anal Mach Intell 42(8):2011–2023
Article Google Scholar
Goyal R, Kahou SE, Michalski V, Materzynska J, Westphal S, Kim H, Haenel V, Fruend I, Yianilos P, Mueller-Freitag M, Hoppe F, Thurau C, Bax I, Memisevic R (2017) The ’something something’ video database for learning and evaluating visual common sense. In: Proceedings of the IEEE international conference on computer vision 2017-October, pp 5843–5851. https://doi.org/10.1109/ICCV.2017.622
Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Viola F, Green T, Back T, Natsev P, Suleyman M, Zisserman A (2017) The kinetics human action video dataset
Feichtenhofer C, Fan H, Malik J, He K (2019) Slowfast networks for video recognition. In: Proceedings of the IEEE international conference on computer vision 2019-October, pp 6201–6210. https://doi.org/10.1109/ICCV.2019.00630
Wang X, Girshick R, Gupta A, He K (2018) Non-local neural networks. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 7794–7803. https://doi.org/10.1109/CVPR.2018.00813
Deng J, Dong W, Socher R, Li L-J, Li K, Li F-F (2009) Imagenet: a large-scale hierarchical image database, pp 248–255
Xie S, Sun C, Huang J, Tu Z, Murphy K (2018) Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics) 11219 LNCS, pp 318–335. https://doi.org/10.1007/978-3-030-01267-0_19
Yang Q, Lu T, Zhou H (2022) A spatio-temporal motion network for action recognition based on spatial attention. Entropy 24(3):368
Article Google Scholar
Liu X, Lee J-Y, Jin H (2019) Learning video representations from correspondence proposals. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 4268–4276. https://doi.org/10.1109/CVPR.2019.00440
Fan Q, Chen C-F, Kuehne H, Pistoia M, Cox D (2019) More is less: learning efficient video representations by big-little network and depthwise temporal aggregation. Adv Neural Inf Process Syst 32
Li X, Wang Y, Zhou Z, Qiao Y (2020) Smallbignet: integrating core and contextual views for video classification. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 1089–1098. https://doi.org/10.1109/CVPR42600.2020.00117
Wang L, Li W, Van Gool L (2018) Appearance-and-relation networks for video classification. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 1430–1439. https://doi.org/10.1109/CVPR.2018.00155
Feichtenhofer C (2020) X3d: expanding architectures for efficient video recognition. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 200–210. https://doi.org/10.1109/CVPR42600.2020.00028
Zhang Y, Li X, Liu C, Shuai B, Zhu Y, Brattoli B, Chen H, Marsic I, Tighe J (2021) Vidtr: video transformer without convolutions. In: Proceedings of the IEEE international conference on computer vision, pp 13557–13567. https://doi.org/10.1109/ICCV48922.2021.01332
Bertasius G, Wang H, Torresani L (2021) Is space-time attention all you need for video understanding?
Arnab A, Dehghani M, Heigold G, Sun C, Lui M, Schmid C (2021) Vivit: a video vision transformer. In: Proceedings of the IEEE international conference on computer vision, pp 6816–6826. https://doi.org/10.1109/ICCV48922.2021.00676
Fan H, Xiong B, Mangalam K, Li Y, Yan Z, Malik J, Feichtenhofer C (2021) Multiscale vision transformers. In: Proceedings of the IEEE international conference on computer vision, pp 6804–6815. https://doi.org/10.1109/ICCV48922.2021.00675
Patrick M, Campbell D, Asano Y, Misra I, Metze F, Feichtenhofer C, Vedaldi A, Henriques JF (2021) Keeping your eye on the ball: trajectory attention in video transformers. Adv Neural Inf Process Syst 15:12493–12506
Google Scholar

Download references

Acknowledgements

This work is supported by the Hubei Technology Innovation Project (2019AAA045), the National Natural Science Foundation of China (62171328), the National Natural Science Foundation of China (62171327) and the Graduate Innovative Fund of Wuhan Institute of Technology (CX2021276).

Author information

Authors and Affiliations

Hubei Key Laboratory of Intelligent Robot, Wuhan, 430205, China
Tongwei Lu, Qi Yang, Feng Min & Yanduo Zhang
School of Computer Science and Engineering, Wuhan Institute of Technology, Wuhan, 430205, China
Tongwei Lu, Qi Yang & Feng Min

Authors

Tongwei Lu
View author publications
You can also search for this author in PubMed Google Scholar
Qi Yang
View author publications
You can also search for this author in PubMed Google Scholar
Feng Min
View author publications
You can also search for this author in PubMed Google Scholar
Yanduo Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tongwei Lu.

Ethics declarations

Conflict of interest

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Lu, T., Yang, Q., Min, F. et al. Action recognition based on adaptive region perception. Neural Comput & Applic 36, 943–959 (2024). https://doi.org/10.1007/s00521-023-09069-9

Download citation

Received: 18 November 2022
Accepted: 14 September 2023
Published: 09 October 2023
Issue Date: January 2024
DOI: https://doi.org/10.1007/s00521-023-09069-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Action recognition based on adaptive region perception

Abstract

Access this article

Similar content being viewed by others

Video summarization using deep learning techniques: a detailed analysis and investigation

Deep Learning Techniques—R-CNN to Mask R-CNN: A Survey

Deep learning for video object segmentation: a review

Data availibility

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Action recognition based on adaptive region perception

Abstract

Access this article

Similar content being viewed by others

Video summarization using deep learning techniques: a detailed analysis and investigation

Deep Learning Techniques—R-CNN to Mask R-CNN: A Survey

Deep learning for video object segmentation: a review

Data availibility

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation