A deep learning model based on sequential object feature accumulation for sport activity recognition

Ryu, Kwanghyun; Kim, Heechan; Lee, Soowon

doi:10.1007/s11042-023-15022-8

A deep learning model based on sequential object feature accumulation for sport activity recognition

Published: 22 March 2023

Volume 82, pages 37387–37406, (2023)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

134 Accesses
1 Citation
Explore all metrics

Abstract

Activity recognition is a problem of recognizing what activities are occurring within a video. An activity consists of the spatial movements of the object over time. Therefore, to recognize the activity included in the video, it is important to understand the object information and its change over time in the video. To recognize the temporal and the spatial structures of activity, we propose an activity recognition model based on a sequential object feature accumulation in the end-to-end manner. The sequential object feature accumulation is a method presented in this paper, which integrates the features extracted from each block of the deep neural network and refines the features for classifying the activtiy and characteristics of sub-objects constituing the activity. The proposed activity recognition model consists of two sub-units: an object feature extraction unit which identifies objects of a specific activity from an input video, and an activity feature extraction unit which identifies changes of objects movements over time. We used the Major League Baseball YouTube dataset in the sports domain to evaluate the performance of the proposed model. In experiments, the proposed model recorded the higher score than existing sports activity recognition models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

A review of object detection based on deep learning

Article 12 June 2020

Convolutional neural network: a review of models, methodologies and applications to object detection

Article 20 December 2019

Video summarization using deep learning techniques: a detailed analysis and investigation

Article 15 March 2023

Data availability

MLB YouTube dataset is available in Piergiovanni and Ryoo (2018).

References

Afrasiabi M, Khotanlou H, Mansoorizadeh M (2020) DTW-CNN: time series-based human interaction prediction in videos using CNN-extracted features. Vis Comput 36:1127–1139. https://doi.org/10.1007/s00371-019-01722-6
Article Google Scholar
Atto A, Benoit A, Lambert P (2020) Timed-image based beep learning for action recognition in video sequences. Pattern Recogn 104:107353. https://doi.org/10.1016/j.patcog.2020.107353
Article Google Scholar
Bagautdinov T, Alahi A, Fleuret F, Fua P, Savarese S (2017) Social scene understanding: end-to-end multi-person action localization and collective activity recognition. In proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp. 4315-4324
Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp.6299-6308
Deliege A, Cioppa GS, Seilvandi MJ, Dueholm JV, Nasrollahi K, Ghanem B, Moeslund TB, Droogenbroeck MV (2021) SoccerNet-v2: A dataset and benchmarks for holistic understanding of broadcast soccer videos. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. pp. 4508–4519
Donahue J, Hendricks LA, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrel T (2015) Long-term recurrent convolutional networks for visual recognition and description. In proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp.2625-2634
Du W, Wang Y, Qiao Y (2018) Recurrent spatial-temporal attention network for action recognition in videos. IEEE Trans Image Process 27:3–1360. https://doi.org/10.1109/TIP.2017.2778563
Article MathSciNet MATH Google Scholar
Feichtenhofer C, Fan H, Malik J, He K (2019) Slowfast networks for video recognition. In proceedings of the IEEE/CVF international conference on computer vision (ICCV). pp. 6202-6211
Gammulle H, Denman S, Sridharan S, Fookes C (2018) Multi-level sequence GAN for group activity recognition. In proceedings of the Asian conference on computer vision (ACCV). pp. 331-346. https://doi.org/10.1007/978-3-030-20887-5_21
Giancola S, Amine M, Dghaily T, Ghanem B (2018) SoccerNet : a scalable dataset for action spotting in soccer videos. Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. pp. 1711-1721
Girdhar R, Carreira J, Doersch C, Zisserman A (2019) Video action transformer network. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 244-253
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In proceedings of the advances in neural information processing systems 27, Montréal, Canada
Gu X, Xue X, Wang F (2020) Fine-grained action recognition on a novel basketball datatset. CASSP IEEE international conference on acoustics, speech and signal processing (ICASSP). pp. 2563-2567 https://doi.org/10.1109/ICASSP40776.2020.9053928
He K, Zhang X, Ren S, Sun J (2016) Identity mappings in deep residual networks. In proceedings of the European conference in computer vision, Amsterdam, the Netherlands
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Article Google Scholar
Hussain T, Muhammad K, Ullah A, Zehong C, Baik S, Albuquerque V (2020) Cloud-assisted multiview video summarization using CNN and bidirectional LSTM. IEEE Trans Industr Inform 16(1):77–86. https://doi.org/10.1109/TII.2019.2929228
Article Google Scholar
Jones ML, Levy K (2018) Sporting chances: robot referees and the automation of enforcement. We robot. Retrieved from https://ssrn.com/abstract=3293076
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) large-scale video classification with convolutional neural networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .pp. 1725-1732
Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Viola F, Green T, Back T, Natsev P, Suleyman M, Zisserman A (2017) The kinetics human action video dataset. arXiv preprint arxiv:1705.06950. https://doi.org/10.48550/arXiv.1705.06950
Khan S, Haq I, Rho S, Baik S, Lee M (2019) Cover the violence : a novel deep-learning-based approach towards violence-detection in movies. Appl Sci 9(22):4663. https://doi.org/10.3390/app9224963
Article Google Scholar
Khowaja SA, Yahya BN, Lee SL (2020) CAPHAR: context-aware personalized human activity recognition using associated learning in smart environments. Human-centric Comput Inform Sci 10:35. https://doi.org/10.1186/s13673-020-00240-y
Article Google Scholar
Kim H, Lee S (2021) A video captioning method based on multi-representation switching for sustainable computing. Sustainability 13:2250. https://doi.org/10.3390/su13042250
Article Google Scholar
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization, in proceedings of the international conference on learning representations, San Diego, CA, USA
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: a large video database for human motion recognition. In proceedings of the international conference on computer vision (ICCV), pp. 2556-2563 https://doi.org/10.1109/ICCV.2011.6126543
Liu S, Ma X, Wu H, Li Y (2020) An end to end framework with adaptive spatio-temporal attention module for human action recognition. Dig Object Ident 8:47220–47231. https://doi.org/10.1109/ACCESS.2020.2979549
Article Google Scholar
Nair V, Hinton GE (2010) Rectified linear units improve restricted Boltzmann machines. In proceedings of the international conference on machine learning 27, Haifa, Israel.
Perše M, Kristan M, Perš J, Mušič G, Vučkovič G, Kovačič S (2010) Analysis of multi-agent activity using perti nets. Pattern Recog 43(4):1491–1501. https://doi.org/10.1016/j.patcog.2009.11.011
Article MATH Google Scholar
Piergiovanni AJ, Ryoo MS (2018) fine-grained activity recognition in baseball videos. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops .pp. 1740-1748
Piergiovanni AJ, Ryoo MS (2019) Representation flow for action recognition. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 9945-9953
Piergiovanni AJ, Fan C, Ryoo M (2017) Learning latent subevents in activity using temporal attention filters. Thirty-First AAAI Conference on Artificial Intelligence 31:1. https://ojs.aaai.org/index.php/AAAI/article/view/11240
Qi S, Ning X, Yang G, Zhang L, Long P, Cai W (2021) Review of multi-view 3D object recognition methods based on deep learning. Displays, 69, 102053. https://doi.org/10.1016/j.displa.2021.102053
Rahmad NA, As’ari MA, Ghazali NF, Sufri NAJ (2018) A survey of video based action recognition in sports. Indonesian Journal of Electrical Engineering and Computer Science 987–993. https://doi.org/10.11591/ijeecs.v11.i3.pp987-993
Ren Q (2021) A video expression recognition method based on multi-mode convolution neural network and multiplicative feature fusion. J Inform Proc Syst 17(3):556–570. https://doi.org/10.3745/JIPS.02.0156
Article MathSciNet Google Scholar
Robertson MR (2015) 500 hours of video uploaded to YouTube every minute. Tubular insights. Retrieved from https://tubularinsights.com/hours-minute-uploaded-youtube/
Shih H (2017) A survey of content-aware video analysis for sports. IEEE Trans Circ Syst Video Technol 28:1212–1231. https://doi.org/10.1109/TCSVT.2017.2655624
Article Google Scholar
Shim M, Kim YH, Kim K, Kim SJ (2018) Teaching machines to understand baseball games: large-scale baseball video database for multiple video understanding tasks. Proceedings of the European conference on computer vision (ECCV). pp. 404-420
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In proceedings of the advances in neural information processing systems 27, Montréal, Canada
Singh R, Sonawane A, Srivastava R (2020) Recent evolution of modern datasets for human activity recognition: a deep survey. Multimedia Systems 26:83–106. https://doi.org/10.1007/s00530-019-00635-7
Article Google Scholar
Soomro K, Zamir A, Shah M (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arxiv:1212.0402. https://doi.org/10.48550/arXiv.1212.0402
Sun B, Kong D, Wang S, Li J, Yin B, Luo X (2022) GAN for vision, KG for relation: a two-stage network for zero-shot action recognition. Pattern Recogn 126:108563. https://doi.org/10.1016/j.patcog.2022.108563
Article Google Scholar
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wonja Z (2016) Rethinking the inception architecture for computer vision. In proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In proceedings of the IEEE international conference on computer vision (ICCV), pp. 4489-4497
Tu H, Xu R, Chi R, Peng Y (2021) Multiperson interactive activity recognition based on interaction relation model. J Mathma 2021:5576369. https://doi.org/10.1155/2021/5576369
Article MathSciNet Google Scholar
Wang X, Girshick R, Gupta A, He K (2018) Non-local neural networks. Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp.7794-7803
Wang C, Wang X, Zhang J, Zhang L, Bai X, Ning X, Zhou J, Hancock E (2022) Uncertainty estimation for stereo matching based on evidential deep learning. Pattern Recogn 124:108498. https://doi.org/10.1016/j.patcog.2021.108498
Article Google Scholar
Wemlinger Z, Holder L (2018) Cross-environment activity recognition using a shared semantic vocabulary. Pervasive Mob Comput 51:150–159. https://doi.org/10.1016/j.pmcj.2018.10.004
Article Google Scholar
Woo S, Park J, Lee J, Kweon I (2018) CBAM: convolutional block attention module. In proceedings of the European conference on computer vision (ECCV), pp. 3-19
Yoon D, Cho N, Lee S (2020) A novel online action detection framework from untrimmed video streams. Pattern Recogn 106:107396. https://doi.org/10.1016/j.patcog.2020.107396
Article Google Scholar
Zhou X (2021) Video expression recognition method based on spatiotemporal recurrent neural network and feature fusion. J Inform Proc Syst 17(2):337–351. https://doi.org/10.3745/JIPS.01.0067
Article Google Scholar

Download references

Code availability

Codes in this study are available from the authors upon reasonable request.

Funding

This research was supported by the MSIT (Ministry of Science and ICT), Korea, under the Innovative Human Resource Development for Local Intellectualization support program (IITP-2022-RS-2022-00156360) supervised by the IITP (Institute for Information & communications Technology Planning & Evaluation).

Author information

Authors and Affiliations

Department of Software Convergence, Soongsil University, Seoul, 06978, South Korea
Kwanghyun Ryu & Heechan Kim
School of Software, Soongsil University, Seoul, 06978, South Korea
Soowon Lee

Authors

Kwanghyun Ryu
View author publications
You can also search for this author in PubMed Google Scholar
Heechan Kim
View author publications
You can also search for this author in PubMed Google Scholar
Soowon Lee
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Conceptualization, Kwanghyun Ryu. and Soowon Lee.; Methodology, Kwanghyun Ryu.; Software, Kwanghyun Ryu and Heechan Kim.; Validation, Kwanghyun Ryu. and Soowon Lee.; Investigation, Kwanghyun Ryu.; Resources, Soowon Lee.; Writing Original Draft Preparation, Kwanghyun Ryu.; Writing Reviews & Editing, Heechan Kim. and Soowon Lee.; Visualization, Kwanghyun Ryu.; Supervision, Soowon Lee.; Project Administration, Soowon Lee.; Funding Acquisition, Soowon Lee. All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Soowon Lee.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Ryu, K., Kim, H. & Lee, S. A deep learning model based on sequential object feature accumulation for sport activity recognition. Multimed Tools Appl 82, 37387–37406 (2023). https://doi.org/10.1007/s11042-023-15022-8

Download citation

Received: 16 February 2022
Revised: 22 June 2022
Accepted: 27 February 2023
Published: 22 March 2023
Issue Date: October 2023
DOI: https://doi.org/10.1007/s11042-023-15022-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A deep learning model based on sequential object feature accumulation for sport activity recognition

Abstract

Access this article

Similar content being viewed by others

A review of object detection based on deep learning

Convolutional neural network: a review of models, methodologies and applications to object detection

Video summarization using deep learning techniques: a detailed analysis and investigation

Data availability

References

Code availability

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A deep learning model based on sequential object feature accumulation for sport activity recognition

Abstract

Access this article

Similar content being viewed by others

A review of object detection based on deep learning

Convolutional neural network: a review of models, methodologies and applications to object detection

Video summarization using deep learning techniques: a detailed analysis and investigation

Data availability

References

Code availability

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation