Abstract
Activity recognition is a problem of recognizing what activities are occurring within a video. An activity consists of the spatial movements of the object over time. Therefore, to recognize the activity included in the video, it is important to understand the object information and its change over time in the video. To recognize the temporal and the spatial structures of activity, we propose an activity recognition model based on a sequential object feature accumulation in the end-to-end manner. The sequential object feature accumulation is a method presented in this paper, which integrates the features extracted from each block of the deep neural network and refines the features for classifying the activtiy and characteristics of sub-objects constituing the activity. The proposed activity recognition model consists of two sub-units: an object feature extraction unit which identifies objects of a specific activity from an input video, and an activity feature extraction unit which identifies changes of objects movements over time. We used the Major League Baseball YouTube dataset in the sports domain to evaluate the performance of the proposed model. In experiments, the proposed model recorded the higher score than existing sports activity recognition models.
Similar content being viewed by others
Data availability
MLB YouTube dataset is available in Piergiovanni and Ryoo (2018).
References
Afrasiabi M, Khotanlou H, Mansoorizadeh M (2020) DTW-CNN: time series-based human interaction prediction in videos using CNN-extracted features. Vis Comput 36:1127–1139. https://doi.org/10.1007/s00371-019-01722-6
Atto A, Benoit A, Lambert P (2020) Timed-image based beep learning for action recognition in video sequences. Pattern Recogn 104:107353. https://doi.org/10.1016/j.patcog.2020.107353
Bagautdinov T, Alahi A, Fleuret F, Fua P, Savarese S (2017) Social scene understanding: end-to-end multi-person action localization and collective activity recognition. In proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp. 4315-4324
Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp.6299-6308
Deliege A, Cioppa GS, Seilvandi MJ, Dueholm JV, Nasrollahi K, Ghanem B, Moeslund TB, Droogenbroeck MV (2021) SoccerNet-v2: A dataset and benchmarks for holistic understanding of broadcast soccer videos. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. pp. 4508–4519
Donahue J, Hendricks LA, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrel T (2015) Long-term recurrent convolutional networks for visual recognition and description. In proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp.2625-2634
Du W, Wang Y, Qiao Y (2018) Recurrent spatial-temporal attention network for action recognition in videos. IEEE Trans Image Process 27:3–1360. https://doi.org/10.1109/TIP.2017.2778563
Feichtenhofer C, Fan H, Malik J, He K (2019) Slowfast networks for video recognition. In proceedings of the IEEE/CVF international conference on computer vision (ICCV). pp. 6202-6211
Gammulle H, Denman S, Sridharan S, Fookes C (2018) Multi-level sequence GAN for group activity recognition. In proceedings of the Asian conference on computer vision (ACCV). pp. 331-346. https://doi.org/10.1007/978-3-030-20887-5_21
Giancola S, Amine M, Dghaily T, Ghanem B (2018) SoccerNet : a scalable dataset for action spotting in soccer videos. Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. pp. 1711-1721
Girdhar R, Carreira J, Doersch C, Zisserman A (2019) Video action transformer network. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 244-253
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In proceedings of the advances in neural information processing systems 27, Montréal, Canada
Gu X, Xue X, Wang F (2020) Fine-grained action recognition on a novel basketball datatset. CASSP IEEE international conference on acoustics, speech and signal processing (ICASSP). pp. 2563-2567 https://doi.org/10.1109/ICASSP40776.2020.9053928
He K, Zhang X, Ren S, Sun J (2016) Identity mappings in deep residual networks. In proceedings of the European conference in computer vision, Amsterdam, the Netherlands
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Hussain T, Muhammad K, Ullah A, Zehong C, Baik S, Albuquerque V (2020) Cloud-assisted multiview video summarization using CNN and bidirectional LSTM. IEEE Trans Industr Inform 16(1):77–86. https://doi.org/10.1109/TII.2019.2929228
Jones ML, Levy K (2018) Sporting chances: robot referees and the automation of enforcement. We robot. Retrieved from https://ssrn.com/abstract=3293076
Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) large-scale video classification with convolutional neural networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .pp. 1725-1732
Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Viola F, Green T, Back T, Natsev P, Suleyman M, Zisserman A (2017) The kinetics human action video dataset. arXiv preprint arxiv:1705.06950. https://doi.org/10.48550/arXiv.1705.06950
Khan S, Haq I, Rho S, Baik S, Lee M (2019) Cover the violence : a novel deep-learning-based approach towards violence-detection in movies. Appl Sci 9(22):4663. https://doi.org/10.3390/app9224963
Khowaja SA, Yahya BN, Lee SL (2020) CAPHAR: context-aware personalized human activity recognition using associated learning in smart environments. Human-centric Comput Inform Sci 10:35. https://doi.org/10.1186/s13673-020-00240-y
Kim H, Lee S (2021) A video captioning method based on multi-representation switching for sustainable computing. Sustainability 13:2250. https://doi.org/10.3390/su13042250
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization, in proceedings of the international conference on learning representations, San Diego, CA, USA
Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: a large video database for human motion recognition. In proceedings of the international conference on computer vision (ICCV), pp. 2556-2563 https://doi.org/10.1109/ICCV.2011.6126543
Liu S, Ma X, Wu H, Li Y (2020) An end to end framework with adaptive spatio-temporal attention module for human action recognition. Dig Object Ident 8:47220–47231. https://doi.org/10.1109/ACCESS.2020.2979549
Nair V, Hinton GE (2010) Rectified linear units improve restricted Boltzmann machines. In proceedings of the international conference on machine learning 27, Haifa, Israel.
Perše M, Kristan M, Perš J, Mušič G, Vučkovič G, Kovačič S (2010) Analysis of multi-agent activity using perti nets. Pattern Recog 43(4):1491–1501. https://doi.org/10.1016/j.patcog.2009.11.011
Piergiovanni AJ, Ryoo MS (2018) fine-grained activity recognition in baseball videos. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops .pp. 1740-1748
Piergiovanni AJ, Ryoo MS (2019) Representation flow for action recognition. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 9945-9953
Piergiovanni AJ, Fan C, Ryoo M (2017) Learning latent subevents in activity using temporal attention filters. Thirty-First AAAI Conference on Artificial Intelligence 31:1. https://ojs.aaai.org/index.php/AAAI/article/view/11240
Qi S, Ning X, Yang G, Zhang L, Long P, Cai W (2021) Review of multi-view 3D object recognition methods based on deep learning. Displays, 69, 102053. https://doi.org/10.1016/j.displa.2021.102053
Rahmad NA, As’ari MA, Ghazali NF, Sufri NAJ (2018) A survey of video based action recognition in sports. Indonesian Journal of Electrical Engineering and Computer Science 987–993. https://doi.org/10.11591/ijeecs.v11.i3.pp987-993
Ren Q (2021) A video expression recognition method based on multi-mode convolution neural network and multiplicative feature fusion. J Inform Proc Syst 17(3):556–570. https://doi.org/10.3745/JIPS.02.0156
Robertson MR (2015) 500 hours of video uploaded to YouTube every minute. Tubular insights. Retrieved from https://tubularinsights.com/hours-minute-uploaded-youtube/
Shih H (2017) A survey of content-aware video analysis for sports. IEEE Trans Circ Syst Video Technol 28:1212–1231. https://doi.org/10.1109/TCSVT.2017.2655624
Shim M, Kim YH, Kim K, Kim SJ (2018) Teaching machines to understand baseball games: large-scale baseball video database for multiple video understanding tasks. Proceedings of the European conference on computer vision (ECCV). pp. 404-420
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In proceedings of the advances in neural information processing systems 27, Montréal, Canada
Singh R, Sonawane A, Srivastava R (2020) Recent evolution of modern datasets for human activity recognition: a deep survey. Multimedia Systems 26:83–106. https://doi.org/10.1007/s00530-019-00635-7
Soomro K, Zamir A, Shah M (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arxiv:1212.0402. https://doi.org/10.48550/arXiv.1212.0402
Sun B, Kong D, Wang S, Li J, Yin B, Luo X (2022) GAN for vision, KG for relation: a two-stage network for zero-shot action recognition. Pattern Recogn 126:108563. https://doi.org/10.1016/j.patcog.2022.108563
Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wonja Z (2016) Rethinking the inception architecture for computer vision. In proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)
Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In proceedings of the IEEE international conference on computer vision (ICCV), pp. 4489-4497
Tu H, Xu R, Chi R, Peng Y (2021) Multiperson interactive activity recognition based on interaction relation model. J Mathma 2021:5576369. https://doi.org/10.1155/2021/5576369
Wang X, Girshick R, Gupta A, He K (2018) Non-local neural networks. Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp.7794-7803
Wang C, Wang X, Zhang J, Zhang L, Bai X, Ning X, Zhou J, Hancock E (2022) Uncertainty estimation for stereo matching based on evidential deep learning. Pattern Recogn 124:108498. https://doi.org/10.1016/j.patcog.2021.108498
Wemlinger Z, Holder L (2018) Cross-environment activity recognition using a shared semantic vocabulary. Pervasive Mob Comput 51:150–159. https://doi.org/10.1016/j.pmcj.2018.10.004
Woo S, Park J, Lee J, Kweon I (2018) CBAM: convolutional block attention module. In proceedings of the European conference on computer vision (ECCV), pp. 3-19
Yoon D, Cho N, Lee S (2020) A novel online action detection framework from untrimmed video streams. Pattern Recogn 106:107396. https://doi.org/10.1016/j.patcog.2020.107396
Zhou X (2021) Video expression recognition method based on spatiotemporal recurrent neural network and feature fusion. J Inform Proc Syst 17(2):337–351. https://doi.org/10.3745/JIPS.01.0067
Code availability
Codes in this study are available from the authors upon reasonable request.
Funding
This research was supported by the MSIT (Ministry of Science and ICT), Korea, under the Innovative Human Resource Development for Local Intellectualization support program (IITP-2022-RS-2022-00156360) supervised by the IITP (Institute for Information & communications Technology Planning & Evaluation).
Author information
Authors and Affiliations
Contributions
Conceptualization, Kwanghyun Ryu. and Soowon Lee.; Methodology, Kwanghyun Ryu.; Software, Kwanghyun Ryu and Heechan Kim.; Validation, Kwanghyun Ryu. and Soowon Lee.; Investigation, Kwanghyun Ryu.; Resources, Soowon Lee.; Writing Original Draft Preparation, Kwanghyun Ryu.; Writing Reviews & Editing, Heechan Kim. and Soowon Lee.; Visualization, Kwanghyun Ryu.; Supervision, Soowon Lee.; Project Administration, Soowon Lee.; Funding Acquisition, Soowon Lee. All authors have read and agreed to the published version of the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ryu, K., Kim, H. & Lee, S. A deep learning model based on sequential object feature accumulation for sport activity recognition. Multimed Tools Appl 82, 37387–37406 (2023). https://doi.org/10.1007/s11042-023-15022-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-15022-8