Skip to main content
Log in

A deep learning model based on sequential object feature accumulation for sport activity recognition

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Activity recognition is a problem of recognizing what activities are occurring within a video. An activity consists of the spatial movements of the object over time. Therefore, to recognize the activity included in the video, it is important to understand the object information and its change over time in the video. To recognize the temporal and the spatial structures of activity, we propose an activity recognition model based on a sequential object feature accumulation in the end-to-end manner. The sequential object feature accumulation is a method presented in this paper, which integrates the features extracted from each block of the deep neural network and refines the features for classifying the activtiy and characteristics of sub-objects constituing the activity. The proposed activity recognition model consists of two sub-units: an object feature extraction unit which identifies objects of a specific activity from an input video, and an activity feature extraction unit which identifies changes of objects movements over time. We used the Major League Baseball YouTube dataset in the sports domain to evaluate the performance of the proposed model. In experiments, the proposed model recorded the higher score than existing sports activity recognition models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Data availability

MLB YouTube dataset is available in Piergiovanni and Ryoo (2018).

References

  1. Afrasiabi M, Khotanlou H, Mansoorizadeh M (2020) DTW-CNN: time series-based human interaction prediction in videos using CNN-extracted features. Vis Comput 36:1127–1139. https://doi.org/10.1007/s00371-019-01722-6

    Article  Google Scholar 

  2. Atto A, Benoit A, Lambert P (2020) Timed-image based beep learning for action recognition in video sequences. Pattern Recogn 104:107353. https://doi.org/10.1016/j.patcog.2020.107353

    Article  Google Scholar 

  3. Bagautdinov T, Alahi A, Fleuret F, Fua P, Savarese S (2017) Social scene understanding: end-to-end multi-person action localization and collective activity recognition. In proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp. 4315-4324

  4. Carreira J, Zisserman A (2017) Quo vadis, action recognition? A new model and the kinetics dataset. In proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp.6299-6308

  5. Deliege A, Cioppa GS, Seilvandi MJ, Dueholm JV, Nasrollahi K, Ghanem B, Moeslund TB, Droogenbroeck MV (2021) SoccerNet-v2: A dataset and benchmarks for holistic understanding of broadcast soccer videos. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. pp. 4508–4519

  6. Donahue J, Hendricks LA, Guadarrama S, Rohrbach M, Venugopalan S, Saenko K, Darrel T (2015) Long-term recurrent convolutional networks for visual recognition and description. In proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp.2625-2634

  7. Du W, Wang Y, Qiao Y (2018) Recurrent spatial-temporal attention network for action recognition in videos. IEEE Trans Image Process 27:3–1360. https://doi.org/10.1109/TIP.2017.2778563

    Article  MathSciNet  MATH  Google Scholar 

  8. Feichtenhofer C, Fan H, Malik J, He K (2019) Slowfast networks for video recognition. In proceedings of the IEEE/CVF international conference on computer vision (ICCV). pp. 6202-6211

  9. Gammulle H, Denman S, Sridharan S, Fookes C (2018) Multi-level sequence GAN for group activity recognition. In proceedings of the Asian conference on computer vision (ACCV). pp. 331-346. https://doi.org/10.1007/978-3-030-20887-5_21

  10. Giancola S, Amine M, Dghaily T, Ghanem B (2018) SoccerNet : a scalable dataset for action spotting in soccer videos. Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops. pp. 1711-1721

  11. Girdhar R, Carreira J, Doersch C, Zisserman A (2019) Video action transformer network. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 244-253

  12. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In proceedings of the advances in neural information processing systems 27, Montréal, Canada

  13. Gu X, Xue X, Wang F (2020) Fine-grained action recognition on a novel basketball datatset. CASSP IEEE international conference on acoustics, speech and signal processing (ICASSP). pp. 2563-2567 https://doi.org/10.1109/ICASSP40776.2020.9053928

  14. He K, Zhang X, Ren S, Sun J (2016) Identity mappings in deep residual networks. In proceedings of the European conference in computer vision, Amsterdam, the Netherlands

  15. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735

    Article  Google Scholar 

  16. Hussain T, Muhammad K, Ullah A, Zehong C, Baik S, Albuquerque V (2020) Cloud-assisted multiview video summarization using CNN and bidirectional LSTM. IEEE Trans Industr Inform 16(1):77–86. https://doi.org/10.1109/TII.2019.2929228

    Article  Google Scholar 

  17. Jones ML, Levy K (2018) Sporting chances: robot referees and the automation of enforcement. We robot. Retrieved from https://ssrn.com/abstract=3293076

  18. Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) large-scale video classification with convolutional neural networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) .pp. 1725-1732

  19. Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Viola F, Green T, Back T, Natsev P, Suleyman M, Zisserman A (2017) The kinetics human action video dataset. arXiv preprint arxiv:1705.06950. https://doi.org/10.48550/arXiv.1705.06950

  20. Khan S, Haq I, Rho S, Baik S, Lee M (2019) Cover the violence : a novel deep-learning-based approach towards violence-detection in movies. Appl Sci 9(22):4663. https://doi.org/10.3390/app9224963

    Article  Google Scholar 

  21. Khowaja SA, Yahya BN, Lee SL (2020) CAPHAR: context-aware personalized human activity recognition using associated learning in smart environments. Human-centric Comput Inform Sci 10:35. https://doi.org/10.1186/s13673-020-00240-y

    Article  Google Scholar 

  22. Kim H, Lee S (2021) A video captioning method based on multi-representation switching for sustainable computing. Sustainability 13:2250. https://doi.org/10.3390/su13042250

    Article  Google Scholar 

  23. Kingma DP, Ba J (2014) Adam: a method for stochastic optimization, in proceedings of the international conference on learning representations, San Diego, CA, USA

  24. Kuehne H, Jhuang H, Garrote E, Poggio T, Serre T (2011) HMDB: a large video database for human motion recognition. In proceedings of the international conference on computer vision (ICCV), pp. 2556-2563 https://doi.org/10.1109/ICCV.2011.6126543

  25. Liu S, Ma X, Wu H, Li Y (2020) An end to end framework with adaptive spatio-temporal attention module for human action recognition. Dig Object Ident 8:47220–47231. https://doi.org/10.1109/ACCESS.2020.2979549

    Article  Google Scholar 

  26. Nair V, Hinton GE (2010) Rectified linear units improve restricted Boltzmann machines. In proceedings of the international conference on machine learning 27, Haifa, Israel.

  27. Perše M, Kristan M, Perš J, Mušič G, Vučkovič G, Kovačič S (2010) Analysis of multi-agent activity using perti nets. Pattern Recog 43(4):1491–1501. https://doi.org/10.1016/j.patcog.2009.11.011

    Article  MATH  Google Scholar 

  28. Piergiovanni AJ, Ryoo MS (2018) fine-grained activity recognition in baseball videos. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops .pp. 1740-1748

  29. Piergiovanni AJ, Ryoo MS (2019) Representation flow for action recognition. In proceedings of the IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp. 9945-9953

  30. Piergiovanni AJ, Fan C, Ryoo M (2017) Learning latent subevents in activity using temporal attention filters. Thirty-First AAAI Conference on Artificial Intelligence 31:1. https://ojs.aaai.org/index.php/AAAI/article/view/11240

  31. Qi S, Ning X, Yang G, Zhang L, Long P, Cai W (2021) Review of multi-view 3D object recognition methods based on deep learning. Displays, 69, 102053. https://doi.org/10.1016/j.displa.2021.102053

  32. Rahmad NA, As’ari MA, Ghazali NF, Sufri NAJ (2018) A survey of video based action recognition in sports. Indonesian Journal of Electrical Engineering and Computer Science 987–993. https://doi.org/10.11591/ijeecs.v11.i3.pp987-993

  33. Ren Q (2021) A video expression recognition method based on multi-mode convolution neural network and multiplicative feature fusion. J Inform Proc Syst 17(3):556–570. https://doi.org/10.3745/JIPS.02.0156

    Article  MathSciNet  Google Scholar 

  34. Robertson MR (2015) 500 hours of video uploaded to YouTube every minute. Tubular insights. Retrieved from https://tubularinsights.com/hours-minute-uploaded-youtube/

  35. Shih H (2017) A survey of content-aware video analysis for sports. IEEE Trans Circ Syst Video Technol 28:1212–1231. https://doi.org/10.1109/TCSVT.2017.2655624

    Article  Google Scholar 

  36. Shim M, Kim YH, Kim K, Kim SJ (2018) Teaching machines to understand baseball games: large-scale baseball video database for multiple video understanding tasks. Proceedings of the European conference on computer vision (ECCV). pp. 404-420

  37. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In proceedings of the advances in neural information processing systems 27, Montréal, Canada

  38. Singh R, Sonawane A, Srivastava R (2020) Recent evolution of modern datasets for human activity recognition: a deep survey. Multimedia Systems 26:83–106. https://doi.org/10.1007/s00530-019-00635-7

    Article  Google Scholar 

  39. Soomro K, Zamir A, Shah M (2012) UCF101: a dataset of 101 human actions classes from videos in the wild. arXiv preprint arxiv:1212.0402. https://doi.org/10.48550/arXiv.1212.0402

  40. Sun B, Kong D, Wang S, Li J, Yin B, Luo X (2022) GAN for vision, KG for relation: a two-stage network for zero-shot action recognition. Pattern Recogn 126:108563. https://doi.org/10.1016/j.patcog.2022.108563

    Article  Google Scholar 

  41. Szegedy C, Vanhoucke V, Ioffe S, Shlens J, Wonja Z (2016) Rethinking the inception architecture for computer vision. In proceedings of the IEEE conference on computer vision and pattern recognition (CVPR)

  42. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In proceedings of the IEEE international conference on computer vision (ICCV), pp. 4489-4497

  43. Tu H, Xu R, Chi R, Peng Y (2021) Multiperson interactive activity recognition based on interaction relation model. J Mathma 2021:5576369. https://doi.org/10.1155/2021/5576369

    Article  MathSciNet  Google Scholar 

  44. Wang X, Girshick R, Gupta A, He K (2018) Non-local neural networks. Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). pp.7794-7803

  45. Wang C, Wang X, Zhang J, Zhang L, Bai X, Ning X, Zhou J, Hancock E (2022) Uncertainty estimation for stereo matching based on evidential deep learning. Pattern Recogn 124:108498. https://doi.org/10.1016/j.patcog.2021.108498

    Article  Google Scholar 

  46. Wemlinger Z, Holder L (2018) Cross-environment activity recognition using a shared semantic vocabulary. Pervasive Mob Comput 51:150–159. https://doi.org/10.1016/j.pmcj.2018.10.004

    Article  Google Scholar 

  47. Woo S, Park J, Lee J, Kweon I (2018) CBAM: convolutional block attention module. In proceedings of the European conference on computer vision (ECCV), pp. 3-19

  48. Yoon D, Cho N, Lee S (2020) A novel online action detection framework from untrimmed video streams. Pattern Recogn 106:107396. https://doi.org/10.1016/j.patcog.2020.107396

    Article  Google Scholar 

  49. Zhou X (2021) Video expression recognition method based on spatiotemporal recurrent neural network and feature fusion. J Inform Proc Syst 17(2):337–351. https://doi.org/10.3745/JIPS.01.0067

    Article  Google Scholar 

Download references

Code availability

Codes in this study are available from the authors upon reasonable request.

Funding

This research was supported by the MSIT (Ministry of Science and ICT), Korea, under the Innovative Human Resource Development for Local Intellectualization support program (IITP-2022-RS-2022-00156360) supervised by the IITP (Institute for Information & communications Technology Planning & Evaluation).

Author information

Authors and Affiliations

Authors

Contributions

Conceptualization, Kwanghyun Ryu. and Soowon Lee.; Methodology, Kwanghyun Ryu.; Software, Kwanghyun Ryu and Heechan Kim.; Validation, Kwanghyun Ryu. and Soowon Lee.; Investigation, Kwanghyun Ryu.; Resources, Soowon Lee.; Writing Original Draft Preparation, Kwanghyun Ryu.; Writing Reviews & Editing, Heechan Kim. and Soowon Lee.; Visualization, Kwanghyun Ryu.; Supervision, Soowon Lee.; Project Administration, Soowon Lee.; Funding Acquisition, Soowon Lee. All authors have read and agreed to the published version of the manuscript.

Corresponding author

Correspondence to Soowon Lee.

Ethics declarations

Conflict of interest

The authors declare no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ryu, K., Kim, H. & Lee, S. A deep learning model based on sequential object feature accumulation for sport activity recognition. Multimed Tools Appl 82, 37387–37406 (2023). https://doi.org/10.1007/s11042-023-15022-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-15022-8

Keywords

Navigation