Skip to main content
Log in

Fine grained sport action recognition with Twin spatio-temporal convolutional neural networks

Application to table tennis

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Human action recognition in video is one of the key problems in visual data interpretation. Despite intensive research, the recognition of actions with low inter-class variability remains a challenge. This paper presents a new Twin Spatio-Temporal Convolutional Neural Network (TSTCNN) for this purpose. When applied to table tennis, it is possible to detect and recognize 20 table tennis strokes. The model has been trained on a specific dataset, so called TTStroke-21, recorded in natural conditions at the Faculty of Sports of the University of Bordeaux. Our model takes as inputs an RGB image sequence and its computed Optical Flow. The proposed Twin architecture is a two stream network both comprising 3 spatio-temporal convolutional layers, followed by a fully connected layer where data are fused. Our method reaches an accuracy of 91.4% against 43.1% for our baseline, a Two-Stream Inflated 3D ConvNet (I3D).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. http://www.multimediaeval.org/mediaeval2019/

  2. https://github.com/P-eMartin/crisp

References

  1. Ahmadi A, Mitchell E, Richter C, Destelle F, Gowing M, O’Connor NE, Moran K (2015) Toward automatic activity classification and movement assessment during a sports training session. IEEE Internet of Things Journal 2(1):23–32

    Article  Google Scholar 

  2. Bilen H, Fernando B, Gavves E, Vedaldi A (2016) Action recognition with dynamic image networks. arXiv:1612.00738

  3. Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. arXiv:1705.07750.

  4. Choutas V, Weinzaepfel P, Revaud J, Schmid C (2018) Potion: Pose motion representation for action recognition. In: CVPR 2018. IEEE Computer Society (2018), pp 7024–7033

  5. Damen D, Doughty H, Farinella GM, Fidler S, Furnari A, Kazakos E, Moltisanti D, Munro J, Perrett T, Price W, Wray M (2018) Scaling egocentric vision: The EPIC-KITCHENS dataset. arXiv:1804.02748

  6. Debard Q, Wolf C, Canu S, Arné J. (2018) Learning to recognize touch gestures: Recurrent vs. convolutional features and dynamic sampling. In: 13th IEEE international conference on automatic face & gesture recognition, 2018, pp 114–121

  7. Escalera S, Baró X, Gonzàlez J, Bautista MÁ, Madadi M, Reyes M, Ponce-López V, Escalante HJ, Shotton J, Guyon I (2014) Chalearn looking at people challenge 2014: Dataset and results. In: Computer vision - ECCV 2014 workshops - zurich, switzerland, september 6-7 and 12, 2014, proceedings, Part I, pp 459–473

  8. Gu C, Sun C, Vijayanarasimhan S, Pantofaru C, Ross DA, Toderici G, Li Y, Ricco S, Sukthankar R, Schmid C, Malik J (2017) AVA: A video dataset of spatio-temporally localized atomic visual actions. arXiv:1705.08421

  9. He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition. 1512.03385

  10. Hou R, Chen C, Shah M (2017) Tube convolutional neural network (t-CNN) for action detection in videos. In: IEEE International conference on computer vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pp 5823–5832

  11. Jhuang H, Gall J, Zuffi S, Schmid C, Black MJ (2013) Towards understanding action recognition. In: International conf. on computer vision (ICCV), pp 3192–3199

  12. Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Viola F, Green T, Back T, Natsev P, Suleyman M, Zisserman A (2017) The kinetics human action video dataset. arXiv:1705.06950

  13. Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: NIPS, Lake Tahoe, Nevada, United States, pp 1106–1114

  14. Kuehne H, Jhuang H, Garrote E, Poggio TA, Serre T (2011) HMDB: A large video database for human motion recognition. In: ICCV. IEEE Computer Society, pp 2556–2563

  15. Li Z, Wang W, Li N, Wang J (2016) Tube convnets: Better exploiting motion for action recognition. In: 2016 IEEE International conference on image processing, ICIP 2016, Phoenix, AZ, USA, September 25-28, 2016, pp 3056–3060

  16. Liu C (2009) Beyond pixels: Exploring new representations and applications for motion analysis. Ph.D. thesis Massachusetts Institute of Technology

  17. Martin P, Benois-Pineau J, Péteri R (2019) Fine-grained action detection and classification in table tennis with siamese spatio-temporal convolutional neural network. In: ICIP 2019. IEEE, pp 3027–3028

  18. Martin P, Benois-Pineau J, Péteri R, Morlier J (2018) Sport action recognition with siamese spatio-temporal cnns: Application to table tennis. In: CBMI 2018. IEEE, pp 1–6

  19. Martin P, Benois-Pineau J, Péteri R, Morlier J (2019) Optimal choice of motion estimation methods for fine-grained action classification with 3d convolutional networks. In: ICIP 2019. IEEE, pp 554–558

  20. Nesterov Y (1983) A method for solving a convex programming problem with convergence rate o(1/k2). Soviet Mathematics Doklady 27:372–367

    MATH  Google Scholar 

  21. Niebles JC, Chen C, Li F (2010) Modeling temporal structure of decomposable motion segments for activity classification. In: ECCV 2010, pp 392–405

  22. Noiumkar S, Tirakoat S (2013) Use of optical motion capture in sports science: a case study of golf swing. In: ICICM, pp 310–313

  23. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein MS, Berg AC, Li F (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252

    Article  MathSciNet  Google Scholar 

  24. Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: NIPS, pp 568–576

  25. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556

  26. Soomro K, Zamir AR, Shah M (2012) UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402

  27. Stoian A, Ferecatu M, Benois-Pineau J, Crucianu M (2016) Fast action localization in large-scale video archives. IEEE Trans Circuits Syst Video Techn 26 (10):1917–1930

    Article  Google Scholar 

  28. Szegedy C, Liu W, Jia Y, Sermanet P, Reed SE, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: IEEE Conference on computer vision and pattern recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pp 1–9

  29. Varol G, Laptev I, Schmid C (2018) Long-term temporal convolutions for action recognition. IEEE Trans Pattern Anal Mach Intell 40(6):1510–1517

    Article  Google Scholar 

  30. Weinzaepfel P, Revaud J, Harchaoui Z, Schmid C (2013) Deepflow: Large displacement optical flow with deep matching. In: IEEE ICCV, pp 1385–1392

  31. Wu D, Pigou L, Kindermans P, Le ND, Shao L, Dambre J, Odobez J (2016) Deep dynamic neural networks for multimodal gesture segmentation and recognition. IEEE Trans Pattern Anal Mach Intell 38(8):1583–1597

    Article  Google Scholar 

  32. Zivkovic Z, van der Heijden F (2006) Efficient adaptive density estimation per image pixel for the task of background subtraction. Pattern Recogn Lett 27(7):773–780

    Article  Google Scholar 

Download references

Acknowledgements

We would like to thank Alain Coupet from sport faculty, expert and teacher in table tennis, for the proposed table tennis strokes taxonomy and all the players and annotators for their involvement in the acquisition and annotation processes leading to TTStroke-21.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pierre-Etienne Martin.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was supported by the CRISP project of the Nouvelle-Aquitaine Region and Bordeaux IDEX Initiative

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Martin, PE., Benois-Pineau, J., Péteri, R. et al. Fine grained sport action recognition with Twin spatio-temporal convolutional neural networks. Multimed Tools Appl 79, 20429–20447 (2020). https://doi.org/10.1007/s11042-020-08917-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-020-08917-3

Keywords

Navigation