Abstract
Human action recognition in video is one of the key problems in visual data interpretation. Despite intensive research, the recognition of actions with low inter-class variability remains a challenge. This paper presents a new Twin Spatio-Temporal Convolutional Neural Network (TSTCNN) for this purpose. When applied to table tennis, it is possible to detect and recognize 20 table tennis strokes. The model has been trained on a specific dataset, so called TTStroke-21, recorded in natural conditions at the Faculty of Sports of the University of Bordeaux. Our model takes as inputs an RGB image sequence and its computed Optical Flow. The proposed Twin architecture is a two stream network both comprising 3 spatio-temporal convolutional layers, followed by a fully connected layer where data are fused. Our method reaches an accuracy of 91.4% against 43.1% for our baseline, a Two-Stream Inflated 3D ConvNet (I3D).
Similar content being viewed by others
References
Ahmadi A, Mitchell E, Richter C, Destelle F, Gowing M, O’Connor NE, Moran K (2015) Toward automatic activity classification and movement assessment during a sports training session. IEEE Internet of Things Journal 2(1):23–32
Bilen H, Fernando B, Gavves E, Vedaldi A (2016) Action recognition with dynamic image networks. arXiv:1612.00738
Carreira J, Zisserman A (2017) Quo vadis, action recognition? a new model and the kinetics dataset. arXiv:1705.07750.
Choutas V, Weinzaepfel P, Revaud J, Schmid C (2018) Potion: Pose motion representation for action recognition. In: CVPR 2018. IEEE Computer Society (2018), pp 7024–7033
Damen D, Doughty H, Farinella GM, Fidler S, Furnari A, Kazakos E, Moltisanti D, Munro J, Perrett T, Price W, Wray M (2018) Scaling egocentric vision: The EPIC-KITCHENS dataset. arXiv:1804.02748
Debard Q, Wolf C, Canu S, Arné J. (2018) Learning to recognize touch gestures: Recurrent vs. convolutional features and dynamic sampling. In: 13th IEEE international conference on automatic face & gesture recognition, 2018, pp 114–121
Escalera S, Baró X, Gonzàlez J, Bautista MÁ, Madadi M, Reyes M, Ponce-López V, Escalante HJ, Shotton J, Guyon I (2014) Chalearn looking at people challenge 2014: Dataset and results. In: Computer vision - ECCV 2014 workshops - zurich, switzerland, september 6-7 and 12, 2014, proceedings, Part I, pp 459–473
Gu C, Sun C, Vijayanarasimhan S, Pantofaru C, Ross DA, Toderici G, Li Y, Ricco S, Sukthankar R, Schmid C, Malik J (2017) AVA: A video dataset of spatio-temporally localized atomic visual actions. arXiv:1705.08421
He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition. 1512.03385
Hou R, Chen C, Shah M (2017) Tube convolutional neural network (t-CNN) for action detection in videos. In: IEEE International conference on computer vision, ICCV 2017, Venice, Italy, October 22-29, 2017, pp 5823–5832
Jhuang H, Gall J, Zuffi S, Schmid C, Black MJ (2013) Towards understanding action recognition. In: International conf. on computer vision (ICCV), pp 3192–3199
Kay W, Carreira J, Simonyan K, Zhang B, Hillier C, Vijayanarasimhan S, Viola F, Green T, Back T, Natsev P, Suleyman M, Zisserman A (2017) The kinetics human action video dataset. arXiv:1705.06950
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: NIPS, Lake Tahoe, Nevada, United States, pp 1106–1114
Kuehne H, Jhuang H, Garrote E, Poggio TA, Serre T (2011) HMDB: A large video database for human motion recognition. In: ICCV. IEEE Computer Society, pp 2556–2563
Li Z, Wang W, Li N, Wang J (2016) Tube convnets: Better exploiting motion for action recognition. In: 2016 IEEE International conference on image processing, ICIP 2016, Phoenix, AZ, USA, September 25-28, 2016, pp 3056–3060
Liu C (2009) Beyond pixels: Exploring new representations and applications for motion analysis. Ph.D. thesis Massachusetts Institute of Technology
Martin P, Benois-Pineau J, Péteri R (2019) Fine-grained action detection and classification in table tennis with siamese spatio-temporal convolutional neural network. In: ICIP 2019. IEEE, pp 3027–3028
Martin P, Benois-Pineau J, Péteri R, Morlier J (2018) Sport action recognition with siamese spatio-temporal cnns: Application to table tennis. In: CBMI 2018. IEEE, pp 1–6
Martin P, Benois-Pineau J, Péteri R, Morlier J (2019) Optimal choice of motion estimation methods for fine-grained action classification with 3d convolutional networks. In: ICIP 2019. IEEE, pp 554–558
Nesterov Y (1983) A method for solving a convex programming problem with convergence rate o(1/k2). Soviet Mathematics Doklady 27:372–367
Niebles JC, Chen C, Li F (2010) Modeling temporal structure of decomposable motion segments for activity classification. In: ECCV 2010, pp 392–405
Noiumkar S, Tirakoat S (2013) Use of optical motion capture in sports science: a case study of golf swing. In: ICICM, pp 310–313
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein MS, Berg AC, Li F (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252
Simonyan K, Zisserman A (2014) Two-stream convolutional networks for action recognition in videos. In: NIPS, pp 568–576
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
Soomro K, Zamir AR, Shah M (2012) UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402
Stoian A, Ferecatu M, Benois-Pineau J, Crucianu M (2016) Fast action localization in large-scale video archives. IEEE Trans Circuits Syst Video Techn 26 (10):1917–1930
Szegedy C, Liu W, Jia Y, Sermanet P, Reed SE, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2015) Going deeper with convolutions. In: IEEE Conference on computer vision and pattern recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, pp 1–9
Varol G, Laptev I, Schmid C (2018) Long-term temporal convolutions for action recognition. IEEE Trans Pattern Anal Mach Intell 40(6):1510–1517
Weinzaepfel P, Revaud J, Harchaoui Z, Schmid C (2013) Deepflow: Large displacement optical flow with deep matching. In: IEEE ICCV, pp 1385–1392
Wu D, Pigou L, Kindermans P, Le ND, Shao L, Dambre J, Odobez J (2016) Deep dynamic neural networks for multimodal gesture segmentation and recognition. IEEE Trans Pattern Anal Mach Intell 38(8):1583–1597
Zivkovic Z, van der Heijden F (2006) Efficient adaptive density estimation per image pixel for the task of background subtraction. Pattern Recogn Lett 27(7):773–780
Acknowledgements
We would like to thank Alain Coupet from sport faculty, expert and teacher in table tennis, for the proposed table tennis strokes taxonomy and all the players and annotators for their involvement in the acquisition and annotation processes leading to TTStroke-21.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
This work was supported by the CRISP project of the Nouvelle-Aquitaine Region and Bordeaux IDEX Initiative
Rights and permissions
About this article
Cite this article
Martin, PE., Benois-Pineau, J., Péteri, R. et al. Fine grained sport action recognition with Twin spatio-temporal convolutional neural networks. Multimed Tools Appl 79, 20429–20447 (2020). https://doi.org/10.1007/s11042-020-08917-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-020-08917-3