FirstPiano: A New Egocentric Hand Action Dataset Oriented Towards Augmented Reality Applications

Voillemin, Théo; Wannous, Hazem; Vandeborre, Jean-Philippe

doi:10.1007/978-3-031-06433-3_15

Théo Voillemin¹²,
Hazem Wannous¹² &
Jean-Philippe Vandeborre¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13233))

Included in the following conference series:

International Conference on Image Analysis and Processing

1233 Accesses
1 Citations

Abstract

Research on hand action recognition has achieved very interesting performance in recent years, notably thanks to deep learning methods. With those improvements, we can see new visions towards real applications of new Human-Machine interfaces (HMI) using this recognition. Such new interactions and interfaces need data to develop the best user experience iteratively. However, current datasets for hand action recognition in an egocentric view, even if perfectly useful for these problems of recognition, they generally lack of a limited but coherent context for the proposed actions. Indeed, these datasets tend to provide a wide range of actions, more or less in relation to each other, which does not help to create an interesting context for HMI application purposes. Thereby, we present in this paper a new dataset, FirstPiano, for hand action recognition in an egocentric view, in the context of piano training. FirstPiano provides a total of 672 video sequences directly extracted from the sensors of the Microsoft HoloLens Augmented Reality device. Each sequence is provided in depth, infrared and grayscale data, with 4 different points of view for the last one, for a total of 6 streams for each video. We also present the first benchmark of experiments using a Capsule Network over different classification problems and different stream combinations. Our dataset and experiments can therefore be interesting for research communities of action recognition and human-machine interface.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
https://github.com/microsoft/HoloLensForCV.

References

Bambach, S., Lee, S., Crandall, D.J., Yu, C.: Lending a hand: Detecting hands and recognizing activities in complex egocentric interactions. In: IEEE International Conference on Computer Vision (ICCV). pp. 1949–1957 (2015)
Google Scholar
Bullock, I.M., Feix, T., Dollar, A.M.: The yale human grasping dataset: Grasp, object, and task data in household and machine shop environments. The International Journal of Robotics Research 34(3), 251–255 (2015)
Article Google Scholar
Cai, M., Kitani, K.M., Sato, Y.: A scalable approach for understanding the visual structures of hand grasps. In: IEEE International Conference on Robotics and Automation (ICRA). pp. 1360–1366 (2015)
Google Scholar
Chen, X., Guo, H., Wang, G., Zhang, L.: Motion feature augmented recurrent neural network for skeleton-based dynamic hand gesture recognition. IEEE International Conference on Image Processing (ICIP), September 2017
Google Scholar
De Smedt, Q., Wannous, H., Vandeborre, J.P., Guerry, J., Saux, B.L., Filliat, D.: 3D hand gesture recognition using a depth and skeletal dataset: Shrec 2017 track. In: Proceedings of the Workshop on 3D Object Retrieval. 3Dor 2017, pp. 33–38. Eurographics Association, Goslar, DEU (2017)
Google Scholar
De Smedt, Q., Wannous, H., Vandeborre, J.-P.: 3D hand gesture recognition by analysing set-of-joints trajectories. In: Wannous, H., Pala, P., Daoudi, M., Flórez-Revuelta, F. (eds.) UHA3DS 2016. LNCS, vol. 10188, pp. 86–97. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-91863-1_7
Devanne, M., Wannous, H., Daoudi, M., Berretti, S., Bimbo, A.D., Pala, P.: Learning Shape Variations of Motion Trajectories for Gait Analysis. In: International Conference on Pattern Recognition (ICPR). pp. 895–900. Cancun, Mexico (2016)
Google Scholar
Duarte, K., Rawat, Y., Shah, M.: VideoCapsuleNet : a simplified network for action detection. In: Advances in Neural Information Processing Systems, pp. 7610–7619 (2018)
Google Scholar
Essig, K., Strenge, B., Schack, T.: ADAMAAS: towards smart glasses for mobile and personalized action assistance.. In: 9th ACM International Conference, pp. 1–4, June 2016
Google Scholar
Fang, L., Liu, X., Liu, L., Xu, H., Kang, W.: JGR-P2O: joint graph reasoning based pixel-to-offset prediction network for 3D hand pose estimation from a single depth image. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12351, pp. 120–137. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58539-6_8
Chapter Google Scholar
Fathi, A., Ren, X., Rehg, J.M.: Learning to recognize objects in egocentric activities. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3281–3288 (2011)
Google Scholar
Garcia-Hernando, G., Yuan, S., Baek, S., Kim, T.K.: First-person hand action benchmark with RGB-D videos and 3D hand pose annotations. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 409–419 (2018)
Google Scholar
Goyal, R., et al.: The something video database for learning and evaluating visual common sense. In: IEEE International Conference on Computer Vision (ICCV) 2017, pp. 5843–5851. Los Alamitos, CA, USA, October 2017
Google Scholar
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725–1732 (2014)
Google Scholar
Khan, M.A., Sharif, M., Akram, T., Raza, M., Saba, T., Rehman, A.: Hand-crafted and deep convolutional neural network features fusion and selection strategy: an application to intelligent human action recognition. Appl. Soft Comput. 87, 105986 (2020)
Google Scholar
Li, C., Li, S., Gao, Y., Zhang, X., Li, W.: A two-stream neural network for pose-based hand gesture recognition. CoRR abs/2101.08926 (2021)
Google Scholar
Li, Y., Liu, M., Rehg, J.M.: In the eye of beholder: joint learning of gaze and actions in first person video. In: Proceedings of the European Conference on Computer Vision (ECCV), September 2018
Google Scholar
Lin, J., Gan, C., Han, S.: Tsm: Temporal shift module for efficient video understanding. In: IEEE International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Moghimi, M., Azagra, P., Montesano, L., Murillo, A.C., Belongie, S.: Experiments on an RGB-D wearable vision system for egocentric activity recognition. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 611–617 (2014)
Google Scholar
Molchanov, P., Yang, X., Gupta, S., Kim, K., Tyree, S., Kautz, J.: Online detection and classification of dynamic hand gestures with recurrent 3D convolutional neural network. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4207–4215 (2016)
Google Scholar
Oberweger, M., Wohlhart, P., Lepetit, V.: Hands deep in deep learning for hand pose estimation. In: Computer Vision Winter Workshop, pp. 1–10 (2015)
Google Scholar
Pirsiavash, H., Ramanan, D.: Detecting activities of daily living in first-person camera views. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2847–2854 (2012)
Google Scholar
Rajasegaran, J., Jayasundara, V., Jayasekara, S., Jayasekara, H., Seneviratne, S., Rodrigo, R.: DeepCaps: going deeper with capsule networks. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10717–10725 (2019)
Google Scholar
Rhif, M., Wannous, H., Farah, I.R.: Action recognition from 3D skeleton sequences using deep networks on lie group features. In: 24th International Conference on Pattern Recognition (ICPR), pp. 3427–3432 (2018)
Google Scholar
Rogez, G., Supancic, J.S., Ramanan, D.: Understanding everyday hands in action from RGB-D images. In: IEEE International Conference on Computer Vision (ICCV), pp. 3889–3897 (2015)
Google Scholar
Sabour, S., Frosst, N., Hinton, G.E.: Dynamic routing between capsules. In: Guyon, I., et al. (eds.) Advances in Neural Information Processing Systems. vol. 30. Curran Associates, Inc. Red Hook (2017)
Google Scholar
Schröder, M., Ritter, H.: Deep learning for action recognition in augmented reality assistance systems. In: ACM SIGGRAPH 2017 Posters, pp. 1–2, June 2017
Google Scholar
Tang, Y., Tian, Y., Lu, J., Feng, J., Zhou, J.: Action recognition in RGB-D egocentric videos. In: IEEE International Conference on Image Processing (ICIP), pp. 3410–3414 (2017)
Google Scholar
Voillemin, T., Wannous, H., Vandeborre, J.P.: 2D deep video capsule network with temporal shift for action recognition. In: 25th International Conference on Pattern Recognition (ICPR), pp. 3513–3519 (2021)
Google Scholar
Wang, L., Qiao, Y., Tang, X.: Action recognition with trajectory-pooled deep-convolutional descriptors. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4305–4314 (2015)
Google Scholar
Wang, S., Hou, Y., Li, Z., Dong, J., Tang, C.: Combining convnets with hand-crafted features for action recognition based on an HMM-SVM classifier. Multim. Tools Appl. 77(15), 18983–18998 (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

IMT Nord Europe, Univ. Lille, CNRS, UMR 9189 - CRIStAL, 59000, Lille, France
Théo Voillemin, Hazem Wannous & Jean-Philippe Vandeborre

Authors

Théo Voillemin
View author publications
You can also search for this author in PubMed Google Scholar
Hazem Wannous
View author publications
You can also search for this author in PubMed Google Scholar
Jean-Philippe Vandeborre
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hazem Wannous .

Editor information

Editors and Affiliations

Boston University, Boston, MA, USA
Stan Sclaroff
National Research Council, Lecce, Italy
Cosimo Distante
National Research Council, Lecce, Italy
Marco Leo
University of Catania, Catania, Italy
Giovanni M. Farinella
Technische Universität München, Garching, Germany
Federico Tombari

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Voillemin, T., Wannous, H., Vandeborre, JP. (2022). FirstPiano: A New Egocentric Hand Action Dataset Oriented Towards Augmented Reality Applications. In: Sclaroff, S., Distante, C., Leo, M., Farinella, G.M., Tombari, F. (eds) Image Analysis and Processing – ICIAP 2022. ICIAP 2022. Lecture Notes in Computer Science, vol 13233. Springer, Cham. https://doi.org/10.1007/978-3-031-06433-3_15

Download citation

DOI: https://doi.org/10.1007/978-3-031-06433-3_15
Published: 15 May 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-06432-6
Online ISBN: 978-3-031-06433-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics