Abstract
This paper describes the data used in the ChaLearn gesture challenges that took place in 2011/2012, whose results were discussed at the CVPR 2012 and ICPR 2012 conferences. The task can be described as: user-dependent, small vocabulary, fixed camera, one-shot-learning. The data include 54,000 hand and arm gestures recorded with an RGB-D \(\hbox {Kinect}^\mathrm{TM}\)camera. The data are organized into batches of 100 gestures pertaining to a small gesture vocabulary of 8–12 gestures, recorded by the same user. Short continuous sequences of 1–5 randomly selected gestures are recorded. We provide man-made annotations (temporal segmentation into individual gestures, alignment of RGB and depth images, and body part location) and a library of function to preprocess and automatically annotate data. We also provide a subset of batches in which the user’s horizontal position is randomly shifted or scaled. We report on the results of the challenge and distribute sample code to facilitate developing new solutions. The data, datacollection software and the gesture vocabularies are downloadable from http://gesture.chalearn.org. We set up a forum for researchers working on these data http://groups.google.com/group/gesturechallenge.
Similar content being viewed by others
Notes
For round 1: http://www.kaggle.com/c/GestureChallenge. For round 2: http://www.kaggle.com/c/GestureChallenge2.
For ease of visualization, earlier experiments were recorded in a different format: depth encoded as gray levels and RGB images were concatenated vertically and stored as a single Matlab movie. However, we later realized that we were loosing depth resolution for some videos because Matlab movies used only 8 bits of resolution (256 levels) and the depth resolution of our videos attained sometimes more than 1,000. Hence, we recorded later batches using cell arrays for K.
References
Accelerative Integrated Method (AIM) foreign language teaching methodology, http://www.aimlanguagelearning.com/
Computer vision datasets on the web. http://www.cvpapers.com/datasets.html
Imageclef—the clef cross language image retrieval track. http://www.imageclef.org/
The Pascal visual object classes homepage. http://pascallin.ecs.soton.ac.uk/challenges/VOC/
Alon, Jonathan, Athitsos, Vassilis, Yuan, Quan, Sclaroff, Stan: A unified framework for gesture recognition and spatiotemporal gesture segmentation. IEEE Trans. Patt. Anal. Mach. Intell. 31(9), 1685–1699 (2009)
Beyer, M.: Teach your baby to sign: an illustrated guide to simple sign language for babies. Fair Winds Press, Minneapolis (2007)
Calatroni, A., Roggen, D., Tröster, G.: Collection and curation of a large reference dataset for activity recognition. In: Systems, Man, and Cybernetics (SMC), 2011 IEEE International Conference on, pp. 30–35. (2011)
Carroll, C., Carroll, R.: Mudras of India: a comprehensive guide to the hand gestures of yoga and Indian dance. Jessica Kingsley Publishers, London (2012)
Chavarriaga, R., Sagha, H, Calatroni, A., Tejaswi D.S., Tröster, G., José del Millán, R., Roggen, D.: The opportunity challenge: A benchmark database for on-body sensor-based activity recognition. Patt. Recogn. Lett. (2013)
Private communication
Curwen, J.: The standard course of lessons & exercises in the Tonic Sol-Fa Method of teaching music: (Founded on Miss Glover’s Scheme for Rendering Psalmody Congregational. A.D. 1835.).. Nabu Press, Charleston (2012)
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection, pp. 886–893. CVPR, Providence (2005)
Dalal, N., Triggs, B., Schmid, C.: Human detection using oriented histograms of flow and appearance. In: Proceedings of the 9th European conference on Computer Vision—Volume Part II. ECCV’06, pp. 428–441. Springer-Verlag, Berlin, (2006)
De la Torre Frade, F., Hodgins, J.K., Bargteil, A.W., Martin A., Xavier, M., Justin C., Collado I Castells, A., Beltran, J.: Guide to the carnegie mellon university multimodal activity (cmu-mmac) database. In: Technical Report CMU-RI-TR-08-22, Robotics Institute, Pittsburgh, (2008)
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: A large-scale hierarchical image database. In: CVPR09, (2009)
Dreuw, P., Neidle, C., Athitsos, V, Sclaroff, S., Ney, H.: Benchmark databases for video-based automatic sign language recognition. In: Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08), European Language Resources Association (ELRA), Marrakech, (2008)
Eichner, Marcin, Marín-Jiménez, Manuel Jesús, Zisserman, Andrew, Ferrari, Vittorio: 2D articulated human pose estimation and retrieval in (almost) unconstrained still images. Intern. J. Comp. Vis. 99(2), 190–214 (2012)
Jair, E.H., Guyon, I.: Principal motion: Pca-based reconstruction of motion histograms. In: Technical report, ChaLearn Technical Memorandum, (2012). http://www.causality.inf.ethz.ch/Gesture/principal_motion.pdf
Escalante, H.J., Guyon, I., Athitsos, V., Jangyodsuk, P., Wan, J.: Principal motion components for gesture recognition using a single-example. CoRR abs/1310.4822 (2013). http://arxiv.org/abs/1310.4822
Escalera, S., Gonzàlez, J., Baró, X., Reyes, M., Lopes, O., Guyon, I, Athitsos, V., Jair E.H.: Multi-modal gesture recognition challenge 2013: Dataset and results. In: Technical report, ChaLearn Technical Memorandum, (2013)
Glomb, P., Romaszewski, M., Opozda, S., Sochan, A.: Choosing and modeling the hand gesture database for a natural user interface. In: Proceedings of the 9th international conference on Gesture and Sign Language in Human–Computer Interaction and Embodied Communication. GW’11, pp. 24–35. Springer-Verlag, Berlin, (2012)
Gross, R., Shi, J.: The cmu motion of body (mobo) database. In: Technical Report CMU-RI-TR-01-18. Robotics Institute, Carnegie Mellon University, Pittsburgh, (2001)
Guyon, I.: Athitsos, V., Jangyodsuk, P., Jair E.H.: ChaLearn gesture demonstration kit. In: Technical report, ChaLearn Technical Memorandum, (2013)
Guyon, I., Athitsos, V., Jangyodsuk, P., Jair E.H., Hamner, B.: Results and analysis of the ChaLearn gesture challenge 2012. In: Advances in Depth Image Analysis and Applications, volume 7854 of, Lecture Notes in Computer Science, pp. 186–204. (2013)
Guyon, I., Athitsos, V., Jangyodsuk, P., Hamner, B., Jair E.H.: Chalearn gesture challenge: design and first results. In: CVPR Workshops, pp. 1–6. IEEE (2012)
Hargrave, J.L.: Let me see your body talk. Kendall/Hunt Pub. Co., Dubuque (1995)
Hwang, B.-W., Kim, S., Lee, S.-W.: A full-body gesture database for automatic gesture recognition. In: FG, pp. 243–248. IEEE Computer Society (2006)
Kendon, A.: Gesture: visible action as utterance. Cambridge University Press, Cambridge (2004)
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: a large video database for human motion recognition. In: ICCV (2011)
Laptev, Ivan: On space–time interest points. Intern. J. Comp. Vis. 64(2–3), 107–123 (2005)
Larsson, M., Serrano V.I., Kragic, D., Kyrki V.: Cvap arm/hand activity database, http://www.csc.kth.se/~danik/gesture_database/
Malgireddy, Manavender, Nwogu, Ifeoma, Govindaraju, Venu: Language-motivated approaches to action recognition. JMLR 14, 2189–2212 (2013)
Martnez, A.M., Wilbur, R.B., Shay, R., Kak, A.C.: Purdue rvl-slll asl database for automatic recognition of american sign language. In: Proceedings of the 4th IEEE International Conference on Multimodal Interfaces. ICMI ’02, pp. 167–172. IEEE Computer Society, Washington, (2002)
McNeill, D.: Hand and mind: what gestures reveal about thought. Psychology/cognitive science. University of Chicago Press, Chicago (1996)
Moeslund, T.B., Bajers, F.: Summaries of 107 computer vision-based human motion capture papers (1999)
Moeslund, Thomas B., Hilton, Adrian, Krüger, Volker, Sigal, L. (eds.): Visual analysis of humans—looking at people. Springer, Berlin (2011)
Müller, M., Röder, T., Clausen, M., Eberhardt, B., Krüger, B., Weber, A.: Documentation mocap database hdm05. In: Technical Report CG-2007-2, Universität Bonn, (2007)
Munari, B.: Speak Italian: the fine art of the gesture. Chronicle Books, San Francisco (2005)
World Federation of the Deaf and World Federation of the Deaf. Unification of Signs Commission. Gestuno: international sign language of the deaf. GESTUNO: International Sign Language of the Deaf, Langage Gestuel International Des Sourds. British Deaf Association [for] the World Federation of the Deaf (1975)
Raptis, M., Kirovski, D., Hoppes, H.: Real-time classification of dance gestures from skeleton animation. In: Proceedings of the ACM SIGGRAPH/Eurographics symposium on Computer animation, (2011)
Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., Blake, A.: Real-time human pose recognition in parts from single depth images. In: In CVPR (2011)
Sigal, Leonid, Balan, Alexandru O.: Humaneva: synchronized video and motion capture dataset and baseline algorithm for evaluation of articulated human motion. Int. J. Comp. Vision 87(1–2), 4–27 (2010)
Antonio, T., Robert, F., Freeman, W.T.: 80 million tiny images: a large data set for nonparametric object and scene recognition. IEEE Trans. Patt. Anal. Mach. Intell. 30(11) (2008)
Viterbi, A.: Error bounds for convolutional codes and an asymptotically optimum decoding algorithm. Info. Theory IEEE Trans 13(2), 260–269 (1967)
von Laban, R., Lange, R.: Laban’s principles of dance and movement notation. Macdonald & Evans, Canada (1975)
Wagner, M., Armstrong, N.: Field guide to gestures: how to identify and interpret virtually every gesture known to man. Field Guide, Quirk Books, Philadelphia (2003)
Wan, J., Ruan, Q., Li, W.: One-shot learning gesture recognition from rgb-d data using bag-of-features. JMLR (2013)
Acknowledgments
This challenge was organized by ChaLearn http://chalearn.org whose directors are gratefully acknowledged. The submission website was hosted by Kaggle http://kaggle.com and we thank Ben Hamner for his wonderful support. Our sponsors include Microsoft (Kinect for Xbox 360) and Texas Instrument who donated prizes. We are very grateful to Alex Kipman and Laura Massey at Microsoft and to Branislav Kisacanin at Texas Instrument who made this possible. We also thank the committee members and participants of the CVPR 2011, CVPR 2012, and ICPR 2012 gesture recognition workshop, the judges of the demonstration competitions hosted in conjunction with CVPR 2012 and ICPR 2012 and the Pascal2 reviewers who made valuable suggestions. We are particularly grateful to Richard Bowden, Philippe Dreuw, Ivan Laptev, Jitendra Malik, Greg Mori, and Christian Vogler, who provided us with useful guidance in the design of the dataset.
Author information
Authors and Affiliations
Corresponding author
Additional information
This effort was initiated by the DARPA Deep Learning program and was supported by the US National Science Foundation (NSF) under grants ECCS 1128436 and ECCS 1128296, the EU Pascal2 network of excellence and the Challenges in Machine Learning foundation. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the funding agencies.
Appendix
Appendix
1.1 Results by challenge participant
We used the code provided by 15 top ranking participants in both challenge rounds to compute performances on the validation and final evaluation sets (Table 5). We also provide results on 20 other batches selected for our translation experiments. Untranslated data are referred to as “utran” and translated data as “tran”. Details on the methods employed by the participants are found in reference [24] and on the website of the challenge.
1.2 Development data lexicons
The development data were recorded using a subset of thirty lexicons (Table 6). They were recorded at least 11 times each by different users. We list in Table 7 the lexicons used for validation and final evaluation data. Note that some validation lexicons are also present in development data but that the final evaluation data include only new lexicons found in no other sets.
1.3 Results by data batch
We show in Table 7 the performances by batch. We computed the best and average performance over 15 top ranking participants in round 1 and 2: Alfnie1, Alfnie2, BalazsGodeny, HITCS, Immortals, Joewan, Manavender, OneMillionMonkeys, Pennect, SkyNet, TurtleTamers, Vigilant, WayneZhang, XiaoZhuWudi, and Zonga.
1.4 Depth parameters
We also provide the parameters necessary to reconstruct the original depth data from normalized values (Table 8).
Rights and permissions
About this article
Cite this article
Guyon, I., Athitsos, V., Jangyodsuk, P. et al. The ChaLearn gesture dataset (CGD 2011). Machine Vision and Applications 25, 1929–1951 (2014). https://doi.org/10.1007/s00138-014-0596-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00138-014-0596-3
Keywords
- Computer vision
- Gesture recognition
- Sign language recognition
- RGBD cameras
- Kinect
- Dataset
- Challenge
- Machine learning
- Transfer learning
- One-shot-learning