short-paper

Are You Watching Closely? Content-based Retrieval of Hand Gestures

Authors:
Mahnaz Amiri Parian

University of Basel, Basel, Switzerland

University of Basel, Basel, Switzerland
View Profile

,
Luca Rossetto

University of Zurich, Zurich, Switzerland

University of Zurich, Zurich, Switzerland
View Profile

,
Heiko Schuldt

University of Basel, Basel, Switzerland

University of Basel, Basel, Switzerland
View Profile

,
Stéphane Dupont

University of Mons, Mons, Belgium

University of Mons, Mons, Belgium
View Profile

ICMR '20: Proceedings of the 2020 International Conference on Multimedia RetrievalJune 2020Pages 266–270https://doi.org/10.1145/3372278.3390723

Published:08 June 2020Publication History

ICMR '20: Proceedings of the 2020 International Conference on Multimedia Retrieval

Pages 266–270

ABSTRACT

Gestures play an important role in our daily communications. However, recognizing and retrieving gestures in-the-wild is a challenging task which is not explored thoroughly in literature. In this paper, we explore the problem of identifying and retrieving gestures in a large-scale video dataset provided by the computer vision community and based on queries recorded in-the-wild. Our proposed pipeline, I3DEF, is based on the extraction of spatio-temporal features from intermediate layers of an I3D network, a state-of-the-art network for action recognition, and the fusion of the output of feature maps from RGB and optical flow input. The obtained embeddings are used to train a triplet network to capture the similarity between gestures. We further explore the effect of a person and body part masking step for improving both retrieval performance and recognition rate. Our experiments show the ability of I3DEF to recognize and retrieve gestures which are similar to the queries independently of the depth modality. This performance holds both for queries taken from the test data, and for queries using recordings from different people performing relevant gestures in a different setting.

References

Lorenzo Baraldi, Matthijs Douze, Rita Cucchiara, and Hervé Jégou. 2018. LAMV: Learning to align and match videos with kernelized temporal layers. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7804--7813.Google ScholarCross Ref
Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. 2006. Surf: Speeded up robust features. In European conference on computer vision. Springer, 404--417.Google ScholarDigital Library
Thomas Brox, Andrés Bruhn, Nils Papenberg, and Joachim Weickert. 2004. High accuracy optical flow estimation based on a theory for warping. In European conference on computer vision. Springer, 25--36.Google ScholarCross Ref
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.Google ScholarCross Ref
Carlos Dávila, Mario González, Jorge-Luis Pérez-Medina, David Dominguez, Ángel Sánchez, and Francisco B. Rodriguez. 2019. Ensemble of Attractor Networks for 2D Gesture Retrieval. In Advances in Computational Intelligence, Ignacio Rojas, Gonzalo Joya, and Andreu Catala (Eds.). Springer International Publishing, Cham, 488--499.Google Scholar
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248--255.Google ScholarCross Ref
Yajiao Dong and Jianguo Li. 2018. Video retrieval based on deep convolutional neural network. In Proceedings of the 3rd International Conference on Multimedia Systems and Signal Processing. 12--16.Google ScholarDigital Library
Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. 2010. The pascal visual object classes (voc) challenge. International journal of computer vision, Vol. 88, 2 (2010), 303--338.Google ScholarDigital Library
Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2016. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1933--1941.Google ScholarCross Ref
Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06), Vol. 2. IEEE, 1735--1742.Google ScholarDigital Library
Yanbin Hao, Tingting Mu, Richang Hong, Meng Wang, Ning An, and John Y Goulermas. 2016. Stochastic multiview hashing for large-scale near-duplicate video retrieval. IEEE Transactions on Multimedia, Vol. 19, 1 (2016), 1--14.Google ScholarDigital Library
Elad Hoffer and Nir Ailon. 2015. Deep metric learning using triplet network. In International Workshop on Similarity-Based Pattern Recognition. Springer, 84--92.Google ScholarCross Ref
Chong Huang, Qiong Liu, Yan-Ying Chen, et al. 2017. Local Feature Descriptor Learning with Adaptive Siamese Network. arXiv preprint arXiv:1706.05358 (2017).Google Scholar
Yu-Gang Jiang and Jiajun Wang. 2016. Partial copy detection in videos: A benchmark and an evaluation of popular methods. IEEE Transactions on Big Data, Vol. 2, 1 (2016), 32--42.Google ScholarCross Ref
Weizhen Jing, Xiushan Nie, Chaoran Cui, Xiaoming Xi, Gongping Yang, and Yilong Yin. 2019. Global-view hashing: harnessing global relations in near-duplicate video retrieval. World wide web, Vol. 22, 2 (2019), 771--789.Google Scholar
Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 1725--1732.Google ScholarDigital Library
Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017).Google Scholar
Giorgos Kordopatis-Zilos, Symeon Papadopoulos, Ioannis Patras, and Yiannis Kompatsiaris. 2017. Near-duplicate video retrieval by aggregating intermediate CNN layers. In International conference on multimedia modeling. Springer, 251--263.Google ScholarCross Ref
Bartosz Krawczyk. 2016. Learning from imbalanced data: open challenges and future directions. Progress in Artificial Intelligence, Vol. 5, 4 (01 Nov 2016), 221--232. https://doi.org/10.1007/s13748-016-0094-0Google Scholar
Kevin Lin, Lijuan Wang, Kun Luo, Yinpeng Chen, Zicheng Liu, and Ming-Ting Sun. 2019. Cross-Domain Complementary Learning with Synthetic Data for Multi-Person Part Segmentation. arXiv preprint arXiv:1907.05193 (2019).Google Scholar
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. Springer, 740--755.Google ScholarCross Ref
C. Ma, J. Huang, X. Yang, and M. Yang. 2019. Robust Visual Tracking via Hierarchical Convolutional Features. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 41, 11 (Nov 2019), 2709--2723. https://doi.org/10.1109/TPAMI.2018.2865311Google ScholarCross Ref
Foteini Markatopoulou, Damianos Galanopoulos, Vasileios Mezaris, and Ioannis Patras. 2017. Query and keyframe representations for ad-hoc video search. In Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval. 407--411.Google ScholarDigital Library
Luca Rossetto, Ivan Giangreco, Ralph Gasser, and Heiko Schuldt. 2018. Competitive video retrieval with vitrivr. In International Conference on Multimedia Modeling. Springer, 403--406.Google ScholarCross Ref
Luca Rossetto, Ivan Giangreco, Claudiu Tanase, and Heiko Schuldt. 2016. vitrivr: A flexible retrieval stack supporting multiple query modes for searching in multimedia collections. In Proceedings of the 24th ACM international conference on Multimedia. 1183--1186.Google ScholarDigital Library
Luca Rossetto, Mahnaz Amiri Parian, Ralph Gasser, Ivan Giangreco, Silvan Heller, and Heiko Schuldt. 2019. Deep learning-based concept detection in vitrivr. In International Conference on Multimedia Modeling. Springer, 616--621.Google ScholarCross Ref
Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 815--823.Google ScholarCross Ref
Omar Seddati, Stéphane Dupont, Said Mahmoudi, and Mahnaz Parian. 2017. Towards good practices for image retrieval based on CNN features. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 1246--1255.Google ScholarCross Ref
Lifeng Shang, Linjun Yang, Fei Wang, Kwok-Ping Chan, and Xian-Sheng Hua. 2010. Real-time large scale near-duplicate web video retrieval. In Proceedings of the 18th ACM international conference on Multimedia. 531--540.Google ScholarDigital Library
Jingkuan Song, Yi Yang, Zi Huang, Heng Tao Shen, and Richang Hong. 2011. Multiple feature hashing for real-time large scale near-duplicate video retrieval. In Proceedings of the 19th ACM international conference on Multimedia. 423--432.Google ScholarDigital Library
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1--9.Google ScholarCross Ref
Giorgos Tolias, Ronan Sicre, and Hervé Jégou. 2015. Particular object retrieval with integral max-pooling of CNN activations. arXiv preprint arXiv:1511.05879 (2015).Google Scholar
Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision. 4489--4497.Google ScholarDigital Library
Jun Wan, Yibing Zhao, Shuai Zhou, Isabelle Guyon, Sergio Escalera, and Stan Z Li. 2016. Chalearn looking at people rgb-d isolated and continuous datasets for gesture recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 56--64.Google ScholarCross Ref
Jindong Wang, Yiqiang Chen, Shuji Hao, Xiaohui Peng, and Lisha Hu. 2019. Deep learning for sensor-based activity recognition: A survey. Pattern Recognition Letters, Vol. 119 (2019), 3--11.Google ScholarDigital Library
Jiang Wang, Yang Song, Thomas Leung, Chuck Rosenberg, Jingbin Wang, James Philbin, Bo Chen, and Ying Wu. 2014. Learning fine-grained image similarity with deep ranking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1386--1393.Google ScholarDigital Library
Gengshen Wu, Jungong Han, Yuchen Guo, Li Liu, Guiguang Ding, Qiang Ni, and Ling Shao. 2018. Unsupervised deep video hashing via balanced code for large-scale video retrieval. IEEE Transactions on Image Processing, Vol. 28, 4 (2018), 1993--2007.Google ScholarDigital Library
Shahrouz Yousefi and Haibo Li. 2015. 3D Hand Gesture Analysis through a Real-Time Gesture Search Engine. International Journal of Advanced Robotic Systems, Vol. 12, 6 (2015), 67. https://doi.org/10.5772/60045Google ScholarCross Ref
Guoying Zhao and Matti Pietikainen. 2007. Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE transactions on pattern analysis and machine intelligence, Vol. 29, 6 (2007), 915--928.Google ScholarDigital Library
Wan-Lei Zhao and Chong-Wah Ngo. 2009. Scale-rotation invariant pattern entropy for keypoint-based near-duplicate detection. IEEE Transactions on Image Processing, Vol. 18, 2 (2009), 412--423.Google ScholarDigital Library

Index Terms

Are You Watching Closely? Content-based Retrieval of Hand Gestures

Recommendations

A real time vision-based hand gestures recognition system
ISICA'10: Proceedings of the 5th international conference on Advances in computation and intelligence

Hand gesture recognition is an important aspect in Human-Computer interaction, and can be used in various applications, such as virtual reality and computer games. In this paper, we propose a real time hand gesture recognition system. It includes three ...
Read More
Eye-based head gestures
ETRA '12: Proceedings of the Symposium on Eye Tracking Research and Applications

A novel method for video-based head gesture recognition using eye information by an eye tracker has been proposed. The method uses a combination of gaze and eye movement to infer head gestures. Compared to other gesture-based methods a major advantage ...
Read More
Real-Time Robotic Hand Control Using Hand Gestures
ICMLC '10: Proceedings of the 2010 Second International Conference on Machine Learning and Computing

this paper presents a new approach for controlling robotic hand or an individual robot by merely showing hand gestures in front of a camera. With the help of this technique one can pose a hand gesture in the vision range of a robot and corresponding to ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICMR '20: Proceedings of the 2020 International Conference on Multimedia Retrieval
June 2020
605 pages
ISBN:9781450370875
DOI:10.1145/3372278
General Chairs:
Cathal Gurrin
Dublin City University, Ireland
,
Björn Þór Jónsson
IT University of Copenhagen, Denmark
,
Noriko Kando
National Institute of Informatics, Tokyo
,
Program Chairs:
Klaus Schoeffmann
Klagenfurt University, Austria
,
Phoebe Chen
La Trobe University, Australia
,
Noel E. O'Connor
Dublin City University, Ireland
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 8 June 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
content-based gesture video retrieval
deep-neural embedding
Qualifiers
- short-paper
Conference

Acceptance Rates
Overall Acceptance Rate254of830submissions,31%
Upcoming Conference
ICMR '24

Sponsor:

sigmm

International Conference on Multimedia Retrieval

June 10 - 14, 2024

Phuket , Thailand
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 106
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Are You Watching Closely? Content-based Retrieval of Hand Gestures

ICMR '20: Proceedings of the 2020 International Conference on Multimedia Retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

A real time vision-based hand gestures recognition system

Eye-based head gestures

Real-Time Robotic Hand Control Using Hand Gestures