ABSTRACT
Gestures play an important role in our daily communications. However, recognizing and retrieving gestures in-the-wild is a challenging task which is not explored thoroughly in literature. In this paper, we explore the problem of identifying and retrieving gestures in a large-scale video dataset provided by the computer vision community and based on queries recorded in-the-wild. Our proposed pipeline, I3DEF, is based on the extraction of spatio-temporal features from intermediate layers of an I3D network, a state-of-the-art network for action recognition, and the fusion of the output of feature maps from RGB and optical flow input. The obtained embeddings are used to train a triplet network to capture the similarity between gestures. We further explore the effect of a person and body part masking step for improving both retrieval performance and recognition rate. Our experiments show the ability of I3DEF to recognize and retrieve gestures which are similar to the queries independently of the depth modality. This performance holds both for queries taken from the test data, and for queries using recordings from different people performing relevant gestures in a different setting.
- Lorenzo Baraldi, Matthijs Douze, Rita Cucchiara, and Hervé Jégou. 2018. LAMV: Learning to align and match videos with kernelized temporal layers. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7804--7813.Google ScholarCross Ref
- Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. 2006. Surf: Speeded up robust features. In European conference on computer vision. Springer, 404--417.Google ScholarDigital Library
- Thomas Brox, Andrés Bruhn, Nils Papenberg, and Joachim Weickert. 2004. High accuracy optical flow estimation based on a theory for warping. In European conference on computer vision. Springer, 25--36.Google ScholarCross Ref
- Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.Google ScholarCross Ref
- Carlos Dávila, Mario González, Jorge-Luis Pérez-Medina, David Dominguez, Ángel Sánchez, and Francisco B. Rodriguez. 2019. Ensemble of Attractor Networks for 2D Gesture Retrieval. In Advances in Computational Intelligence, Ignacio Rojas, Gonzalo Joya, and Andreu Catala (Eds.). Springer International Publishing, Cham, 488--499.Google Scholar
- Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248--255.Google ScholarCross Ref
- Yajiao Dong and Jianguo Li. 2018. Video retrieval based on deep convolutional neural network. In Proceedings of the 3rd International Conference on Multimedia Systems and Signal Processing. 12--16.Google ScholarDigital Library
- Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. 2010. The pascal visual object classes (voc) challenge. International journal of computer vision, Vol. 88, 2 (2010), 303--338.Google ScholarDigital Library
- Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2016. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1933--1941.Google ScholarCross Ref
- Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06), Vol. 2. IEEE, 1735--1742.Google ScholarDigital Library
- Yanbin Hao, Tingting Mu, Richang Hong, Meng Wang, Ning An, and John Y Goulermas. 2016. Stochastic multiview hashing for large-scale near-duplicate video retrieval. IEEE Transactions on Multimedia, Vol. 19, 1 (2016), 1--14.Google ScholarDigital Library
- Elad Hoffer and Nir Ailon. 2015. Deep metric learning using triplet network. In International Workshop on Similarity-Based Pattern Recognition. Springer, 84--92.Google ScholarCross Ref
- Chong Huang, Qiong Liu, Yan-Ying Chen, et al. 2017. Local Feature Descriptor Learning with Adaptive Siamese Network. arXiv preprint arXiv:1706.05358 (2017).Google Scholar
- Yu-Gang Jiang and Jiajun Wang. 2016. Partial copy detection in videos: A benchmark and an evaluation of popular methods. IEEE Transactions on Big Data, Vol. 2, 1 (2016), 32--42.Google ScholarCross Ref
- Weizhen Jing, Xiushan Nie, Chaoran Cui, Xiaoming Xi, Gongping Yang, and Yilong Yin. 2019. Global-view hashing: harnessing global relations in near-duplicate video retrieval. World wide web, Vol. 22, 2 (2019), 771--789.Google Scholar
- Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 1725--1732.Google ScholarDigital Library
- Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017).Google Scholar
- Giorgos Kordopatis-Zilos, Symeon Papadopoulos, Ioannis Patras, and Yiannis Kompatsiaris. 2017. Near-duplicate video retrieval by aggregating intermediate CNN layers. In International conference on multimedia modeling. Springer, 251--263.Google ScholarCross Ref
- Bartosz Krawczyk. 2016. Learning from imbalanced data: open challenges and future directions. Progress in Artificial Intelligence, Vol. 5, 4 (01 Nov 2016), 221--232. https://doi.org/10.1007/s13748-016-0094-0Google Scholar
- Kevin Lin, Lijuan Wang, Kun Luo, Yinpeng Chen, Zicheng Liu, and Ming-Ting Sun. 2019. Cross-Domain Complementary Learning with Synthetic Data for Multi-Person Part Segmentation. arXiv preprint arXiv:1907.05193 (2019).Google Scholar
- Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. Springer, 740--755.Google ScholarCross Ref
- C. Ma, J. Huang, X. Yang, and M. Yang. 2019. Robust Visual Tracking via Hierarchical Convolutional Features. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 41, 11 (Nov 2019), 2709--2723. https://doi.org/10.1109/TPAMI.2018.2865311Google ScholarCross Ref
- Foteini Markatopoulou, Damianos Galanopoulos, Vasileios Mezaris, and Ioannis Patras. 2017. Query and keyframe representations for ad-hoc video search. In Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval. 407--411.Google ScholarDigital Library
- Luca Rossetto, Ivan Giangreco, Ralph Gasser, and Heiko Schuldt. 2018. Competitive video retrieval with vitrivr. In International Conference on Multimedia Modeling. Springer, 403--406.Google ScholarCross Ref
- Luca Rossetto, Ivan Giangreco, Claudiu Tanase, and Heiko Schuldt. 2016. vitrivr: A flexible retrieval stack supporting multiple query modes for searching in multimedia collections. In Proceedings of the 24th ACM international conference on Multimedia. 1183--1186.Google ScholarDigital Library
- Luca Rossetto, Mahnaz Amiri Parian, Ralph Gasser, Ivan Giangreco, Silvan Heller, and Heiko Schuldt. 2019. Deep learning-based concept detection in vitrivr. In International Conference on Multimedia Modeling. Springer, 616--621.Google ScholarCross Ref
- Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 815--823.Google ScholarCross Ref
- Omar Seddati, Stéphane Dupont, Said Mahmoudi, and Mahnaz Parian. 2017. Towards good practices for image retrieval based on CNN features. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 1246--1255.Google ScholarCross Ref
- Lifeng Shang, Linjun Yang, Fei Wang, Kwok-Ping Chan, and Xian-Sheng Hua. 2010. Real-time large scale near-duplicate web video retrieval. In Proceedings of the 18th ACM international conference on Multimedia. 531--540.Google ScholarDigital Library
- Jingkuan Song, Yi Yang, Zi Huang, Heng Tao Shen, and Richang Hong. 2011. Multiple feature hashing for real-time large scale near-duplicate video retrieval. In Proceedings of the 19th ACM international conference on Multimedia. 423--432.Google ScholarDigital Library
- Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1--9.Google ScholarCross Ref
- Giorgos Tolias, Ronan Sicre, and Hervé Jégou. 2015. Particular object retrieval with integral max-pooling of CNN activations. arXiv preprint arXiv:1511.05879 (2015).Google Scholar
- Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision. 4489--4497.Google ScholarDigital Library
- Jun Wan, Yibing Zhao, Shuai Zhou, Isabelle Guyon, Sergio Escalera, and Stan Z Li. 2016. Chalearn looking at people rgb-d isolated and continuous datasets for gesture recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 56--64.Google ScholarCross Ref
- Jindong Wang, Yiqiang Chen, Shuji Hao, Xiaohui Peng, and Lisha Hu. 2019. Deep learning for sensor-based activity recognition: A survey. Pattern Recognition Letters, Vol. 119 (2019), 3--11.Google ScholarDigital Library
- Jiang Wang, Yang Song, Thomas Leung, Chuck Rosenberg, Jingbin Wang, James Philbin, Bo Chen, and Ying Wu. 2014. Learning fine-grained image similarity with deep ranking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1386--1393.Google ScholarDigital Library
- Gengshen Wu, Jungong Han, Yuchen Guo, Li Liu, Guiguang Ding, Qiang Ni, and Ling Shao. 2018. Unsupervised deep video hashing via balanced code for large-scale video retrieval. IEEE Transactions on Image Processing, Vol. 28, 4 (2018), 1993--2007.Google ScholarDigital Library
- Shahrouz Yousefi and Haibo Li. 2015. 3D Hand Gesture Analysis through a Real-Time Gesture Search Engine. International Journal of Advanced Robotic Systems, Vol. 12, 6 (2015), 67. https://doi.org/10.5772/60045Google ScholarCross Ref
- Guoying Zhao and Matti Pietikainen. 2007. Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE transactions on pattern analysis and machine intelligence, Vol. 29, 6 (2007), 915--928.Google ScholarDigital Library
- Wan-Lei Zhao and Chong-Wah Ngo. 2009. Scale-rotation invariant pattern entropy for keypoint-based near-duplicate detection. IEEE Transactions on Image Processing, Vol. 18, 2 (2009), 412--423.Google ScholarDigital Library
Index Terms
- Are You Watching Closely? Content-based Retrieval of Hand Gestures
Recommendations
A real time vision-based hand gestures recognition system
ISICA'10: Proceedings of the 5th international conference on Advances in computation and intelligenceHand gesture recognition is an important aspect in Human-Computer interaction, and can be used in various applications, such as virtual reality and computer games. In this paper, we propose a real time hand gesture recognition system. It includes three ...
Eye-based head gestures
ETRA '12: Proceedings of the Symposium on Eye Tracking Research and ApplicationsA novel method for video-based head gesture recognition using eye information by an eye tracker has been proposed. The method uses a combination of gaze and eye movement to infer head gestures. Compared to other gesture-based methods a major advantage ...
Real-Time Robotic Hand Control Using Hand Gestures
ICMLC '10: Proceedings of the 2010 Second International Conference on Machine Learning and Computingthis paper presents a new approach for controlling robotic hand or an individual robot by merely showing hand gestures in front of a camera. With the help of this technique one can pose a hand gesture in the vision range of a robot and corresponding to ...
Comments