skip to main content
10.1145/3372278.3390723acmconferencesArticle/Chapter ViewAbstractPublication PagesicmrConference Proceedingsconference-collections
short-paper

Are You Watching Closely? Content-based Retrieval of Hand Gestures

Published:08 June 2020Publication History

ABSTRACT

Gestures play an important role in our daily communications. However, recognizing and retrieving gestures in-the-wild is a challenging task which is not explored thoroughly in literature. In this paper, we explore the problem of identifying and retrieving gestures in a large-scale video dataset provided by the computer vision community and based on queries recorded in-the-wild. Our proposed pipeline, I3DEF, is based on the extraction of spatio-temporal features from intermediate layers of an I3D network, a state-of-the-art network for action recognition, and the fusion of the output of feature maps from RGB and optical flow input. The obtained embeddings are used to train a triplet network to capture the similarity between gestures. We further explore the effect of a person and body part masking step for improving both retrieval performance and recognition rate. Our experiments show the ability of I3DEF to recognize and retrieve gestures which are similar to the queries independently of the depth modality. This performance holds both for queries taken from the test data, and for queries using recordings from different people performing relevant gestures in a different setting.

References

  1. Lorenzo Baraldi, Matthijs Douze, Rita Cucchiara, and Hervé Jégou. 2018. LAMV: Learning to align and match videos with kernelized temporal layers. In Proceedings of the IEEE conference on computer vision and pattern recognition. 7804--7813.Google ScholarGoogle ScholarCross RefCross Ref
  2. Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. 2006. Surf: Speeded up robust features. In European conference on computer vision. Springer, 404--417.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Thomas Brox, Andrés Bruhn, Nils Papenberg, and Joachim Weickert. 2004. High accuracy optical flow estimation based on a theory for warping. In European conference on computer vision. Springer, 25--36.Google ScholarGoogle ScholarCross RefCross Ref
  4. Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.Google ScholarGoogle ScholarCross RefCross Ref
  5. Carlos Dávila, Mario González, Jorge-Luis Pérez-Medina, David Dominguez, Ángel Sánchez, and Francisco B. Rodriguez. 2019. Ensemble of Attractor Networks for 2D Gesture Retrieval. In Advances in Computational Intelligence, Ignacio Rojas, Gonzalo Joya, and Andreu Catala (Eds.). Springer International Publishing, Cham, 488--499.Google ScholarGoogle Scholar
  6. Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248--255.Google ScholarGoogle ScholarCross RefCross Ref
  7. Yajiao Dong and Jianguo Li. 2018. Video retrieval based on deep convolutional neural network. In Proceedings of the 3rd International Conference on Multimedia Systems and Signal Processing. 12--16.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. 2010. The pascal visual object classes (voc) challenge. International journal of computer vision, Vol. 88, 2 (2010), 303--338.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Christoph Feichtenhofer, Axel Pinz, and Andrew Zisserman. 2016. Convolutional two-stream network fusion for video action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1933--1941.Google ScholarGoogle ScholarCross RefCross Ref
  10. Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06), Vol. 2. IEEE, 1735--1742.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Yanbin Hao, Tingting Mu, Richang Hong, Meng Wang, Ning An, and John Y Goulermas. 2016. Stochastic multiview hashing for large-scale near-duplicate video retrieval. IEEE Transactions on Multimedia, Vol. 19, 1 (2016), 1--14.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Elad Hoffer and Nir Ailon. 2015. Deep metric learning using triplet network. In International Workshop on Similarity-Based Pattern Recognition. Springer, 84--92.Google ScholarGoogle ScholarCross RefCross Ref
  13. Chong Huang, Qiong Liu, Yan-Ying Chen, et al. 2017. Local Feature Descriptor Learning with Adaptive Siamese Network. arXiv preprint arXiv:1706.05358 (2017).Google ScholarGoogle Scholar
  14. Yu-Gang Jiang and Jiajun Wang. 2016. Partial copy detection in videos: A benchmark and an evaluation of popular methods. IEEE Transactions on Big Data, Vol. 2, 1 (2016), 32--42.Google ScholarGoogle ScholarCross RefCross Ref
  15. Weizhen Jing, Xiushan Nie, Chaoran Cui, Xiaoming Xi, Gongping Yang, and Yilong Yin. 2019. Global-view hashing: harnessing global relations in near-duplicate video retrieval. World wide web, Vol. 22, 2 (2019), 771--789.Google ScholarGoogle Scholar
  16. Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. 2014. Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 1725--1732.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Will Kay, Joao Carreira, Karen Simonyan, Brian Zhang, Chloe Hillier, Sudheendra Vijayanarasimhan, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al. 2017. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017).Google ScholarGoogle Scholar
  18. Giorgos Kordopatis-Zilos, Symeon Papadopoulos, Ioannis Patras, and Yiannis Kompatsiaris. 2017. Near-duplicate video retrieval by aggregating intermediate CNN layers. In International conference on multimedia modeling. Springer, 251--263.Google ScholarGoogle ScholarCross RefCross Ref
  19. Bartosz Krawczyk. 2016. Learning from imbalanced data: open challenges and future directions. Progress in Artificial Intelligence, Vol. 5, 4 (01 Nov 2016), 221--232. https://doi.org/10.1007/s13748-016-0094-0Google ScholarGoogle Scholar
  20. Kevin Lin, Lijuan Wang, Kun Luo, Yinpeng Chen, Zicheng Liu, and Ming-Ting Sun. 2019. Cross-Domain Complementary Learning with Synthetic Data for Multi-Person Part Segmentation. arXiv preprint arXiv:1907.05193 (2019).Google ScholarGoogle Scholar
  21. Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. Springer, 740--755.Google ScholarGoogle ScholarCross RefCross Ref
  22. C. Ma, J. Huang, X. Yang, and M. Yang. 2019. Robust Visual Tracking via Hierarchical Convolutional Features. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 41, 11 (Nov 2019), 2709--2723. https://doi.org/10.1109/TPAMI.2018.2865311Google ScholarGoogle ScholarCross RefCross Ref
  23. Foteini Markatopoulou, Damianos Galanopoulos, Vasileios Mezaris, and Ioannis Patras. 2017. Query and keyframe representations for ad-hoc video search. In Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval. 407--411.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Luca Rossetto, Ivan Giangreco, Ralph Gasser, and Heiko Schuldt. 2018. Competitive video retrieval with vitrivr. In International Conference on Multimedia Modeling. Springer, 403--406.Google ScholarGoogle ScholarCross RefCross Ref
  25. Luca Rossetto, Ivan Giangreco, Claudiu Tanase, and Heiko Schuldt. 2016. vitrivr: A flexible retrieval stack supporting multiple query modes for searching in multimedia collections. In Proceedings of the 24th ACM international conference on Multimedia. 1183--1186.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Luca Rossetto, Mahnaz Amiri Parian, Ralph Gasser, Ivan Giangreco, Silvan Heller, and Heiko Schuldt. 2019. Deep learning-based concept detection in vitrivr. In International Conference on Multimedia Modeling. Springer, 616--621.Google ScholarGoogle ScholarCross RefCross Ref
  27. Florian Schroff, Dmitry Kalenichenko, and James Philbin. 2015. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition. 815--823.Google ScholarGoogle ScholarCross RefCross Ref
  28. Omar Seddati, Stéphane Dupont, Said Mahmoudi, and Mahnaz Parian. 2017. Towards good practices for image retrieval based on CNN features. In Proceedings of the IEEE International Conference on Computer Vision Workshops. 1246--1255.Google ScholarGoogle ScholarCross RefCross Ref
  29. Lifeng Shang, Linjun Yang, Fei Wang, Kwok-Ping Chan, and Xian-Sheng Hua. 2010. Real-time large scale near-duplicate web video retrieval. In Proceedings of the 18th ACM international conference on Multimedia. 531--540.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Jingkuan Song, Yi Yang, Zi Huang, Heng Tao Shen, and Richang Hong. 2011. Multiple feature hashing for real-time large scale near-duplicate video retrieval. In Proceedings of the 19th ACM international conference on Multimedia. 423--432.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1--9.Google ScholarGoogle ScholarCross RefCross Ref
  32. Giorgos Tolias, Ronan Sicre, and Hervé Jégou. 2015. Particular object retrieval with integral max-pooling of CNN activations. arXiv preprint arXiv:1511.05879 (2015).Google ScholarGoogle Scholar
  33. Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri. 2015. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision. 4489--4497.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Jun Wan, Yibing Zhao, Shuai Zhou, Isabelle Guyon, Sergio Escalera, and Stan Z Li. 2016. Chalearn looking at people rgb-d isolated and continuous datasets for gesture recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 56--64.Google ScholarGoogle ScholarCross RefCross Ref
  35. Jindong Wang, Yiqiang Chen, Shuji Hao, Xiaohui Peng, and Lisha Hu. 2019. Deep learning for sensor-based activity recognition: A survey. Pattern Recognition Letters, Vol. 119 (2019), 3--11.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Jiang Wang, Yang Song, Thomas Leung, Chuck Rosenberg, Jingbin Wang, James Philbin, Bo Chen, and Ying Wu. 2014. Learning fine-grained image similarity with deep ranking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1386--1393.Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Gengshen Wu, Jungong Han, Yuchen Guo, Li Liu, Guiguang Ding, Qiang Ni, and Ling Shao. 2018. Unsupervised deep video hashing via balanced code for large-scale video retrieval. IEEE Transactions on Image Processing, Vol. 28, 4 (2018), 1993--2007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Shahrouz Yousefi and Haibo Li. 2015. 3D Hand Gesture Analysis through a Real-Time Gesture Search Engine. International Journal of Advanced Robotic Systems, Vol. 12, 6 (2015), 67. https://doi.org/10.5772/60045Google ScholarGoogle ScholarCross RefCross Ref
  39. Guoying Zhao and Matti Pietikainen. 2007. Dynamic texture recognition using local binary patterns with an application to facial expressions. IEEE transactions on pattern analysis and machine intelligence, Vol. 29, 6 (2007), 915--928.Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Wan-Lei Zhao and Chong-Wah Ngo. 2009. Scale-rotation invariant pattern entropy for keypoint-based near-duplicate detection. IEEE Transactions on Image Processing, Vol. 18, 2 (2009), 412--423.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Are You Watching Closely? Content-based Retrieval of Hand Gestures

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            ICMR '20: Proceedings of the 2020 International Conference on Multimedia Retrieval
            June 2020
            605 pages
            ISBN:9781450370875
            DOI:10.1145/3372278

            Copyright © 2020 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 8 June 2020

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • short-paper

            Acceptance Rates

            Overall Acceptance Rate254of830submissions,31%

            Upcoming Conference

            ICMR '24
            International Conference on Multimedia Retrieval
            June 10 - 14, 2024
            Phuket , Thailand

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader