Abstract
Hand motion capture is a popular research field, recently gaining more attention due to the ubiquity of RGB-D sensors. However, even most recent approaches focus on the case of a single isolated hand. In this work, we focus on hands that interact with other hands or objects and present a framework that successfully captures motion in such interaction scenarios for both rigid and articulated objects. Our framework combines a generative model with discriminatively trained salient points to achieve a low tracking error and with collision detection and physics simulation to achieve physically plausible estimates even in case of occlusions and missing visual data. Since all components are unified in a single objective function which is almost everywhere differentiable, it can be optimized with standard optimization techniques. Our approach works for monocular RGB-D sequences as well as setups with multiple synchronized RGB cameras. For a qualitative and quantitative evaluation, we captured 29 sequences with a large variety of interactions and up to 150 degrees of freedom.
Notes
All annotated sequences are available at http://files.is.tue.mpg.de/dtzionas/hand-object-capture.html
All annotated sequences are available at http://files.is.tue.mpg.de/dtzionas/hand-object-capture.html.
References
Aggarwal, A., Klawe, M. M., Moran, S., Shor, P., & Wilber, R. (1987). Geometric applications of a matrix-searching algorithm. Algorithmica, 2(1–4), 195–208.
Albrecht, I., Haber, J., & Seidel, H. P. (2003). Construction and animation of anatomically based human hand models. In: SCA (pp. 98–109).
Athitsos, V., & Sclaroff, S. (2003). Estimating 3d hand pose from a cluttered image. In CVPR (pp 432–439).
Ballan, L., & Cortelazzo, G. M. (2008). Marker-less motion capture of skinned models in a four camera set-up using optical flow and silhouettes. In 3DPVT.
Ballan, L., Taneja, A., Gall, J., Van Gool, L., & Pollefeys, M. (2012) Motion capture of hands in action using discriminative salient points. In ECCV (pp. 640–653).
Baran, I., & Popović, J. (2007). Automatic rigging and animation of 3d characters. TOG, 26(3).
Belongie, S., Malik, J., & Puzicha, J. (2002). Shape matching and object recognition using shape contexts. PAMI, 24(4), 509–522.
Bray, M., Koller-Meier, E., & Van Gool, L. (2007). Smart particle filtering for high-dimensional tracking. CVIU, 106(1), 116–129.
Bregler, C., Malik, J., & Pullen, K. (2004). Twist based acquisition and tracking of animal and human kinematics. IJCV, 56(3), 179–194.
Brox, T., Rosenhahn, B., Gall, J., & Cremers, D. (2010). Combined region- and motion-based 3d tracking of rigid and articulated objects. PAMI, 32(3), 402–415.
Canny, J. (1986). A computational approach to edge detection. PAMI, 8(6), 679–698.
Chen, Y., & Medioni, G. (1991). Object modeling by registration of multiple range images. In ICRA.
Coumans, E. (2013) Bullet real-time physics simulation. http://bulletphysics.org.
de Campos, T., & Murray, D. (2006). Regression-based hand pose estimation from multiple cameras. In CVPR.
de La Gorce, M., Fleet, D. J., & Paragios, N. (2011). Model-based 3d hand pose estimation from monocular video. PAMI, 33(9), 1793–1805.
Delamarre, Q., & Faugeras, O. D. (2001). 3d articulated models and multiview tracking with physical forces. CVIU, 81(3), 328–357.
Ekvall, S., & Kragic, D. (2005). Grasp recognition for programming by demonstration. In ICRA (pp. 748–753).
Erol, A., Bebis, G., Nicolescu, M., Boyle, R. D., & Twombly, X. (2007). Vision-based hand pose estimation: A review. CVIU, 108(1–2), 52–73.
Everingham, M., Van Gool, L., Williams, C., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. IJCV, 88(2), 303–338.
Felzenszwalb, P. F., & Huttenlocher, D. P. (2004). Distance transforms of sampled functions. Cornell Computing and Information Science: Tech. rep.
Gall, J., Fossati, A., & Van Gool, L. (2011a). Functional categorization of objects using real-time markerless motion capture. In CVPR (pp. 1969–1976).
Gall, J., Yao, A., Razavi, N., Van Gool, L., & Lempitsky, V. (2011b). Hough forests for object detection, tracking, and action recognition. PAMI, 33(11), 2188–2202.
Gärtner, B., & Schönherr, S. (2000). An efficient, exact, and generic quadratic programming solver for geometric optimization. In SCG ’00 (pp 110–118).
Hamer, H., Gall, J., Weise, T., & Van Gool, L. (2010). An object-dependent hand pose prior from sparse training data. In CVPR (pp. 671–678).
Hamer, H., Schindler, K., Koller-Meier, E., & Van Gool, L. (2009). Tracking a hand manipulating an object. In ICCV (pp. 1475–1482).
Heap, T., & Hogg, D. (1996). Towards 3d hand tracking using a deformable model. In: FG (pp. 140–145).
Holzer, S., Rusu, R., Dixon, M., Gedikli, S., & Navab, N. (2012). Adaptive neighborhood selection for real-time surface normal estimation from organized point cloud data using integral images. In: IROS (pp 2684–2689).
Jones, M. J., & Rehg, J. M. (2002). Statistical color models with application to skin detection. IJCV, 46(1), 81–96.
Keskin, C., Kra, F., Kara, Y., & Akarun, L. (2012). Hand pose estimation and hand shape classification using multi-layered randomized decision forests. In ECCV.
Kim, D., Hilliges, O., Izadi, S., Butler, A.D., Chen, J., Oikonomidis, I., & Olivier, P. (2012). Digits: Freehand 3d interactions anywhere using a wrist-worn gloveless sensor. In UIST (pp. 167–176).
Kyriazis, N., & Argyros, A. (2013). Physically plausible 3d scene tracking: The single actor hypothesis. In CVPR (pp. 9–16).
Kyriazis, N., & Argyros, A. (2014) Scalable 3d tracking of multiple interacting objects. In CVPR.
Lewis, J. P., Cordner, M., & Fong, N. (2000). Pose space deformation: A unified approach to shape interpolation and skeleton-driven deformation. In SIGGRAPH.
Lu, S., Metaxas, D., Samaras, D., & Oliensis, J. (2003). Using multiple cues for hand tracking and model refinement. In CVPR (pp. 443–450).
MacCormick, J., & Isard, M. (2000) Partitioned sampling, articulated objects, and interface-quality hand tracking. In ECCV (pp. 3–19).
Murray, R. M., Sastry, S. S., & Zexiang, L. (1994). A mathematical introduction to robotic manipulation.
Oikonomidis, I., Kyriazis, N., & Argyros, A. (2011a). Efficient model-based 3d tracking of hand articulations using kinect. In BMVC (pp 101.1–101.11).
Oikonomidis, I., Kyriazis, N., & Argyros, A. (2011b). Full dof tracking of a hand interacting with an object by modeling occlusions and physical constraints. In ICCV.
Oikonomidis, I., Kyriazis, N., & Argyros, A. A. (2012). Tracking the articulated motion of two strongly interacting hands. In CVPR (pp 1862–1869).
Oikonomidis, I., Lourakis, M. I., & Argyros, A. A. (2014). Evolutionary quasi-random search for hand articulations tracking. In CVPR.
Paris, S., & Durand, F. (2009). A fast approximation of the bilateral filter using a signal processing approach. IJCV, 81(1), 24–52.
Pons-Moll, G., & Rosenhahn, B. (2011). Model-based Pose estimation (pp. 139–170).
Qian, C., Sun, X., Wei, Y., Tang, X., & Sun, J. (2014). Realtime and robust hand tracking from depth. In CVPR.
Rehg, J. M., & Kanade, T. (1994). Visual tracking of high dof articulated structures: An application to human hand tracking. In ECCV (pp. 35–46).
Rehg, J., & Kanade, T. (1995). Model-based tracking of self-occluding articulated objects. In ICCV (pp. 612–617).
Romero, J., Kjellström, H., & Kragic, D. (2009). Monocular real-time 3d articulated hand pose estimation. In HUMANOIDS (pp. 87–92).
Romero, J., Kjellström, H., & Kragic, D. (2010). Hands in action: Real-time 3d reconstruction of hands in interaction with objects. In ICRA (pp. 458–463).
Rosales, R., Athitsos, V., Sigal, L., & Sclaroff, S. (2001). 3d hand pose reconstruction using specialized mappings. In ICCV (pp. 378–387).
Rosenhahn, B., Brox, T., & Weickert, J. (2007). Three-dimensional shape knowledge for joint image segmentation and pose tracking. IJCV, 73(3), 243–262.
Rusinkiewicz, S., & Levoy, M. (2001). Efficient variants of the icp algorithm. In 3DIM (pp 145–152).
Rusinkiewicz, S., Hall-Holt, O., & Levoy, M. (2002). Real-time 3d model acquisition. TOG, 21(3), 438–446.
Schmidt, T., Newcombe, R., & Fox, D. (2014). Dart: Dense articulated real-time tracking. In Proceedings of robotics: Science and systems, Berkeley, USA.
Sharp, T., Keskin, C., Robertson, D., Taylor, J., Shotton, J., Kim, D., Rhemann, C., Leichter, I., Vinnikov, A., Wei, Y., Freedman, D., Kohli, P., Krupka, E., Fitzgibbon, A., & Izadi, S. (2015). Accurate, robust, and flexible real-time hand tracking. In CHI.
Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., & Blake, A. (2011). Real-time human pose recognition in parts from single depth images. In CVPR (pp. 1297–1304).
Sridhar, S., Mueller, F., Oulasvirta, A., & Theobalt, C. (2015). Fast and robust hand tracking using detection-guided optimization. In: CVPR.
Sridhar, S., Oulasvirta, A., & Theobalt, C. (2013). Interactive markerless articulated hand motion tracking using rgb and depth data. In ICCV (pp. 2456–2463).
Sridhar, S., Rhodin, H., Seidel, H.P., Oulasvirta, A., & Theobalt, C. (2014). Real-time hand tracking using a sum of anisotropic gaussians model. In 3DV.
Stenger, B., Mendonca, P., & Cipolla, R. (2001). Model-based 3D tracking of an articulated hand. In CVPR.
Stolfi, J. (1991). Oriented projective geometry: A framework for geometric computation. Boston: Academic Press.
Sudderth, E., Mandel, M., Freeman, W., & Willsky, A. (2004) Visual hand tracking using nonparametric belief propagation. In Workshop on generative model based vision (pp. 189–189).
Tang, D., Chang, H. J., Tejani, A., & Kim, T. K. (2014). Latent regression forest: Structured estimation of 3d articulated hand posture. In CVPR.
Tang, D., Yu, T. H., & Kim, T. K. (2013). Real-time articulated hand pose estimation using semi-supervised transductive regression forests. In ICCV (pp. 3224–3231).
Taylor, J., Stebbing, R., Ramakrishna, V., Keskin, C., Shotton, J., Izadi, S., Hertzmann, A., & Fitzgibbon. A. (2014). User-specific hand modeling from monocular depth sequences. In CVPR.
Teschnerm, M., Kimmerle, S., Heidelberger, B., Zachmann, G., Raghupathi, L., Fuhrmann, A., Cani, M. P., Faure, F., Magnetat-Thalmann, N., & Strasser, W. (2004). Collision detection for deformable objects. In Eurographics.
Thayananthan, A., Stenger, B., Torr, P. H. S., & Cipolla, R. (2003). Shape context and chamfer matching in cluttered scenes. In CVPR (pp. 127–133).
Tompson, J., Stein, M., Lecun, Y., & Perlin, K. (2014). Real-time continuous pose recovery of human hands using convolutional networks. In TOG 33.
Tzionas, D., & Gall, J. (2013). A comparison of directional distances for hand pose estimation. In GCPR.
Tzionas, D., Srikantha, A., Aponte, P., & Gall, J. (2014). Capturing hand motion with an rgb-d sensor, fusing a generative model with salient points. In GCPR.
Vaezi, M., & Nekouie, M. A. (2011). 3d human hand posture reconstruction using a single 2d image. IJHCI, 1(4), 83–94.
Wang, R. Y., & Popović, J. (2009). Real-time hand-tracking with a color glove. TOG, 28(3), 63:1–63:8.
Wu, Y., Lin, J., & Huang, T. (2001). Capturing natural hand articulation. In ICCV (pp. 426–432).
Ye, M., Zhang, Q., Wang, L., Zhu, J., Yang, R., & Gall, J. (2013). A survey on human motion analysis from depth data. In Time-of-flight and depth imaging. sensors, algorithms, and applications (pp. 149–187).
Acknowledgments
Financial support was provided by the DFG Emmy Noether program (GA 1927/1-1).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Junsong Yuan, Wanqing Li, Zhengyou Zhang, David Fleet, Jamie Shotton.
Rights and permissions
About this article
Cite this article
Tzionas, D., Ballan, L., Srikantha, A. et al. Capturing Hands in Action Using Discriminative Salient Points and Physics Simulation. Int J Comput Vis 118, 172–193 (2016). https://doi.org/10.1007/s11263-016-0895-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-016-0895-4