Skip to main content
Log in

Upper Body Detection and Tracking in Extended Signing Sequences

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

The goal of this work is to detect and track the articulated pose of a human in signing videos of more than one hour in length. In particular we wish to accurately localise hands and arms, despite fast motion and a cluttered and changing background.

We cast the problem as inference in a generative model of the image, and propose a complete model which accounts for self-occlusion of the arms. Under this model, limb detection is expensive due to the very large number of possible configurations each part can assume. We make the following contributions to reduce this cost: (i) efficient sampling from a pictorial structure proposal distribution to obtain reasonable configurations; (ii) identifying a large number of frames where configurations can be correctly inferred, and exploiting temporal tracking elsewhere.

Results are reported for signing footage with challenging image conditions and for different signers. We show that the method is able to identify the true arm and hand locations with high reliability. The results exceed the state-of-the-art for the length and stability of continuous limb tracking.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Agarwal, A., & Triggs, B. (2006). Recovering 3D human pose from monocular images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(1), 44–58.

    Article  Google Scholar 

  • Andriluka, M., Roth, S., & Schiele, B. (2009). Pictorial structures revisited: people detection and articulated pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition.

    Google Scholar 

  • Buchanan, A. M., & Fitzgibbon, A. W. (2006). Interactive feature tracking using k-d trees and dynamic programming. In Proceedings of the IEEE conference on computer vision and pattern recognition (Vol. 1, pp. 626–633).

    Google Scholar 

  • Buehler, P., Everingham, M., Huttenlocher, D. P., & Zisserman, A. (2008). Long term arm and hand tracking for continuous sign language TV broadcasts. In Proceedings of the British machine vision conference.

    Google Scholar 

  • Buehler, P., Everingham, M., & Zisserman, A. (2009). Learning sign language by watching TV (using weakly aligned subtitles). In Proceedings of the IEEE conference on computer vision and pattern recognition.

    Google Scholar 

  • Cooper, H., & Bowden, R. (2007). Large lexicon detection of sign language. In ICCV, workshop human computer interaction (Vol. 4796, pp. 88–97).

    Google Scholar 

  • Dalal, N., & Triggs, B. (2005). Histogram of oriented gradients for human detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (Vol. 2, pp. 886–893).

    Google Scholar 

  • Eichner, M., & Ferrari, V. (2009). Better appearance models for pictorial structures. In Proceedings of the British machine vision conference.

    Google Scholar 

  • Farhadi, A., Forsyth, D., & White, R. (2007). Transfer learning in sign language. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1–8).

    Google Scholar 

  • Felzenszwalb, P., & Huttenlocher, D. (2005). Pictorial structures for object recognition. International Journal of Computer Vision 61(1), 55–79.

    Article  Google Scholar 

  • Felzenszwalb, P. F., & Huttenlocher, D. P. (2000). Efficient matching of pictorial structures. In Proceedings of the IEEE conference on computer vision and pattern recognition (Vol. 2, pp. 2066–2073).

    Google Scholar 

  • Ferrari, V., Marin-Jimenez, M., & Zisserman, A. (2008). Progressive search space reduction for human pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition.

    Google Scholar 

  • Fischler, M., & Elschlager, R. (1973). The representation and matching of pictorial structures. IEEE Transactions on Computer, c-22(1), 67–92.

    Article  Google Scholar 

  • Fleck, M. M., Forsyth, D. A., & Bregler, C. (1996). Finding naked people. In Lecture notes in computer science: Vol. 1065. Proceedings of the European conference on computer vision (pp. 591–602). Berlin: Springer.

    Google Scholar 

  • Fossati, A., Dimitrijevic, M., Lepetit, V., & Fua, P. (2007). Bridging the gap between detection and tracking for 3D monocular video-based motion capture. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1–8).

    Google Scholar 

  • Jiang, H. (2009). Human pose estimation using consistent max-covering. In Proceedings of the international conference on computer vision.

    Google Scholar 

  • Johnson, S., & Everingham, M. (2009). Combining discriminative appearance and segmentation cues for articulated human pose estimation. In IEEE international workshop on machine learning for vision-based motion analysis.

    Google Scholar 

  • Kadir, T., Bowden, R., Ong, E. J., & Zisserman, A. (2004). Minimal training, large lexicon, unconstrained sign language recognition. In Proceedings of the British machine vision conference.

    Google Scholar 

  • Kumar, M. P., Torr, P. H. S., & Zisserman, A. (2004). Extending pictorial structures for object recognition. In Proceedings of the British machine vision conference.

    Google Scholar 

  • Kumar, M. P., Torr, P. H. S., & Zisserman, A. (2005). Learning layered motion segmentations of video. In Proceedings of the international conference on computer vision.

    Google Scholar 

  • Kumar, M. P., Torr, P. H. S., & Zisserman, A. (2009). Efficient discriminative learning of parts-based models. In Proceedings of the international conference on computer vision.

    Google Scholar 

  • Lan, X., & Huttenlocher, D. (2005). Beyond trees: common-factor models for 2D human pose recovery. In Proceedings of the international conference on computer vision: Vol. 1.

    Google Scholar 

  • Lee, M., & Cohen, I. (2006). A model-based approach for estimating human 3D poses in static images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(6), 905–916.

    Article  Google Scholar 

  • Lin, Z., Davis, L., Doermann, D., & DeMenthon, D. (2007). An interactive approach to pose-assisted and appearance-based segmentation of humans. In ICCV, workshop on interactive computer vision.

    Google Scholar 

  • Micilotta, A., Ong, E., & Bowden, R. (2005). Real-time upper body 3D pose estimation from a single uncalibrated camera.

  • Navaratnam, R., Thayananthan, A., Torr, P., & Cipolla, R. (2005). Hierarchical part-based human body pose estimation. In Proceedings of the British machine vision conference (pp. 479–488).

    Google Scholar 

  • Ong, E., & Bowden, R. (2004). A boosted classifier tree for hand shape detection. In Proceedings of the international conference on automatic face and gesture recognition (pp. 889–894).

    Google Scholar 

  • Ramanan, D. (2006). Learning to parse images of articulated bodies. In Advances in neural information processing systems. Cambridge: MIT Press.

    Google Scholar 

  • Ramanan, D., Forsyth, D. A., & Zisserman, A. (2005). Strike a pose: tracking people by finding stylized poses. In Proceedings of the IEEE conference on computer vision and pattern recognition (Vol. 1, pp. 271–278).

    Google Scholar 

  • Sheikh, Y., Datta, A., & Kanade, T. (2008). On the sustained tracking of human motion. In Proceedings of the international conference on automatic face and gesture recognition.

    Google Scholar 

  • Siddiqui, M., & Medioni, G. (2007). Efficient upper body pose estimation from a single image or a sequence. In Human motion, lecture notes in computer science: Vol. 4814.

    Google Scholar 

  • Sigal, L., & Black, M. (2006). Measure locally, reason globally: Occlusion-sensitive articulated pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition (Vol. 2, pp. 2041–2048).

    Google Scholar 

  • Sivic, J. Zitnick, C. L., & Szeliski, R. (2006). Finding people in repeated shots of the same scene. In Proceedings of the British machine vision conference.

    Google Scholar 

  • Starner, T., Weaver, J., & Pentland, A. (1998). Real-time American sign language recognition using desk- and wearable computer-based video. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12), 1371–1375.

    Article  Google Scholar 

  • Tran, D., & Forsyth, D. (2007). Configuration estimates improve pedestrian finding. In Advances in neural information processing systems.

    Google Scholar 

  • Viola, P., & Jones, M. (2002). Robust real-time object detection. International Journal of Computer Vision, 1(2), 137–154.

    Google Scholar 

  • Wang, Y., & Mori, G. (2008). Multiple tree models for occlusion and spatial constraints in human pose estimation. In Proceedings of the European conference on computer vision.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mark Everingham.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Buehler, P., Everingham, M., Huttenlocher, D.P. et al. Upper Body Detection and Tracking in Extended Signing Sequences. Int J Comput Vis 95, 180–197 (2011). https://doi.org/10.1007/s11263-011-0480-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-011-0480-9

Keywords

Navigation