Skip to main content
Log in

Volumetric Features for Video Event Detection

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Real-world actions occur often in crowded, dynamic environments. This poses a difficult challenge for current approaches to video event detection because it is difficult to segment the actor from the background due to distracting motion from other objects in the scene. We propose a technique for event recognition in crowded videos that reliably identifies actions in the presence of partial occlusion and background clutter. Our approach is based on three key ideas: (1) we efficiently match the volumetric representation of an event against oversegmented spatio-temporal video volumes; (2) we augment our shape-based features using flow; (3) rather than treating an event template as an atomic entity, we separately match by parts (both in space and time), enabling robustness against occlusions and actor variability. Our experiments on human actions, such as picking up a dropped object or waving in a crowd show reliable detection with few false positives.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Aggarwal, J. K., & Cai, Q. (1999). Human motion analysis: A review. Computer Vision and Image Understanding, 73(3), 428–440.

    Article  Google Scholar 

  • Ankerst, M., Kastenmüller, G., Kriegel, H.-P., & Seidl, T. (1999). 3D shape histograms for similarity search and classification in spatial databases. In Proceedings of international symposium of advances in spatial databases.

  • Arambel, P., Silver, J., Krant, J., Antone, M., & Strat, T. (2004). Multiple-hypothesis tracking of multiple ground targets from aerial video with dynamic sensor control. In Proceedings of SPIE 5429 (Signal processing, sensor fusion, and target recognition XIII).

  • Aslam, J. A., Pavlu, V., & Yilmaz, E. (2005). A geometric interpretation of R-precision and its correlation with average precision. In Proceedings of the international ACM SIGIR conference on research and development in information retrieval.

  • Bell, W., Felzenszwalb, P., & Huttenlocher, D. (1999). Detection and long term tracking of moving objects in aerial video (Technical report). Cornell University.

  • Belongie, S., Malik, J., & Puzicha, J. (2002). Shape matching and object recognition using shape contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(4).

  • Blank, M., Gorelick, L., Shechtman, E., Irani, M., & Basri, R. (2005). Actions as space-time shapes. In Proc. ICCV.

  • Bobick, A. F., & Davis, J. W. (2001). The recognition of human movement using temporal templates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 23(3).

  • Boiman, O., & Irani, M. (2006). Similarity by composition. In NIPS.

  • Buckley, C., & Voorhees, E. M. (2000). Evaluating evaluation measure stability. In Proceedings of international ACM SIGIR conference on research and development in information retrieval.

  • Chang, C.-C., & Lin, C.-J. (2001). LIBSVM: a library for support vector machines. Software available at www.csie.ntu.edu.tw/~cjlin/libsvm.

  • Cheng, Y. (1995). Mean shift, mode seeking, and clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 17(8).

  • Comaniciu, D., & Meer, P. (2002). Mean shift: A robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5).

  • Cour, T., & Shi, J. (2007). Recognizing objects by piecing together the segmentation puzzle. In Proc. CVPR.

  • Cyr, C. M., & Kimia, B. B. (2001). 3D object recognition using shape similiarity-based aspect graph. In Proc. ICCV.

  • DeMenthon, D. (2002). Spatio-temporal segmentation of video by hierarchical mean shift analysis. In Statistical methods in video processing workshop.

  • DeMenthon, D., & Doermann, D. (2006). Video retrieval of near-duplicates using k-nearest neighbor retrieval of spatio-temporal descriptors. Multimedia Tools and Applications, 30(3).

  • Dollar, P., Rabaud, V., Cottrell, G., & Belongie, S. (2005). Behavior recognition via sparse spatio-temporal features. In IEEE VS-PETS workshop.

  • Efros, A., Berg, A., Mori, G., & Malik, J. (2003). Recognizing action at a distance. In Proc. ICCV.

  • Fei-Fei, L., Fergus, R., & Perona, P. (2006). One-shot learning of object categories. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(4).

  • Felzenszwalb, P., & Huttenlocher, D. (2005). Pictorial structures for object recognition. International Journal of Computer Vision, 61(1).

  • Fischler, M. A., & Elschlager, R. A. (1973). The representation and matching of pictorial structures. IEEE Transactions on Computers, 22(1).

  • Funkhouser, T., Min, P., Kazhdan, M., Chen, J., Halderman, A., Dobkin, D., & Jacobs, D. (2003). A search engine for 3D models. ACM Transactions on Graphics.

  • Gorelick, L., Galun, M., Sharon, E., Basri, R., & Brandt, A. (2006). Shape representation and classification using the Poisson equation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(12).

  • Hamid, R., Maddi, S., Johnson, A., Bobick, A., Essa, I., & Isbell, C. (2005). Discovery and characterization of activities from event-streams. In Proc. UAI.

  • Hongeng, S., & Nevatia, R. (2001). Multi-agent event recognition. In Proc. ICCV.

  • Jhuang, H., Serre, T., Wolf, L., & Poggio, T. (2007). A biologically inspired system for action recognition. In Proc. ICCV.

  • Jiang, H., Drew, M. S., & Li, Z.-N. (2006). Successive convex matching for action detection. In Proc. CVPR.

  • Kazhdan, M., Funkhouser, T., & Rusinkiewicz, S. (2003). Rotation invariant spherical harmonic representation of 3D shape descriptors. In Symposium on geometry processing.

  • Ke, Y., Sukthankar, R., & Hebert, M. (2005). Efficient visual event detection using volumetric features. In Proc. ICCV.

  • Ke, Y., Sukthankar, R., & Hebert, M. (2007a). Event detection in crowded videos. In Proc. ICCV.

  • Ke, Y., Sukthankar, R., & Hebert, M. (2007b). Spatio-temporal shape and flow correlation for action recognition. In Workshop on visual surveillance.

  • Laptev, I., & Lindeberg, T. (2003). Space-time interest points. In Proc. ICCV.

  • Laptev, I., & Perez, P. (2007). Retrieving actions in movies. In Proc. ICCV.

  • Laptev, I., Marszalek, M., Schmid, C., & Rozenfeld, B. (2008). Learning realistic human actions from movies. In Proc. CVPR.

  • Leibe, B., Schindler, K., & Gool, L. V. (2007). Coupled detection and trajectory estimation for multi-object tracking. In Proc. ICCV.

  • Leung, Y., Zhang, J.-S., & Xu, Z.-B. (2000). Clustering by scale-space filtering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22.

  • Ling, H., & Jacobs, D. W. (2007). Shape classification using the inner-distance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(2).

  • Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2).

  • Lucas, B., & Kanade, T. (1981). An iterative image registration technique with an application to stereo vision. In Proceedings of the 7th international joint conference on artificial intelligence.

  • Medioni, G., Cohen, I., Bremond, F., Hongeng, S., & Nevatia, R. (2001). Event detection and analysis from video streams. IEEE Transactions on Pattern Analysis and Machine Intelligence.

  • Mori, G. (2005). Guiding model search using segmentation. In Proc. ICCV.

  • Niebles, J. C., Wang, H., & Fei-Fei, L. (2006). Unsupervised learning of human action categories using spatial-temporal words. In Proc. BMVC.

  • Odobez, J.-M., & Bouthemy, P. (1995). Robust multiresolution estimation of parametric motion models. Journal of Visual Communication and Image Representation, 6(4).

  • QuickFix Tight Abs Workout. Peter Pan Studios. ASIN: B00004Z73V.

  • Ramanan, D., & Forsyth, D. A. (2003). Automatic annotation of everyday movements. In NIPS.

  • Ramanan, D., Forsyth, D. A., & Barnard, K. (2006). Building models of animals from video. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(8).

  • Ramanan, D., Forsyth, D. A., & Zisserman, A. (2007). Tracking people by learning their appearance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(1).

  • Sali, E., & Ullman, S. (1999). Combining class-specific fragments for object classification. In Proc. BMVC.

  • Schuldt, C., Laptev, I., & Caputo, B. (2004). Recognizing human actions: A local SVM approach. In Proc. ICPR.

  • Shechtman, E., & Irani, M. (2005). Space-time behavior based correlation. In Proc. CVPR.

  • Shechtman, E., & Irani, M. (2007a). Matching local self-similarities across images and video. In Proc. CVPR.

  • Shechtman, E., & Irani, M. (2007b). Space-time behavior based correlation -OR- How to tell if two underlying motion fields are similar without computing them? IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(11).

  • Sheikh, Y., Sheikh, M., & Shah, M. (2005). Exploring the space of a human action. In Proc. ICCV.

  • Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8).

  • Sivic, J., & Zisserman, A. (2003). Video Google: A text retrieval approach to object matching in videos. In Proc. ICCV.

  • Srinivasan, P., & Shi, J. (2007). Bottom-up recognition and parsing of the human body. In Proc. CVPR.

  • Veit, P. B. T., & Cao, F. (2004). Probabilistic parameter-free motion detection. In Proc. CVPR.

  • Vaswani, N., Chowdhury, A. R., & Chellappa, R. (2003). Activity recognition using the dynamics of the configuration of interacting objects. In Proc. CVPR.

  • Veeraraghavan, A., Chellappa, R., & Roy-Chowdhury, A. K. (2006). The function space of an activity. In Proc. CVPR.

  • Viola, P., & Jones, M. (2004). Robust real-time face detection. International Journal of Computer Vision, 57(2).

  • Wang, J., Bhat, P., Colburn, A., Agrawala, M., & Cohen, M. (2005). Interactive video cutout. In ACM SIGGRAPH.

  • Wang, J., Thiesson, B., Xu, Y., & Cohen, M. (2004). Image and video segmentation by anisotropic kernel mean shift. In Proc. ECCV.

  • Wang, L., Hu, W., & Tan, T. (2003). Recent developments in human motion analysis. Pattern Recognition, 36(3).

  • Weber, M., Welling, M., & Perona, P. (2000). Unsupervised learning of models for recognition. In Proc. ECCV.

  • Weinland, D., Ronfard, R., & Boyer, E. (2006a). Automatic discovery of action taxonomies from multiple views. In Proc. CVPR.

  • Weinland, D., Ronfard, R., & Boyer, E. (2006b). Free viewpoint action recognition using motion history volumes. Computer Vision and Image Understanding, 104(2).

  • Wimbledon 2000 Semi-Final—Agassi vs. Rafter. SRO Sports Entertainment. ISBN: 0-7697-7886-0.

  • Yilmaz, A., & Shah, M. (2005). Actions as objects: A novel action representation. In Proc. CVPR.

  • YouTube (2008). http://www.youtube.com/.

  • Zhu, G., Xu, C., Gao, W., & Huang, Q. (2006a). Action recognition in broadcast tennis video using optical flow and support vector machine. In ECCV workshop in HCI.

  • Zhu, G., Xu, C., Huang, Q., & Gao, W. (2006b). Action recognition in broadcast tennis video. In Proc. ICPR.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yan Ke.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ke, Y., Sukthankar, R. & Hebert, M. Volumetric Features for Video Event Detection. Int J Comput Vis 88, 339–362 (2010). https://doi.org/10.1007/s11263-009-0308-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-009-0308-z

Navigation