Abstract
We focus on the recognition of human actions in uncontrolled videos that may contain complex temporal structures. It is a difficult problem because of the large intra-class variations in viewpoint, video length, motion pattern, etc. To address these difficulties, we propose a novel system in this paper that represents each action class by hidden temporal models. In this system, we represent the crucial action event per category by a video segment that covers a fixed number of frames and can move temporally within the sequences. To capture the temporal structures, the video segment is described by a temporal pyramid model. To capture large intra-class variations, multiple models are combined using Or operation to represent alternative structures. The index of model and the start frame of segment are both treated as hidden variables. We implement a learning procedure based on the latent SVM method. The proposed approach is tested on two difficult benchmarks: the Olympic Sports and HMDB51 data sets. The experimental results reveal that our system is comparable to the state-of-the-art methods in the literature.
Similar content being viewed by others
References
Bobick, A.F., Davis, J.W.: The recognition of human movement using temporal templates. IEEE Trans. Pattern Anal. Mach. Intell. 23(2), 257–267 (2001)
Boykov, Y., Kolmogorov, V.: An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. IEEE Trans. Pattern Anal. Mach. Intell. 26(9), 1124–1137 (2004)
Brendel, W., Todorovic, S.: Learning spatiotemporal graphs of human activities. In: ECCV (2011)
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: CVPR (2005)
Dalal, N., Triggs, B., Schimid, C.: Human detection using oriented histograms of flow and appearance. In: ECCV (2006)
Fan, R.E., Chang, K.W., Hsieh, C.J., Wang, X.R., Lin, C.J.: Liblinear: a library for large linear classification. J. Mach. Learn. Res. (2008)
Fathi, A., Mori, G.: Action recognition by learning mid-level motion features. In: CVPR (2008)
Felzenszwalb, P., Girshick, R., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part based models. IEEE Trans. Pattern Anal. Mach. Intell. 32(9), 1627–1645 (2010)
Hoai, M., la Torre, F.D.: Max-margin early event detectors. In: CVPR (2012)
Ikizler, N., Forsyth, D.: Searching for complex human activities with no visual examples. Int. J. Comput. Vis. 80(3), 337–357 (2008)
Jhuang, H., Serre, T., Wolf, L., Poggio, T.: A biologically inspired system for action recognition. In: ICCV (2007)
Jiang, Y.G., Dai, Q., Xue, X., Liu, W., Ngo, C.W.: Trajectory-based modeling of human actions with motion reference points. In: ECCV (2012)
Kliper-Gross, O., Gurovich, Y., Hassner, T., Wolf, L.: Motion interchange patterns for action recognition in unconstrained videos. In: ECCV (2012)
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: Hmdb: a large video dataset for human action recognition. In: ICCV (2011)
Laptev, I., Marszalek, M., Schmid, C., Rozenfeld, B.: Learning realistic human actions from movies. In: Proceedings of CVPR (2008)
Laxton, B., Lim, J., Kriegman, D.: Leveraging temporal, contextual and ordering constraints for recognizing complex activities in video. In: CVPR (2007)
Lazebnik, S., Schmid, C., Ponce, J.: Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: CVPR (2006)
Liu, J., Kuipers, B., Savarese, S.: Recognizing human actions by attributes. In: CVPR (2011)
Liu, J., Shah, M.: Learning human actions via information maximization. In: Proceedings of CVPR (2008)
Natarajan, P., Nevatia, R.: View and scale invariant action recognition using multiview shape-flow models. In: CVPR (2008)
Niebles, J.C., Chen, C.W., Fei-Fei, L.: Modeling temporal structure of decomposable motion segments for activity classification. In: ECCV (2010)
Ogale, A.S., Karapurkar, A., Gutemberg, G.F., Aloimonos, Y.: View invariant identification of pose sequences for action recognition. In: VACE (2004)
Sadanand, S., Corso, J.: Action bank: a high-level representation of activity in video. In: CVPR (2012)
Satkin, S., Hebert, M.: Modeling the temporal extent of actions. In: ECCV (2010)
Schindler, K., Gool, L.V.: Action snippets: how many frames does human action recognition require? In: CVPR (2008)
Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 888–905 (2000)
Tang, K., Fei-Fei, L., Koller, D.: Learning latent temporal structure for complex event detection. In: CVPR (2012)
Tran, D., Yuan, J.: Optimal spatio-temporal path discovery for video event detection. In: CVPR (2011)
Tran, D., Yuan, J.: Max-margin structured output regression for spatio-temporal action localization. In: NIPS (2012)
Tsochantaridis, I., Hofmann, T., Joachims, T., Altun, Y.: Support vector machine learning for interdependent and structured output spaces. In: ICML (2004)
Vedaldi, A., Zisserman, A.: Efficient additive kernels via explicit feature maps. In: CVPR (2010)
Wang, H., Klaser, A., Schimid, C., Liu, C.L.: Action recognition by dense trajectories. In: CVPR (2011)
Wang, H., Klaser, A., Schmid, C., Liu, C.L.: Dense trajectories and motion boundary descriptors for action recognition. Int. J. Comput. Vis. 103(1), 60–79 (2013)
Wang, Y., Mori, G.: Hidden part models for human action recognition: probabilistic versus max margin. IEEE Trans. Pattern Anal. Mach. Intell. 33(7), 1310–1323 (2011)
Weinland, D., Boyer, E., Ronfard, R.: Action recognition from arbitrary views using 3d exemplars. In: ICCV (2007)
Weinland, D., Ronfard, R., Boyer, E.: A survey of vision-based methods for action representation, segmentation and recognition. Comput. Vis. Image Underst. 115(2), 224–241 (2011)
Xiang, T., Gong, S.: Beyong tracking: modelling action and understanding behavior. Int. J. Comput. Vis. 67(1), 21–51 (2006)
Yao, B., Fei-Fei, L.: Recognizing human actions in still images by modeling the mutual context of objects and human poses. IEEE Trans. Pattern Anal. Mach. Intell. 34(9), 1691–1703 (2012)
Yu, C.N.J., Joachims, T.: Learning structural svms with latent variables. In: ICML (2009)
Yuan, J., Liu, Z., Wu, Y.: Discriminative video pattern search for efficient action detection. IEEE Trans. Pattern Anal. Mach. Intell. 33(9), 1728–1743 (2011)
Yuille, A., Rangarajan, A.: The concave-convex procedure (cccp). In: NIPS, pp. 1033–1040 (2001)
Acknowledgments
This work is supported by the National Basic Research Program of China (2013CB329401), the Natural Science Foundation of China (61375034, 61203263) and the NUDT Open Project of National Key Lab of High Performance Computing.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wu, J., Hu, D. & Chen, F. Action recognition by hidden temporal models. Vis Comput 30, 1395–1404 (2014). https://doi.org/10.1007/s00371-013-0899-9
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00371-013-0899-9