Elsevier

Pattern Recognition

Volume 59, November 2016, Pages 55-62
Pattern Recognition

Sparsity-inducing dictionaries for effective action classification

https://doi.org/10.1016/j.patcog.2016.03.011Get rights and content

Highlights

  • Sparsity-inducing dictionaries as an effective representation for action classification in videos.

  • Features obtained from sparsity based representation provide enough discriminative information for classification of action videos.

  • Constructed dictionaries are distinct for a large number of action classes resulting in a significant improvement in classification accuracy.

Abstract

Action recognition in unconstrained videos is one of the most important challenges in computer vision. In this paper, we propose sparsity-inducing dictionaries as an effective representation for action classification in videos. We demonstrate that features obtained from sparsity based representation provide discriminative information useful for classification of action videos into various action classes. We show that the constructed dictionaries are distinct for a large number of action classes resulting in a significant improvement in classification accuracy on the HMDB51 dataset. We further demonstrate the efficacy of dictionaries and sparsity based classification on other large action video datasets like UCF50.

Introduction

Action recognition is the process of extracting human action patterns from real video streams. It can be used in diverse applications like automated video indexing of huge on-line video repositories like Youtube and Vimeo, analysing video surveillance systems in public places, human-computer interaction, sports analysis, etc. Actions are defined as single-person activities like “walking”, “waving”, “punching”, etc. If the action video contains only one distinct human action, the task is to classify the video into one of the different categories. It has been shown in [1] that both spatial and temporal information are important for action representation. However, features which are shared across action classes are not suitable to build discriminative dictionaries. For example, “running” is a part of both “cricket bowling” and “soccer penalty”. In such a case, the main action (bowling/penalty taking) occupies a small fraction of the entire duration of the video. Hence, it is difficult with just spatio-temporal descriptors to classify such actions with high credibility. Action bank [2] captures the similarity of the video with the class it belongs to and dissimilarity with other classes. Since, running occurs before bowling(or penalty taking), this temporal dependence can be exploited to produce a more unique representation for “cricket bowling” (or soccer penalty) which is useful for classification.

In this work, we construct sparsity-inducing dictionaries built specifically for action classification. Such a sparse dictionary based representation highlights discriminative information about various action classes. Also, these dictionaries distinctly represent the different action classes of HMDB51 dataset. Since dictionary learning has no strict convergence criteria, the dictionaries are trained until reasonable classification performance is obtained. On the HMDB51 dataset which contains many diverse and challenging views of human actions, dictionaries achieve very low mis-classification rate.

The rest of the paper is organized as follows. In Section 2 we provide an overview of the various feature descriptors and sparsity based methods which have been applied for action classification. In Section 3, we present the proposed sparsity based classification scheme in detail. In Section 4, we describe the performance of the proposed approach on two large action datasets – UCF50 and HMDB51. Finally, Section 5 gives the conclusion for this work.

Section snippets

Related work and analysis

The challenges in action recognition have been studied with great interest in the computer vision community. Schuldt et al. [3] introduced the KTH [4] dataset which consists of six action categories. A support vector machine (SVM) was used for classification with local space-time features. In [5], Kläser et al. presented the histogram of oriented 3D spatio-temporal gradients which is essentially a collection of quantized 2D histograms collected from each frame of the video. Kuehne et al. [6]

Sparsity-inducing dictionaries for action classification

In this section, a detailed discussion of the proposed method is presented. The classification scheme in typical dictionary learning consists of two phases – dictionary construction from training examples (training) and sparsity based evaluation of test clip (testing). The detailed block diagram of the entire approach is given in Fig. 1. In the training phase, dictionaries are constructed for each class using online dictionary learning (ODL) and then concatenated to form a single dictionary.

Results and evaluation

In this section, a critical evaluation of the proposed method is presented. The main goal is to establish the robustness of sparse representation on large datasets like HMDB51 and UCF50. Further evaluation is done to determine the optimal dictionary size with respect to classification accuracy.

Conclusion

The main goal of this work was to study dictionaries as an effective representation for action classification in videos. Sparse representation of multi-frame based features was exploited to obtain discriminative dictionaries. It was shown that these dictionaries distinctly represent the different action classes. Further, it was also shown that dictionaries learned from action bank features showed a four-fold improvement in classification accuracy over naïve action bank features on the HMDB51

Conflict of interest

None declared.

Acknowledgment

We would like to thank Dr. Jason Corso for making the Action Bank features available for the UCF50 and HMDB51 datasets. We would also like to thank Dr. Julian Mairal for the SPAMS toolbox.

Debaditya Roy is currently pursuing his Ph.D. in the Department of Computer Science and Engineering, Indian Institute of Technology, Hyderabad. He graduated with a silver medal in M.Tech., computer science from the Department of Computer Science and Engineering, National Institute of Technology, Rourkela, India in 2013. He received his Bachelor of Technology in computer science and engineering from West Bengal University of Technology in 2011. His research interests include deep learning,

References (44)

  • S. Lu et al.

    Fast human action classification and {VOI} localization with enhanced sparse coding

    J. Vis. Commun. Image Represent.

    (2013)
  • H. Wang, C. Schmid, Action recognition with improved trajectories, in: International Conference on Computer Vision...
  • S. Sadanand, J. Corso, Action bank: a high-level representation of activity in video, in: IEEE Conference on Computer...
  • C. Schuldt, I. Laptev, B. Caputo, Recognizing human actions: a local SVM approach, in: Proceedings of the 17th...
  • I. Laptev, B. Caputo,...
  • A. Klaser, M. Marszalek, C. Schmid, A spatio-temporal descriptor based on 3d-gradients, in: British Machine Vision...
  • H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, T. Serre, HMDB: a large video database for human motion recognition, in:...
  • H. Jhuang,...
  • O. Kliper-Gross, Y. Gurovich, T. Hassner, L. Wolf, Motion interchange patterns for action recognition in unconstrained...
  • B. Solmaz et al.

    Classifying web videos using a global video descriptor

    Mach. Vis. Appl.

    (2013)
  • Y.-G. Jiang, Q. Dai, X. Xue, W. Liu, C.-W. Ngo, Trajectory-based modeling of human actions with motion reference...
  • J. Wu et al.

    Learning effective event models to recognize a large number of human actions

    IEEE Trans. Multimed.

    (2014)
  • X. Liang, L. Lin, L. Cao, Learning latent spatio-temporal compositional model for human action recognition, in:...
  • A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, L. Fei-Fei, Large-scale video classification with...
  • J.Y. Ng, M.J. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, G. Toderici, Beyond short snippets: deep networks...
  • X. Liang et al.

    Deep human parsing with active template regression

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2015)
  • G. Gkioxari, R.B. Girshick, J. Malik, Contextual action recognition with r⁎cnn, CoRR abs/1505.01197, URL...
  • L. Lin et al.

    A deep structured model with radius-margin bound for 3d human activity recognition

    Int. J. Comput. Vis.

    (2015)
  • K. Wang, X. Wang, L. Lin, M. Wang, W. Zuo, 3d human activity recognition with reconfigurable convolutional neural...
  • L. Sun, K. Jia, D.-Y. Yeung, B.E. Shi, Human action recognition using factorized spatio-temporal convolutional...
  • Q. Qiu, Z. Jiang, R. Chellappa, Sparse dictionary-based representation and recognition of action attributes, in: 2011...
  • A. Castrodad et al.

    Sparse modeling of human actions from motion imagery

    Int. J. Comput. Vis.

    (2012)
  • Cited by (0)

    Debaditya Roy is currently pursuing his Ph.D. in the Department of Computer Science and Engineering, Indian Institute of Technology, Hyderabad. He graduated with a silver medal in M.Tech., computer science from the Department of Computer Science and Engineering, National Institute of Technology, Rourkela, India in 2013. He received his Bachelor of Technology in computer science and engineering from West Bengal University of Technology in 2011. His research interests include deep learning, generative models and feature selection.

    M. Srinivas received his Ph.D. in computer science and engineering from Indian Institute of Technology, Hyderabad, India in 2015. He received his M.Tech. in computer science from Jawaharlal Nehru Technological University, Hyderabad, India in 2009. His research interests include sparsity based methods, deep learning and biomedical imaging.

    C. Krishna Mohan is currently an associate professor with the Department of Computer Science and Engineering, Indian Institute of Technology, Hyderabad, India. He received his Ph.D. in computer science and engineering from Indian Institute of Technology, Madras, India in 2007. He received the Master of Technology in system analysis and computer applications from National Institute of Technology, Surathkal, India in 2000. He received the Master of Computer Applications degree from S. J. College of Engineering, Mysore, India in 1991 and the Bachelor of Science Education (B.Sc.Ed) degree from Regional Institute of Education in 1988. His research interests include video content analysis, pattern recognition, and neural networks.

    View full text