Elsevier

Signal Processing

Volume 112, July 2015, Pages 74-82
Signal Processing

Coupled hidden conditional random fields for RGB-D human action recognition

https://doi.org/10.1016/j.sigpro.2014.08.038Get rights and content

Highlights

  • We propose cHCRF to learn sequence-specific and sequence-shared temporal structure.

  • We contribute a novel RGB-D human action dataset containing 1200 samples.

  • Experiments on 3 popular datasets show the superiority of the proposed method.

Abstract

This paper proposes a human action recognition method via coupled hidden conditional random fields model by fusing both RGB and depth sequential information. The coupled hidden conditional random fields model extends the standard hidden-state conditional random fields model only with one chain-structure sequential observation to multiple chain-structure sequential observations, which are synchronized sequence data captured in multiple modalities. For model formulation, we propose the specific graph structure for the interaction among multiple modalities and design the corresponding potential functions. Then we propose the model learning and inference methods to discover the latent correlation between RGB and depth data as well as model temporal context within individual modality. The extensive experiments show that the proposed model can boost the performance of human action recognition by taking advance of complementary characteristics from both RGB and depth modalities.

Introduction

Nowadays, human action recognition is a hot research topic in the fields of computer vision and machine learning since it plays essential roles on the applications of intelligent visual surveillance, natural user interface and so on. Especially, with the emergence of multiple sensors, like depth image, laser, etc., we can capture the signals of human action in multiple modalities and consequently multimodal human action recognition is becoming extremely popular in the recent years [1], [2], [3], [4], [5].

The task of human action recognition is challenging because of the high variability of appearances, shapes and potential occlusions. The related methods can be classified into two categories. One representative method is the space-time feature-based method. The extraction of space-time feature usually involves local feature detectors and descriptors [6], [7], [8], [9]. The detectors usually design specific objective function for the selection of X–Y–T locations. The representative local feature detectors include Harris3D [10], Cuboid [11], 3D Hessian [12] and DSTIP [13] on RGB or depth imagery. The feature descriptors [12], [14], [15], [16], [17], [18], [19], [20] can be calculated to represent the characteristics of shape and motion around the detected local space-time points. With the recent advent of Kinect, depth cameras have received increasing attention and many researchers are engaged in the formulation of the depth-based local saliency descriptor [21], [22], [13]. At last, the bag-of-words (BoW) method [23], [24] is usually leveraged for video representation and model learning. The probabilistic model can be utilized to overcome the constraint by camera views [25], [26]. The other representative method focuses on learning the sequential dynamics within one action image sequence captured by the traditional RGB camera [27], [28], [29]. The graph-based methods [30], [31], [32], [33] for sequential modeling with information in single modality have been developed and evaluated on several benchmark datasets. Conditional random fields (CRF) [30] is designed for sequence annotation given an observation sequence, which is able to incorporate both overlapping features and long-range dependencies into the model. It can also be further utilized to recognize continuous human actions from unsegmented motion sequences [34]. However, CRF is limited because it cannot capture the intermediate structures using hidden-state variables and it assumes the label sequences to be fully observable. Therefore, Quattoni et al. developed the hidden conditional random fields (HCRF) [32] for action sequence modeling [35]. Morency et al. proposed latent-dynamic conditional random field to capture both inter-class and intra-class dynamics during human actions [33]. Recently, Liu et al. proposed a bidirectional-integrated random fields model for this task [36]. Different from the random fields models above, which model human action sequence in one shot, this model leverages CRF for sequence segmentation and HCRF for sequence classification and then bridge both by modifying the feature functions to propagate sequence classification or segmentation information in-between. Consequently the sequence classification result by HCRF and the sequence segmentation results by CRF can be taken advantage of to direct the decision making and the performance of both models will be boosted iteratively.

The release of Microsoft Kinect and multiple advanced sensors has made it flexible to capture both RGB images, depth images, and even other useful information simultaneously with affordable devices. Since such multimodal signals can represent one identical scene in different modalities and consequently are complementary to each other, it will benefit human action recognition by fusing all of them for sequence modeling [37], [38], [39], [40]. However, it is extremely challenging for machine to fuse multimodal sequence information due to two reasons: (1) multimodal features usually belong to different feature subspaces and the variations lying in different manifold spaces might cause asynchrony [41], [42]; (2) there might exist differences in multimodal feature dimension and the one with obviously high dimension would dominate the entire performance [43]. To tackle these problems, a novel and challenging research topic, the fusion of multimodal sequential signals, appears [44]. The intuitive methods for the fusion of multimodal sequential signals are the feature-level and decision-level fusion [45]. Wang et al. [46] integrated the local binary patterns and the histogram of oriented gradient for support vector machine learning. The experiment showed that the high dimensional fused feature can cause the high computational complexity. Comparatively, Spinello and Arras [47] fused the detection results by both HOD descriptors in the depth image and HOG descriptor in the color image via a weighted fusion method. Gao proposed the sophisticated distance learning method for similarity measurement [48]. Although these methods can improve the performance with multimodal information, they always ignore the sequential structure learning for inference as [32], [33], [49]. However, the current sequential models, such as HCRF and LDCRF, cannot be directly implemented for sequential signal fusion due to its deficiency of the single-chain graph structure. On one hand, when concatenating the features from individual views together [50], the exponentially increasing states in latent space cause the model to need a large amount of training data for estimation of the underlying distributions, which makes the model learning impractical for real applications. On the other hand, the improper correlation heuristically imposed among different sequences by human might lead to a worse performance since there exists asynchronization and noise among sequential data [51]. Therefore, it is essential to learn the latent dynamics in the multimodal data. Brand et al. [39] proposed a coupled hidden Markov model for human action recognition. Chen et al. [52] presented a multi-view latent space Markov network for multimodal object grouping.

To tackle this problem, we propose a human action recognition method based on the coupled hidden conditional random fields (cHCRF) model. For the representation of individual action sample captured in both RGB and depth modality simultaneously, the visual feature of human region in each RGB/depth frame is extracted to form the visual feature sequences in both RGB and depth modalities. Then, both visual feature sequences are imputed into the proposed cHCRF for model learning. To learn the modality-specific and modality-shared knowledge, we design the specific graph structure and propose the model learning and inference methods. We demonstrate the superiority of the proposed method on three popular RGB-D human action datasets, including DHA [53], UTKinect [54], and TJU prepared by ourselves. The contributions lie in two aspects:

  • Different from the direct feature-level fusion and decision-level fusion, the proposed cHCRF can learn both temporal structure within individual RGB/depth sequence and transfer the common structuring information in-between. Therefore, it can preserve the dynamics of individual sequence while share the complementary information from different modalities.

  • We contribute a novel RGB-D human action dataset (TJU) to the community. The dataset not only covers most of the popular action categories of KTH [55], DHA [53] and so on but also contains the action samples in both light and dark environments. To our knowledge, TJU, containing 1200 samples, is the largest RGB-D human action dataset till now. The dataset can be downloaded from http://media.tju.edu.cn/tju_dataset.html.

The rest of the paper is organized as follows. Section 2 presents the coupled hidden conditional random fields model. Section 3 introduces the experimental method and Section 4 illustrates the experimental results. At last, we conclude the paper in Section 5.

Section snippets

Model formulation

In this section, we will detail the proposed the coupled hidden conditional random fields (cHCRF) model. Especially, we designed the specific graph structure and the corresponding potential function for cHCRF and proposed the methods for model learning and inference.

Considering each training/test action sample X consists of two synchrony sequences, which are respectively captured in RGB and depth modalities, namely X={xRGB,xD}, where xRGB and xD are respectively the observation sequences in the

Data

The proposed method is evaluated on three popular RGB-D datasets as shown in Fig. 2.

DHA: Lin et al. released a depth-included human action video dataset (DHA) [53]. DHA contains 17 action categories: (1) bend, (2) jack, (3) jump, (4) one-hand-wave, (5) pjump, (6) run, (7) side, (8) skip, (9) two-hand-wave, (10) walk, (11) clap-front, (12) arm-swing, (13) kick-leg, (14) pitch, (15) swing, (16) boxing, and (17) tai-chi. Each action was performed by 21 people (12 males and 9 females) and there are

Experimental results

For cHCRF model learning, the number of hidden states in each sub-modal may has significant influence on the performance. Therefore it should be adaptively selected by cross validation. We varied the number of hidden states from 4 to 16 per modality and plotted the ROC curve for each hidden state number. The best parameter can be selected when the area under curve (AUC) corresponding to each hidden state number reached the maximum. From Fig. 3 we can see that the cHCRF model on each dataset can

Conclusion

In this paper we propose a coupled hidden conditional random fields model for human action recognition. With the designed graph structure and the corresponding potential function, the proposed cHCRF model can take advantage of both temporal context within individual RGB/depth sequential data and the latent correlation in-between to boost the performance. The comparison experiments show that the proposed model can benefit the fusion of multimodal sequential data and consequently outperforms the

Acknowledgments

The authors would like to thank the anonymous reviewers for the constructive suggestions.

This work was supported in part by the National Natural Science Foundation of China (61472275, 61100124, 21106095, 61170239, 61202168), the Grant of Elite Scholar Program of Tianjin University, the Grant of Introducing Talents to Tianjin Normal University (5RL123), the Grant of Introduction of One Thousand High-level Talents in Three Years in Tianjin.

References (61)

  • A. Liu, N. Xu, Y. Su, H. Lin, T. Hao, Z. Yang, Single/multi-view human action recognition via regularized multi-task...
  • A. Liu et al.

    Partwise bag of words-based multi-task learning for human action recognition

    Electron. Lett.

    (2013)
  • I. Laptev, T. Lindeberg, Space-time interest points, in: ICCV׳03, 2003, pp....
  • P. Dollar, V. Rabaud, G. Cottrell, S. Belongie, Behavior recognition via sparse spatio-temporal features, in: VS-PETS,...
  • G. Willems, T. Tuytelaars, L.J.V. Gool, An efficient dense and scale-invariant spatio-temporal interest point detector,...
  • L. Xia, J. Aggarwal, Spatio-temporal depth cuboid similarity feature for activity recognition using depth camera, in:...
  • I. Laptev, Local Spatio-Temporal Image Features for Motion Interpretation (Ph.D. thesis), Department of Numerical...
  • Y. Yang, Y. Gao, H. Zhang, J. Shao, T. Chua, Image tagging with social assistance, in: ICMR׳14, 2014, pp....
  • I. Laptev, T. Lindeberg, Local descriptors for spatio-temporal recognition, in: First International Workshop on Spatial...
  • P. Scovanner, S. Ali, M. Shah, A 3-dimensional sift descriptor and its application to action recognition, in: ACM...
  • H. Jhuang, T. Serre, L. Wolf, T. Poggio, A biologically inspired system for action recognition, in: ICCV׳07, 2007, pp....
  • I. Laptev, M. Marszalek, C. Schmid, B. Rozenfeld, Learning realistic human actions from movies, in: CVPR׳08,...
  • Z. Gao et al.

    Human action recognition using pyramid histograms of oriented gradients and collaborative multi-task learning

    KSII Trans. Internet Inf. Syst.

    (2014)
  • Y. Zhao, Z. Liu, L. Yang, H. Cheng, Combining rgb and depth map features for human activity recognition, in: APSIPA...
  • O. Oreifej, Z. Liu, Hon4d: histogram of oriented 4d normals for activity recognition from depth sequences, in: CVPR׳13,...
  • H. Wang, M.M. Ullah, A. Klaser, I. Laptev, C. Schmid, Evaluation of local spatio-temporal features for action...
  • R. Ji et al.

    Location discriminative vocabulary coding for mobile landmark search

    Int. J. Comput. Vis.

    (2012)
  • Y. Gao et al.

    Camera constraint-free view-based 3-d object retrieval

    IEEE Trans. Image Process.

    (2012)
  • L. Zhang et al.

    Probabilistic graphlet transfer for photo cropping

    IEEE Trans. Image Process.

    (2013)
  • Y. Su et al.

    Max margin discriminative random fields for multimodal human action recognition

    Electron. Lett.

    (2014)
  • Cited by (83)

    • Deep learning and RGB-D based human action, human–human and human–object interaction recognition: A survey

      2022, Journal of Visual Communication and Image Representation
      Citation Excerpt :

      The actions were captured involving only a single person and there were no objects or other person involved while performing actions. Gradually, over the years the number of action classes having inter-class and intra-class variations has increased with multiple camera views for more complexity and challenges as in [40–43]. Also, the number of sample instances was increased for training and evaluation of data-driven deep learning techniques.

    • Complex Network-based features extraction in RGB-D human action recognition

      2022, Journal of Visual Communication and Image Representation
      Citation Excerpt :

      On the other hand, some research groups used different modalities of data (RGB and depth signals) as input for the task of action classification [8]. Liu, et al. [2] proposed a human action recognition method via coupled hidden conditional random fields model by fusing both RGB and depth sequential information. Chen, et al. [4] proposed an approach based on Spatio-Temporal local features and a Bag-of-Words (BoW) model for single-person action recognition from combined intensity and depth images.

    View all citing articles on Scopus
    View full text