Jointly Learning Multiple Sequential Dynamics for Human Action Recognition

Discovering visual dynamics during human actions is a challenging task for human action recognition. To deal with this problem, we theoretically propose the multi-task conditional random fields model and explore its application on human action recognition. For visual representation, we propose the part-induced spatiotemporal action unit sequence to represent each action sample with multiple partwise sequential feature subspaces. For model learning, we propose the multi-task conditional random fields (MTCRFs) model to discover the sequence-specific structure and the sequence-shared relationship. Specifically, the multi-chain graph structure and the corresponding probabilistic model are designed to represent the interaction among multiple part-induced action unit sequences. Moreover we propose the model learning and inference methods to discover temporal context within individual action unit sequence and the latent correlation among different body parts. Extensive experiments are implemented to demonstrate the superiority of the proposed method on two popular RGB human action datasets, KTH & TJU, and the depth dataset in MSR Daily Activity 3D.


Introduction
Human action recognition is an essential issue in computer vision and machine learning due to its wide and significant applications on multimedia content analysis and retrieval, human computer interaction and so on [1]. Recently, the importance is strongly highlighted by the urgent need for intelligent video surveillance in security sensitive environments.

Motivation and Overview
Although the current methods have shown the superior performance on this task, there still exist two problems.
• Only leveraging the global characteristics of one action while seldomly considering the body part information. One kind of methods utilize the bag-of-visual-word model with space-time interest point detectors and descriptors for global visual representation [2] [3]. The silhouette-based method is another representative method, which has been widely utilized for action recognition [4]. However, this kind of methods highly rely on accurate foreground extractions, which seriously limits their application. Since human action usually shows a dynamic visual pattern, many sophisticated sequential modeling methods have been designed for this task. Grammar-based methods [5,6] have been successfully applied for complex human activity recognition. Pei et al. [7] proposed a Stochastic Context Sensitive Grammar-based method for video event prediction. They constructed the hierarchical structure to represent the spatial and temporal relationships between the sub-events. The syntactic-based methods are another widely used method to represent human activity with high level temporal logic complexity. Hamid et al. [8] proposed to encode the sequential dynamic within human activities with a set of local action units. Consequently, they handled the problem of discovering action dynamic by feature selection. Graphical model-based methods [9] have recently attracted great interests due to its strong ability for sequential modeling. Lv et al. [10] trained the hidden Markov model with 3D joint trajectories to model the dynamic within human motion. Han et al. [11] utilized the conditional random fields model for the identification of continuous human activity. As the extension of the conditional random fields model, multiple advanced graphical models, including the hidden conditional random fields [12], the latent dynamic conditional random fields [13], the bidirectional-integrated random fields model [14], the semi-Markov model [15], etc., were developed for sequential modeling. Although these powerful methods have shown great performances on this task, they only consider the sequential dynamics of entire body within one action. Since the visual features of different body parts might lie in different feature subspaces, it would be difficult to learn the latent state spaces of one action by simply considering the sequential visual features of the entire body. Consequently, it is mandatory to take advantage of the characteristics of the body and parts together.
• Ignoring the sequential dynamics of body part. Recently, much more work has been done on the part-based method since part-based representation and modeling is more discriminative by focusing on local regions [16]. The most famous part-based method is the deformable part models (DPM) [17] which can simultaneously learn the characteristics of local and global body regions and their correlation with the latent support vector machine. Motivated by the DPM method, Wang et al. [18] [19] leveraged the motion patterns of both part regions and body region to construct a graphical model with the latent states for human action recognition. To avoid these highly structured models, Sharma et al. proposed the expanded parts model to automatically discover the parts and learn corresponding discriminative templates with their respective locations from a pool of candidate pars. Tian et al. [20] further improved the 2D DPM method, which used to be utilized for the still image, and designed the 3D DPM method to model the local and global spatiotemporal action pattern. We also propose the part-regularized multi-task structural learning method for both multiple-view and single-view action recognition [21]. This method can couple the body-based classification and the part-based classification to benefit intrinsic relatedness sharing across multiple action categories and consequently augment the performance of action recognition. These methods ignore the temporal context of partwise features for action modeling. Therefore, the specific sequential modeling methods need to be developed to take advantage of the sequential partwise features for temporal dynamic modeling.
To tackle both problems, it is mandatory to develop a method, which can take advantage of both global temporal context and partwise temporal context within one action to discover both sequence-specific structure and the sequence-shared correlation for action modeling. In this paper, we propose the multi-task conditional random fields model and explore its application on the task of human action. First, we partition human body into several parts with the prior knowledge of body structure and propose the partwise spatio-temporal action unit sequence (ST-AUS) to represent the temporal context of individual body part during one action. Second, we propose the multi-task conditional random fields (MTCRFs) model to jointly learn sequential dynamics of entire body and individual parts and the correlation inbetween. We demonstrate the superiority of the proposed method on on two RGB human action datasets, KTH [22] & TJU, and the depth dataset in MSR Daily Activity 3D [23].

Contributions
The main contributions of the proposed method can be summarized as follows.
• Motivated by the theory of multi-task learning (MTL) [24], we originally propose the multitask conditional random fields (MTCRFs) model. Different from the direct feature-level fusion and decision-level fusion, the proposed MTCRFs can learn both temporal structure within individual partwise spatio-temporal action unit sequence and transfer the latent correlation inbetween. Therefore, it can preserve the dynamics of individual sequence while sharing the complementary information of different body parts.
• We propose the partwise spatio-temporal action unit sequence (ST-AUS) to represent the multi-level visual dynamics. Individual ST-AUS can represent the part-specific visual dynamic while multiple ST-AUSs together can convey the latent correlations among multiple body parts. Therefore, the proposed ST-AUS representation can well depict the diverse visual characteristics of each action video.
• We contribute a novel human action dataset (TJU) to the community. TJU contains 22 types of human actions. There are totally 1760 action sequence. The synchronized RGB/depth/ skeleton sequences of one action were taken with a Microsoft Kinect sensor with 20fps frame rate and 640 × 480 resolution. To obtain the satisfactory skeleton data, the dataset was recorded in the lab and no occlusion was set. However, it is still challenging since there are more complex and similar actions and both light and dark environments are concerned. This dataset can be downloaded from http://dx.doi.org/10.7910/DVN/YDDJ9F.
The rest of the paper is organized as follows. Section 2 briefly introduces the framework of the proposed method. Section 3 and 4 introduce the proposed spatio-temporal action unit sequence and the multi-task conditional random fields model, respectively. Section 5 explains the experimental method and Section 6 illustrates the experimental results. At last, we conclude the paper in Section 7.

Framework
The proposed method aims to automatically identify the action performing in one query video by discovering and modeling both partwise and bodywise dynamics. It contains two key steps, ST-AUS representation and MTCRFs modeling.
1) ST-AUS representation: We propose the partwise spatio-temporal action unit sequence (ST-AUS) as shown in Fig 1. We utilize the prior knowledge of body structure to define seven body parts, e.g., head, left/right limbs, left/right legs, and left/right feet. One specific part region lasting for T frames is considered as an action unit. Then the action units belonging to one part region in an action video is considered as a partwise action unit sequence. Each partwise action unit sequence focuses on the dynamic of specific body area. Consequently, each action video can be represented in multiple sequential feature spaces.
2) MTCRFs modeling: With the ST-AUS representation, we propose the multi-task conditional random fields (MTCRFs) model. The partwise spatio-temporal action unit sequences can be considered as the sequential observations of MTCRFs to discover the latent state space and the transition inbetween. Since different partwise spatio-temporal action unit sequences might have different temporal dynamics, they cannot be linked or fused directly. In MTCRFs, each sequential observation is designed to connect to one sequential hidden state and all the hidden state sequences are correlated to the sequence label. Therefore, the proposed MTCRFs can flexibly learn the temporal dynamics of individual sequence and discover the correlation among them.

Spatio-temporal Action Unit Sequence
The proposed spatio-temporal action unit sequence(ST-AUS) representation contains three main steps (Fig 1).
First, we extract the local saliency descriptors. The extraction of local space-time feature can be accomplished by local saliency point detection and description. The popular local space-   [29] can be implemented for this step. With the extracted local space-time interest points, one video can be considered as a collection of local spatiotemporal points. We will further partition them into different groups depending of the prior knowledge of human body structure.
To achieve the body structure information, we implement two methods for body part localization: 1) For the classic RGB human action datasets, we implement the part model-based method [17] to localize 7 body parts (head, left/right limbs, left/right legs, and left/right feet). With the localized part regions, we can obtain the centers of individual regions and then utilized them to group the space-time interest points into different part-induced categories. Depending on different feature pooling methods, the proposed ST-AUS can be classified into three levels.
Low Level (LL) ST-AUS: LL ST-AUS represents the basic partwise visual dynamics. The left and right limbs/legs/ feet are respectively considered as one category to avoid the sparse interesting points in individual regions. There are totally 4 kinds of parts, including head/limb/leg/ foot and consequently the space-time interest points can be grouped into the corresponding categories. Then, we can learn individual codebook with the K-means algorithm. The specific part regions in adjacent F frames is defined as a spatio-temporal action unit (ST-AU) which represent the local saliency of a specific body part locating in a special spatio-temporal volume. Therefore the dynamic evolution of a specific body part performing an action can be represented by the sequential ST-AUs, which is defined as the spatio-temporal action unit sequence (ST-AUS). With the learned dictionary corresponding to each part, the standard Bag-of-Words (BoW) scheme can be implemented on each ST-AU for BoW representation and consequently ST-AUS can be represented by the sequential BoW features.

Multi-task Conditional Random Fields
In this section, we respectively present the probabilistic model of the multi-task conditional random fields (MTCRFs) model and the methods for MTCRFs learning and inference.

Probabilistic Model of MTCRFs
We design the specific graph structure G = {V, E} (Fig 3) for the MTCRFs model. V means the node set, including both observation nodes and hidden state nodes. E means the edge set, including the transition between adjacent hidden states and the correlation between the hidden state and the action label. In terms of the designed graph structure, each action video, depicted by the extracted ST-AUS representation, can be represented by P parallel sequences S ¼ fs p g P p¼1 with the chain structure, where s p ¼ fs p i g I i¼1 denotes individual part-induced ST-AUS which represents the temporal dynamics of a specific body part during one action. S is assigned with a specific action label A 2 A. To model the state transition within individual ST-AUS (s p ), we utilize the hidden state layer L ¼ fl p g P p¼1 ðL 2 LÞ to correlate the adjacent observations, in which l p ¼ fl p i g I i¼1 means the hidden state sequence corresponding to s p . Each l p i is a member of a finite discrete set L p of the p th ST-AUS. All the hidden states are correlated by the edge between individual hidden state and the action label node. Consequently, all ST-AUSs can be correlated and will contribute for the modeling and inference of the action category.
To take advantage of both inner-sequence and inter-sequence context for MTCRFs modeling, the proposed probabilistic model of MTCRFs can be formulated with two parts: where Θ = {ϕ, ψ} denotes the weight coefficients of the model; P 1 (AjS, ϕ) and P 2 (AjS, ψ) denotes the inner-sequence sub-model and the inter-sequence sub-model, respectively. The inner-sequence sub-model P 1 (AjS, ϕ) can be formulated as follows: where Z 1 is the partition function for normalization. The potential function of the innersequence sub-model, ϕ > Áμ(A, S, L) 2 R, can be defined with the first-order attribute between the observation and the corresponding hidden state and the second-order attribute between adjacent hidden states: where the first term denotes the correlation between the observation s p i and the corresponding hidden state l p i and the second term denotes the transition between adjacent hidden states. F 1;k ðl p i ; s p i Þ 2 R is a general expression of the first order attribute function which represents the relationship between pairwise observation and hidden state nodes. F 2;k ðl p i ; l p iþ1 Þ 2 R is a general expression of the second order attribute function which represents the relationship between pairwise hidden state nodes. Different from previous work [30] which only consider the first order attribute between the observation node and the corresponding hidden state node, we leverage both first order and second order attributes to formulate the inner-sequence submodel. It is intuitive that different actions might have the similar action unit. For example, both boxing and handclapping can have similar hand/limb motion. It is not discriminative enough to distinguish each other only with the body motion during one state. However, it will be more easily to discriminate both when considering the motion changes between adjacent states since there will be different motion intensity, direction, etc. These motion changes can be well represented by the second-order attribute.
The inter-sequence sub-model P 2 (AjS, ψ) can be likewise formulated as follows: where Z 2 is another partition function for normalization. The potential function of the intersequence sub-model, ψ > Áν(A, S, L) 2 R, can be defined with the first-order attribute between the sequence class and the corresponding hidden state and the second-order attribute between the sequence class and the pairwise hidden states: where the first term denotes the correlation between the sequence label A and individual hidden state l p i and the second term denotes the correlation between the sequence label A and pairwise hidden states. F 0 1;k ðl p i ; A 0 Þ 2 R is the first order attribute function which represents the relationship between the sequence label and individual hidden state node. F 0 2;k ðl p i ; l p iþ1 ; A 0 Þ 2 R is the second order attribute function which represents the relationship between the sequence label and pairwise hidden state nodes. Since all the hidden states are linked by the sequence label node, this model can learn the latent correlation among different ST-AUSs. This is quite understandable since each action has specific motion pattern and the movement of different body parts during one action is implicitly constrained by each other. Different from previous work [31] which designed the edges between the pairwise hidden states from two different chain-structured sequences to learn the inter-sequence correlation, we omit this edges due to two reasons: 1) there might exist asynchrony among different ST-AUSs since different partinduced ST-AUSs can have different motion dynamics. It is unreasonable to impose strong constraint by linking the pairwise hidden stated with the same index from two sequences. 2) the inter-sequence edges linking pairwise hidden states will significantly increase the complexity of the graph structure and directly make model learning much more difficult.

MTCRFs Learning and Inference
Since both sub-models have the similar formulation in the exponential family manner, they can be separately optimized by the gradient descent method [32]. Here we take the innersequence sub-model, P 1 (AjS, ϕ), as an example to illustrate the model learning method. Model learning of P 1 (AjS, ϕ) can be accomplished by maximize the following likelihood objective function, given the training set with N samples G ¼ fðS i ; A i Þg N i¼1 : where we suppose ϕ 2 R K (K = K 1 +K 2 ). The objective function L(Γ;ϕ) is the summation of loglikelihood of all training samples minus a regularization term. The second term 1 2s 2 kk 2 is a an L 2 -norm regularization when parameters are assumed to be Gaussian distributed with variance σ 2 . It is set to avoid overfitting. To achieve the optimal parameter ϕ Ã , we can take a partial derivative of L(Γ;ϕ) with respect to each entity ϕ k as: Since ϕ 1, k only exists in the first-order attribute term, the partial derivative of the core of Eq 7 can be derived as: In the same way, ϕ 2, k only exists in the second attribute term and the partial derivative of the core of Eq 7 is: With the partial derivative of Eq 7 with respect to both ϕ 1, k and ϕ 2, k above, the parameters can be updated as follows: where t means the iteration index; γ means the learning rate. The parameter ψ of the inter-sequence sub-model P 2 (AjS, ψ) can be optimized in the same way and we omit the related derivation. For model inference, the optimal sequence label A Ã can be predicted with the belief propagation algorithm [33]: where Θ Ã = {ϕ Ã , ψ Ã }.

Experiment Method Data
The proposed method is evaluated on the popular datasets in both RGB and depth modalities, including KTH, TJU, MSR Daily Activity 3D (MDA) respectively shown in Fig 2a, 2b and 2c. We will briefly introduce the datasets as follows.
• KTH: KTH [22] contains 6 action categories: 1. boxing, 2. hand clapping, 3. hand waving, 4. jogging, 5. running and 6. walking. Each action was performed by 25 people in 4 different scenarios, including indoor, outdoor, changes in clothing and variations in scale. Each video sample contains one subject engaged in a single activity in a certain condition. For fair comparison, we strictly followed the experiment setting as [34] in the "split" manner.

Implementation Details
For RGB datasets (KTH and TJU), we utilized STIP [36] for spatiotemporal interest point localization and representation because it has shown superior performances in the BoW+SVM framework [34]. For depth dataset (MDA), the local HON4D descriptor and the 3Djoint position feature were utilized. As Oreifej [37], we concatenated HON4D and 3Djoint position features for representation and further utilized them for dictionary learning and BoW representation. The original parameter settings were implemented to extract STIP and HON4D features for fair comparison. For KTH, the part-based model [17] was implemented on each frame to localize body part regions. For TJU and MDA, we directly used the skeleton data provided in each dataset for body part localization. Then the space-time feature points were grouped into different body parts in terms of the geometric distances between the point and the center of each part region. To construct the visual vocabulary of individual part, we clustered 20,000 feature points of each part region sampled from the training videos with the K-means algorithm. The number of visual words was empirically set with 100, which showed satisfactory performance for partwise ST-AUS representation. The ST-AUS in our experiment lasted for 30 frames in temporal domain and the overlap between two adjacent ST-AUSs was set with 15.

Experimental Results
To show the superiority of the proposed method, the following two comparison experiments were implemented.

Comparison against other fusion methods
With the ST-AUS representation, the proposed MTCRFs model can be trained and utilized for action recognition based on the Maximum A Posteriori criteria. To select the best hidden state for temporal modeling, we plotted the ROC curve of each hidden state number and the best parameter can be selected when the area under curve (AUC) of the corresponding ROC reached the maximum. In our experiments, we varied the number of hidden states from 3 to 6 per chain for parameter selection. From Figs 4, 5 and 6, it is obvious that the MTCRFs model on each dataset can achieve the best performance with hidden_state = 5.
To show the proposed method can benefit sequential dynamic learning, we implemented the following strategies for comparison: • HL-HCRF: HCRF model trained only with HL-AUS (as shown in Fig 1(h)), which utilized the bodywise feature for temporal modeling.
• ML-DDF: Direct decision-level fusion by linearly combining the posterior probabilities of HCRF models, trained with ML-AUS1 and ML-AUS2 respectively, with equal weights.  The comparison results are shown in Table 1. The confusion matrixes of the optimal performance on three datasets by All-MTCRFs are respectively shown in Fig 7a, 7b and 7c.
From Table 1 we can see that HL-HCRF worked in a traditional manner for HCRF learning simply with global feature and only acieved 90.0%, 80.5% and 71.2% for KTH, TJU, and MDA respectively. ML-DDF, LL-DDF, and All-DDF worked in a decision-fusion manner and cannot take advantage of the temporal context of all sequences for sequential structure learning. It can only improve the performance slightly. Comparatively, ML-DFF, LL-DFF, and All-DFF worked in a feature-fusion manner and can leverage the temporal context of all partwise sequences for sequential structure learning. Consequently the feature-level fusion method can further augment the performance. Since the proposed MTCRFs can preserve and learn the temporal context of individual ST-AUS while relaxing the interaction inbetween to handle the asynchronization, it can achieve the best accuracy of 94.8%, 90.5% and 89.8% on three datasets respectively and consistently outperform all the others.
The action category-wise comparison among HL-HCRF, LL-MTCRFs, ML-MTCRFs, and All-MTCRFs for three datasets are listed in Fig 8a, 8b and 8c. For KTH, All-MTCRFs can work better or equally to the others in 4 out of 6 actions. Especially it can drastically improve the accuracy of Jogging from around 80% to 91.2%, which is the most challenging one in KTH. For TJU, All-MTCRFs can rank 1st in 17 out of 22 action. It can augment the performance of 4 actions (clapping, jacks, p-jump, draw-circle) with more than 10% accuracy. For MDA in the depth modality, All-MTCRFs can also improve the performance and rank 1st in 14 out of 16 action. It can augment the performance of 6 actions (drink, read book, write on a paper, sit still, toss paper, lie down on sofa) with more than 20% accuracy. This comparison demonstrates that the proposed method is more robust to high intra variation caused by more complex actions and diverse environments. The action-wise comparison among the first three methods in Fig 8 shows that LL-MTCRFs can work better or equally to HL-HCRF and ML-MTCRFs in 3 out of 6 actions in KTH, 16 out of 22 actions in TJU, 12 out of 16 actions in MDA. It further demonstrates that the body part-induced ST-AUS representation is more discriminative for local dynamic description and consequently facilitates human action recognition.

Comparison against state-of-the-art methods
To show the superiority of the proposed method, we compared its performances on KTH, TJU, and MDA to the state-of-the-art methods. Tables 2 and 3 show the comparison on KTH & MDA. The performances of the competing methods are cited from the corresponding papers. From Tables 2 and 3 it is obvious that the proposed method can outperform the competing methods in both RGB and depth modalities. Especially, the proposed method can outperform the slow feature-based method [47], which can also explore the temporal context. Theoretically, Zhang et al. [47] utilized the slow feature learning for frame-wise feature transformation and further computed the accumulated squared derivative (ASD) feature of video-wise representation by leveraging the statistical characteristics of frame-wise slow features. Although the ASD feature also conveys the temporal characteristics, it loses the multiple sequential context  Learning Sequential Dynamics for Action Recognition represented by the proposed part-induced spatiotemporal action unit sequence. Therefore, it is reasonable that the proposed method can outperform the slow feature-based method. Table 4 shows the comparison on TJU. Since TJU is a new dataset prepared by our group and no off-the-shelf comparison available, we re-implemented six representative methods for comparison: 1) STIP-BoW+SVM: the standard BoW+SVM method with the STIP feature [28]; 2) 3DHOG-BoW+SVM: the standard BoW+SVM method with the 3DHOG feature [3]; 3) BoW+SRC: the standard BoW representation with STIP and sparse representation-based classification [52]; 4) BoW+CRC: the standard BoW representation with STIP and collaborative representation-based classification [53]; 5) HL-AUS+HMM: bodywise action unit sequence with STIP and hidden markov model [54]; 6) HL-AUS+semi-DRF: bodywise action unit sequence with STIP and semi-Markov discriminative random fields [55]. The first four methods belong to the popular BoW-based methods for classification without utilizing temporal

Method
Year Accuracy(%) context. Therefore it is expected that their performances are relatively lower than the other temporal inference methods. HMM and semi-DRF are two popular models which leverage sequence structure for temporal modeling. HMM is theoretically limited by its inability to accommodate long-range dependencies among observations or multiple overlapping features because they assume that the observations are conditionally independent. Therefore semi-DRF works better than HMM. However, both HMM and semi-DRF cannot take advantage of multiple partwise ST-AUS simultaneously for sequence structuring learning and the latent correlation discovery. Consequently, the proposed MTCRFs model can achieve the best performance of 90.5% accuracy.

Conclusion
In this paper we propose a novel human action recognition method based on multi-task conditional random fields model. For feature representation, we propose the part-induced spatiotemporal action unit sequence to represent each action sample with multiple partwise sequential feature subspaces. For model learning, we propose the multi-task conditional random fields (MTCRFs) model to discover the sequence-specific structure and the sequenceshared relationship. Specifically, we propose the probabilistic model and the corresponding learning and inference methods to discover temporal context within individual action unit sequence and the latent correlation among different body parts. The comparison experiments demonstrated its superiority on two RGB human action datasets, KTH & TJU, and one depth dataset, MSR Daily Activity 3D.