Nonnegative Tensor-Based Linear Dynamical Systems for Recognizing Human Action from 3D Skeletons

,


Introduction
Human action recognition based on spatiotemporal data has been one of the most prominent research topics owing to its applications in human-computer interfaces [1], gaming [2], and surveillance systems [3].Over the past few decades, numerous methods have been proposed for recognizing human actions from monocular RGB videos [4].However, the monocular RGB data is very sensitive to background clutter, occlusion, variations in the view-point, and illumination changes.Thus, although in the past few decades several significant research studies have been conducted, the method to accurately recognize human actions from RGB videos still remains a challenging problem.As a human skeleton can be viewed as an articulated system of rigid bodies connected by bone joints, a human action can be described as the spatiotemporal evolution of a series of skeletons.Therefore, if human skeleton sequences can be accurately extracted from RGB videos, it is possible to perform action recognition by classifying the skeleton sequences.However, it is an extremely difficult task to reliably extract the human skeleton from monocular video sensors.With the development of costeffective depth sensors [5], it has become easier to extract the three-dimensional (3D) positions of skeletal joints from depth data.Hence, skeleton-based action recognition has once again become an active area of research.
When recognizing human actions, the action representations embodying the temporal dynamics can provide a more relevant description than using static data [6].A linear dynamical system (LDS) [7] is an effective tool in various disciplines for capturing the spatiotemporal data.Hence, the authors of [6,8] employed an LDS model to capture the spatiotemporal information of an action (skeleton sequence) and used the singular value decomposition (SVD) [9] or Tucker decomposition [10] to estimate model parameters (A, C).The parameters were used to build finite observability matrix O 푇 푘 = [C 푇 , (CA) 푇 , (CA 2 ) 푇 , . . ., (CA 푘−1 ) 푇 ].Hence, an action, represented by an LDS, was alternately identified as a point on the finite Grassmann manifold corresponding to the column space of O 푇 푘 .
Using the nLDS to model actions Motivated by the above methods, this paper proposes a novel approach to model and analyze human actions.The overall approach is shown in Figure 1.In this study, an action (a skeleton sequence) is represented as a third-order nonnegative tensor and each skeleton is converted into a second-order nonnegative tensor.To retain the spatiotemporal information of an action to the maximum extent, a nonnegative tensor-based LDS (nLDS) is proposed to model the action.In this work, nonnegative Tucker decomposition (NTD) [11] is used to decompose the third-order nonnegative tensor for improving the accuracy of action recognition.Because the NTD is a powerful tool to extract a part-based representation of high-dimensional tensors, an action can be represented by a linear combination of relevant components.

Nonnegative Tucker Decomposition
The parameter tuple (A, C) can be learned by the nLDS model and NTD of an action.The parameters C and A represent the appearance and dynamics of the nLDS model, respectively.Thus, it is appropriate to regard extended observability sequence O 푇 ∞ = [C 푇 , (CA 2 ) 푇 , . . ., (CA 푛 ) 푇 , . ..] as the feature descriptor of the action.The conventional method approximates O 푇 ∞ as a finite-order matrix O 푇 푚 and maps O 푇 푚 to a point on a finite Grassmann manifold.However, the value of order 푚 can affect the asymptotic behavior of O 푇 ∞ and computational complexity.To avoid the limitations introduced by choosing the value of order 푚, an action is represented as a point on infinite Grassmann manifold S 푇 ∞ consisting of the orthonormalized extended observability sequences.Finally, classification is performed using dictionary learning and sparse coding on the infinite Grassmann manifold.
The main contributions of this study are as follows: (1) To retain the spatiotemporal information of an action, an nLDS is used to model the action that is represented by a third-order nonnegative tensor.
(2) Compared with the Tucker decomposition, the uniqueness of NTD allows the representation of an action to be more discriminative.Thus, to further improve the accuracy of action recognition, NTD is used to estimate the parameters of the nLDS model.The parameters are utilized to build an extended observability sequence O 푇 ∞ = [C 푇 , (CA 2 ) 푇 . . ., (CA 푛 ) 푇 , . ..] that can be considered as the feature descriptor of the action.
(3) To overcome the limitation caused by approximating extended observability sequence O 푇 ∞ with finite-order matrix O 푇 푘 , an action is represented as a point on infinite Grassmann manifold S 푇 ∞ consisting of the orthonormalized extended observability sequences.
The rest of the paper is organized as follows: Section 2 reviews related work, Section 3 briefly introduces some fundamental concepts on the Tucker model, NTD, and LDS.Section 4 elaborates the nLDS model and describes how to represent an action as a point on the infinite Grassmann manifold.Section 5 presents our experimental results.Section 6 concludes this paper.

Related Work
A brief overview of the skeleton-based action recognition approaches is provided in this section.Presently, the existing skeleton-based action recognition approaches can be categorized into two types; the first type represents the human skeleton as a set of skeletal joints.Wang et al. [22] employed pairwise relative positions of the joints to represent a human skeleton and used a hierarchy of Fourier coefficients to model the temporal evolution of this representation.To obtain the discriminative joint combinations, they used a multiple kernel learning approach to characterize the human actions.Li et al. [23] used a graph-based model to represent the relative spatial variations between various skeletal joints.They utilized the relative variance of the joint relative distance (RVJRD) [24] to indicate their activity levels in a human action to obtain the most informative skeletal joint pairs.To derive the spatiotemporal compatibility of the skeletal joints between different actions, Koniusz et al. [25] used a sequence compatibility kernel (SCK) to capture the spatial and temporal similarities between the skeletons in an action.Furthermore, they employed a dynamics compatibility kernel to represent the similarity between a pair of skeletons in a given action to capture the spatiotemporal dynamics of that action.Zhu et al. [26] input the raw 3D skeletal joint locations to an end-to-end fully connected deep long shortterm memory (LSTM) network for recognizing skeletonbased human actions.Ding et al. [6] represented a human skeleton as a 3D joint-based tensor, and then a human action was considered as a third-order tensor time series.They proposed a tensor-based LDS (tLDS) to model a tensor time series and estimated the parameters of the LDS model using the Tucker decomposition.To eliminate the noise and occlusion in 3D skeleton data, Liu et al. [17] proposed a spatiotemporal long short-term memory (ST-LSTM) network with trust gates.By analyzing the reliability of the skeleton data, trust gates could dynamically update the long-term context information stored in the memory cell.Lee et al. [14] built an Ensemble Temporal Sliding LSTM (TS-LSTM) network which was composed of short-term, medium-term, and long-term TS-LSTM networks.The subnetworks could capture the temporal dependencies between skeletons and the spatial dependency of each skeleton.The second type of skeleton-based action recognition approaches represents a human skeleton as a set of connected rigid bodies.The authors of [12] represented human actions as curves in the Lie group.For the easier performance of the classification task, the human actions represented by the Lie group were mapped to its Lie algebra in the vector space.In [8], the authors divided a skeleton into smaller body parts and employed certain bioinspired shape features to represent each body part.An LDS was used to learn the temporal evolution of the bioinspired features.Using the motion velocities, direction of the motion, and curvatures of the 3D trajectories, Ding et al. [27] categorized the foremost actions into two types of actionunits: dynamic instants and intervals.They utilized selforganizing mapping (SOM) [28] to cluster the action-units with the spatiotemporal feature and employed the sequences of the discrete symbols of each action to build profile hidden Markov models (PHMMs) [29] to obtain the spatiotemporal information between the action-units in the given actions.Huang et al. [21] generated a neural network architecture to learn the most informative Lie group representations.Based on the proposed network structure, they input the Lie group features into rotation mapping layers to obtain the desirable results.

Briefs of Basic Concepts
A brief overview of the Tucker model, NTD, and LDS is presented here, which will help in understanding the tensorbased LDS.Let P ∈ R 퐼 1 ×퐼 2 ⋅⋅⋅×퐼  denote a 푛-order tensor, P (푛) represent the mode-n matricization of tensor P, P represent a matrix, and W (푛) represent a mode-n matrix belonging to the Tucker model.If all elements of tensor P are nonnegative (i.e., P 푖 1 ,푖 2 ,...,푖  ≥ 0, where 1 ≤ 푛 ≤ 푁 and 1 ≤ 푖 푛 ≤ 퐼 푛 ), P is called a nonnegative tensor.The mode-n matricization of tensor P utilizes the elements of tensor P to structure matrix

Tucker Model. The mode-n product of tensor
for all the index values.With the help of the mode-n product, (1) is rewritten in a form of the matrix unfoldings by fixing the n-th mode as follows: where P (n) and K (n) are the matrix unfolding of P and K, respectively.The Tucker model decomposes 푁-order tensor P ∈ R 퐼 1 ×퐼 2 ⋅⋅⋅×퐼  into the mode products of core tensor K ∈ R 퐽 1 ×퐽 2 ⋅⋅⋅×퐽  and 푁 mode matrices W (푛) as follows: where W (푛) are the factor matrices and W (푛)푇 W (푛) = I.Mode-n matricization of tensor P in (3) can be represented by the mode-n matricization of the core tensor and mode matrices: where ⊗ denotes the Kronecker product.

Nonnegative Tucker Decomposition.
Given nonnegative 푁-order tensor P ∈ R 퐼 1 ×퐼 2 ⋅⋅⋅×퐼  , NTD of P obtains core tensor K and mode matrix W (푛) ∈ R 퐼  ×퐽  , which are restricted to having only nonnegative elements, such that To search for approximate factorization P of tensor P, the cost function was used to quantify the quality of the approximation.The generalized Kullback-Leibler divergence (or I-divergence) is usually used to construct the cost function as follows: To obtain mode matrices W (n) and core tensor K of tensor P, Kim et al. [11] used multiplicative updating algorithms to minimize cost function (6) as follows.
They define which is the Kronecker product structure of mode matrices W (n) , and W (\n) is represented as a backward cyclic form.The mode-n matrices of the NTD can be rewritten in the form: Kim et al. [11] derived multiplicative updating algorithms for the mode matrices W (푛) and core tensor K of NTD as follows: The update rule for mode matrices W (푛) is The update rule for core tensor K is where , / is the element-wise division, ⊛ denotes the Hadamard product, and 휉 is a distinct tensor, all of whose elements are 1.I-divergence D(P ‖ P) is nonincreasing under the update rules.

Linear Dynamical Systems. An LDS belongs to a multivariate time series (MTS) model. The MTS model utilizes hidden states to indirectly represent the observation sequences.
Given MTS Y = [푦(1), . . ., 푦(푡), . . ., 푦(휆)], an LDS is usually described as follows: where discrete variable 푡 is the time index, 푥(푡) denotes the d-dimensional hidden state at time 푡, 푦(푡) represents an n-dimensional observed state at time 푡, and d is 표푟푑푒푟 of the LDS.A ∈ R 푑×푑 is a 푡푟푎푛푠푖푡푖표푛 푚푎푡푟푖푥 that can transform current hidden state 푥(푡 + 1) to previous hidden state 푥(푡).푂푏푠푒푟V푎푡푖표푛 푚푎푡푟푖푥 C ∈ R 푛×푑 can transform hidden state 푥(푡) to observed state 푦(푡).Noise covariance matrices are Q and R. Noise components 푤(푡) and V(푡) represent multivariate normal distributions with a zero-mean and covariance matrices 푄 ∈ R 푑×푑 and 푅 ∈ R 푛×푛 , respectively.In [30], G. Doretto et al. employed singular value decomposition (SVD) Y = UΣV of the observed sequence to obtain the best estimate of C and A as follows: where The parameters (A, C) of LDS do not lie in a linear space.Transition matrix A is constrained to be stable and its eigenvalues are distributed in the unit circle.Observation matrix C is an orthonormal matrix.In fact, matrix C lies in the Stiefel manifold.Parameters (A, C) can be utilized to describe the intrinsic characteristics of the LDS model [31] because A and C represent the dynamics and spatial appearance, respectively.Therefore, pair A, C can be used to describe a set of joint trajectories of an articulated body model.The extended observability matrix [32] for a tuple (A, C) has the following form: In the current human action research based on skeletons, a human action is usually described as a finite skeleton sequence.Therefore, a human action can be defined as a klength extended observability matrix as follows: Here k is the total number of frames in the skeleton sequence.
The size of Grassmann manifold G(푝, 푞) [33] represents a set of 푞-dimensional linear subspaces of R 푝 .Each point on the Grassmann manifold is a particular subspace spanned by the column space of 푝 × 푞, the orthogonal matrices.To obtain a particular subspace spanned by the columns of O 푇 푘 , the Gram-Schmidt orthonormalization can be used to compute an orthonormal basis.Thus, a human action can be represented as a point on Grassmann manifold G(푘푛, 푑) corresponding to the column space of O 푇 푘 .

Nonnegative Tensor Representation of Skeleton Sequence.
The continuity of human motions decides that a skeleton sequence, which is used to describe a human action, is a combination of interrelated skeletons.Instead of traditional skeleton feature vectorization, a skeleton sequence is represented as a nonnegative tensor based on a time series.By searching the nonnegative tensor components of a human action, the tensor representation can better reflect independence of each skeleton and the variation between different skeletons.Figure 2(a) shows a human skeleton with 19 rigid bodies and 20 joints.We preprocess the complete skeleton in the action datasets, which makes it more accurate and easy to represent a skeleton sequence as a nonnegative tensor.We use the preprocessing method elaborated in Section 5.9 to keep the joints of a skeleton in the first octant of the global coordinate system (refer to Figure 2(b)).
Let 푊 = (퐶,퐸) be a preprocessed skeleton (the preprocessed skeleton located at the first octant of the global coordinate system).C = {푐 1 , 푐 2 , . . ., 푐 푁 } is a set of N joints in the first octant of the global coordinate system.

Nonnegative
Tensor-Based LDS Model.Mathematically, a third-order tensor can be unfolded into a second-order tensor time series.It is well known that a second-order tensor time series can be modeled as the output of an LDS.Therefore, a third-order nonnegative tensor, representing a skeleton sequence, is used to build an LDS model.The parameters (A, C) are estimated using the LDS model and NTD of the action.Then, the LDS model is called the nonnegative tensorbased LDS (nLDS) model, as shown in Figure 3.
A skeleton sequence can be represented as a third-order nonnegative tensor as follows: where The NTD of Y uses the update rules proposed by Kim et al. [11] as follows: where 푖 = 1,2, and 3. Encoding variable matrix S (3)  (푊) is computed in the following manner: Then, the NTD of Y is given by where the core tensor S ∈ R 푑 1 ×푑 2 ×휏 and mode matrices (W (1) ∈ R 푗 1 ×푑 1 , W (2) ∈ R 푗 2 ×푑 2 , and W (3) ∈ R 휆×휏 ) are restricted to having only nonnegative elements in the factorization, W (2)   W (3)   Figure 4: Nonnegative Trucker decomposition of a third-order nonnegative tensor.
The pseudocode in Algorithm 1 shows the processing to build the nLDS model and solve its parameters.

Representing a Human Action as a Point on Infinite
Grassmann Manifold.Starting from the initial state V푒푐X(1), the expected observation sequence of an nLDS model is represented as follows: Here the transition matrix A ∈ R 퐼×퐼 is stable with eigenvalues inside the unit circle, the observation matrix C ∈ R 퐽×퐼 is an orthonormal matrix, and 휇( Â) is the largest eigenvalue of Â in magnitude.The expected observation sequence lies in the column space of the extended observability sequence given by O 푇 ∞ = [ Ĉ푇 , ( ĈÂ 2 ) 푇 , ( ĈÂ 3 ) 푇 , . ..].The column space of O 푇 ∞ can be seen as the descriptor of an LDS due to invariance of the choice of the basis of its state space.In this manner, the nLDS model of an action can be represented by its O 푇 ∞ , which means that the O 푇 ∞ can be seen as the feature descriptor of the action.
The traditional methods [6,30] can approximate the extended observability sequence O 푇 ∞ by taking the finite observability matrix (푘-order observability matrix) O 푇 푘 = [ Ĉ푇 , ( ĈÂ 2 ) 푇 , . . ., ( ĈÂ 푘−1 ) 푇 ].Thus, an LDS can be represented as a point on a Grassmann manifold corresponding to the column space of O 푇 푘 .The value of the order 푘 can influence the approximation of the extended observability matrix: if the value of 푘 is too small, a 푘-order observability matrix cannot adequately represent the behavior of the extended observability matrix; conversely, the finite observability matrix can be asymptotical to the extended observability matrix by increasing the value of 푘, but this also leads to an increase in the computational cost.To avoid these limitations, we use the method proposed in [34] to project infiniteorder observability matrices to points on infinite Grassmann manifold (infinite Grassmann manifold has been defined as Then, the orthonormalization of O 푇 ∞ is defined as S 푇 ∞ denotes the quotient space of Z 푇 ∞ .According to the definition of S 푇 ∞ , it is infinite Grassmann manifold G(퐼, ∞) with an extra intrinsic structure.Thus, an action, represented by O 푇 ∞ , can be alternately identified as a point on infinite Grassmann manifold S 푇 ∞ consisting of the orthonormalized extended observability sequences.

Sparse Coding and Dictionary Learning on Infinite
Grassmann Manifold.To classify the actions projected to infinite Grassmann manifold S 푇 ∞ , an efficient method [35] is used to perform parse coding and dictionary learning on the infinite Grassmann manifold.Given dictionary D = {D 1 , D 2 , . . ., D 퐾 }, action set X = {X 1 , X 2 , . . ., X 푁 }, and coefficients y = {푦 1 , 푦 2 . . ., 푦 퐾 } ( D 푗 , X 푖 ∈ S 푇 ∞ and 푦 (푗,푖) ∈ 푦 푖 ), a sparse coding objective function on S 푇 ∞ can be expressed as The purpose of dictionary learning is to search a good dictionary that can represent all the actions with a small reconstruction error.Let the tuples of the element D ℎ ∈ S 푇 ∞ and the action X 푖 ∈ S 푇 ∞ be (A ℎ , C ℎ ) and ( Â푖 , Ĉ푖 ), respectively.Then, the problem of dictionary learning on the infinite Grassmann manifold can be expressed as optimization function min D ∑ 퐾 ℎ=1 2Θ(ℎ), where where P ℎ , P 푗 , and P푖 represent the Cholesky decomposition matrices associated with dictionary elements D ℎ , D 푗 , and action X 푖 , respectively.

Alternative Nonnegative Tensor Representation of Skeleton
Sequence.To obtain the nonnegative tensor representation more easily, the whole skeleton is transformed to the first octant.Each skeleton in the first octant contains N joint points, N-1 joint angles, and N-1 rigid bodies, and an action sequence consists of 휆 skeletons.To verify the effectiveness of our approach, we compare it with the following four alternative nonnegative tensor representations: Second-Order Nonnegative Joint Positions (2NJP).An action sequence is represented as a second-order nonnegative tensor, 3푁 × 휆, in which each skeleton is seen as a nonnegative vector consisting of the 3D coordinates of all joint points.

Second-Order Nonnegative Rigid Body Direction (2NRBD
).An action sequence is represented as a second-order nonnegative tensor, 3(푁 − 1) × 휆, in which each skeleton is seen as a nonnegative vector consisting of the directions of all rigid bodies (the directions of a rigid body are represented by the three angles between the rigid body and x-, y-, and z-axis, respectively).

Third-Order Nonnegative Joint Angle and Joint Positions (3NJAP).
Given two adjacent rigid bodies, we can obtain a joint angle between the two rigid bodies and a coordinate tuple comprising the coordinates of the three joints in the two rigid bodies.An action sequence can be represented as a third-order nonnegative tensor, (푁 − 1) × 10 × 휆, where each skeleton is seen as a second-order nonnegative tensor that consists of all the joint angles and coordinate tuples.

Third-Order Nonnegative Joint Angle and Direction (3NJAD).
Given two adjacent rigid bodies, we can obtain a direction tuple containing only the directions of the two rigid bodies and a joint angle between the two rigid bodies.An action sequence can be represented as a third-order nonnegative tensor, 푁(푁 − 1) × 7 × 휆, in which each skeleton is seen as a second-order nonnegative tensor consisting of all the joint angles and direction tuples.

Parameter Estimation.
In the MSR-Action3D, UTKinect-Action, and G3D-Gaming datasets, all the skeletons contain the same number of joints and rigid bodies (19 rigid bodies and 20 joint points).NTD computes the best rank approximation of nonnegative tensor is the core tensor).Thus, 푑 1 , 푑 2 , and 휏 can affect the cost function D(Y ‖ Ŷ) between nonnegative tensor Y and its best rank approximation Ŷ.Each skeleton sequence in the three datasets can be represented as nonnegative tensor Y ∈ R 18×9×휆 (휆 is the length of the skeleton sequence).To compute the best rank approximation of nonnegative tensor Y ∈ R 18×9×휆 , we set 푑 1 = 9, 푑 2 = 4, and 휏 = 푅푎푛푔푒(휆) (휏 = 23 on the MSR-Action3D dataset).Following the experimental protocol in [36], 20 actions in the dataset are categorized into three different subsets: 퐴푆1, 퐴푆2, and 퐴푆3, and each subset includes 8 actions.Subsets 퐴푆1 and 퐴푆2 include actions with similar movements.Subset 퐴푆3 includes more complex actions.The crosssubject evaluation method, in which half of the subjects are used for training and remaining subjects for testing, is utilized to perform recognition on each subset.The average recognition is reported over 10 different combinations of training and testing sets.Table 1 shows that the proposed Table 2: Recognition rate on the MSR-Action3D dataset based on experimental protocol of [8].

Approach
Recognition Accuracy SE3 [12] 89.48 Grassmann Manifold [8] 91.21 Learning features combination [16] 90.36 ± 2.45 ST-LSTM + Trust Gate [17] 94.80 Bi-LSTM [15] 86.18Our approach 96.97 approach outperforms various other methods extracting the action feature from 3D joint positions.Our approach achieves an average accuracy of 97.63% for the MSR-Action3D dataset, outperforming the other action recognition approaches.The average accuracy of our approach is 0.41% better than the average accuracy of Ensemble TS-LSTM [14], 2.78% better than the average accuracy of tLDS [6], 5.79% better than the average accuracy of Bi-LSTM [15], 1.81% better than the average accuracy of 3NJAP-nLDS, and 0.17% better than the average accuracy of 3NJAD-nLDS.The superior performance on the three subsets indicates that our approach is better than the other methods, for both distinguishing similar actions and recognizing complex actions.Following the experimental protocol in [8], we test all the actions on the MSR-Action3D dataset.The experiment on the entire dataset is more challenging than that in [36].Our approach achieves an accuracy of 96.97%, as shown in Table 2.
Figure 5 shows the classification confusion matrix in the entire MSR-Action3D dataset.By observing the confusion matrix, we find that the recognition rate of most actions achieves 100%.It is obvious that classification errors occur if two actions are extremely similar, such as draw tick and horizontal arm wave.

Experiments on UTKinect-Action Dataset.
The UTKinect-Action dataset is used to further evaluate our approach.This dataset comprises 10 types of human actions, which are captured by a single stationary Kinect in indoor settings.The Table 3: Recognition rate on the UTKinect-Action dataset.

Approach
Recognition Accuracy SE3 [12] 97.08 EigenJoints [13] 97.10 Grassmann Manifold [8] 88.5 Key-Pose-Motifs [18] 93.47 Learning features combination [16] 98.00 Ensemble TS-LSTM [14] 96.97 tLDS [6] 96.48 Bi-LSTM [15] 96.89Our approach 98. 23 10 actions are walking, sitting down, standing up, picking up, carrying, throwing, pushing, pulling, waving, and clapping hands.Ten different subjects (9 males, 1 female) performed each action twice.Overall, there are 6220 frames of the 199 action sequences.This dataset is very challenging.Firstly, as the body parts of some actions are out of the field-of-view, parts of the human body are invisible.Secondly, different subjects perform the same action with different limbs, such as left-hand waving and right-hand waving.Thirdly, the action sequences captured from different views cause difficulties for action recognition.
To appropriately compare our approach with the state-ofthe-art algorithms, the leave-one-sequence-out cross validation (LOOCV) method is applied to perform our experiment on the dataset.For each iteration, we choose an action sequence for testing and use the remaining action sequences for training.Each testing sequence was randomly chosen.The experiment on the dataset was performed ten times.Table 3 presents the experiment results achieved by our approach and other state-of-the-art methods.The recognition rate of our approach on the dataset is 98.23%.It is obvious that our approach outperforms SE3 [12], EigenJoints [13], Grassmann manifold [8], Key-Pose-Motifs [18], learning features combination [16], Ensemble TS-LSTM [14], tLDS [6], and Bi-LSTM [15], which achieve recognition rates of 97.08%, 97.10%, 88.5%, 93.47%, 98.00%, 96.97%, 96.48%, and 96.89%, respectively.

Approach
Recognition Accuracy GB-RBM+HMM [19] 86.40 SE [12] 91.09SO [20] 87.95 LieNet [21] 89.10 tLDS [6] 90.60 Our approach 92.56 The main reason for this may be that our approach more accurately reflects the relationships between the skeletons in the action sequence.

G3D-Gaming
Dataset.The G3D-Gaming dataset consists of 663 sequences of the 20 different gaming actions captured by Microsoft Kinect.Ten different subjects performed each gaming action more than twice.The dataset can provide three types of data: synchronized video, depth, and skeleton data.Only the skeleton data is chosen to perform our experiment.Experiments on this dataset are also challenging owing to the following two factors: First, when the body parts are occluded, the Kinect tracker gives the inferred results that influence the recognition rate, such as the TennisSwing-Backhand, Golf, and ThrowBowlingBall rates.Second, if the movement range of two different actions is relatively small, the two actions may easily interfere with each other in the action recognition.The cross-subject evaluation method is used to perform our experiment.The average recognition results are reported over ten different combinations of the training sets and testing sets.The proposed approach is compared with the state-of-the-art methods reported for the G3D-Gaming dataset, as listed in Table 4. GB-RBM+HMM [19] and LieNet [21] use deep learning method to recognize human action: GB-RBM+HMM incorporates a Gaussian binary-restricted Boltzmann machine (GB-RBM) with a hidden Markov model (HMM) to capture the global and local dynamic features of the joint trajectories.LieNet combines the Lie group structures with a deep network architecture to obtain more appropriate Lie group features used to recognize human action.Our approach achieves a recognition accuracy of 92.56%.It is obvious that our approach outperforms GB-RBM+HMM [19], SE [12], SO [20], LieNet [21], and tLDS [6], which achieve recognition rates of 86.40%, 91.09%, 87.95%, 89.10%, and 90.60%, respectively.

Discussion about Nonnegative Tensor Representation and
Extended Observability Sequence.In this work, an action can be represented by a third-order nonnegative tensor.Referring to the approaches proposed in [6], the finite extended observability sequence O 푇 푘 of the third-order nonnegative tensor can be considered as the feature descriptor for the action (O 푇 푘 = [ Ĉ푇 , ( ĈÂ 2 ) 푇 , . . ., ( ĈÂ 푘−1 ) 푇 ] and 푘 is the truncation parameter of the extended observability sequence O 푇 ∞ ).The subspace spanned by columns of the finite extended observability sequence corresponds to a point on a Grassmann manifold.Therefore, in order to verify the effectiveness of infinite Grassmann manifold, we use the dictionary learning and sparse coding on Grassmann manifold to classify the nonnegative tensor-based actions represented as points on a Grassmann manifold (the application of extended observability sequence approximation is that an extended observability sequence can be mapped to a point on infinite Grassmann manifold).The experiment results, shown in Table 5, demonstrate that the application of extended observability sequence approximation is effective in improving the accuracies on the three datasets.
In [6], an action can be represented by a third-order tensor.The third-order tensor can be mapped to a point on an infinite Grassmann manifold by using the approach proposed in [34].Then, in order to verify the effectiveness of the nonnegative tensor-based action representation, we use the dictionary learning and sparse coding on infinite Grassmann manifold to classify the third-order tensors represented as points on an infinite Grassmann manifold.The experiment results, shown in Table 6, demonstrate that the nonnegative tensor-based action representation is effective in improving the accuracies on the three datasets.푘 corresponds to a point on a Grassmann manifold (푘 is the truncation parameter of O 푇 ∞ ).Then, the actions, represented by third-order nonnegative tensors, can be mapped to points on a Grassmann manifold.Therefore, in order to verify the effectiveness of NTD, we use LTBSVM [8] to classify the actions represented as points on a Grassmann manifold.The experiment results, shown in Table 7, demonstrate that NTD is effective in improving the accuracies on the three datasets.

Evaluating the Effect of Infinite Mapping and NTD on the
In [8], the extended observability matrix of an action is given by 휃 The parameters (C, A) are estimated by the ARMA model of the action sequence.We map the extended observability matrix to a point on an infinite Grassmann manifold using the approach proposed in [34].Then, the dictionary learning and sparse coding on infinite Grassmann manifold are employed to classify the actions represented as points on an infinite Grassmann manifold in order to verify the effectiveness of infinite mapping (or infinite Grassmann manifold).The experiment results, shown in Table 8, demonstrate that infinite mapping is effective in improving the accuracies on the three datasets.

Computation Complexity and Run Time. Computation complexity comprises time complexity and space complexity.
The time complexity of our algorithm is 푂(푀 * 푁 2 ), where 푁 is the scale of a skeleton sequence (or an action) and 푀 is the total number of skeleton sequences in a dataset.Its space complexity is 푂(푀 * 푁 3 ).Matlab is used to run our experiments on a 3.60GHz Intel Core i7-4790 CPU machine.Run time comprises the training time and the testing time.Table 9 shows the run time on the three datasets.

Preprocessing and Skeleton Translation
The Preliminary Preprocessing and Translation of Skeleton.Before all skeletons in an action dataset are translated to the local coordinate system, we preprocess the skeletons using the preliminary preprocessing as follows: Preliminary Preprocessing.For each action dataset, all skeletons were transformed to a global coordinate system whose origin locates at the hip center.This makes the skeletal data invariant to absolute location of the human in the scene.One of the skeletons was chosen as a reference skeleton.All other skeletons were normalized (without changing their joint angles) such that their body-part lengths are equal to the corresponding body-part lengths of the reference skeleton, which makes the skeletons scale-invariant.All skeletons were rotated such that the global x-axis is aligned with the ground plane projection of the vector from the left hip to the right hip, which makes the skeletons view-invariant.Figure 6(a) shows a preprocessed skeleton.
Obviously, the hip center of each preprocessed skeleton is located at the origin 표 of the global coordinate system.This means that whether a subject moves a lot or little, the translation's amount of each joint in the preprocessed skeleton sequence is not large (the preprocessed skeleton sequence represents the locomotion of the subject).In other words, when the locomotion of subjects was represented by the preprocessed skeleton sequences, the amount of translation, required to make all joint coordinates stay in the first octant, is not large.Therefore, the locomotion of subjects does not influence the amount of translation because all skeletons have been preprocessed by preliminary preprocessing.In fact, the amount of translation is only related to the joints in the preprocessed skeleton sequences, while the translation's amount of the joints is small.For our approach, the preliminary preprocessing not only has no special limitations, but also helps to improve the accuracy for action recognition.
Given an action dataset 퐷, all skeletons in the dataset have been preprocessed by the preliminary preprocessing.Let 푝 푖 푡 (푗) = 푥 푖 푡 (푗), 푦 푖 푡 (푗), 푧 푖 푡 (푗) 푗=[1:퐽],푡=[1:푇],푖=[1:퐼] denote the 3D position of joint 푗 at skeleton 푡 in skeleton sequence 푖, where 퐽 is the total number of joints in skeleton 푡, 푇 is the total number of skeletons in skeleton sequence 푖, and 퐼 is the total number of skeleton sequences in action dataset 퐷 (skeleton sequence 푖 belongs to action dataset 퐷, which means that the skeletons in skeleton sequence 푖 have been preprocessed by the preliminary preprocessing).Let 표 耠 = (푥 표  , 푦 표  , 푧 표  ) be the origin of the local coordinate system, where 푥 Next, to make all joints stay in the first octant, all preprocessed skeletons in action dataset 퐷 are translated to the local coordinate system with the origin at point 표 耠 , and the hip center of the preprocessed skeletons is placed at origin 표 耠 of the local coordinate system.system and the origin 표 of the global coordinate system, where 푝 = (푥 표  + 훼, 푦 표  + 훽, 푧 표  + 훾), 훼 ⩾ 0, 훽 ⩾ 0, and 훾 ⩾ 0. Figure 7 shows the relationship between 푑 표푝 and the recognition rates.We found that the recognition rates gradually decreased with the increase of 푑 표푝 , which meant that the translation of the skeletons affects the accuracy for action recognition.Therefore, aiming to reduce the negative effect as far as possible, we set the origin of the local coordinate system to 표 耠 , which can ensure that the joints of translated skeletons are located at the first octant of the global coordinate system while 푑 표푝 can achieve its minimum 푑 표표  as well.

Conclusion and Future Work
In this paper, an action is represented as a third-order nonnegative tensor.To capture the original spatiotemporal information of the action, an nLDS is used to model the action.NTD is employed to estimate the parameters of the nLDS model.The extended observability sequence of the parameters, which is considered as the feature descriptor of the action, is mapped to a point on infinite Grassmann manifold.The dictionary learning and sparse coding on infinite Grassmann manifold are used to perform classification.The experiment results demonstrate that our approach achieves a better performance than other state-of-the-art skeleton-based action recognition approaches.According to the theories proposed by Zhou et al. [38], future researches will focus on the effect of the unique and sparse NTD on action recognition.

Figure 1 :
Figure 1: Complete framework of the proposed method.

Figure 2 :Figure 3 :
Figure 2: (a) A human skeleton including 20 joints and 19 rigid bodies.(b) Using the translation to move a skeleton to the first octant.

3 )Algorithm 1 :
(1 : 휆 − 1) † * Finding the best estimate of the parameter A Learning the nLDS model with a second-order tensor time series.
Dataset.The MSR-Action3D dataset includes 557 human pose sequences that were captured by 10 subjects performing 20 actions, with each action having 2 or 3 repetitions.Each pose of this dataset provides the 3D locations of 20 joints.Each action sequence includes approximately 50 frames.Experiments on the dataset are challenging because many actions are very similar and the pose sequence of the same action class can have a large intraclass variation owing to the performing style variations.

Figure 5 :
Figure 5: nLDS model with the third-order nonnegative tensor.
Figure 6(b) shows a preprocessed skeleton translated to the local coordinate system.Discussion about Skeleton Translation.Let 푑 표푝 = |표푝| denote the distance between the origin 푝 of the local coordinate

Figure 6 :
Figure 6: (a) a preprocessed skeleton in global coordinate system, (b) the translation of the preprocessed skeleton.

Figure 7 :
Figure 7: The influence of skeletal translation on action recognition.

Table 4 :
Recognition rates on the G3D-gaming dataset.

Table 6 :
Discussion about nonnegative tensor representation.

Table 7 :
Comparison between SVD and NTD.