Key-Skeleton-Pattern Mining on 3D Skeletons Represented by Lie Group for Action Recognition

The human skeleton can be considered as a tree system of rigid bodies connected by bone joints. In recent researches, substantial progress has been made in both theories and experiments on skeleton-based action recognition. However, it is challenging to accurately represent the skeleton and precisely eliminate noisy skeletons from the action sequence. This paper proposes a novel skeletal representation, which is composed of two subfeatures to recognize human action: static features and dynamic features. First, to avoid scale variations from subject to subject, the orientations of the rigid bodies in a skeleton are employed to capture the scale-invariant spatial information of the skeleton.The static feature of the skeleton is defined as a combination of the orientations. Unlike previous orientation-based representations, the orientation of a rigid body in the skeleton is defined as the rotations between the rigid body and the coordinate axes in three-dimensional space. Each rotation is mapped to the special orthogonal group SO(3). Next, the rigid-body motions between the skeleton and its previous skeletons are utilized to capture the temporal information of the skeleton.The dynamic feature of the skeleton is defined as a combination of the motions. Similarly, the motions are represented as points in the special Euclidean group SE(3).Therefore, the proposed skeleton representation lies in the Lie group (SE(3)×⋅ ⋅ ⋅×SE(3), SO(3) × ⋅ ⋅ ⋅ × SO(3)), which is a manifold. Using the proposed representation, an action can be considered as a series of points in this Lie group. Then, to recognize human action more accurately, a new pattern-growth algorithm named MinP-PrefixSpan is proposed to mine the key-skeleton-patterns from the training dataset. Because the algorithm reduces the number of new patterns in each growth step, it is more efficient than the PrefixSpan algorithm. Finally, the key-skeleton-patterns are used to discover the most informative skeleton sequences of each action (skeleton sequence). Our approach achieves accuracies of 94.70%, 98.87%, and 95.01% on three action datasets, outperforming other relative action recognition approaches, including LieNet, Lie group, Grassmann manifold, and Graph-based model.


Introduction
Human action recognition is currently the most dynamic research topic in the field of computer vision, owing to its applications in intelligent surveillance, video games, robotics, and other fields. Several approaches have been proposed to recognize human action from RGB video sequences over the past few decades [1], but their performance is unsatisfactory because RGB data are very sensitive to factors such as perspective changes, occlusions, and background clutter. Although significant research results have been achieved, human action recognition remains a challenging problem.
Because the human skeleton can generally be regarded as an articulated system of rigid segments, which are connected by joints, human action can be viewed as a continuous evolution of the spatial configuration, which is constructed by these rigid segments [2]. Therefore, if human skeleton sequences can be accurately extracted from RGB videos, action recognition can be performed by classifying these sequences. However, it is very difficult to accurately extract a skeleton sequence from RGB videos [3]. With the advent of cost-effective RGB-D cameras, it has become easier to extract the three-dimensional (3D) human skeleton from depth maps. Although this improves the appearance and viewpoint variations to a certain extent [4][5][6][7], the following two challenges cause large intraclass variations and remain unresolved. First, different people can perform the same action in different ways. Second, the 3D human skeleton is sometimes imprecise because depth maps include noisy information. However, a psychological research found that humans can easily recognize an action from a pose sequence [8]. According to the work of [8], Yang et al. considered that actions can be classified by a single key pose [9]. This suggests that a set of key skeletons can be used to perform action classification rather than the entire skeleton sequence. Since the representation of the key poses is robust to outlier poses, this approach should improve the accuracy of action recognition as long as the key poses are accurate.
The general framework of the proposed approach is shown in Figure 1. By observing human action in daily life, the orientations and motions of rigid bodies can include a lot of useful information for action recognition. In this paper, a new skeletal representation, which is composed of the static feature and dynamic feature, is proposed for 3D skeleton-based action recognition. The static feature is used to represent the spatial information in a given skeleton t. To capture the scale-invariant spatial information, the orientations of the rigid bodies in the skeleton are employed to construct its static feature. In this work, the orientation of a rigid body in the skeleton is represented as six rotation matrices between the rigid body and the three coordinate axes in 3D space. The rotation matrices are mapped to the special orthogonal group (3) [10]. Next, the dynamic feature is employed to represent the temporal information of skeleton t. The dynamic feature is composed of the rigid-body motions between skeletons t and t-1 and those between skeletons t and 1 (the three skeletons belong to the same sequence or action). The motions are represented as points in the special Euclidean group (3) [11]. Hence, skeleton t is represented by a point in Lie Group ( (3) × (3) ⋅ ⋅ ⋅ × (3), (3) × (3) ⋅ ⋅ ⋅ × (3)),where the operation × represents the direct product between groups in group theory. Using the proposed skeleton representation, a human action (skeleton sequence) can be represented as points in the Lie group. However, it is typically a very complicated task to classify human actions represented by a Lie group directly. Many standard classification approaches, such as the support vector machine (SVM) [12] approach, are not directly applicable to Lie groups. To overcome the classification difficulties, the actions (skeleton sequences) are mapped from the Lie Group to its Lie algebra (se (3) (3)), which are the elements of the tangent space of the manifold at the identity element. The Lie algebra is a vector space, which makes action classification easier.
An action (skeleton sequence) usually includes many noisy skeletons, which can reduce the action recognition accuracy. In this study, the key-skeleton-patterns are used to eliminate noisy skeletons from an action, and the remaining skeletons in the action are called the most informative skeleton sequence. First, a pattern is defined as a short skeleton sequence, which is not necessarily adjacent in the original skeleton sequences. If the short skeleton sequence appears in many skeleton sequences of an action class, the pattern is called the key-skeleton-pattern in that class. Next, to mine the key-skeleton-patterns, k-means is used to learn the symbolic dictionary from all skeletons in the dataset. Each symbol in the dictionary represents a class of similar skeletons, which means that each skeleton is quantized (represented) by a symbol in the dictionary. Then, a skeleton sequence can be represented as a symbol sequence. In this paper, probability is used to measure the distance between a skeleton and its corresponding symbol in order to minimize the effect of quantization errors (e.g., two different skeletons are quantized by the same symbol). Hence, each skeleton is represented by a distance-based probability, and an action is represented as a probability sequence. Then, a new pattern-growth algorithm named MinP-PrefixSpan is proposed to mine the key-skeleton-patterns of an action class from the symbol sequences and the probability sequences that correspond to the action class. Compared with the PrefixSpan algorithm, our algorithm achieves higher efficiency by reducing the number of new skeleton patterns in each growth step. Finally, the key-skeleton-patterns are utilized to eliminate noisy skeletons from the action in order to capture the most informative skeleton sequence of the action. An SVM is employed to classify the most informative skeleton sequences.
The main contributions of this study are as follows. (1)To capture the scale-invariant skeletal information, the Mathematical Problems in Engineering 3 orientations of rigid bodies in a skeleton are utilized to construct the static feature. Different from previous orientationbased approaches, in this study, a rigid-body orientation is represented as six rotation matrices, and each rotation matrix is represented as a point in (3).
(2) Traditional approaches based on Lie groups [5,13,14] only consider the spatial information of a skeleton but ignore the temporal information between different skeletons. Therefore, our approach employs the rigid-body motions between different skeletons to describe the temporal variation. Likewise, the motions can be represented as points on (3). (3) Traditional approaches also ignore the influence of noisy skeletons in an action on the accuracy of the action recognition. In this study, based on the PrefixSpan algorithm [15] in data mining, a new pattern-growth algorithm is proposed to mine the keyskeleton-patterns of each action class, and the key-skeletonpatterns are used to eliminate noisy skeletons.

Related Work
A brief overview of the related work on human action recognition approaches based on skeletons is provided in this section, and various sequential pattern-mining algorithms are reviewed.
The existing skeleton-based action recognition approaches can be classified into three main categories. The first class of approach ignores the influence of noisy skeletons on the accuracy of action recognition. Slama et al. represented an action by an observability matrix, which was characterized by an element of a finite Grassmann manifold [16]. However, their method does not eliminate noisy skeletons from an action, and it is insufficient to estimate the approximation of an extended observability sequence with a finite Grassmann manifold. Ding et al. divided actions into subactions and used the profile hidden Markov model(HMM) to align them [13]. Although their approach accurately extracts the spatial features of an action, it does not solve the following two problems: eliminating noisy skeletons and reducing the time complexity of the profile HMMs. Liu et al. proposed a new spatiotemporal representation, called "Skepxels, " to transform skeleton videos into images of flexible dimensions, and employed the resulting images to build a CNN-based framework for effective human action recognition [17]. Likewise, their approach does not eliminate noisy skeletons from an action. In this study, the key-skeleton-patterns of an action are utilized to eliminate noisy skeletons from the action as an approach to improve the accuracy of action recognition.
The second class of approach ignores scale variations from subject to subject, which means that the spatial feature of an action cannot be accurately represented. Chaudhry et al. hierarchically divided the human skeleton into smaller parts and employed certain bio-inspired shape features to represent each part [18]. The temporal evolutions of these bioinspired features are modeled by linear dynamical systems (LDSs). Although their approach takes full advantage of the correlation between the skeletal parts, it ignores the feature of the rigid bodies in a skeleton and the scale variations between different subjects. Xia et al. proposed a view-invariant representation of the human skeleton using histograms of 3D joint locations [19]. The temporal evolutions of this skeletal representation are modeled by a discrete HMM. However, their approach not only ignores the relativity between the rigid bodes in a skeleton but also the normalization of the skeleton data. Li et al. represented an action by a special graph based on the top-K relative variance of joint relative distance (RVJRD) [20]. One potential limitation of this approach is that the graph-based model does not handle scale variations, which may cause incorrect spatial information to be selected by the top-K RVJRD. In contrast, our proposed approach uses the orientations of the rigid bodies in a skeleton to capture scale-invariant skeletal features.
The third class of approach ignores the temporal information of an action and treats the poses in the action independently. Evangelidis et al. used a local skeleton descriptor to encode the relative positions of joint quadruples [21]. The descriptor of an action was represented by a multilevel Fisher vector composed of the local skeleton descriptor in the action. However, the action descriptor not only ignores the temporal information between different skeletons but also has high time complexity. Huang et al. combined the Lie group structure with a deep network framework [22]. Their learning structure (LieNet) has a rotation mapping layer transforming the Lie group features into the traditional neural network model. One main limitation of this approach is that LieNet ignores the rich temporal information of human actions. Vemulapalli et al. described the relative geometry between the rigid-body parts using special Euclidean group (3) [5]. Therefore, the entire skeleton in an action can be represented as a point in (3). An action is represented as a curve in the Lie group ( (3) × (3) ⋅ ⋅ ⋅ × (3)). Although their approach can accurately extract the spatial information of a skeleton, it ignores the temporal cues between the skeletons in an action and does not eliminate noisy skeletons from the action. Our proposed dynamic feature models the temporal structures of an action using the rigid-body motions between different skeletons in the action.
Sequential pattern mining aims to discover frequent subsequences as patterns in a sequence database. Traditional sequential pattern mining algorithms [23][24][25][26] are usually used to mine frequent sequential patterns from deterministic databases. However, those approaches cannot be indirectly applied to uncertain data (or probabilistic data). Unfortunately, the existing pattern mining algorithm on an uncertain dataset [27,28] is not adopted to our probabilistic sequence model. Therefore, considering the amount of noise in our uncertain datasets (probabilistic datasets), a new patterngrowth algorithm is proposed to mine the key-skeletonpatterns from the datasets.

Fundamental Concepts.
In this subsection, a brief overview of the special Euclidean group (3) and the special orthogonal group (3) is presented, which is necessary for further understanding of the Lie group. We refer the readers to [2,10,11] for a general introduction to Lie groups. Important notations are shown in Table 1.
where 3 denotes the 3 × 3 identity matrix and A is a rotation matrix. In 3D space, a rotation A is an element of (3) and can transform a vector = [ 1 , 2 , 3 ] to by Every group (3) has an associated Lie algebra of (3) that is the tangent space around the identity element 3 . The Lie algebra of (3), denoted by so (3), is a set of all real 33 skew-symmetric matrices as follows: Given an element The exponential map (3) from so (3) to (3) and the logarithm map (3) from (3) to so(3), respectively, are

Explanation of Fundamental
Concepts. According to the concepts described in Section 3.2.1, the orientation of a rigid body is represented by six rotation matrices. Mathematically, a rotation matrix is a point in (3); therefore, the orientation of a rigid body in skeleton can be represented as six points in (3). Then the static feature, composed of the orientations of the rigid bodies in the skeleton, is represented as a point in the Lie group ( (3) × (3) ⋅ ⋅ ⋅ × (3)), as shown in Figure 1.
The motions of a rigid body is generally regarded as its rotations and translations in 3D space. Mathematically, the rotations and translations of a rigid body are defined as (3); therefore, a rigid-body motion between skeletons and − 1 (or skeletons and 1) can be represented as a point in (3). Then, the dynamic feature, which is composed of the rigidbody motions between skeletons and −1 and those between skeletons and 1, is represented as a point in the Lie group ( (3) × (3) ⋅ ⋅ ⋅ × (3)), as shown in Figure 1. A skeletal representation, composed of the static feature and dynamic feature, can be represent as a point in the Lie group The wavy surface represents a Lie group Figure 1. A whole circle in the wavy surface represents an action (skeleton sequence). Each black dot in the circle represents a skeleton. Then, an action can be represented as points in the Lie group (the points are included in the same circle). To overcome the classification difficulties, an action (or a whole circle) is mapped from the Lie group to its Lie algebra (3)), as shown in Figure 1. In fact, the Lie algebra is a vector space. To describe the orientation of a given rigid body , the global coordinate system is translated to the local coordinate system. , , , , and , represent the three rotations that transform the rigid body to the three coordinate axes, as shown in Figure 3

Extraction of Skeleton
where , , , , and , ∈ (3). Similarly, , , , , and , represent the three rotations that transform , , and to the rigid body , respectively, as shown in Figure 3(d). Their rotation relationship is shown as follows: where , , , , and , ∈ (3). The six rotations can be used to describe the orientation of the rigid bodies.
Given a skeleton , ( ) = { , ( ), , ( ), , ( ), , ( ), , ( ), , ( )} is used to represent the orientation of ( ) in the skeleton. In this work, the skeletal static feature is defined as a set of the orientations of the rigid bodies in the skeleton as follows: where M is the total number of rigid bodies in the human skeleton.

Dynamic Feature of Skeleton.
Rigid-body motion is generally regarded as rotations and translations in 3D space. Mathematically, the rotations and translations of a rigid body can be denoted by (3). In this study, (3) is employed to describe rigid-body motions between different skeletons. Let ( ) ∈ 3 be rigid body in skeleton i and ( ) ∈ 3 be rigid body in skeleton j( ¡ = ). Given a point ( ) ∈ ( ) and ( ) ∈ ( ) corresponding to ( ), we have where ( , ) ∈ (3) and ( , ) and ( , ) are the rotation and translation, which can transform ( ) to the position and orientation of ( ), respectively, as shown in Figure 4(b). Similarly, given a point ( ) ∈ ( ) and ( ) ∈ ( ) corresponding to ( ), we have where where M is the total number of the rigid bodies in the skeleton.
In this study, our approach only considers the rigidbody motions between skeletons t and t-1 and those between skeletons t and 1. According to formula (24), the rigidbody motions between skeletons t and t-1 can be represented by ( , − 1). Similarly, the rigid-body motions between skeletons t and 1 can be represented by ( , 1). Then, the skeletal dynamic feature is defined as a set of the rigid-body motions in skeleton as follows: Skeleton t is represented by the static feature and the dynamic feature as follows: where ( ) represents the static feature of the skeleton and ( ) represents its dynamic feature.

Lie Group Representation of Skeleton Sequence.
Using the proposed skeletal feature, a skeleton sequence or an action is represented by

Lie Algebra Representation of Skeleton Sequence.
Since the most classification methods (such as SVM) cannot be directly applied to manifolds, to overcome these difficulties, ( ) is mapped to its Lie algebra (so(3) × ⋅ ⋅ ⋅ × so(3), se(3) × ⋅ ⋅ ⋅ × se (3)).The Lie algebra of ( ) is given by and the Lie algebra of ( , ) is given by The Lie algebra of ( ) is given by ( ) = { ( ), ( )} = { ( ), ( , − 1), ( , 1)}. A human action can be represented by the following Lie algebra structure: where T is the total number of frames in the sequence. M is the total number of rigid bodies in skeleton . Given a rigid body in skeleton , V ( ( ( )) is the Lie algebra's representation of the orientation of the rigid body, which is 18-dimensional vector( ∈ [1, ]). ( ) = { ( , − 1), ( , 1)} is the Lie algebra's representation of the dynamic feature of skeleton , which is a (24 = 12 + 12 )-dimensional vector. ( ) = { ( ), ( )} is the Lie algebra's representation of skeleton t, which is a (42 = 18 + 24 )-dimensional vector. Hence, a human action can be seen as temporal evolutions of a 42 -dimensional vector.

Key-Skeleton-Pattern Mining.
In the previous subsections, a skeleton sequence is represented as the Lie algebra structure = { ( ), ∈ [1, ]}, where T is the total number of frames in a skeleton sequence. However, a skeleton sequence can include many noisy skeletons, which reduce the accuracy and efficiency of the action recognition. In this subsection, the key-skeleton-patterns are used to eliminate noisy skeletons in a skeleton sequence in order to capture the most informative skeleton sequences.

Formal Definitions.
To mine the key-skeleton-patterns, classic k-means is used to quantize all skeletons represented by the Lie algebra to K symbol. Let = { 1 , . . . , } be a set containing K symbol and = { 1 , . . . , } be a set of centroid. Then, a skeleton sequence can be represented as a symbol sequence = [ 1 , . . . , ] ( ∈ , ∈ [1, ]). Since different skeletons may be quantized as the same symbol, to minimize the effect of quantization errors, each skeleton ( ) in a sequence is represented by probability , which is used to measure the distance between skeleton ( ) ∈ and centroid ∈ as follows: where ∈ correspond to symbol ∈ . Equation (31) shows the distance inversely proportional to . Now, a skeleton sequence also can be represented by a probability sequence = [ 1 , . . . , ].
Definitions. Some terms are defined in this paper as follows (Important notations are in Table 2.).
Definition 2 (mining sequence). = { , , } is a mining sequence applied to mine the Key-skeleton sequence. is a probability sequence, which represents a skeletons sequence.
is the symbol sequence, which corresponds to . is a skeleton sequence represented by the Lie algebraic structure.   Definition 6 (key-skeleton pattern). Given a pattern and a mining sequence dataset of an action class, if ( , | ) is larger than a threshold and ( , | ) is larger than a threshold , then the pattern is called a key-skeletonpattern of that action class. A key-skeleton-pattern of length is called an -pattern.

MinP-PrefixSpan Algorithm.
In this subsection, a new pattern-growth algorithm, called MinP-PrefixSpan, is proposed to mine the key-skeleton-patterns of an action class by searching over the enormous space of the symbol sequences and probability sequences of the action class. The algorithm is shown in Algorithm 1. In Lines 2-8, the dataset | is employed to construct a new projected dataset | . In Lines 9-11, if is the key-skeleton-pattern, pattern is appended to and symbol table | is constructed by ( | , | ). In Line 12, MinP-PrefixSpan is recursively called to grow the key-skeleton-pattern until all key-sequence patterns are found.
In Line 10, the trim algorithm is used to improve the efficiency of the MinP-PrefixSpan algorithm by eliminating nonkey-skeleton-patterns. Algorithm 2 shows the implementation details of the trim algorithm. The trim algorithm mainly consists of the following two parts: Triming rules. Given a mining sequence dataset of an action class and a pattern (according to Definition 3, | is the −projected dataset), two rules are proposed to trim nonkey-skeleton-patterns as follows: (1) If ( , | ) ≤ , then pattern is a non-keyskeleton-pattern.

Discovering the Most Informative Skeleton
Sequence. The task of Algorithm 3 is to discover the most informative skeleton sequences for all actions. Let be a mining sequence dataset of all actions. is a dataset used to store the keyskeleton-patterns of all action classes, and is a dataset used to store the most informative skeleton sequences of all actions. In Lines 2-11, the key-skeleton-patterns of each class action are mined from training dataset and appended to dataset . In Lines 12-22, the key-skeleton-patterns in dataset are employed to discover the most informative skeleton sequence of the actions in dataset , and all most informative skeleton sequences are stored in dataset (refer to Figure 5).
Dynamic Time Warping (DTW) [33] has excellent performance in searching for an optimal alignment between time sequences. Therefore, for each action class, our model uses the action standardization algorithm proposed by the author of [5] to compute a nominal action and employs DTW to warp all the training or testing actions into this nominal action. SVMs are extensively used in computer vision to achieve excellent performances in image and video classifications. To achieve better classification results, a linear SVM is used to classify the most informative skeleton sequences.

Datasets.
In this study, three standard 3D human action datasets are employed to study the effectiveness of the proposed method.
MSRAction3D dataset [34] can be captured using a depth camera similar to the Kinect device. This dataset consists of 20 actions of 10 subjects, with each action having two or three repetitions. In total, there are 557 action sequences. The dataset provides 3D locations of 20 joints. The horizontal and vertical locations of each skeleton joint are stored in the screen coordinates, and the skeleton's depth position is stored in the global coordinates. Human actions in this dataset capture various types of motions, which are related to arms, legs, torso, and their combinations. Experiments on this dataset are challenging, but the dataset is widely applied Input: key-skeleton-pattern dataset , mining sequence dataset of all actions Output: The dataset of the most informative skeleton sequences (1) Obtaining training dataset from (2) ← 0; ← 0 (3) for the dataset of each action class ⊂ U do (4) T is the table that includes all symbols in (5) for each symbol ∈ do (6) if ( , | ) ≥ and ( , | ) ≥ then (7) | ← ( , | ) (8) append to K (9) MinP-PrefixSpan( | , | , ) (10) end if (11) end for (12) end for (13) for each element ∈ do (14) for each key-skeleton-pattern ∈ do (15) if m is a subsequence of d.L then (16) = ( ( . ), 1) for ∈ do (19) [ ] = 1 (20) end for (21) end if (22) end for (23) for z=1 to Len(d.L) do (24) if patternpos[i]==1 then (25) append d.LA[z] to s (26) end if (27) end for (28) Append s to (29) end for (30) return Algorithm 3: Informative skeleton sequence ( , ).

key-skeleton-patterns
Red skeletons are called informative skeleton sequences to test the accuracy and robustness of recognition methods for various actions.
UCKinect-Action dataset [19] is captured using a stationary Kinect sensor. It consists of 10 human actions obtained from daily life: walking, sitting down, standing up, picking up, carrying, throwing, pushing, pulling, waving, and clapping hands. Each human action is performed by 10 different subjects (nine males and one female) twice or thrice. In total, there are 199 action sequences. This dataset is very challenging. First, for some action sequences, parts of the human body are invisible because the body parts are out of the field of view. Second, subjects performed the same action using different limbs, such as waving the left hand and waving the right hand. Third, it is very difficult to capture the action sequences with invariance to the view point.  Figure 6: Recognition rate on the MSRAction3D dataset based on AS1, AS2, and AS3.
G3D dataset [23] consists of 663 sequences of 20 gaming actions captured by Kinect. Each actor performed each gaming action more than two twice. Although the dataset can provide synchronized video, depth, and skeleton data, skeleton data is only chosen in our experiment. The dataset is challenging because of the following two aspects: (1) if the body parts are occluded, the Kinect device gives inferred results, which may reduce the accuracy of the action recognition. (2) if two different actions have very small interclass variations, the two actions may easily interference with each other in the action recognition.

Experimental Results
The skeleton preprocess is as follows: a human action is composed of a continuous evolution of a series of skeletons. To make each skeleton view-invariant, all 3D joint coordinates in the skeleton are transformed to the coordinate system, which places the hip center at the origin. The entire skeleton will stop rotation until the global x-axis is aligned with the ground plane projection of the vector from the left hip to the right hip (refer to Figure 2(b)).

Experiments on the MSRAction3D Dataset.
Following the experimental protocol of [4], 20 actions in the MSRAction3D dataset are divided into three subsets 1, 2, and 3, each including eight actions. AS1 and AS2 include actions with similar movement. AS3 groups include more complex actions. A half of the subjects are chosen for training, and the remaining subjects for testing. The experiment is run on ten different combinations of training and testing sets, and the mean performance is reported. Figure 6 shows that our approach outperforms various other representations. Our approach achieves a mean accuracy of 94.58% on the MSRAction3D dataset, outperforming other action recognition approaches, including Bag of 3D Points [4], Eigenjoints [29], and Lie group [5], which achieved accuracies of 74.7, 83.3, and 91.88%, respectively. Our approach performs better than the others both in distinguishing similar actions and in recognizing complex actions. This is mainly because the informative skeleton sequences, represented by the Lie group, are used to train SVM classifiers.
Following the experimental protocol of [16], the dataset containing all actions is tested. The experimental setting is more challenging than that of [4]. Our approach achieves an accuracy of 97.4%, outperforming other relative action recognition approaches, including Grassmann manifold [16], graph-based model [20], and Lie group [5], which achieved accuracies of 91.21, 92.2, and 92.46%, respectively, as shown in Table 3. Figure 7 shows the classification confusion matrix on the whole MSR-Action3D dataset. Most actions on the dataset can be correctly recognized by our approach, but classification errors occurred if two actions were extremely similar, such as draw tick and draw .
Matlab is used to run the experiments on a 3.60GHz Intel Core i7-4790 CPU machine. The average testing time of one action sequence in the dataset only costs 35.1ms, which is lower than that of Lie group(72.5ms).The reason is that the skeleton feature dimension of our approach(798-dimension) is lower than that of Lie Group(2052-dimension). However, since the authors of Grassmann Manifold and Graph-based model do not open the source code of their approaches, the average testing time of their approaches cannot be obtained.
The average testing time of one action sequence in the dataset costs 33.6ms, which is lower than that of Lie group (69.2ms) but higher than that of Learning features combination (13.7ms). The reason is that the skeleton feature dimension of our approach (798-dimension) is lower than Table 3: Recognition rate on the MSRAction3D dataset based on the experimental protocol of [16].

Approach
Recognition accuracy Lie group [5] 97.08 Grassmann manifold [16] 97.91 Eigenjoints [29] 97.1 Learning feature combination [30] 98.00 Our approach 98.87 that of Lie Group (2052-dimension) but higher than that Learning features combination (256-dimension). Unfortunately, the average testing time of Grassmann Manifold and Eigenjoints cannot be obtained without the source code of the approaches.

Experiments on the G3D-Gaming Dataset.
The crosssubject test setting, in which half of subjects were used for training and the remaining subjects were used for testing, is used to perform recognition on the data. Table 5 compares our approach with other approaches on the dataset (GB-RBM+HMM [21] and LieNet [17] used a deep-learning method to recognize human action). Our approach achieves a higher recognition rate. The average testing time of one action sequence in the dataset only costs 34.8ms, which is lower than that of Lie group(71.2ms) and (3)(58.9ms). The reason is that the skeleton feature dimension of our approach (798dimension) is lower than that of Lie Group(2052-dimension) and (3)(1026-dimension). However, the authors of tLDS do not open the source code of their approach, the average Table 5: Recognition rate on the Florence3D-Action dataset.

Approach
Recognition accuracy SO(3) [14] 87.95 Lie group [5] 91.09 GB-RBM+HMM [31] 86.40 LieNet [22] 89.10 tLDS [32] 90.60 Our approach 95.01 testing time of their approach cannot be obtained. Since the deep learning-based approaches usually use GPU to accelerate their models while the nondeep learning-based approaches usually use CPU to perform their experiments, it is hard to implement a fair comparison between these two classes of approaches (our approach belongs to the nondeep learning-based approaches, and LieNet and GB-RBM+HMM belong to the deep learning-based approaches).

Conclusion and Future Work
This paper proposes a new skeleton-based action representation, which consists of static and dynamic features. First, the orientation of a rigid body is regarded as six rotation matrices, and each rotation matrix is represented as a point in (3). The rigid-body orientations in a skeleton are used to construct the static feature in order to avoid dealing with skeletal scale variations. Second, the motions of rigid bodies, represented by (3), are used to construct the dynamic features in order to capture the temporal information of the skeleton. Finally, based on the proposed representation, the key-skeleton-patterns are employed to discover the most informative skeleton sequences. The experiment results show that our approach achieves better performance than other state-of-the-art skeleton-based action recognition approaches. Further research should combine the Lie group with a linear dynamical system to model human actions as a tensor time series.