Optimizing Features Quality: A Normalized Covariance Fusion Framework for Skeleton Action Recognition

Action recognition based on 3D skeleton sequences has gained considerable attention in recent years. Due to effectively representing the spatial and the temporal characters of skeleton sequences, the Covariance Matrix (CM) features combined with the Long Short-Term Memory (LSTM) network is an effective and reasonable roadmap to enhance the action recognition accuracy. However, the CM features in the existing recognition models are computed from the raw data without normalization or with static normalization. Moreover, a CM feature is calculated from all coordinates in one frame, treating all coordinates in three axes identically and neglecting the relationship of the coordinates in the same axe. In this paper, an end to end deep learning framework is proposed that includes a normalization layer dynamically adapting to data distribution and inference procedure. After normalization, the three covariance feature sequences from the coordinates in three axes are produced from the sliding windows and are fused into one fusion matrix using a convolution layer. Finally, the fusion matrix is sequentially fed into an LSTM network to recognize skeleton action. The novelty of the proposed framework is combining the adaptive preprocessing and the features fusion to the LSTM network and improving the recognition accuracy by optimizing the quality of the features rather than network construction. In the experiments, the proposed framework is verified on the public datasets and one student action dataset collected from a real classroom. The experimental results demonstrate that the proposed method achieves a significant improvement in accuracy compared to the state-of-the-art methods. It can be concluded that the proposed framework can not only accurately capture the correlation of joints in the same frame but can also effectively express the dependences of sequential frames.


I. INTRODUCTION
Human action recognition has a wide range of applications, including video surveillance, human-machine interaction, interactive entertainment and multimedia information retrieval [1]. Nowadays, the prevalence of highly-accurate and affordable depth devices has increased the use of depth videos to achieve computer vision tasks in various applications. Action recognition based on 3D skeleton sequences has attracted widespread attention from both academia and industry. Compared to the RGB videos, the skeleton data is more robust to cluttered backgrounds and illumination changes, with human actions described as movements of the The associate editor coordinating the review of this manuscript and approving it for publication was Claudio Cusano . skeleton joints [2]. Recurrent Neural Networks (RNN) with Long-Short Term Memory (LSTM) units [3], [4] have been widely used to model the spatial-temporal information of the skeleton sequences for action recognition. However, it is difficult to construct deep LSTM networks to learn highlevel features of skeleton data. Covariance Matrix (CM) can be considered as high-level feature. It has been a general representation method that has been successfully applied on image, vision and text processing [5]. For the skeleton data, the CM is computed on the coordinates of skeleton joints in a time-series manner. It has been proved that the CM combined with the time features can effectively describe the spatial characters of action sequences. Combining the CM and the LSTM network is the effective and reasonable roadmap to enhance the accuracy of action recognition. To date, several LSTM models with CM features as inputs have been proposed. However, all these models have two limitations as follows. On the one side, normalization is a data preprocessing technique that transforms the data with attributes of different units into a uniform scale. After normalization, all attributes become relatively independent and better conditioned for convergence. Due to different body shapes and deviations in the movements from different subjects, the performance of deep learning models may degenerate rapidly if the skeleton data is not appropriately normalized. The traditional normalization methods include z-score, min-max and decimal scaling normalization [6]. There are also several deep learning normalization approaches available that include batch normalization [7], layer normalization [8] and group normalization [9]. The traditional normalization techniques can easily be applied to static data. However, it is still difficult for the existing approaches to handle evolving data, i.e., continuously change in range values and classes. Although the performance can slightly be improved through a deep learning schema, the space complexity is usually high. Moreover, from tasks, such as action recognition based on skeletal data, some important information may be lost when normalized in the traditional manner [10]. Therefore, to dynamically normalize the skeleton data, this paper proposes a dynamic normalization module that is capable of normalizing the data adaptively during inference according to the distribution of the measurements of the current skeletal data. On the other side, the traditional methods use all coordinates of all joints in one frame to compute the CM feature, which is not reasonable. It is meaningless to compare the x coordinate of one joint with the y coordinate of the other joint. Nevertheless, paying more attention to the spatial and the temporal relationship between the coordinates from the same axis may discover more valuable information for motion recognition. To encode the temporal dependency between the same axes of intra-and inter-frames, each skeleton sequence is transformed into three parts in the proposed framework and each part consists of one axis of the skeleton sequence data. Then all three skeleton sequences are fed into a sliding window with variable window length. The advantage of the proposed novel representation is that the deep learning models can be leveraged to learn finegrained features of intra-and inter-frames. Generally, the limitation of the existing methods includes: Firstly, CM features are computed from raw data without normalization, or with static normalization. Secondly, the existing methods mainly compute the CM feature of all coordinates in one frame and neglect the relation of the coordinates in the same axe. The main contributions of this paper are summarized as follows: 1) A dynamic normalization module is proposed that learns to normalize the data according to their distribution instead of using fixed normalization schemes. 2) A covariance fusion module (CFM) is proposed to learn fine-grained features of intra-and inter-frames. The CFM first generates the CM features on three axes through the sliding window that can ensure the nonsingularity of CM. Then, the three CMs are fused using a convolution operation and a fusion CM is generated as the input of the LSTM network. 3) A lightweight strong baseline is developed on several datasets, which is more powerful than most the stateof-the-art methods.
The rest of the paper is organized as follows. Section II reviews the related works. Section III elaborates the normalized covariance fusion framework, which includes normalization layer design, CM fusion Layer design and loss function design. Section IV introduces the utilized dataset and the experimental results. Finally, Section V concludes the article.

II. RELATED WORKS
This section briefly reviews the relevant literature on normalization, data fusion and skeleton-based action recognition methods using deep learning networks.

A. NORMALIZATION ON SKELETON DATA
In 3D skeleton-based action classification, an action is described as a collection of time series of 3D positions of the joints in the skeleton. Generally, there are three paradigms for the normalization, which have been investigated in skeletonbased action classification. First, recognition by indicating the joint angles. The representation depends on the selection of the reference coordinate system, which may be different according to the recording environment and biometric differences. To deal with these issues, a collection of coordinate system transform methods have been proposed. [11] Considered the joint angles between any two connected limbs and represented an action as a time series of joint angles. Pham et al. calculated angles between the joint positions for the normalization. Dynamic Time Wrapping (DTW) was the obeyed classification method and resulted in an accuracy of over 90% for the recognition of five gestures such as hand waves and greetings [12]. In [13], data has been initially registered into a common coordinate system to make the joint coordinates comparable to each other. Second, recognition by setting the skeleton to a fixed position. In [2], human poses have been normalized by aligning the torsos and the shoulders. In [14] and [15], all the 3D joint coordinates have been transformed from the world coordinate system to the person-centric coordinate system placing by the Hip Center. In [16], the skeletons have been aligned based on the head location. Next, all the other coordinates in the skeleton were normalized by the head length [17]. Gaglio et al. [18] and Cippitelli et al. [19] normalized the 3D-skeleton-data by traversing the positions to a new coordinate system that was fixed at the torso. They achieved accuracy rates above 90% using Support Vector Machines (SVMs) on the KARD dataset containing strong specifiable gestures like hand waves and sidekick. Laraba at al. [20] excluded the arm joints from the dataset and normalized the other point positions by setting the distances to the pelvis joint. They recognized the dance steps with an accuracy of above 70%. Third, transforming the data into a range. In [5], the coordinates of the 3D joints have 211870 VOLUME 8, 2020 been normalized to a range of [0, 1] in all the dimensions over the sequence. Each coordinate of the normalized pose vector has been smoothed across time with a 5 by 1 Gaussian filter. The poses were normalized to compensate for biometric differences. A reference skeleton was learned from the training data. The length of skeleton limbs in the reference skeleton was adjusted to unit norm. All the skeletons in the test and the training data were transformed to impose the same limb segment length as in the reference skeleton while preserving the direction vector. A similar normalization has been applied in [21], where the inter-frame linear interpolation has also been applied to account for missing values in the skeletal data.

B. DATA FUSION METHODS
Data fusion has a long history and is of great significance in mining the data value. Early data fusion methods transformed the data into a single feature-based data and treated the transformed data as a single dataset. Nowadays, the common data fusion methods study the value of data from different perspectives. For example, the simplest data fusion method is to directly combine two one-dimensional datasets that have the same meaning in this dimension. Moreover, data fusion can be performed from the perspective of extracting the features of different dimensions. Different data fusion methods can provide different optimization results to the machine learning model. The representative deep learning architectures of the data fusion models include deep belief net, stacked autoencoder, convolutional neural network and recurrent neural network. The restricted Boltzmann machine is the basic block of the deep belief net [22]. Srivastava and Salakhutdinov [23] proposed a multimodal generative model based on the deep Boltzmann learning model and learned multimodal representations by fitting the joint distributions of multimodal data over the various modalities, such as image, text, and audio. Suk et al. and the Alzheimer's Disease Neuroimaging Initiative [24] proposed a multimodal Boltzmann model that could fuse the complementary knowledge from the multimodal data to effectively diagnose Alzheimer at an early phase. To accurately estimate human poses, Ouyang et al. [25] designed a multisource deep learning model that learned multimodal representation from mixture type, appearance score, and deformation modalities by extracting the joint distribution of the body pattern in highorder space. To generate visually and semantically effective human skeletons from a series of images, especially videos, Hong et al. [26] proposed a multimodal deep autoencoder to capture the fusion relationship between the images and the poses. Wang et al. [27] designed a multimodal stacked autoencoder for feature learning of words in which the association and the gating mechanisms were adopted to improve the word features. Khattar et al. [28] designed a multimodal variational framework based on encoder-decoder architecture. To model the semantic mapping distribution between the images and the sentences, Ma et al. [29] proposed a multimodal convolutional neural network. In order to fully capture the semantic correlations, a three-level fusion strategy-the word level, the phase level, and the sentence level, has been proposed. Frome et al. [30] presented a multimodal convolutional neural network by leveraging the semantic information from the text data to scale the visual recognition system to an unlimited number of discrete categories. There are some new RNN-based multimodal deep learning methods. For example, Abdulnabi et al. [31] designed a multimodal RNN to label the indoor scenes in which the intermodality feature and the cross-modality feature are learned by the RNN and the transform layers. Narayanan et al. [32] designed the gate recurrent cell with the multimodal sensor data to model driver behaviors. Sano et al. [33] proposed a multimodal BiLSTM to detect ambulatory sleep in which the BiLSTM was used to extract the features of the data collected from wearable devices. Then each intermodality feature was concatenated by a fully connected network. The methods of producing covariance descriptors for classification tasks have been intensively studied. In particular, [34] proposed the patch-specific covariance descriptors, efficiently computed with the integral images. Relying on systematically encoding mutual relationships inside the data, the covariance descriptors have been applied to many different applications such as face recognition [35], person identification [36] and more general classification tasks [37]. Furthermore, the covariance features have been proposed to measure the similarities across the data samples [38]. This latter direction grounds on the mathematical properties of positive definite matrices, exploiting the Riemannian metrics on the manifold for image classification. Once moved from a finite to an infinitedimensional space, the performance enhances [39], [40] and only recently, the deep learning approaches have shown to be superior. However, one of the main limitation related to the covariance matrix is that it only enables to capture the linear inter-relationships [41]. For instance, the principal component analysis exploits a covariance matrix to remove the linear correlation of data points [42]. Among the attempts for modeling more complicated relationships, the additional statistics, such as entropy and mutual information [43], and kernels [44] have been adopted. A different approach is to model the non-linear behaviors by preliminary applying a preprocessing step and encode the raw data utilizing a transformation, which increases the feature space. Such approach has been used for spatial and temporal derivatives for gesture recognition in [45], where both different color spaces and edge detectors were considered for image classification, and filter bank responses were used as features to estimate the head orientation.

C. SKELETON-BASED ACTION RECOGNITION
The RNNs with LSTM units have been used to model the spatial-temporal information of the skeleton sequences for action recognition. Du et al. [46] divided the skeleton joints into five sets, and fed them into five LSTMs for feature fusion and classification. Du et al. [47] used random rotation and scale transformations during the training process to improve the robustness of the models. Shahroudy et al. [48] grouped FIGURE 1. Framework of the proposed end-to-end model.Given a skeleton sequence with T i frames, the skeleton sequence is adaptively normalized by a Dynamic Normalization Module. Three axes covariance matrices corresponding to the 3D coordinates of skeleton joints are generated with a sliding window. After that, a Covariance Fusion module is used to fuse the three covariance matrices, and each feature matrix represents the temporal information of the skeleton sequence and a particular spatial relationship between the skeleton joints. Finally, the LSTM networks are used to classify the action.
the skeleton joints into five parts and fed them to a partaware LSTM, which consisted of five sets of input and forget gates for the five body parts. Veeriah et al. [49] designed a differential RNN (dRNN) model and computed different orders of Derivative of States to capture the salient motions of actions. Zhu et al. [50] fed the skeleton joints to a deep LSTM at each time slot to learn the inherent co-occurrence features of the skeleton joints. Liu et al. [51] designed a spatialtemporal LSTM with a Trust Gate to jointly learn both the spatial and the temporal information of skeleton sequences and to automatically remove the noisy joints. Although the LSTM networks are designed to explore the long-term temporal dependencies, it is still difficult for the LSTM to memorize the information of the entire sequence with many time steps [49], [52]. Furthermore, it is also difficult to construct a deep LSTM to extract the high-level features [53], [54]. In contrast to the LSTM, the CNNs can extract such high-level features, but cannot model the long-term temporal dependency of the entire video [55]. Fig. 1 shows the overall architecture of the proposed method. A skeleton sequence is first dynamically normalized. The three axes covariance matrices corresponding to the 3D coordinates of skeleton joints are generated with a sliding window. After that, a Covariance Fusion Module is used to fuse the three covariance matrices. The fused matrices are then fed to an LSTM network for action recognition. This section describes the details of the proposed framework.

A. DYNAMIC NORMALIZATION MODULE DESIGN
As shown in the Fig. 2, let X i ∈ R T i ×d ; i = 1, 2, . . . , N be a collection of N skeletal sequence data, each of them composed of T i d-dimensional measurements (or features). The notation x i j ∈ R d , j = 1, 2, . . . , T i is used to refer to the d features observed at a time point j in the skeletal sequence data i. The target of the proposed method is to learn to normalize the measurements x i j using the following formula: where is the Hadamard division operator. A Hadamard division is defined as: The global z-score normalization is where µ k and σ k refer to the global average and the standard deviation of the k-th feature, respectively: As discussed above, µ and σ might not be optimal for normalizing every possible measurement vector.  Therefore, this work proposes to normalize each skeletal sequence so that µ and σ are learned and depend on the current input of each sliding window, instead of being the global averages calculated using the entire dataset. The initial estimation for the mean of the current skeletal sequence X i is (5).
A matrix W µ is defined that represents the deep learning model weight parameters: where a i ∈ R d , indicates the dynamic mean of the current skeletal sequence data. The magnitude of the variation in the adaptive mean of the skeletal sequence data can be calculated as (7).
The dynamic standard deviation can be calculated as (8).
For x i j , adaptive regularization can be performed as (9).

B. SLIDING WINDOW DESIGN FOR CM FEATURE
In order to capture the temporal characteristics of each action, an LSTM network is used in this paper. Although the LSTM networks are designed to explore the long-term temporal dependencies, it is still difficult for the LSTM to memorize the information of the entire sequence with many time steps. Moreover, it is also difficult to construct deep LSTM to extract high-level features. For most datasets, the skeletal sequences for each of these actions are of varying lengths, some of which are short and others are long. Using the data with these inconsistent lengths for training may cause the model to significantly deviate between the different actions.
To overcome these problems, a sliding window is used to unify the lengths of the sequences. Given T i feature vectors x i j ∈ R 3d , j = 1, 2, . . . , T i , representing a sequence of T i frames, a descriptor for the action in the sequence can be constructed. It has a fixed length, regardless of the number of frames T i . To capture the temporal information, which is crucial in action recognition, a temporal pyramid construction is adopted. Particularly, a sequence of T i frames is divided into possibly-overlapping sub-sequences of lengths d, d 2 , d 4 , . . .. Generally, the l-th level of the pyramid contains sub-sequence of length d 2 l−1 frames, where d is the feature dimension. The consecutive sub-sequences in the level are separated by a constant step s. For both the non-overlapping and the overlapping modes, one issue to be considered is that some datasets have only a few data frames for an action, either less than the length of the sub-sequence (sliding window), or the last remaining frames after sliding is less than the length of the sub-sequence (sliding window). The method used in the proposed framework is circular selection. As shown in the Fig. 3, when the length of the last subsequence is T i − (T i − d + m + 1) + 1 = d − m, the last sub-sequence is added by picking another m frames from the beginning. Specifically, if the step size is s, the window size is D and the sequence length is T i ,then a W i sub-sequence can be generated as (10).
For example, if the sequence length is 10, the step size is 2 and the window size is 3, then a 10−3 2 + 1 = 4 + 1 = 5 sub-sequence can be generated.

C. COVARIANCE FUSION MODULE DESIGN
Next, the covariance matrix is used for skeleton joint locations over time as a discriminative descriptor for a sequence. The descriptor has a fixed length independent from the length of the original sequence. As shown in Fig. 4, the descriptor is constructed by computing the covariance matrix on the coordinates of body skeleton joints over time. In order to encode the short-term temporal dependency of joint locations from a different view, multiple covariance matrices are used with each covering a sliding window of the normalized input sequence at one axis.  1 , q 2 , . . . , q d ] represent the sequence of skeletal point values on the 3 axes at a given frame, respectively. Since the probabilities of O, P, Q are usually unknown, the sample covariance is used as (11). (11) where S represents the value of the bone sequence at the same coordinate. The covariance formula for data frame from t 1 +1 to t 2 can be obtained as (12). where . A fusion function f : o t , p t , q t −→ z t fuses the three axes covariance matrices o t ∈ R d×d , p t ∈ R d×d and q t ∈ R d×d , at time t, to produce an output map z t ∈ R d×d , where d is the number of skeleton points. When applied to feedforward convolutional network architectures, consisting of convolutional, fully-connected, pooling and nonlinearity layers, f can be applied at different points in the network to implement various fusion functions. In this paper, convolution fusion is used. z t = f (o t , p t , q t ) stacks the three feature maps at the same spatial location i, j across the feature channels, as shown in the Fig. 5.
where z stack ∈ R d×d×3 . Then, the stacked data is convolved with a bank of filters H and biases b stack as: Here, the filter H is used to model the weighted combinations of the three feature maps o t , p t , q t at the same spatial location. When used as a trainable filter kernel in the network, H can learn the correspondences of the three feature maps that minimize a joint loss function. Given that there should be no significant relationship between the different skeletal points, most of the values between all the skeleton points should be 0 or close to 0. In order to filter out this small value information, H is used here, which is meant to extract the locally available information for integration.
The LSTM represents an improved RNN architecture, which mitigates the vanishing gradient effect of the RNN. An LSTM neuron contains a memory cell c t , which has where the initial values are c 0 = 0 and h 0 = 0 and the operator ⊗ denotes the Hadamard product (element-wise product).
The subscript t indexes the time step. The input vector to the LSTM unit is x t . The activation vectors of forget, input and output gates are f t , i t ∈ R h and o t ∈ R h , respectively. The hidden state vector also known as the output vector of the LSTM unit is h t ∈ R h . c t ∈ R h is the cell input activation vector and c t ∈ R h is the cell state vector. W ∈ R h×d , U ∈ R h×h and b ∈ R h refer to weight matrices and bias vector parameters that need to be learned during training. The superscripts d and h refer to the numbers of input features and hidden units, respectively. σ g is the sigmoid function. σ c and σ h are the hyperbolic tangent functions. A straightforward solution is to compensate the position of the skeleton by centering the coordinate space in one skeleton joint. Considering a skeleton composed of P joints, G 0 and G i VOLUME 8, 2020 being the coordinates of the torso and i-th joints, the i-th joint feature j i is the distance vector between G 0 and G i .
These features may be seen as a set of distance vectors connecting each joint of the torso. A posture feature vector F is created for each skeleton frame can be expressed as: In order to find the optimal solution, it is essential to measure the quality of a solution. This is done using an objective function. For each frame of the sequence, the distance from the reference skeleton for each skeleton point is calculated.
Equation (25) is the objective function of the dynamic normalization part, where j r q is the distance between the qth skeleton of the raw data and the reference skeleton and J n q is the distance between the qth skeleton of the normalized data and the reference skeleton. For each classification result, the information entropy is calculated according to (26).
This part of the objective function is defined to ensure the validity of the dynamic normalization. The final objective function is a combination of the weights of these two parts of the error function and defined as (27).
where ω 1 and ω 2 are the weights of these two parts of the objective function.

IV. EXPERIMENTAL RESULTS
In order to examine the effectiveness of the proposed method, MSR-Action3D, UTKinect-Action3D, UTD-MHAD, NTU RGB+D, SBU Kinect Interaction, PKU-MMD, Kinetics Human Action Video, G3D and Classroom datasets are used. The details of these datasets are explained in the next section by modeling the actions. The experiments are conducted on the PyTorch and TensorFlow platforms with NVIDIA GeForce RTX 2070 GPU cards.

A. DATASETS
The MSR-Action3D dataset (MSR) consists of 3D positions of 20 skeletal joints, including 20 types of actions. Each action is conducted three times by ten subjects, and there are 557 action sequences. The UTKinect-Action3D dataset consists of 20 skeletal joints, including ten types of actions. Each action is performed twice by ten subjects, and there are a total of 199 action sequences. The Florence 3D Action dataset (Flornce) consists of 15 skeletal joints, including nine types of actions. Each action is conducted two to three times by ten subjects, and there are 215 action sequences in total. The UTD-MHAD dataset (UTD) consists of 20 skeletal joints and contains 27 types of actions. Each action is conducted four times by eight subjects, and there is a total of 816 action sequences. The NTU RGB+D dataset (NTU) is so far the largest skeleton-based human action recognition dataset. It contains 56680 skeleton sequences and 60 annotated action classes. There are two recommended evaluation protocols: Cross-Subject (CS) and Cross-View (CV). In the cross-subject setting, sequences of 20 subjects are used for training and the sequences of the rest of the 20 subjects are used for validation. In the cross-view setting, the samples are split by camera views. The samples from two camera views are used for training and the rest are used for testing. The The Kinect enabled us to record synchronized video, depth and skeleton data. The Game 3D dataset contains 10 subjects performing 20 gaming actions, punch right, punch left, kick right, kick left, defend, golf swing, tennis swing forehand, tennis swing backhand, tennis serve, throw a bowling ball, aim and fire a gun, walk, run, jump, climb, crouch, steer a car, wave, flap and clap. The 20 gaming actions were recorded in 7 action sequences. Most sequences contain multiple actions in a controlled indoor environment with a fixed camera, a typical setup for gesture-based gaming. In order to test the effect of the proposed method in real scenes, experimental data on the action of students in the classroom was collected and the classroom dataset (Classroom) was obtained, as shown in the Fig. 6. The classroom data set provides 3D positions of 25 skeletal joints, including 7 actions. Each action was performed 3 times and lasted about 13 to 15 seconds. There were 10 experimental subjects.

B. EFFECTIVENESS OF COVARIANCE FUSION MODULE AND DYNAMIC NORMALIZATION MODULE
A covariance matrix contains the important spatial information of a skeleton sequence which is important for skeleton-based action recognition. In order to demonstrate the  Table. 2) are investigated to validate the effectiveness of the CFM, FL and DNM. The three main observations are as follows.
1) It is difficult to construct deep LSTM networks to learn high-level features. For the learning of spatial and temporal information of skeleton sequence, the covariance matrix of the skeleton sequence is used. Given the different sequence lengths of different actions, a sliding window is used to split the long sequences of different lengths into fixed-length sub-sequences. Compared with the LSTM network (Column 1 in Table. 2) without covariance matrix and sliding window, the accuracy is improved by about 3% to 15% (Column 2 in Table. 2). The effectiveness of using the covariance matrix and sliding window is verified. 2) In the CFM, the skeletal sequence is decomposed on 3 axes to compute the covariance matrices. The matrices on each axis do not contain a lot of information that can be used for action recognition. To efficiently fuse the locally valid information on the 3 axes, the FL (Column 3 in Table. 2) is used. The FL improves the accuracy of the overall model by about 15%-30% compared with the method without the FL. 3) Data normalization is a fundamental preprocessing step for mining and learning from the data. The normal normalization may cause loss of available information in a sliding window. Therefore, to enhance the characteristics of the actions in the window, the DNM is used to adjust the normalization strategy through an adaptive approach. The experiments show that the DNM is effective and improves the accuracy by about 4% to 27% In summary, the explicit modeling of the covariance matrix, fusion, dynamic normalization and LSTM network benefit the learning of spatial and temporal information. The information enables the model to efficiently exploit the information of sequence order.

C. SLIDING WINDOW SIZE
The rank of the covariance matrix obeys the rule that rank (C) ≤ min (d, n − 1). When C is used as region descriptor, the number of feature vectors extracted from an image region. n is usually much larger than the dimensions d. This ensures C to be nonsingular and allows it to be reliably estimated [44]. In order to examine the effect of different sliding window lengths on the proposed model, experiments  were conducted using different lengths of sliding windows, and the results are shown in Table. 3 and Fig. 8.
The proposed method models the correlations of the joints using the sliding window and the size is d. To demonstrate its effectiveness, the performance of the proposed model is compared with four different window sizes in Table. 3. It can be observed that the classification accuracy changes with different sliding window sizes. Columns 1 to 2 in Table. 3 are less effective in terms of classification accuracy than the models with a window size of less than d. The effect of window size of less than d is shown in Table. 3. The result occurs because when the window length is too small, some useful information that can be used for action recognition is lost. For example, for the action of standing up from a chair, the sequence of actions of leaving the chair may be cut into two separate sub-sequences, each of which cannot determine the current job, making the recognition more difficult. Columns 4 to 5 in Table. 3 have window sizes greater than d.  The effect of the model with window size > d is greater than the effect of the model with window size = d in terms of classification accuracy. It can be seen that the accuracy of the corresponding model decreases instead of increasing when the size of the window band is larger than d. This is presumably due to the fact that the increase in the window size leads to a decrease in the number of sub-sequences generated for skeletal sequences of the same size, i.e. a decrease in the total training data. At the same time, a longer window can lead to some interfering frames, e.g. for the action of eating, the sub-sequence of bringing the food into the mouth is more important, while the earlier hand-raising action is equivalent to an interference in its recognition.

D. COMPARISON WITH THE STATE-OF-THE-ARTS
The performance of the proposed method is compared with other state-of-the-art methods on NTU RGB+D, Florence, UTD, UTK, MSR, Game, Kinetics-Skeleton, Classroom, PKU-MMD and SBU datasets. The comparative results are shown in Table. 4 and Fig. 9.
As shown in Table. 4 and Fig. 9, the proposed method brings performance improvement of 1.95%, 1.10% and 6.33% on Florence, MSR and Classroom datasets, respectively. The Co-LSTM, STA-LSTM and IndRNN are RNNbased methods. Compared to these RNN-based models, the accuracy of the proposed model is improved on almost all datasets. To better explore the structural information of skeleton, some models used GCN, which is the most highly regarded search method currently in skeletal databased action recognition. To further investigate the performance of the model proposed in this paper, experiments were conducted on a large dataset NTU RGB+D 120. The highest accuracy on this dataset was the GCN-based model PA-ResGCN-B19 with 87.3% accuracy (CROSS-SUBJECT).
The highest RNN-based model, Logsig-RNN, had an accuracy of 68.3% (CROSS-SUBJECT). The model in this paper has an accuracy of 81.4% (CROSS-SUBJECT) on this dataset. Compared with the GCN-based models, the proposed model does not perform better on all datasets. However, even if the accuracy gap is somewhat less, it is still within a certain range, which validates the effectiveness of the proposed model.

V. CONCLUSION
In this work, an end-to-end framework for skeleton-based action recognition is presented. In order to model the correlations of joints, the covariance fusion module and LSTM network are proposed for capturing the correlations of joints in the same frame and a window of frames is proposed for modeling the dependencies of frames. To enhance the characteristics of the actions in the window, the dynamic normalization module is used to adjust the normalization strategy through an adaptive approach. The experimental results demonstrate that the proposed method achieves a significant improvement in accuracy compared with the state-of-the-art methods.

ACKNOWLEDGMENT
This work was partially supported by the National Natural Science Foundation of China (Grant No. 61977061 and 51934007).
GUAN HUANG received the B.S. degree in computer science and technology from Nanchang University, Nanchang, China, in 2017. He is currently pursuing the master's degree with the China University of Mining and Technology, Xuzhou, China. His research interests include deep learning for action recognition.
QIUYAN YAN received the Ph.D. degree in computer applications from the China University of Mining and Technology in 2010. She is currently an Associate Professor with the China University of Mining and Technology. Her current research interests include multimodal image action recognition, big data analytics for education, and temporal data mining. VOLUME 8, 2020