Research on Discriminative Skeleton-Based Action Recognition in Spatiotemporal Fusion and Human-Robot Interaction

. A novel posture motion-based spatiotemporal fused graph convolutional network (PM-STGCN) is presented for skeleton-based action recognition. Existing methods on skeleton-based action recognition focus on independently calculating the joint information in single frame and motion information of joints between adjacent frames from the human body skeleton structure and then combine the classiﬁcation results. However, that does not take into consideration of the complicated temporal and spatial relationship of the human body action sequence, so they are not very eﬃcient in distinguishing similar actions. In this work, we enhance the ability of distinguishing similar actions by focusing on spatiotemporal fusion and adaptive feature extraction for high discrimination information. Firstly, the local posture motion-based attention (LPM-TAM) module is proposed for the purpose of suppressing the skeleton sequence data with a low amount of motion in the temporal domain, and the representation of motion posture features is concentrated. Besides, the local posture motion-based channel attention module (LPM-CAM) is introduced to make use of the strongly discriminative representation between diﬀerent action classes of similarity. Finally, the posture motion-based spatiotemporal fusion (PM-STF) module is constructed which fuses the spatiotemporal skeleton data by ﬁltering out the low-information sequence and enhances the posture motion features adaptively with high discrimination. Extensive experiments have been conducted, and the results demonstrate that the proposed model is superior to the commonly used action recognition methods. The designed human-robot interaction system based on action recognition has competitive performance compared with the speech interaction system.


Introduction
With the development of artificial intelligence technology, human-robot interaction technology has become a research hotspot.Compared with speech and image signals, visionbased human-robot interaction technology is more stable, and it attracts a lot of research interest.e key to humancentered visual interaction technology is to understand human activities [1] and human social behaviors [2].erefore, action recognition plays an important role in the field of human-robot interaction [3].
e two main approaches of human action recognition are RGB-based and skeleton-based.e RGB-based method makes full use of the image data and can obtain higher performance in the recognition rate.However, this method usually needs to process every pixel in the image to extract features.erefore, highcost computing resources are required and real-time processing can hardly be achieved.It is also vulnerable to poor lighting conditions and background noise.In the skeleton sequence method, the 2D or 3D coordinates are expressed as human joint positions.Due to the limited number of joints in the human skeleton, only a few dozen, some modest computing resources would be enough for real-time applications.It is also robust to dynamic environments and complex backgrounds.Many widely available devices are suitable for extracting human skeleton features, such as Microsoft Kinect, OpenPose [4], and CPN [5].
e conventional deep learning-based methods convert the skeleton sequence as a set of joint vector sequences, input them to RNNs [6], or extract features by feeding 2D pseudoimages representing skeleton sequences into CNNs [7], and then predict the action classes.However, neither the joint vector sequence nor the 2D pseudoimage can represent the correlation between human joints effectively.Recently, graph convolutional neural networks (GCNs) have extended the convolution operation from 2D image structure to graph structure and have shown good performance in many applications.Yan et al. [8] used GCNs for the first time in skeleton-based action recognition and proposed a spatialtemporal graph model.Subsequently, the methods for optimizing spatial feature extraction were proposed.Yang et al. [9] presented a finite-time convergence adaptive fuzzy control method for a dual-arm robot with an unknown number of kinematics and dynamics.Shi et al. [10] used adaptive graph convolutional layer and attention mechanism to increase the flexibility of the model, first-order joint information, second-order bone information, and motion information as inputs to construct multistream networks.Liu et al. [11] proposed multiscale aggregation across spatial and temporal dimensions effectively to eliminate the importance of neighbor nodes for long-range modeling.Yang et al. proposed a personalized variable gain control with tremor attenuation for robot teleoperation [12] and used adaptive parameter estimation and control design for robot manipulators with finite-time convergence [13].Peng et al. [14] used high-order representations of skeleton adjacency graphs and dynamic graph modeling mechanisms to find implicit joint correlations.Obinata and Yamamto [15] modeled the spatiotemporal graph by adding extra edges on the interframe to extract the relevant features of the human joints.However, all these methods ignore the fusion of posture motion and skeleton joint features in the temporal domain.
In the existing research work, the spatial information and motion information of the spatiotemporal graph are not fused to achieve end-to-end training effectively.e proposed novel posture motion-based spatiotemporal graph convolution networks (PM-STFGCNs) use the posture motion-based spatiotemporal fusion (PM-STF) module to perform feature fusion of motion and skeleton representation in the spatiotemporal domain for enhancing skeleton features adaptively.e defined local posture motion-based attention module (LPM-TAM) is used to constrain the disturbance information in the temporal domain and learn the representation of motion posture.e introduced local posture motion-based channel attention module (LPM-CAM) is employed to learn the strong discrimination representation between similar action classes in order to enhance the ability to distinguish fuzzy action.Extensive experiments have been performed on two large-scale skeleton datasets.Compared with common methods, the proposed method can further improve the recognition performance which combines with the method of optimizing the spatial graph convolution only.In addition, a human action recognition interactive system was designed to compare with speech interaction.
e main contributions of our methods are the following: (1) A novel local posture motion-based attention module (LPM-TAM) filters out low motion information in the temporal domain that helps to improve the ability of relevance motion feature extraction (2) Local posture motion-based channel attention module (LPM-CAM) is employed to enhance the ability to distinguish similar actions for learning the strong discriminative representation adaptively between different action classes (3) e posture motion-based spatiotemporal fusion (PM-STF) module is used, which integrates LPM-TAM and LPM-CAM to effectively fuse the spatiotemporal feature information and extract high-discriminative feature for improving the ability to distinguish similar actions (4) e effectiveness of the proposed method has been verified through extensive experiments, compared with other common methods to evaluate the competitiveness of the proposed method and applied in humanoid robots successfully to verify that action interaction is better than speech interaction

Spatial Graph Convolution
Networks.e spatial-temporal graph convolutional neural network [8] represents the connection relationship of the joints with the self-connected identity matrix I and the adjacency matrix A S .In the case of a single frame, the convolution operation of the spatial dimension is performed as follows: where f in ∈ R C in ×T×N is the feature map with input dimension of (C in , T, N) tensor, N is the number of joints, A S k is the N × N adjacency-like matrix, A S i,j k � 1 denotes the vertex v i in the subset of the vertex v j , and Λ ii k �  j (A S ij k ) + λ is the normalized diagonal matrix, where λ � 0.001.K represents the numbers of different subsets in spatial dimension based on spatial distance partition strategies.ere are three different subsets, namely, K � 3. A S 0 represents the connection of the vertex itself, A S 1 represents the connection of the centripetal subset, and A S 2 represents the connection of the centrifugal subset.
A S is a spatial attention feature map of the N × N dimension, which denotes the importance of each joint.⊗ is the multiplication of the corresponding elements of the matrix, which means that it can only affect the vertices connected to the current target.

Temporal Graph Convolutional Networks.
e literature [10] proposed a temporal attention module, and the attention coefficient is calculated as follows: where f in ∈ R C in ×T×N is an input feature map.AvgPool is an average pooling operation.θ is a 1 × 1 convolution operation, and the weight matrix M θ ∈ R 1×C in ×S , where S is the size of the convolution kernel.σ refers to the Sigmoid activation function.e attention feature map A T ∈ R 1×T×1 , which denotes the importance of the skeleton graph at a temporal dimension, and T refers to the length in time.
e literature [8] defined the temporal graph convolution based on a simple strategy.In equation (1), they use the kernel size Θ × 1 in the temporal dimension to perform graph convolution.
erefore, the sampling area on the , where Θ is the kernel size in temporal dimension, which is set to 9 in [8].

Posture Motion Representation.
e posture motion represents the motion information of the corresponding joint in a series of consecutive frames, for example, the j th joint of the given frame u + 1, i.e., v u+1,j � (x u+1,j , y u+1,j , z u+1,j ) and the j th joint of frame u, i.e., v uj � (x uj , y uj , z uj ), which posture motion is represented as τ u,j � (x u+1,j − x u,j , y u+1,j − y u,j , z u+1,j − z u,j ).τ u,j is the posture motion representation of the j th joint of frame u.

Local Posture Motion-Based Temporal Attention Module.
A novel local posture motion-based temporal attention module (LMP-TAM) is proposed for suppressing a large amount of disturbance information in the temporal dimension.As shown in Figure 1, the posture motion feature map of each vertex in the spatiotemporal graph is calculated as follows: where where Φ T ∈ R 1×T×1 refers to the importance of each frame of the spatiotemporal graph with a time length of T. Ω is the motion feature map of joints, and Γ ∈ R 1×1×V is the local limb mask set D. Γ ⊗ Ω is the attention Φ L based on local limbs in the temporal dimension.⊗ is the multiplication of the corresponding elements of the matrices.σ is the sigmoid activation function.e final output is as follows: e input feature map is multiplied by the attention feature map Φ T in a residual manner to calculate adaptive feature enhancement, and ⊗ refers to the addition of corresponding matrix elements.

Local Posture Motion-Based Channel Attention Module.
e local posture motion-based channel attention module (LPM-CAM) has been proposed to improve the ability to learn the strong discrimination representation between different postures.As shown in Figure 2, the input includes the posture motion feature map Ω ∈ R C×T×V extracted based on the local posture motion-based temporal attention module and generated temporal attention, which multiplied of each other to obtain the spatial-temporal graph after attention.e temporal sequence action segment with rich action semantic information is paid more attention.e channel attention coefficient is calculated as follows: ...

Limb mask
Concat e action sequence with important semantic on the spatialtemporal graph has been screened out by the temporal attention module, and the channel attention selects the strong discriminative representations between different posture movements for action recognition.ρ is marked as ReLu nonlinear activation function.Concat refers to concatenate the local limb feature map: e input feature map is multiplied by the channel attention feature map in way of residual connection to achieve adaptive feature enhancement.

Posture Motion-Based Spatiotemporal Fusion.
In order to fuse skeleton joints information and motion features to achieve an end-to-end learning manner, the posture motionbased spatiotemporal fusion module (PM-STF) is proposed to fuse spatial and temporal features and enhance the discriminative feature adaptively.
e output of temporal convolution module at the i th vertex of frame u is is is different from formula (1), and the input is a posture motion feature map extracted from the spatiotemporal graph and adopts a residual connection to enhance the motion feature.τ(v j ) is the posture motion feature of the neighborhood vertex v j .δ is the weighting function.c j (v j ) refers to the mapping label of the subset of the neighborhood vertex v j , which is divided into three subsets S a0 ′ , S a1 ′ , and S a2 ′ based on the spatial distance partition strategy.
To implement the PM-STF, equation ( 8) is transformed into where Ω ∈ R C in /2×T×N is posture motion feature map and is a 1 × 1 convolution weight matrix, increasing the channel of the same posture motion feature map as input channel.M k ∈ R C out ×C in ×1×1 is a 1 × 1 convolution weight vector.A S ∈ R 1×T×N is a spatial attention map which is used to distinguish the importance of vertices.⊗ refers to the multiplication of the corresponding elements of matrices.A S k is adjacency-like matrix, and

Implementation of PM-STFGCN.
e implementation of our module is combined with the model of optimizing the spatial graph convolution only, such as ST-GCN and 2s-AGCN.Taking ST-GCN as an example, shown in Figure 3, the implementation of our module PMSTF-GCN is added between S-GCN and T-GCN.Each layer of PMSTF-GCN contains LPM-TAM, LPM-CAM, and PM-STF.S-GCN and T-GCN are named as the spatial graph convolution layer and temporal graph convolution layer of the original model.GAP is a global average pooling layer, and FCN is marked as a fully connected network layer.Finally, a spatiotemporal fusion graph convolution block is constructed.e overall architecture of the network consists of several STFGCN blocks.e batch normalization layer is added to the skeleton data input to normalize the input data.Finally, the global average pooling layer is implemented to pool the feature graphs to the same size, and the followed layer is a SoftMax classifier to obtain the prediction.

Implementation in Human-Robot Interaction.
e presented action recognition schemes were applied on a real system, which consists of a Pepper robot and an external Kinect v2 depth camera.
e implementation in human-robot interaction was performed as follows (Algorithm 1).[16] is the largest and most widely used multimodality dataset for skeleton-based action recognition.Each action segment was performed by 40 volunteers aged 10 to 35 and captured by three camera sensors at the same height but from different horizontal ere are two kinds of training benchmarks, and [16] recommends as follows: cross-subject (CS) and cross-view (CV).In cross-subject (CS) benchmarks, the training dataset contains 40320 action samples, and the testing dataset contains 16560 action samples.In cross-view (CV) benchmarks, the training dataset contains 37920 action samples taken by camera sensors 2 and 3, and the testing dataset contains 18960 action samples taken by camera sensors 1.In the following experiments, we test the top-1 accuracy on two benchmark datasets.[17] is a large dataset for skeleton-based action recognition.Kinetic contains 300000 action video clips, and a total of 400 classes [8] used the publicly available OpenPose toolbox [4] to estimate the pose of 18 joints in each fragment frame.ere are a total of 300 frames for each action video frame.According to the average joint confidence, two people are selected as multiperson clips in each frame.e training dataset contains 240000 video clips, and the testing dataset contains 20000 video clips.We make use of the training dataset and then perform experiments to verify the accuracy of top-1 and top-5 on the testing dataset.

Ablation Study.
e effectiveness of the proposed method has been verified over two large skeleton datasets in Kinetics-Skeleton and NTU-RGB + D.  Complexity made between ST-GCN [8] and ST-GCN + PM-STFGCN, and between 2s-AGCN [18] and 2s-AGCN + PM-STFGCN.
e results show that the performance has been improved over the original models and verified the effectiveness of LPM-TAM, LPM-CAM, and PM-STF.
As shown in Tables 1 and 2 for ST-GCN [8] and ST-GCN + PM-STFGCN, PM-STFGCN improves the top-1 accuracy of the CS and CV benchmarks by 4.2% and 1.6%, respectively, and the accuracy of the top-1 and top-5 of the Kinetics-Skeleton dataset by 2.5% and 1.9%, respectively.by 0.8% and 1.3%, respectively.Compared with the original model, the spatiotemporal fusion module has a greater contribution to the improvement of the recognition performance which verifies the effectiveness and necessity of spatiotemporal fusion.

Comparison with State-of-the-Art Schemes.
e proposed method is compared with some of the state-of-the-art schemes, and the results are shown in Tables 3 and 4.Among them, 2s-AGCN + PM-STFGCN achieved very good performance on CS and CV.On the Kinetics-Skeleton dataset, the accuracy of top-1 and top-5 of 2s-AGCN + PM-STFGCN also showed decent performance.

Human-Robot Interaction Demonstration.
To further evaluate the robustness of the proposed action recognition schemes to distinguish similar action classes, action recognition is applied to a real system that consists of a Pepper robot and an external Kinect v2.As shown in Table 5, there is a correspondence between action semantics and interactive action.
e designed correspondence between action semantic and interactive activities ranges from partial and limb movements of hands to whole-body movements with more complexity.For example, waving the hand, touching the ear, holding the head with hands, and applauding are all hand movements.Among them, the hand movements of the first three movements are related to the head with high similarity.Also, squatting, sitting 91.9 96.5 Table 2: Ablation study on the skeleton-based dataset Kinetics-Skeleton.
Among them, each similar action has a high recognition accuracy which means our method can effectively distinguish each different action.Each action sequence can be seen as a combination of many steps.For example, waving the hand can be divided into two steps: first, raise your right hand to above the head; second, swing the hand around the head.Similarly, a video can be decomposed into multiple frames of images.

Strong Discrimination Analysis. As shown in
Figures 5-8, there are examples of human-robot interaction with similar actions.A period of time action sequence has been calculated, and the classification result with the highest probability is selected as the recognized result.Skeleton sequence with low motion information can be filtered out well by LPM-TAM, which helps to identify the process from raising hands to the head and swinging, and more purposefully recognize interactive actions.e main action of touching the ear is the process of raising the hand to the ear.Compared with waving the hand, the main difference is the movement of the hand swinging near the head.e characteristics of strong discrimination have been paid more attention by LPM-CAM to constraint similar movement processes, such as the process of raising the hand which serves as the basis for action recognition.
e action of holding the head with both hands is similar to touching the ears.However, the main difference is that holding the head with both hands is the movement of the left and right hands, while touching the ears is the movement of the limbs with one hand.e main difference between similar movements in the local limb area can be captured by LPM-CAM effectively that enables the proposed method to extract stronger and discrimination representation.e human-robot interaction experiments verified that similar action did not affect the recognition result at all and has a strong discrimination of similar actions.

Comparison with Speech Interaction.
In this work, the two indicators of accuracy and real-time performance are compared with speech interaction.e accurate times of these interaction methods were recorded 50 times to verify the reliability of the action interaction.Figure 9 shows the confusion matrix of the Pepper robot speech interaction recognition.In the testing phase, it only needs to speak out the corresponding action, such as wave the hand or touch the ear.e recognition result is regarded as "jumping" if the result of speech interaction has not been recognized within the specified test time.
e recognition result of speech interaction is easily affected by external noise and distance which cause recognition errors, or no recognition results.From the experimental results, the average recognition rate of action interaction and speech interaction is 95.7% and 94.8%, respectively.Compared with speech interaction, our scheme has highly competitive which verifies the reliability of the action interaction in the recognition effect.

Methods
CS (%) CV (%) STA-LSTM [6] 73.4 81.2 ST-GCN [8] 81.5 88.3 CNN-based [7] 83.2 89.3 GCN-NAS [14] 89.4 95.7 MS-AAGCN [10] 90.0 96.2 2s-AGCN + PM-STFGCN (ours) 91.9 96.5  Due to the different durations of each action, using the same time segment as inputs will cause fluctuations of response time.We try to do a few more experiments to eliminate the differences among the action response time.e results show that the response time of action interaction is shorter than speech interaction because of the robustness to external environment noise.e average response time of speech and action interaction is 2.05 s and 1.86 s, respectively.Compared with speech interaction, the proposed scheme reduced the responding time by 0.19 s in real-time.e main reason is that video frames within a certain time range are used for recognition and shorter processing time for the action recognition network.
In conclusion, through the experimental comparison of two human-robot interaction ways, the action recognition has its advantages: it is not affected by environmental noise or spatial distance; it provides better real-time response during the interaction.

Conclusion
Previous works in the literature mostly make use of modeling of motion information and skeleton joint information independently, which cannot fully express the relationship between them.e posture motion-based spatiotemporal fusion graph convolution network (PM-STFGCN) is presented to fuse temporal and spatial features and enhance the posture motion features adaptively with high discrimination.A novel local posture motion-based temporal (LPM-TAM) module is introduced to suppress the disturbance information with low motion in the temporal domain efficiently and fully learn the representation of the posture motion.e local posture motion-based channel attention module (LPM-CAM) is proposed for the purpose of learning strong discrimination representation between different motion postures which improved the ability to discriminate action classes, and the posture motion-based spatiotemporal fusion module (PM-STF) is adopted to fuse the motion feature and skeleton representation effectively.Extensive experiments were performed on two large skeleton datasets, and the constructed scheme shows substantial improvement over some other methods.e proposed action recognition interaction system has a competitive performance in accuracy and response time compared with speech interaction.
ε m extracted the posture motion representation from the input feature map.Ω ∈ R C in /2×T×N is the posture motion feature map, where the channel of the feature map is half of the input channel.Human motion is body movements, which involve part or all of the limbs.e attention map Φ T of skeleton sequence in the spatiotemporal graph is represented by the attention Φ L of local limbs in the temporal dimension.e importance of local limb in temporal dimension is determined by motion information in the local perception domain D. D � d 0 , d 1 , d 2 , d 3 , d 4  , where d 0 denotes left hand, d 1 denotes right hand, d 2 denotes left leg, d 3 denotes right leg, and d 4 denotes other limb parts.Φ L ∈ R 1×T×η , where η refers to the number of limbs being denoted, and η has been set 5 in this work.e temporal attention of local posture motion-based is calculated as follows:
e local posture motion-based attention module (LPM-TAM), local posture motion-based channel module (LPM-TAM), and posture motion-based spatiotemporal fusion (PM-STF) module are represented by PM-STFGCN.Two sets of comparisons are

4. 4 . 3 .
Comparison of Response Time.As shown in Figure 10, comparison with the response time of speech and action interaction shows the average time of the 10 test results.

Figure 5 :
Figure 5: Action interaction with waving the hand as an example.

Figure 6 :
Figure 6: Action interaction with touching the ear as an example.

Figure 7 :
Figure 7: Action interaction with holding hands as an example.

Figure 8 :
Figure 8: Action interaction with applause as an example.

Figure 9 :
Figure 9: e confusion matrix of speech interaction with the Pepper robot.

Figure 10 :
Figure 10: Comparison of the response time of speech and action interaction.

Table 1 :
Ablation study on the benchmark of NTU-RGB + D.

Table 3 :
Comparison of CS and CV benchmarks with state-of-theart schemes.

Table 4 :
Comparison with Kinetics-Skeleton dataset with state-ofthe-art schemes.