Deep Learning for Human Activity Recognition on 3D Human Skeleton: Survey and Comparative Study

Human activity recognition (HAR) is an important research problem in computer vision. This problem is widely applied to building applications in human–machine interactions, monitoring, etc. Especially, HAR based on the human skeleton creates intuitive applications. Therefore, determining the current results of these studies is very important in selecting solutions and developing commercial products. In this paper, we perform a full survey on using deep learning to recognize human activity based on three-dimensional (3D) human skeleton data as input. Our research is based on four types of deep learning networks for activity recognition based on extracted feature vectors: Recurrent Neural Network (RNN) using extracted activity sequence features; Convolutional Neural Network (CNN) uses feature vectors extracted based on the projection of the skeleton into the image space; Graph Convolution Network (GCN) uses features extracted from the skeleton graph and the temporal–spatial function of the skeleton; Hybrid Deep Neural Network (Hybrid–DNN) uses many other types of features in combination. Our survey research is fully implemented from models, databases, metrics, and results from 2019 to March 2023, and they are presented in ascending order of time. In particular, we also carried out a comparative study on HAR based on a 3D human skeleton on the KLHA3D 102 and KLYOGA3D datasets. At the same time, we performed analysis and discussed the obtained results when applying CNN-based, GCN-based, and Hybrid–DNN-based deep learning networks.


Introduction
HAR is an important research problem in computer vision. It is applied in many fields, such as human-machine interaction [1], video surveillance [2,3], and fashion and retail [4]. It has been of research interest for nearly a decade, and studies are often based on human pose to perform the activity recognition process, where especially 3D human pose-based activity recognition provides real-world-like visualization.
Despite much research interest and impressive results, HAR still contains many real challenges in the implementation process. In Islam et al. [5]'s study, the following challenges were presented. Firstly, the skeleton data of commonly seen daily activities such as walking, running, and sitting down are often recognizable and rich in data. However, data on the skeleton of concurrent or aggregate actions on these actions are sorely lacking. Secondly, the datasets of 3D human skeletons currently used to train recognition models are often in a scene with only one person (a skeleton); in real-life operations, there are often many people performing many activities, such as queuing in a store, walking, or jogging. In particular, information about the context of people's activities is still lacking. Since human activity recognition is very closely related to understanding human behavior, context plays a huge role in HAR. Thirdly, human skeleton datasets for HAR can be collected from many different types of sensors, different objects, and different contexts. Although they perform the same activity, some people are taller or shorter, the method of execution is not the same for each person, etc., and has not been standardized. Especially, 3D human skeleton data can be represented in spatial and temporal models; the number of dimensions of the data is too large because the data of the 3D human skeleton has many degrees of freedom, or due to the exploitation of different operations, movement is located on only a small part of the human body.
In the studies of Xing et al. [6], Ren et al. [7], and Arshad et al. [8], they conducted a full survey of HAR based on a 3D human skeleton, in which the research is also surveyed and divided into three approaches: RNN-based, CNN-based, and GCN-based. In these two survey studies, the challenges of HAR based on the 3D human skeleton have not been presented and analyzed. Islam et al. [5] only conducts the survey using CNN-based HAR.
In this paper, we conduct a survey on deep learning-based methods, datasets, and HAR results based on 3D human poses as input data. From there, we propose and analyze the challenges in recognizing human activities based on the 3D human pose.
Previous studies on HAR often applied to datasets with a small number of 3D joints, namely from 20 to 31 points as the HDM05 dataset [9] (31 3D joints), CMU dataset (22 3D joints) [10], NTU RGB-D dataset (25 3D joints) [11,12], MSRA Action3D dataset (20 3D joints) [13], as presented in Figure 1 of Kumar et al. [14]'s research. With a low number of representative points, the low dimensionality of the data is provided, but the lack of enough information to distinguish actions also occurs. As illustrated in Figure 1, the number of joints is large and has many degrees of freedom, which makes the dimensionality of the data very large, especially when the skeleton moves in the 3D space and follows the temporal model. Another problem is that in the KLHA3D-102 [14] dataset, there are many similar actions such as "drink water" and "drink tea" and "golf swing pitch shot" and "golf swing short shot".  [14] and KLYOGA3D [14] datasets.
In the study of Kumar et al. [14], the problem of human activity recognition was researched on the HDM05, CMU, NTU RGB-D, MSRA Action3D, KLHA3D102 [14], and KLYOGA3D [14] datasets. At the same time, we also perform experiments on activity recognition on the DDNet, Res-GCN, and PA Res-GCN models with the KLHA3D102 [14], and KLYOGA3D [14] datasets. To scrutinize the results of HAR based on the latest deep learning models (DLM), we have performed a comparative study on HAR on the KLHA3D102 [14] and KLYOGA3D [14] datasets.
The main contributions of the paper are as follows: • We present an overview of the HAR problem based on the 3D human pose as the input, with four types of DNN to perform the estimation: RNN-based, CNN-based, GCN-based, and Hybrid-DNN-based. • A full survey of HAR based on the 3D human pose is elaborated in detail from methods, datasets, and recognition results. More specifically, our survey provided about 250 results of the HAR across more than 70 valuable studies from 2019 to March 2023. The results are listed in ascending order of the year and are all evaluated on the accuracy measure. • We performed a study comparing HAR on the KLHA3D 102 [14] and KLYOGA3D [14] datasets with the GCN-based and Hybrid-DNN-based neural models. • Analysis of challenges in HAR based on the 3D skeleton of the whole body is presented. The analysis of the challenges of implementing HAR with two main contents is the number of dimensions of the data and the insufficient information to distinguish actions with a limited number of reference points.
The content of this paper is organized as follows. Section 1 introduces the applications and difficulties of HAR based on 3D human skeleton input data. Section 2 discusses related research in HAR. Section 3 presents a full survey of HAR methods based on 3D human skeleton data input. Section 4 presents a comparative study of HAR on the KLHA3D 102 [14] and KLYOGA3D [14] datasets. Section 5 concludes the contributions and presents future works.

Related Works
HAR is based on computer vision with RGB, skeletal, and depth input representation. Wang et al. [15] surveyed on HAR based on input data can be the skeleton, RGB, RGB + D, optical flow, etc. In the study, the authors only present and analyze the ST-GCN (Spatial Temporal Graph Convolutional Networks) [16] and 3D CNNs (3D Convolutional Neural Networks) hybrid with some architecture [17,18]. HAR results were not presented in this study.
Morshed et al. [19] conducted a comprehensive survey of HAR with three methods of input data types: depth-based methods, skeleton-based methods, and hybrid feature-based methods. They showed that the approaches using depth information could use more RGB images to extract features such as Histogram of Gradients (HOG) to generate Depth Motion Maps (DMM) and to train HAR models. With skeleton-based methods, the trajectory-based method used is based on the trajectory that investigates the spatial and temporal movement of the human body's skeleton to extract different features. The human skeleton can be a 2D skeleton or a 3D skeleton. At the same time, the volume-based methods usually compute features like texture, color, pose, histograms of optical flow, histograms of directed gradients, and other features to represent human activity in a spatial-temporal volume. The results in Tables 1 and 2 from [19] are mainly created on RGB and depth videos, with only one result [20] on the MSRAction3D dataset.
In addition, there is a survey by Jobanputra et al. [21] divided into two main directions: using traditional machine learning and deep learning. Within deep learning, they separated dense artificial neural networks, convolutional neural networks (CNN), and recurrent neural networks (RNN).
Gupta et al. [22] surveyed HAR of people from multi-modal information sources, where information can be sensor-based [23], vision-based [24], RFID-based [25,26], WiFibased [27,28], and device free [29]. The recognition model is also based on two methods: machine learning and deep learning. They calculated the ratio of device types to capture the data for HAR or perform HAR right on those devices as follows: vision = 34%, WiFi = 7%, RFID = 7%, sensor = 52%, and the ratio of vision-based input information is video = 65% and skeleton = 35%, as illustrated in Figure 2.

Methods
In the valuable surveys of Ren et al. [7] and Le et al. [30], the problem of HAR using 3D skeleton-based deep learning models can be solved by four types of deep learning models (DLMs): recurrent neural networks, convolutional neural networks, graph convolution network, and hybrid deep neural networks (Hybrid-DNN). The types of deep learning models and methods for HAR are illustrated in Figure 3. In Xing et al. [6] and Ren et al. [7], they conducted a full survey on HAR based on 3D human skeletons, however, the results have only been updated until 2019. In this paper, we update the results of HAR based on a 3D skeleton from 2019 to March 2023.

RNN-Based
RNN-based methods that use vector sequences of joint positions in continuous time, the position of each joint in the human body as it moves over time can be expressed as a vector. The main idea of the RNN is to use some kind of memory to store information from previous computational steps so that based on it can make the most accurate prediction for the current prediction step. Figure 4 illustrates the RNN-based approach for HAR based on a 3D human skeleton. As known, a variation of RNN called Long Short-Term Memory (LSTM) [31,32] has achieved many impressive results in the field of NLP (Natural Language Processing) [33,34], speech recognition [35], and especially in computer vision [36,37]. LSTM is the same as a traditional RNN, except adding computational gates in the hidden layer to decide to retain important information for the next time steps.

Figure 4.
Illustrating the RNN-based approach for HAR based on 3D human skeleton [6]. (a) is the skeleton representation as a function of the temporal-spatial, (b) is the skeleton representation according to the tree structure.
Ye et al. [38] proposed a combination of RNN and LSTM to learn geometric features from 3D human skeleton data. The proposed model selects geometric features based on distances from joints and selected lines as the input of the network. To this end, the model has the first layer of LSTM and the second layer of temporal pooling with the ability to select the most recognizable time period features. To extract the geometric features, the algorithm performs two steps. The first is pre-processing the 3D human skeleton data by converting the 3D human skeleton data from the camera coordinate system to the human coordinate system, with the origin being the center of the hip joints. The X-axis is a 3D vector parallel to the "Right shoulder"| and the "Left shoulder" (red axis), the Y-axis is parallel to the 3D vector from the "Head" to the "Center hip", and the Z-axis is then X × Y. The coordinate system on the body is shown in Figure 5. The second step is to represent the geometric features, unlike other studies that use the coordinates of the joints as the input. In this study, the authors use 30 lines selected on the lines as shown in Figure 5 as the input for geometric feature calculation.
Gaur et al. [37] develops a HAR framework based on LSTM-RNN. The framework is integrated into wearable sensors. The proposed framework includes four modules: the first is the data pre-processing, the second describes the benefits and drawbacks of the RNN model, the third is the LSTM networks model, and the final is the combination module of LSTM and RNN. Li et al. [39] proposed an independently recurrent neural network (IndRNN) to solve five problems of RNN for HAR. The first one can process longer sequences greater than 5000 steps and still solve the problem of gradient vanishing and exploding. The second can construct deeper networks (over 20 layers, much deeper if GPU memory supports). The third can be robustly trained with ReLU. The fourth can be to interpret the behavior of IndRNN neurons independently without the effect of the others. The fifth is reduced computational complexity (over 10 times faster than cuDNN LSTM when the sequence is long).
Liao et al. [40,41] proposed the Logsig-RNN with some advantages over RNN as follows: (1) The sequence of the log signatures at a coarser time partition is transformed from a high-frequency sampled time series by the log-signature layer. The log-signatures transformation reduces training time. (2) When using high frequency and continuous data sampling at a coarser time grid, it is possible to ignore the microscopic character of the streamed data and render lower accuracy. Meanwhile, the Logsig-RNN model can do this well. (3) has a much better performance than RNN when performed on missing data. (4) The Logsig-RNN model can sample the highly oscillatory stream data. The improved model of Logsig-RNN compared to RNN is illustrated in Figure 6.

CNN-Based
This approach utilizes the outstanding advantages of CNN in object recognition in 2D space (image space). It maps a 3D representation of the 3D human skeletons into a 2D array (possibly spatial relations from the skeleton joints) to learn the spatial-temporal skeleton features. The CNN-based approach is illustrated in Figure 7.
Tasnim et al. [42] proposed a DCNN model to train the feature vector transformed from the coordinates of the joints along the X, Y, and Z axes. The joints of each frame ith are represented by F i (X ij , Y ij , Z ij ), where j is the joint number. Li et al. [43] proposed a CNN fusion model for skeletal action recognition. The fusion model was trained from two types of feature vectors: three SPI (skeletal pose image) sequences and three STSIs (skeletal trajectory shape images). A PoseConv3D model was proposed by Duan et al. [44]. This model used the 3D-CNN for capturing the spatio-temporal dynamics of skeleton sequences; there in the input of the 3D-CNN backbone are 3D heatmap volumes. The pseudo heatmaps for joints and limbs are generated and are good inputs for 3D-CNNs. Koniusz et al. [45] propose the sequence compatibility kernel (SCK) and dynamics compatibility kernel (DCK) feature representations. SCK is generated from the spatio-temporal correlations between features as illustrated in Figure 2a,b of Koniusz et al. [45]'s research, and DCK explicitly models the action dynamics of a sequence as illustrated in Figure 4a,b of Koniusz et al. [45]'s research. This research used the ResNet-152 model [46] to train the HAR features.

GCN-Based
GCN-based deep learning uses the natural representation of the 3D human skeleton as a graph, with each joint as a vertex and each segment connecting the human body parts as an edge. This approach often extracts the spatial and temporal features of the skeleton graph series, as illustrated in Figure 8. With the advantages of features that can be extracted from the skeleton graph, this approach has received much research attention in the past four years. Figure 9 shows the number of studies based on the GCN methods. In 2019, Shi et al. [47] proposed a novel MS-AAGCN (multistream attention-enhanced adaptive graph convolutional neural network) with some advantages for HAR based on 3D human skeleton data. The first is that an adaptive GCN is proposed to adaptively learn the topology of the graph in an end-to-end manner. The second is to embed an STC-attention module in each graph convolutional layer, which can help the model learn to focus on discriminative joints, frames, and channels selectively. Third, the combination of information from bones, joints, and information about the movement of bones and joints has created high efficiency for the activity recognition process. Peng et al. [48] proposed the first automatically designed GCN as well as a NAS (Neural Architecture Search). The spatial-temporal correlations between nodes are used to increase the search space of the GCN by building higher-order connections with a Chebyshev polynomial approximation. The NAS helps to increase search efficiency; it both performs sampling and is memoryefficient. Shi et al. [49] proposed a novel directed graph neural network to train features extracted from joints, bones, and their relationships. The skeleton data are represented as a DAG (directed acyclic graph) based on the kinematic dependency between the joints and bones in the human body. A two-stream framework is used to exploit two streams of information, namely the space and time of movement of the joints. The AS-GCN (actionalstructural graph convolution network) is proposed by Li [50]. The AS-GCN combines both actional-structural graph convolution and temporal convolution into a basic building block for training both spatial and temporal features. The AS-SCN block is connected to two parallel branches by a future pose prediction head.
A novel end-to-end network AR-GCN (attention-enhanced recurrent graph convolutional network) is proposed by Ding et al. [51]. AR-GCN is an end-to-end network capable of selectively learning discriminative spatial-temporal features and overcoming the disadvantages of learning only using key frames and key joints. The AR-GCN combines the advantages of the GCN and an RNN. Thus, the AR-GCN promotes the spatial feature extraction ability of GCN and improves the discriminative temporal information modeling ability. Gao et al. [52] proposed the BAGCN (Bidirectional Attentive Graph Convolutional Network). A GCN-based focusing and diffusion mechanism is used to learn spatial-temporal context from human skeleton sequences. The features of BAGCN are built based on the representation of the skeleton data in a single frame by two opposite-direction graphs, thereby effectively promoting the way of message passes in the graph. Li et al. [53] proposed the Sym-GNN (Symbiotic Graph Neural Networks) for HAR and predicting motion based on a 3D human skeleton. Sym-GNN consists of two component networks: a prime joint-based network to learn body-joint-based features, and a dual bone-based network to learn body-bone-based features. The backbone of each network is essentially a multi-branch, multi-scale GCN. Wu et al. [54] proposed a dense connection block for ST-GCN to learn global information, and to improve the robustness of features. The proposed method based on the spatial residual layer and the dense connection block produces better results than state-of-art methods resting on the spatial-temporal GCN. A two-stream non-local graph convolutional network is proposed by Shi et al. [55] to solve the problem of mining both the coordinate of joints and the length and information direction of bones. The BP algorithm is used to learn the topology of the graph in each layer. Papadopoulos et al. [56] proposed the DH-TCN (Dilated Hierarchical Temporal Graph Convolutional Network) module for modeling short and long-term dependencies. To represent extracted features on a 3D human skeleton, the author proposed a GVFE (Graph Vertex Feature Encoder) module for encoding vertex features.
Kao et al. [57] proposed graph-based motion representations using the skeleton-based graph structure. A skeletal-temporal graph starts with a skeletal-temporal graph such as the Fourier transform graph. The skeletal-temporal graph is transformed into the motion representation. To extract features for the activity recognition process, the authors used temporal pyramid matching [58] to model the dynamics in the sequence of frame-wise representations.
In 2020, Song et al. [59] proposed the PA-ResGCN with the combination of MIBs (Multiple Input Branches), ResGCN (Residual GCN) with a bottleneck structure, and PA (Part-wise Attention) blocks. The authors calculated and characterized spatial-temporal sequence from the joints, velocity, and bone of the human skeleton based on human body parts. These features are represented by a part of the human skeleton and trained by some Residual GCN modules. Next, the branches are concatenated and sent to several PA-ResGCN modules, where each PA-ResGCN module contains a sequential execution of a Residual GCN module.
The Shift-GCN is proposed by Cheng et al. [60]. Other GCNs, such as AS-GCN and Adaptive GCN, use heavy regular graph convolutions. The Shift-GCN uses shift graph operations and lightweight point-wise convolutions. In the shift graph, both the spatial graph and temporal graph are used to compute feature vectors. Thus, the computational complexity is significantly reduced. Song et al. [61] proposed the GCN-based multi-stream model called the RA-GCN (richly activated GCN). The rich discriminative features are extracted from skeleton motion sequences. Especially, the noisy or incomplete skeleton data brings challenges to HAR and training; thus, RA-GCN proposed the problem with the learned redundant. Peng et al. [48] proposed a brand-new ST-GCN to model the graph sequences on the Riemann manifold by Poincare geometry features computed from the spatial-temporal graph convolutional network. A Poincare model is trained on a multidimensional structural embedding for each graph. Mixing the dimensions is used to provide a more distinguished representation of the Poincare model. To obtain effective feature learning, Liu et al. [62] proposed a unified spatial-temporal graph convolution called G3D. This method is based on the multi-scale aggregation scheme to remove the redundant dependencies between node features from different neighborhoods. G3D introduced graph edges across the "3D" spatial-temporal domain as skip connections for the unobstructed information flow. The Dynamic GCN proposed by Ye et al. [63] exploits the advantages of learning-based skeleton topology of CNNs. A CNN named CeN (Context-encoding Network) is introduced to learn skeleton topology automatically. CeN can be embedded into a graph convolutional layer and learned end-to-end. The contextual information of each joint can be monitored globally by CeN and can represent the dynamics of the skeleton system more accurately. Obinata et al. [64] proposed extending the temporal graph of a GCN. The authors performed adding connections to neighboring multiple vertices on the inter-frame and extracting additional features based on the extended temporal graph. From this, the extended method can extract correlated features of multiple joints in human movement for training the HAR model. Yang et al. [65] proposed a PGCN-TCA (pseudo graph convolutional network with temporal and channel-wise attention) to solve the three existing problems of the previous GCN-based networks. First, the features of joints are usually only extracted based on the direct connection between bones for which the distant joint information that has no physical connection in a skeleton chain has not been used. The second is the normalized adjacency matrices are directly computed by the predefined graph and kept fixed through the training process. They are used on most GCN-based networks, which makes the model unable to extract diverse features. The third is that different frames and channels are of different importance to action recognition.
Ding et al. [66] proposed a novel SemGCN (Semantics-Guided Graph Convolutional Network) to extract multiple semantic graphs for skeleton sequences adaptively. This method can explore action-specific latent dependencies, and allocate different levels of importance to different skeleton information. The spatial useful and temporal information is extracted based on the different feature fusion strategies of the Sem-GCN block.
Yu et al. [67] proposed PeGCNs (Predictively encoded Graph Convolutional Networks) to train a GCN-based action recognition model with missing and noisy human skeleton data. To learn such representations by predicting the perfect sample from the noisy sample in latent space via an auto-regression model by using a probabilistic contrastive loss to capture the most useful information for predicting a perfect sample.
The PR-GCN (pose refinement graph convolutional network) is proposed by Li et al. [68]. To reduce the impact of errors in the skeleton data, the authors preprocessed the input skeleton sequences via a pose refinement module. Then, the position and motion information is combined through two branches: a motion-flow branch and a position-flow-branch. In addition, the refined skeleton sequences are created based on gradual fusion. Finally, the temporal aggregation module aggregates the information over time and predicts the action class probabilities.
In 2021: Chen et al. [69] proposed a dual-head GNN (graph neural network) framework for HAR based on human skeleton data. This method used two branches of interleaved graph networks to extract features at two different temporal resolutions. The branch with a lower temporal resolution captures motion patterns at a coarse level, and the branch with a higher temporal resolution is encoded time movements on a more sophisticated level. These two branches are processed in parallel, and the output is the dual-granular action classification. Yang et al. [70] proposed a new framework called UNFGEF. This framework is unified with 15 graph embedding features with GCN and model characteristic skeletons. The human skeleton is represented using the adjacent matrix to represent the skeleton graph. The graph features of nodes, edges, and subgraphs are extracted and embedded into GCN and TCN networks. The final prediction is fused from the multi-stream through the softmax classifier for each stream. Chen et al. [71] propose the CTR-GC (Channel-wise Topology Refinement Graph Convolution) to learn different topologies dynamically and effectively aggregate joint features in different channels. This method learns a shared topology and channel-specific correlations simultaneously. To solve the problem of confusion between the activities of the nearly identical human bodies, Qin et al. [72] proposed fusing higherorder features in the form of angular encoding (AGE) into modern architectures to capture the relationships between joints and body parts robustly. To extract relevant information from neighboring nodes effectively while suppressing undesired noises, Zeng et al. [73] suggested a hop-aware hierarchical channels-squeezing fusion layer. The information from distant nodes is extracted and fused in a hierarchical structure. Dynamic skeletal graphs are built upon the fixed human skeleton topology and capture action-specific poses. Song et al. [74] have made improvements to the ResGCN to EfficientGCN v1, the authors used additional three types of layers (SepLayer, EpSepLayer, and SGLayer) for skeleton-based action recognition. This study employs a compound scaling strategy to configure the model's width and depth with a scaling coefficient; since then, the number of hyper-parameters is also calculated automatically. EfficientGCN v1 considers spatial attention and distinguishes the most important temporal frames. To extract effective spatial-temporal features from skeleton data in a coarse-to-fine progressive process for action recognition, Yang et al. [75] suggested the FGCN (Feedback Graph Convolutional Network). FGCN builds a local network with lateral connections between two temporal stages by a dense connections-based FGCB (Feedback Graph Convolutional Block) to transmit high-level semantic features to low-level layers.
In 2022: Lee et al. [76] proposed the HD-GCN (hierarchically decomposed graph convolutional network), as illustrated in Figure 10. HD-GCN contains a hierarchically decomposed graph (HD-Graph) to thoroughly identify the distant edges in the same hierarchy subsets and attention-guided hierarchy aggregation (A-HA) module to highlight the key hierarchy edge sets with representative spatial average pooling and hierarchical edge convolution. The DG-STGCN model (Dynamic Group Spatio-Temporal GCN) is proposed by Duan et al. [44]. DGSTGCN has the following advantages. The spatial modeling is built on learning the learnable coefficient matrices. The dynamic spatial-temporal modeling of the skeleton motion diversified groups of graph convolutions and temporal convolutions is designed dynamically group-wise.
The STGAT is proposed by Hu et al. [77] to capture short-term dependencies of spatialtemporal modeling. STGAT uses the three simple modules to reduce local spatial-temporal feature redundancy and further release the potential. STGAT builds local spatial-temporal graphs by connecting nodes in local spatial-temporal neighborhoods and dynamically constructing their relationships.
Duan et al. [78] proposed an open-source toolbox for skeleton-based action recognition based on PyTorch called PYSKL. PYSKL implements six different algorithms under a unified framework with both the latest and original good practices to ease the comparison of efficacy and efficiency. The PYSKL framework is built on top of ST-GCN, and PYSKL is the version of ST-GCN++.
The InfoGCN framework is proposed by Chi et al. [79] and presented in Figure 11. InfoGCN is a learning framework that combines a novel learning objective and an encoding method. The authors used the attention-based graph convolution that captures the contextdependent intrinsic topology of human action to learn the discriminative information for classifying action. A multi-modal representation of the skeleton using the relative position of joints also provides complementary spatial information for joints. Figure 11. The InfoGCN framework [79].
The TCA-GCN (Temporal-Channel Aggregation Graph Convolutional Networks) method is proposed by Wang et al. [80]. The TCA-GCN is used to learn spatial and temporal topologies dynamically and efficiently aggregate topological features in different temporal and channel dimensions for HAR. The TCA-GCN process of learning features is divided into two types: the TA module to learn temporal dimensional features and the channel aggregation module to efficiently combine spatial dynamic channel-wise topological features with temporal dynamic topological features.

Hybrid-DNN
Hybrid-DNN approaches use deep learning networks together to extract features and train recognition models. Here we examine a series of studies from 2019 to 2023 for skeletal data-based activity recognition. Si et al. [81] proposed the AGC-LSTM (Attention Enhanced Graph Convolutional LSTM Network). The AGC-LSTM is capable of combining discriminative features in spatial configuration, temporal dynamics, and exploring the co-occurrence relationship between spatial and temporal domains. The AGC-LSTM also uses the AGC-LSTM layer to learn high-level semantic representation and significantly reduce the computation cost. The end-to-end trainable framework is proposed [82] with a combination of a Bayesian neural network (BNN) model where BNN is again combined from the graph convolution and LSTM. The graph convolution is used to capture the spatial dependency among different body joints and LSTM is used to capture the temporal dependency of pose change over time. A new SSNet (Scale Selection Network) is proposed [83] for online action prediction. SSNet learns the proper temporal window scale at each step to cover the performed part of the current action instance. The network predicts the ongoing action at each frame. Shi et al. [84] proposed a DSTA-Net (decoupled spatial-temporal attention networks). It is built with pure attention modules without manual designs of traversal rules or graph topologies. The spatial-temporal attention decoupling, decoupled position encoding, and spatial global regularization are used to build attention networks. The DSTA-Net model splits the skeleton's data into four streams: spatial-temporal stream, spatial stream, slow-temporal stream, and fast-temporal stream, each focusing on a specific aspect of the skeleton sequence. These data streams are then combined to obtain a feature vector that best describes the skeleton data in space and time, as presented in Figure 12. The end-to-end SGN network (Semantics-Guided Neural Network) is proposed [85] based on the combination of GCN and CNN models. It consists of a joint-level module and a frame-level module. SGN learns the dynamics representation of a joint by fusing the position and velocity information of a joint. To model the dependencies of joints in the jointlevel module and the dependencies of frames, SGN used the three GCN layers and two CNN layers, respectively. Plizzaria et al. [86] proposed a novel two-stream Transformerbased model that is used on both the spatial and the temporal dimensions. The first stream is the Spatial Self-Attention (SSA) module to dynamically build links between skeleton joints, representing the relationships between human body parts, conditionally on the action and independently from the natural human body structure. The second stream is a Temporal Self-Attention (TSA) module to study the dynamics of a joint over time. Xiang et al. [87] employed a large-scale language model as the knowledge engine to provide text descriptions for body parts' movements for actions. The authors proposed a multi-modal training scheme by utilizing the text encoder to generate feature vectors for different body parts and supervise the skeleton encoder for action representation learning. This means that the skeleton is divided into parts, and every human action is a working combination of the parts. Each part is coded with a descriptive text.
Trived et al. [88] proposed PSUMNet (Part Stream Unified Modality Network) for HAR based on human skeleton data. It introduces the combined modality part-based streaming approach compared to the conventional modality-wise streaming approaches. PSUMNet performs across skeleton action recognition datasets compared to state-of-the-art methods, yet it reduces the number of parameters by around 100-400%.
Zhou et al. [89] built a hybrid model named Hyperformer. This model used a solution to incorporate bone connectivity into Transformer via a graph distance embedding. Unlike GCN, which only uses the skeleton structure for initialization, Hyperformer retains the skeleton structure during training. Hyperformer also implements a self-attention (SA) mechanism on hypergraph, termed Hypergraph Self-Attention (HyperSA), to incorporate intrinsic higher-order relations into the model.
The action capsule network (CapsNet) for skeleton-based action recognition is proposed by Bavil et al. [90]. The temporal features associated with each joint are hierarchically encrypted based on ResTCN (Residual Temporal Convolution Neural Network) and Cap-sNet to focus on a set of critical joints dynamically. CapsNet learns to dynamically attend to features of pivotal joints and for each action of the human skeleton.

Datasets
To evaluate deep learning models for HAR based on 3D human skeleton data, usually, some benchmark datasets have to be used to evaluate the performance. Here we introduce some databases containing 3D human skeleton data.
UTKinect-Action3D Dataset [91] includes three types of data: color image, depth image, and 3D human skeleton. It is captured from a single MS Kinect with Kinect for Windows SDK Beta Version and 10 action types of human: walk, sit down, stand up, pick up, carry, throw, push, pull, wave hands, and clap hands. The skeleton data of each frame includes 20 joints, each joint has coordinates (x, y, z), and the total is 199 motion sequences.
Florence 3D Actions dataset [93] is captured from MS Kinect sensors at the University of Florence in 2012. This dataset includes nine class activities: wave, drink from a bottle, answer the phone, clap, clinch, sit down, stand up, read, watch, and bow. The actions were performed 2/3 times by 10 subjects, resulting in 215 action sequences. Each skeleton frame includes 15 body joints (with x, y, and z coordinates) captured with MS Kinect.
J-HMDB dataset [94] is a subset of HMDB [95] with 21 action classes: brush hair, catch, clap, climb stairs, golf, jump, kickball, pick, pour, pull-up, push, run, shoot ball, shoot a bow, shoot a gun, sit, stand, swing baseball, throw, walk, wave. Each skeleton frame includes a total of 15 joints, of which there are 13 joints (left shoulder, right shoulder, left elbow, right elbow, left wrist, right wrist, left hip, right hip, left knee, right knee, left ankle, right ankle, neck) and two landmarks (face and belly). J-HMDB contains 928 samples and uses 3 train/test splits in the ratio of 7:3 (70% training and 30% testing).
Northwestern UCLA Multiview Action 3D (N-UCLA) [96] is captured by the MS Kinect version 1 sensor from various viewpoints. The training data are captured from view 1 and view 2, and the testing data are captured from view 3. This dataset includes 10 action categories: pick up with one hand, pick up with two hands, drop trash, walk around, sit down, stand up, donning, doffing, throwing, and carrying. Each action is performed by 10 subjects. [97] includes 480 skeleton sequences with 12 action classes performed by 40 different subjects. The number of joints in each human skeleton is 20 joints. In each action, each subject can interact with one of six objects: phone, chair, bag, wallet, mop, and besom. For training and evaluation, the authors used two data split protocols as follows. The first protocol is for half of the samples for training and the other half for testing. The second protocol is to use half of the subjects for training and the other half for testing. NTU RGB+D dataset [11] has been captured by three MS Kinect V2 sensors. 3D skeletal data contains the 3D locations/joints (with x, y, z coordinates) of 25 major body joints at each frame. It contains 56,880 skeleton videos of 60 action classes. This dataset is split in two ways: cross-subject and cross-view. The cross-subject includes 40,320 videos from 20 subjects for training and the rest for testing. The cross-view include 37,920 videos captured from camera 2 and 3 for training and those from camera 1 for testing.

SYSU 3D Human-Object Interaction Dataset
Kinetics-Skeleton dataset [98] is named from DeepMind Kinetics human action video dataset. It includes 400 human action classes captured from nearly 300 videos. Each video is about 10 s, and the 3D human skeleton is annotated from the Open-Pose toolbox [99]. Each human skeleton includes 18 joints. The training and test sets of this dataset consist of 240, 436, and 19,794 samples, respectively.
NTU RGB+D 120 dataset [12] is extended from NTU RGB+D dataset [11]. Most descriptions of this dataset are the same as the NTU RGB+D dataset [11], only the number of action classes is 120, and the number of samples is expanded to 114,480. This dataset is split into two parts: an auxiliary set and a one-shot evaluation set. The auxiliary set contains 100 classes, and all samples of these classes can be used for training. The evaluation set consists of 20 novel classes, and one sample from each novel class is picked as the exemplar, while all the remaining samples of these classes are used to test the recognition performance. Two evaluation protocols for NTU RGB+D 120 dataset are set similarly to those in the NTU RGB+D dataset, where Cross-Subject in these two datasets has the same name, and Cross-View in the NTU-RGB+D is renamed Cross-Setting. [100] is captured and combined from eight cameras of the MOCAP system. The 3D human skeleton of each frame includes 39 joints, as illustrated in Figure 13. In KLHA3D-102 consists of 102 classes, with each action class having five subjects (Sub), so the total video frame is 510 with 299,468 frames. KLYOGA3D dataset [14] that the structure of the joints of the 3D human skeleton is similar to the KLHA3D-102 dataset [100]. The only difference, it has 39 action classes of yoga skeletal, so the total number of video frames is 39 with 173,060 frames.

Evaluation Matrices
To evaluate and compare the performance of HAR models based on a 3D human skeleton, the measurements are essential and must be unified. Usually, in machine learning, model evaluation is often based on the accuracy metric as follows: • Accuracy (Acc): where TP (True Positive) is the number of predictions when the label is positive and the prediction is true, TN (True Negative) is the number of predictions when the label is negative and the prediction is true, FP (False Positive) is the number of predictions when the label is positive, but the prediction is false, FN (False Negative) is the number of predictions when the label is negative, but the prediction is false.

Literature Results
The results of HAR based on the 3D human skeleton data of the NTU RGB+D [11] dataset are presented in Table 1. We presented the results of 66 valuable studies over time from 2019 to March 2023. The results are based on the Acc measure and two protocols: Cross-Subject and Cross-View [11]. The results of HAR based on the 3D human skeleton data of the NTU RGB+D 120 [12] dataset are presented in Table 2. We present the results of 37 valuable studies over time from 2019 to March 2023. The results are based on the Acc measure and two protocols: Cross-Subject and Cross-Setting [11]. In Table 3, we have shown the number of FLOPS (floating point operation per second) for training and testing of some DNNs on the NTU RGB + D [11] and NTU RGB + D 120 [12] datasets. With the larger number of FLOPS, the processing speed of the DNNs is slower.  [11] and NTU RGB + D 120 [12] datasets. The results of HAR based on the 3D human skeleton data of the Kinetics-Skeleton dataset [98] are presented in Table 4. We present the results of 17 studies over the period from 2019 to March 2023. The results are based on the Acc measure with the training and test dataset presented in [98].  Table 5 presents the results of HAR based on the 3D human skeleton data of the N-UCLA dataset [96]. We presented the results of 17 studies over the period from 2019 to March 2023. The results are based on the Acc measure with the training and test dataset presented.  Table 6 presents the results of HAR based on the 3D human skeleton data of the J-HMDB dataset [94]. Studies on HAR based on the 3D human skeleton on the J-HMDB dataset are only available from the year 2019.  Table 7 presents the results of HAR based on the 3D human skeleton data of the SYSU 3D dataset [97]. Studies on HAR based on a 3D human skeleton on the SYSU 3D dataset are only available from the years 2019 and 2020.  Table 8 presents the results of HAR based on the 3D human skeleton data of the UTKinect-Action3D dataset [91]. In the UTKinect-Action3D dataset [91], the authors used the "Test Two" protocol in [123] for training and testing (2/3 of the samples were for training, the rest for testing). Studies on HAR based on a 3D human skeleton on the UTKinect-Action3D dataset are only available in the years 2019 and 2020.  Table 9 presents the results of HAR based on the 3D human skeleton data of the Florence 3D Actions dataset [93]. In the Florence 3D Actions dataset, [93] have the same protocol for training and evaluation as the UTKinect-Action3D dataset [91].  Table 10 presents the results of HAR based on the 3D human skeleton data of the SBU dataset [92]. The training and testing data have been described in [92].

Challenges and Discussions
In Tables 1, 2, and 4-10, we presented the results of HAR based on a 3D human skeleton. The datasets of 3D human skeletons are presented in Section 3.2; in each scene/frame in 3D space, only one 3D human skeleton is considered. The activities identified are simple and common everyday activities. The datasets presented are of no interest and contain data on the context of human activity. Therefore, the building recognition models are only used for testing but have not been able to apply it in practice to build real applications.
These tables also show that the GCN/GNN-based approach is the most interesting in the research because the structure of the human skeleton is represented as graphs, and the temporal-spatial functions are important information to represent and extract feature vectors. Tables 1 and 2 presented the results on the NTU RGB + D [11] and NTU RGB + D 120 [12], respectively. Although it is the same evaluation method and measure, the results on the NTU RGB + D dataset are much higher than on the NTU RGB + D 120 dataset. It can be seen that the number of action classes in the data greatly affects the results. It increases the complexity of the activity recognition problem. In the Kinetics-Skeleton dataset [98], the number of human action classes is 400 classes, so the results in Table 4 are very low. The highest result is 52.3% when there is a combination of the 3D skeleton and RGB images. The results only based on 3D human skeletons are only 30-40%.

Experiment
In this paper, we perform a comparative study on HAR based on the 3D human skeleton alone. This study was conducted on the two databases presented above, namely the KLHA3D-102 dataset [100] and the KLYOGA3D dataset [14]. We use DDNet [120] and PA-ResGCN [59] to experiment on two datasets.
We divide the KLHA3D-102 dataset [100] into five configurations for training and testing: Configuration 1 (KLHA3D-102_Conf. 1) has Sub #2, Sub #3, Sub #4, and Sub #5 of each action used for training, and Sub #1 for testing; Configuration 2 (KLHA3D-102_Conf. 2) has Sub #1, Sub #3, Sub #4, and Sub #5 of each action used for training and Sub #2 for testing; Configuration 3 ((KLHA3D-102_Conf. 3) has Sub #1, Sub #2, Sub #4, and Sub #5 of each action used for training and Sub #3 for testing; Configuration 4 (KLHA3D-102_Conf. 4) has 15% of the first frames of each sub of each action used for testing and the remaining 85% of frames in each sub of each action for training; Configuration 5 (KLHA3D-102_Conf. 5) has 85% of the first frames of each sub of each action used for training and 15% of the remaining frames in each sub of each action for testing.
We also divide the KLYOGA3D dataset [14] into two configurations for training and testing: Configuration 1 (KLYOGA3D_Conf. 1) has 15% of the first frames of each sub of each action used for testing, and the remaining 85% of frames in each sub of each action for training; Configuration 2 (KLYOGA3D_Conf. 2) has 85% of the first frames of each sub of each action used for training and 15% of the remaining frames in each sub of each action for testing.
In this paper, we used a server with an NVIDIA GeForce RTX 2080 Ti 12 GB GPU for fine-tuning, training, and testing. The programs were written in the Python language (≥3.7 version) with the support of CUDA 11.2/cuDNN 8.1.0 libraries. In addition, there are a number of other libraries such as OpenCV, Numpy, Scipy, Pillow, Cython, Matplotlib, Scikit-image, Tensorflow ≥ 1.3.0, etc.

Results and Discussions
The results of HAR based on the skeleton of the KLHA3D-102, KLYOGA3D datasets are shown in Table 11. The results show that GCN-based DNNs and Hybrid DNNs are very low (DDnet when evaluated on KLHA3D-102_Conf. 5 = 1.96%, PA-ResGCN when evaluated on KLHA3D-102_Conf. 5 = 8.56%). Meanwhile, the results based on CNNs have very high results compared to GCN/GNN and Hybrid-DNN networks. This can be explained by several reasons. The model is trained to recognize using only the data of the 3D human skeleton and has not been combined with other data, such as data about the activity context. Human skeleton data are collected from many cameras in different viewing directions without being normalized. The number of joints in the skeleton data of these two datasets are 39 joints, which is a large number of joints in 3D space. They make the feature vector size large, and the number of action classes in the KLHA3D-102 dataset is 102 classes; there are many similar actions, such as "drinking tea", "drinking water", and "eating". As illustrated by the human skeleton data in Figure 1 of the KLHA3D-102, KLYOGA3D datasets. The actions "drinking tea", "drinking water", and "eating" differ only in the coordinates of the 3 joints ("14", "15", "16") or ("21", "22", "23"). Therefore, the size of the feature to distinguish between these three types of actions is too small ( 3 39 = 1 13 ) compared to the size of the entire feature extracted from the 39 joints of the 3D human skeleton. These actions are all sitting and holding/grasping a bowl or cup, the action is only slightly different in the hand activity. At least the computational complexity will be reduced by 1 3 times when switching from computing features in 3D space to 2D space (image space). This is also the reason Kumar et al. [14] chooses the approach of projecting the representation of the feature vectors of the KLHA3D-102 and KLYOGA3D datasets from the 3D space to the image space. All these make the features extracted from the 3D skeleton based on the skeleton graph and the space-time function have low discrimination. That makes the result of active recognition based on GCN and Hybrid-DNN lower than CNNs. CNNs often project representations (joints, coordinates, temporal-spatial) on the skeleton to the image space. This results in better discrimination between activities, as shown by the difference between the feature vectors of 10 joints in Figure 5 of Kumar et al.'s research [14]. In this figure, the top is a representation of JADM (Joint Angular Displacement Maps) [130], the middle is a representation of JDM (Joint Distance Maps) [131], and the bottom is a representation of QJVM (Quad Joint Volume Maps) [14]. In Table 12, the processing time to recognize the human activity on the KLHA3D-102 [100] dataset. The computation time of DDnet [120] is 100 times faster than PA-ResGCN [59].  Figure 14 shows the results of the training set and testing set on the KLHA3D-102 and KLYOGA3D datasets using DDNet [120]. On the KLHA3D-102 dataset, the result of DDnet on the training set is more than 80%, and the result on the testing set is just over 50%. On the KLYOGA3D dataset, the result on the training set is only more than 50%, and on the testing set, it is only more than 20%. This shows that the efficiency of learning features extracted directly on 3D human pose on the KLYOGA3D dataset is very low. When training DDnet on the KLHA3D-102 dataset, the result on the training dataset is more than 90%, but the result on the test dataset is about 50%. This result occurs because the trained model of DDnet is overfitting. Figure 15 illustrates the 3D human skeleton of the action of "drinking tea" and "drinking water". The skeleton data of the two actions are almost the same; only the activity is different in the skeleton of the head. This is a huge challenge to build a discriminative model of human actions in the KLHA3D-102 dataset.

Conclusions and Future Works
In this paper, we have carried out a full survey of the methods of using deep learning to recognize human activities based on 3D human skeleton input data. Our survey produced about 250 results on about more than 70 different studies on HAR based on deep learning under four types of networks: RNN-based, CNN-based, GCN/GNN-based, and Hybrid-DNN-based. The results of HAR are shown in terms of methods and processing time. We also discuss the challenges of HAR in terms of data dimensions and the insufficient information to distinguish actions with a limited number of reference points. At the same time, we have carried out comparative, analytical, and discussion studies based on finetuning two methods of DNNs (DDNet, PA-ResGCN) for HAR on the KLHA3D-102 and KLYOGA3D datasets. Although the training set rate is up to 85% and the test set rate is 15%, the recognition results are still very low (the results on KLHA3D-102_Conf. 5 is 1.96% of DDnet and 8.56% of PA-ResGCN). It also shows that choosing a method for the HAR problem is very important; for datasets with a large number of joints in the 3D human skeleton, the method based on projecting a 3D human skeleton to the image space and extraction features on the image space should be chosen.
Shortly, we will combine many types of features extracted from the 3D human skeleton into a deep learning model or construct new 2D feature sets to improve higher HAR results. We will propose a unified model from end-to-end for detecting, segmenting, estimating 3D human pose, and recognizing human activities for training and learning exercises in the gym or yoga for training and protecting health. As illustrated in Figure 16 is an application that detects, segments, estimates 3D human pose, recognizes activity, and calculates the total motion of joints. From there, it is possible to calculate the total energy consumed for exercise. From there, it is possible to make a training plan for students to practice to protect their health, to avoid exercising too much or doing too little. This is a very practical application in martial arts teaching, sports analysis, training, and health protection.

Conflicts of Interest:
The paper is our research, not related to any organization or individual.