A Survey on 3D Skeleton-Based Action Recognition Using Learning Method

Three-dimensional skeleton-based action recognition (3D SAR) has gained important attention within the computer vision community, owing to the inherent advantages offered by skeleton data. As a result, a plethora of impressive works, including those based on conventional handcrafted features and learned feature extraction methods, have been conducted over the years. However, prior surveys on action recognition have primarily focused on video or red-green-blue (RGB) data-dominated approaches, with limited coverage of reviews related to skeleton data. Furthermore, despite the extensive application of deep learning methods in this field, there has been a notable absence of research that provides an introductory or comprehensive review from the perspective of deep learning architectures. To address these limitations, this survey first underscores the importance of action recognition and emphasizes the significance of 3-dimensional (3D) skeleton data as a valuable modality. Subsequently, we provide a comprehensive introduction to mainstream action recognition techniques based on 4 fundamental deep architectures, i.e., recurrent neural networks, convolutional neural networks, graph convolutional network, and Transformers. All methods with the corresponding architectures are then presented in a data-driven manner with detailed discussion. Finally, we offer insights into the current largest 3D skeleton dataset, NTU-RGB+D, and its new edition, NTU-RGB+D 120, along with an overview of several top-performing algorithms on these datasets. To the best of our knowledge, this research represents the first comprehensive discussion of deep learning-based action recognition using 3D skeleton data.


Introduction
Action analysis, a pivotal and vigorously researched topic in the field of computer vision, has been under investigation for several decades [102,116,117,143].The ability to recognize actions is of paramount importance, as it enables us to understand how humans interact with their surroundings and express their emotions [60,70].This recog- nition can be applied across a wide range of domains, including intelligent surveillance systems, human-computer interaction, virtual reality, and robotics [11,136,137].In recent years, the field of skeleton-based action recognition has made significant strides, surpassing conventional handcrafted methods.This progress has been chiefly driven by substantial advancements in deep learning methodologies.[61,71,73,83,84,97,99,118,127,145].
Traditionally, action recognition has relied on various data modalities, such as RGB image sequences [28,58,62,98,101], the depth image sequences [1,123], videos, or a fusion of these modalities (e.g., RGB combined with the optical flow) [27,30,34,93,110].These approaches have yielded impressive results through various techniques.Compared to skeleton data, which offers a detailed topological representation of the human body through joints and bones, these alternative modalities often prove computationally intensive and less robust when confronted with complex backgrounds and variable conditions.This includes challenges posed by variations in body scales, viewpoints, and motion speeds [38,59].
Furthermore, the availability of sensors like the Microsoft Kinect [144] and advanced human pose estimation algorithms [8,20,129,146] has facilitated the acquisition of accurate 3D skeleton data [92].Figure 1 provides a visual representation of human skeleton data.In this case, 1 arXiv:2002.05907v2[cs.CV] 27 Jan 2024 25 body joints are captured for a given human body.Skeleton sequences possess several advantages over other modalities, characterized by three notable features: 1) Intra-frame spatial information, where strong correlations exist between joints and their adjacent nodes, enabling the extraction of rich structural information.2) Inter-frame temporal information, which captures strong and clear temporal correlations between frames of each body joint, enhancing the potential for action recognition.3) A co-occurrence relationship between spatial and temporal domains when considering joints and bones, offering a holistic perspective.These unique attributes have catalyzed substantial research endeavors in human action recognition and detection.The escalating integration of skeleton data is anticipated to pervade diverse applications in the field.
The recognition of human actions using skeleton sequences predominantly hinges on a temporal dimension, transforming it into both a spatial and temporal information modeling challenge.As a result, traditional approaches in skeleton-based methods focus on extracting motion patterns from these sequences, prompting extensive research into handcrafted features.[34,36,105,113,133,149].These features often entail capturing the relative 3D rotations and translations among different joints or body parts [71,104].However, it has become evident that handcrafted features perform well only on specific datasets [111], highlighting the challenge that features tailored for one dataset may not be transferable to others.This issue hampers the generalization and broader application of action recognition algorithms.
With the remarkable development and outstanding performance of deep learning methods in various computer vision tasks, such as image classification [22,42] and object detection [9,152], the application of deep learning to skeleton data for action recognition has gained prominence. Nowadays, deep learning techniques utilizing Recurrent Neural Networks (RNNs) [48], Convolutional Neural Networks (CNNs) [17], Graph Convolutional Networks (GCNs), and Transformer-based methods have emerged in this field [90,126].Figure 2 provides an overview of the general pipeline for 3D skeleton-based action recognition (3D SAR) using deep learning, starting from raw RGB sequences or videos and culminating in action category prediction.RNN-based methods leverage skeleton sequences as natural time series data, treating joint coordinates as sequential vectors, aligning well with the RNN's capacity for processing time series information.To enhance the learning of temporal context within skeleton sequences, variants like Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) have been employed.Meanwhile, CNNs complement RNN-based techniques, as they excel at capturing spatial cues in the input data, which RNNs may lack.Additionally, a relatively recent approach, the GCNs has gained attention for its ability to model skeleton data in a natural topological graph structure, with joints and bones as vertices and edges, respectively, offering advantages over alternative formats like images or sequences.Transformer-based methods [2,79,109,132,150,153] capture the spatialtemporal relation of the input 3D skeleton data mainly based on its core multi-head self-attention mechanism (MSA).
All these three kinds of deep learning-based architectures have already gained unprecedented performance, but most review works just focus on traditional techniques or deep learning-based methods just with the RGB image or RGB-D data method.Ronald Poppe et al. [81] firstly addressed the basic challenges and characteristics of this domain and then gave a detailed illumination of basic action classification methods about direct classification and temporal state-space models.Daniel and Remi et al. [119] showed an overall overview of the action representation only in both spatial and temporal domains.Though the methods mentioned above provide some inspiration that may be used for input data pre-processing, neither skeleton sequence nor deep learning strategies were taken into account.Recently, Wu et al. [121] and Herath et al. [32] offered a summary of deep learning-based video classification and captioning tasks, in which the fundamental structure of CNN, as well as RNNs, were introduced, and the latter made a clarification about common deep architectures and quantitative analysis for action recognition.To our best knowledge, [75] is the first work recently giving an in-depth study in 3D SAR, which concludes this issue from the action representation to the classification methods, in the meantime, it also offers some commonly used datasets such as UCF, MHAD, MSR daily activity 3D, etc. [25,72,77,107], while it doesn't cover the emerging GCN based methods.Finally, [111] proposed a new review based on Kinect-dataset-based action recognition algorithms, which organized a thorough comparison of those Kinect-dataset-based techniques with various types of input data including RGB, Depth, RGB+Depth, and skeleton sequences.[96] presented an overview of the action recognition across all the data modalities but without presenting the Transformer-based methods.In addition, all these works mentioned above also ignore the differences and motivations among CNN-based, RNN-based, GCN-based, and Transformer-based methods, especially when taking the 3D skeleton sequences into account.
To address these issues comprehensively, this survey aims to provide a detailed summary of 3D SAR employing four fundamental deep learning architectures: RNNs, CNNs, GCNs, and Transformers.Additionally, we delve into the motivations behind the choice of these models and offer insights into potential future directions for research in this field.
In summary, our study encompasses four key contribu-

3D SAR With Deep Learning
While existing surveys have offered comprehensive comparisons of action recognition techniques based on RGB or skeleton data, they often lack a detailed examination from the perspective of neural networks.To bridge this gap, we provide a concise introduction to the fundamental properties of each architecture (Section 2.1).Then our survey provides an exhaustive discussion and comparison of RNNbased (Section 2.2), CNN-based (Section 2.3), GCN-based (Section 2.4), and Transformer-based (Section 2.5) methods for 3D skeleton-based action recognition.We will explore these methods in-depth, highlighting their strengths and weaknesses, and introduce several latest related works as case studies, focusing on specific limitations or classic spatial-temporal modeling challenges associated with these neural network models.

Preliminaries: Basic Properties of RNNs, CNNs, GCNs, and Transformers
Before delving into the specifics of each method, we provide a brief overview of the fundamental architecture, outlining their respective advantages, disadvantages, and coarse selection criteria under the 3D SAR setting.
RNNs.RNNs are ideal for capturing temporal dependencies in sequences of joint movements over time and are suited for modeling action sequences due to their ability to retain temporal information.However, RNNs are also vulnerable to long-term dependencies, potentially missing complex relationships in lengthy sequences, and are computationally inefficient due to sequential processing, leading to longer training times for large-scale datasets.CNNs.CNNs are not only effective in capturing spatial patterns from the joint coordinates, recognizing spatial features within individual frames of the 3D skeleton data but also great for local spatial relationships among joints.However, CNNs are limited to capturing temporal evolution in sequences, potentially missing out on the temporal dynamics crucial for action recognition.
GCNs.GCNs are designed to manage graph-structured data such as skeletal joint connections in action recognition, enabling the learning of relationships between joints and their connectivity while integrating spatial and temporal information.However, GCNs can be sensitive to noisy or irregular connections among joints, potentially impacting recognition accuracy, particularly in complex actions.Transformers.Transformers is not only efficient at capturing long-range dependencies without the vanishing/exploding gradient issue but also Versatile in handling multiple modalities and learning global relationships.However, it's also computationally intensive due to attention mechanisms, potentially requiring substantial computational resources.What's more, compared to RNNs, it is also limited to sequential locality

RNN-Based Methods
Recursive connections within the RNN structure are established by feeding the output of the previous time step as the input to the current time step, as demonstrated in prior work [139].This approach is known to be effective for processing sequential data.In a similar vein, models like the standard RNN, LSTM, and GRU were introduced to address limitations such as gradient-related issues and the modeling of long-term temporal dependencies that were present in the standard RNN.
From the first aspect, spatial-temporal modeling can be seen as the principle in action recognition tasks.Due to the weakness of the spatial modeling ability of RNN-based architecture, the performance of some related methods generally could not gain a competitive result [54,120,147].Recently, Hong et al. [106] proposed a novel two-stream RNN architecture to model both temporal dynamics and spatial configurations for skeleton data.Figure 3 shows the framework of their work.An exchange of the skeleton axes was applied for the data level pre-processing for the spatial dominant learning.Unlike [106], Jun et al. [63] stepped into the traversal method of a given skeleton sequence to acquire the hidden relationship of both domains.Compared with the general method which arranges joints in a simple chain so that ignores the kinetic dependency relations between adjacent joints, the mentioned tree-structurebased traversal would not add false connections between body joints when their relation is not strong enough.Then, using an LSTM with a trusted gate treat the input discriminately, through which, if the tree-structured input unit is reliable, the memory cell will be updated by importing input latent spatial information.Inspirited by the property of CNN, which is extremely suitable for spatial modeling.Li et al. [49] incorporated an attention RNN with a CNN model to enhance the complexity of spatial-temporal modeling.Initially, they introduced a temporal attention module within a residual learning module, allowing for the recalibration of temporal attention across frames within a skeleton sequence.Subsequently, they applied a spatial-temporal convolutional module to this first module, treating the calibrated joint sequences as images.Furthermore, in the work by Lin et al. [50], an attention recurrent relation LSTM network was employed.This network combines a recurrent relation network for spatial features with a multi-layer LSTM to capture temporal features within skeleton sequences.
The second aspect involves the network structure, serving as a solution to address the limitations of standard RNNs.While RNNs are inherently suitable for sequence data, they often suffer from well-known problems like gradient exploding and vanishing.Although LSTM and GRU have alleviated these issues to some extent, the use of hyperbolic tangent and sigmoid activation functions can still result in gradient decay across layers.In response, new types of RNN architectures have been proposed [5,47,52].Shuai et al. [52] introduced an independently recurrent neural network (IndRNN) designed to address gradient exploding and vanishing problems, making it feasible and more robust to construct longer and deeper RNNs for high-level semantic feature learning.This modification for RNNs is not limited to skeleton-based action recognition but can also find applications in other domains, such as language modeling.In the IndRNN structure, neurons in one layer operate independently of each other, enabling the processing of much longer sequences.
Finally, the third aspect is the data-driven.In the consideration that not all joints are informative for an action analysis, [65] add global context-aware attention to LSTM networks, which selectively focus on the informative joints in a skeleton sequence.Figure 4 illustrates the visualization of the proposed method, from the figure we can conclude that the more informative joints are addressed with a red circle color area, indicating those joints are more important for this special action.In addition, because the skeletons provided by datasets or depth sensors are not perfect, which would affect the result of an action recognition task, [44] transform skeletons into another coordinate system for the robustness to scale, rotation and translation first and then extract salient motion features from the transformed data instead of sending the raw skeleton data to LSTM. Figure 4(b) shows the feature representation process.Numerous valuable works have utilized RNN-based methods to address challenges related to large viewpoint changes and the relationships among joints within a single skeleton frame.However, it's essential to acknowledge that in specific modeling aspects, RNN-based methods may exhibit limitations compared to CNN-based approaches.In the following sections, we delve into an intriguing question: how do CNN-based methods perform temporal modeling, and how can they strike the right balance between spatial and temporal information in action recognition?

CNN-Based Methods
While convolutional neural networks (CNNs) offer efficient and effective high-level semantic cue learning, they are primarily tailored for regular image tasks.However, action recognition from skeleton sequences presents a distinct challenge due to its inherent time-dependent nature.Achieving the right balance and maximizing the utilization of both spatial and temporal information within a CNN-based architecture remains a challenging endeavor.
Typically, from the spatial-temporal modeling aspect, most of the CNN-based methods explored the representation of 3D skeleton sequences, Specifically, to accommodate the input requirements of CNNs, 3D-skeleton sequence data undergoes the transformation from a vector sequence to a pseudo-image format.However, achieving a suitable representation that effectively combines both spatial and temporal information can be challenging.Consequently, many researchers opt to encode skeleton joints into multiple 2D pseudo-images, which are subsequently fed into CNNs to facilitate the learning of informative features [21,125].Wang et al. [112] proposed the Joint Trajectory Maps (JTM), which represent spatial configuration and dynamics of joint trajectories into three texture images through color encoding.However, this kind of method is a little complicated and also loses importance during the mapping procedure.To tackle this shortcoming, Li et al.  which firstly divided human skeleton joints in each frame into five main parts according to the human physical structure, then those parts were mapped to 2D form.This method makes the skeleton image consist of both temporal information and spatial information.However, though the performance was improved, there is no reason to take skeleton joints as isolated points, cause in the real world, imitate connection exists among our body, for example, when waiting for the hands, not only the joints directly within the hand should be taken into account, but also other parts such as shoulders and legs are considerable.Li et al. [55] proposed the shape-motion representation from geometric algebra, Commonly, CNN-based methods represent a skeleton sequence as an image by encoding temporal dynamics and skeleton joints as rows and columns, respectively.However, this simplistic approach may limit the model's ability to capture co-occurrence features, as it considers only neighboring joints within the convolutional kernel and may overlook latent correlations involving all joints.Consequently, CNNs might fail to learn the corresponding and useful features.In response to this limitation, Chao et al. [10] introduced an end-to-end framework designed to learn co-occurrence features through a hierarchical methodology.This approach gradually aggregates different levels of contextual information, beginning with the independent encoding of point-level information, which is then assembled into semantic representations within both temporal and spatial domains.
Besides explorations in the representation of 3D skeleton sequences, there also exist some other problems in CNNbased techniques.For example, to find a balance between the model size and the corresponding inference efficiency, DD-Net [127] was proposed to model double feature and double motion via CNN for efficient solutions.Kim et al. [95] proposed to use the temporal CNN (TCN) for modeling the interpretable spatio-temporal cues [43].As a result, the point-level feature of each joint is learned.In addition, twostream and three-stream CNN-based heavy models are also proposed for improving the representation learning ability for spatial-temporal modeling [85].So the skeleton-based action recognition using CNN is still an open problem waiting for researchers to dig in.

GCN-Based Methods
Drawing inspiration from the inherent topological graph structure of human 3D-skeleton data, distinct from the sequential vector or pseudo-image treatments in RNN-based or CNN-based methods.Recently the Graph Convolution Network has been adopted in this task frequently due to the effective representation of the graph structure data.Generally, two kinds of graph-related neural networks can be found, i.e., the graph and recurrent neural networks, and graph and convolutional neural networks (GCNs).In this survey, we mainly pay attention to the latter.This focus yielded compelling results, as evidenced by the performance of the GCN-based method on the rank board.Furthermore, merely encoding the skeleton sequence into a vector or 2D grid fails to fully capture the interdependence among correlated joints from the skeleton's perspective.Conversely, GCNs present adaptability to diverse structures, such as the skeleton graph.Nonetheless, the principal challenge within GCN-based approaches persists in the handling of skeleton data, particularly in structuring the original data into a coherent graph format.Yan et al. [126] first presented a novel model, the spatial-temporal graph convolutional networks (ST-GCN), for skeleton-based action recognition.Specifically, the approach first involved the creation of a spatial-temporal graph, wherein the joints functioned as graph vertices, establishing inherent connections within the human body structure and across temporal sequences as the graph edges.Following this step, the ST-GCN's higher-level feature maps on the graph underwent classification using a standard Softmax classifier, assigning them to their respective action categories.This work has notably directed attention towards employing GCNs for skeleton-based action recognition, resulting in a surge of recent related research [15,18,24,88,142,148].
Built upon GCNs, two main common aspects are explored, i.e., more representative manner for the construction of the skeleton data and more effective designs of the GCNbased model [46,51].
From the first aspect, [51] proposed the Action-  6 shows the feature learning and its generalized skeleton graph of AS-GCN.Multi-task learning strategy used in this work may be a promising direction because the target task would be improved by the other task as a complementary.To capture and enhance richer feature representations, Shi et al. [88] introduced the 2s-AGCN, which incorporates an adaptive topology graph.This approach allows for automatic updates leveraging the neural network's backpropagation algorithm, effectively enhancing the characterization of joint connection strengths.Liu et al. [74] proposes MS-G3D which constructs a unified spatial-temporal graph.This big spatial-temporal graph is composed of several subgraphs, and each subgraph represents the spatial relationships of joints on a certain frame.This form of the adjacent matrix can effectively model the relationship between different joints in different frames.Similarly, there are also a lot of following-up methods proposed for constructing more representative graphs [31,45,115].
From the second aspect, traditional GCNs operate as straight feed-forward networks, limiting low-level layers' access to semantic information from higher-level layers.To address this, Yang et al. [128] introduce the Feedback Graph Convolutional Network (FGCN) aimed at incrementally acquiring global spatial-temporal features.Departing from direct manipulation of the complete skeleton sequence, FGCN adopts a multi-stage temporal sampling strategy to sparsely extract a sequence of input clips from the skeleton data.Furthermore, Bian et al. [3] introduces a structural knowledge distillation scheme aimed at mitigating accuracy loss resulting from low-quality data, thereby enhancing the model's resilience to incomplete skeleton sequences.Fang et al. [26] presents the spatial-temporal slowfast graph convolutional network (STSF-GCN), which conceptualizes skeleton data akin to a unified spatial-temporal topology, reminiscent of MS-G3D.
From the preceding introduction and discussion, it's evident that the predominant concern revolves around datadriven approaches, seeking to uncover latent insights within 3D skeleton sequence data.In the realm of GCN-based action recognition, the central query persists: 'How do we extract this latent information?'This question remains an ongoing challenge.Particularly noteworthy is the inherent temporal-spatial correlation within the skeleton data itself.The optimal utilization of these two aspects warrants further exploration.There remains substantial potential for enhancing their effectiveness, calling for deeper investigation and innovative strategies to maximize their utilization.

Transformer-Based Methods
Transformers [103] demonstrated their overwhelming power on a broad range of language tasks (e.g., text classification, machine translation, or question answering [41,103]), and the vision community follows it closely and extends it for vision tasks, such as image classification [22,82,100], object detection [9,152], segmentation [131], image restoration [12,56], and point cloud registration [35,76,114].The emergence of transformer algorithms marks a pivotal shift in point-centric research.These transformer-based methods are gradually challenging the dominance of GCN methods, showcasing promising advancements in computational efficiency and accuracy.Upon analysis, we firmly believe that transformer-based approaches retain robust potential and are poised to become the mainstream technique in the future.
The core module in Transformer, multi-head self-attentions (MSAs) [22,103] aggregate sequential tokens with normalized attention as: where Q Q Q, K K K and V V V are query, key and value matrices, respectively.d is the dimension of query and key, and z z z j is the j-th output token.This step usually represents the context relation computation and update of the overall 3D skeleton features.Building upon the MSA from the Transformer for solving the 3D-SAR problem, there are lots of transformer architecture-based solutions are proposed In particular, Cho et al. [19] proposed a novel model called Self-Attention Network (SAN) that completely utilizes the self-attention mechanism to model spatialtemporal correlations.Shi et al. [89] proposed a decoupled spatial-temporal attention network (DSTA-Net) that contains spatial-temporal attention decoupling, decoupled position encoding, and global spatial regularization.DSTA-Net decouples the skeleton data into four streams, namely, spatial-temporal stream, spatial stream, slow-temporal stream, and fast-temporal stream, each data stream focuses on expressing a particular aspect of the action.Plizzari et al. [80] proposed a novel Spatial-Temporal Transformer network (ST-TR) in which the spatial selfattention module and temporal self-attention module are used to capture the correlation between different nodes in a frame and the dynamic relationship between the same node in the whole frames.To handle action sequences of varying lengths proficiently, Ibh et al. [37] proposed TemPose which leaves out the padded temporal and interaction tokens in the attention map.At the same time, Tempose codes the position of the player and the position of the badminton ball to predict the action class together.
The Transformer-based approach effectively mitigates the issue of solely concentrating on local information and excels in capturing extensive dependencies over long sequences.When applied to tasks involving skeleton-based human behavior recognition, the Transformer architecture demonstrates adeptness in capturing temporal relationships.However, its efficacy in modeling spatial relationships remains constrained due to limitations in capturing and encoding the intricate high-dimensional semantic information inherent in skeleton data [122,151].Simultaneously, numerous approaches have emerged that amalgamate the Transformer with GCNs or CNNs, thereby forming hybrid architectures.These models are designed with the aspiration of harnessing the strengths inherent in each fundamental architecture.By combining the Transformer's capabilities with the specialized strengths of RNNs, CNNs, or GCNs, these hybrid models aim to achieve a more comprehensive and powerful framework for diverse tasks [29,134,138,150].
The NTU-RBG+D dataset, introduced in 2016, stands as a significant resource, comprising 56,880 video samples gathered through Microsoft Kinect-v2.This dataset holds a prominent position as one of the largest collections available for skeleton-based action recognition.It furnishes the 3D spatial coordinates of 25 joints for each human depicted in an action, as illustrated in Figure 1 (a).For assessing the proposed methods, two evaluation protocols are suggested: Cross-Subject and Cross-View.The Cross-Subject setting involves 40,320 samples, with 16,560 allocated for training and evaluation, employing a split of 40 subjects into training and evaluation groups.In the case of Cross-View, comprising 37,920 and 18,960 samples, the evaluation uses camera 1 while training is conducted using cameras 2 and 3. Recently, an extended version of the original NTU-RGB+D dataset known as NTU-RGB+D 120 has been introduced.This extended dataset comprises 120 action classes and encompasses a total of 114,480 skeleton sequences, significantly expanding the scope.Additionally, the viewpoints have increased to 155.
In Table 1 and Table 2, we present the performance of recent skeleton-based techniques relevant to NTU-RGB + D and NTU-RGB + D 120 datasets, respectively.Note that in NTU-RGB+D, 'CS' stands for Cross-Subject, and 'CV' stands for Cross-View.For NTU-RGB + D120, there are two settings, i.e., Cross-Subject (C-Subject), and Cross-Setup (C-Setup).
Based on the observation of the performance of these two datasets, we find that it's evident that existing algorithms have achieved impressive performances in the original NTU-RGB+D dataset.However, the newer NTU-  [103] based methods also show their promis-ing performance on both datasets.It's also easy to find that a hybrid Transformer and other architectures also further boost the overall performance of the 3D-SAR.

Discussion
Considering the performance and attributes of the aforementioned deep architectures, several critical points war-rant further discussion concerning the criteria for architecture selection.In terms of accuracy and robustness, GCNs demonstrate potential excellence by adeptly capturing spatial and temporal relationships among joints.RNNs exhibit proficiency in capturing temporal dynamics, while CNNs excel in identifying spatial features.When evaluating computational efficiency, CNNs boast faster processing capabilities owing to their parallel processing nature, contrasting with RNNs' slower sequential processing.ditionally, RNNs tend to excel in recognizing fine-grained actions, where temporal dependencies play a crucial role, while CNNs may better suit the recognition of gross motor actions based on spatial configurations.Considering factors like dataset size and hardware resources, the choice becomes more adaptable, contingent on the final model's scale.The size of the dataset and available computational resources for training become pivotal considerations, as different architectures might entail varying requirements.In summary, when recognizing actions reliant on temporal sequences, RNNs prove suitable for capturing the nuanced temporal dynamics within joint movements.In contrast, CNNs excel in identifying static spatial features and local patterns among joint positions.However, for comprehensive action recognition, leveraging both spatial and temporal relationships among joints, GCNs offer a beneficial approach when dealing with 3D skeletal data.
A possible in-practical solution can be also proposed to integrate not only one architecture but also a combination of them.This may make the final model absorb the advantages of each fundamental architecture.Furthermore, beyond the choice of deep architectures, the trajectory of 3D skeleton action recognition (SAR) navigation is a crucial consideration.Building upon our earlier discussions, we deduce that long-term action recognition, optimizing 3D-skeleton sequence representations, and achieving real-time operation remain significant open challenges.Moreover, annotating action labels for given 3D skeleton data remains exceptionally labor-intensive.Exploring avenues such as unsupervised or weakly-supervised strategies, along with zero-shot learning, may pave the way forward.

Conclusion
This paper presents an exploration of action recognition using 3D skeleton sequence data, employing four distinct neural network architectures.It underscores the concept of action recognition, highlights the advantages of skeleton data, and delves into the characteristics of various deep architectures.Unlike prior reviews, our study pioneers a data-driven approach, providing comprehensive insights into deep learning methodologies, encompassing the latest algorithms spanning RNN-based, CNN-based, GCN-based, and Transformer-based techniques.Specifically, our focus on RNN and CNN-based methods centers on addressing spatial-temporal information by leveraging skeleton data representations and intricately designed network architectures.In the case of GCN-based approaches, our emphasis lies in harnessing joint and bone correlations to their fullest extent.Furthermore, the burgeoning Transformer architecture has garnered significant attention, often employed in conjunction with other architectures for action recognition tasks.Our analysis reveals that a fundamental challenge across diverse learning structures lies in effectively extracting pertinent information from 3D skeleton data.The topology graph emerges as the most intuitive representation of human skeleton joints, a notion substantiated by the performance metrics observed in datasets like NTU-RGB+D.However, this doesn't negate the suitability of CNN or RNN-based methods for this task.On the contrary, the introduction of innovative strategies, such as multi-task learning, shows promise for substantial improvements, particularly in cross-view or cross-subject evaluation protocols.Nevertheless, achieving further accuracy enhancements on datasets like NTU-RGB+D presents increasing difficulty due to the already high-performance levels attained.Hence, redirecting focus towards more challenging datasets, such as the enhanced NTU-RGB+D 120 dataset, or exploring other fine-grained human action datasets becomes imperative.Finally, we delve into an exhaustive discussion on the selection of foundational deep architectures and explore potential future pathways in 3D skeleton-based action recognition.

Figure 1 .
Figure 1.Examples of skeleton data in NTU RGB+D / NTU RGB+D 120 datasets [66, 86].(a) is the Configuration of 25 body joints in the dataset.(b) illustrates RGB+joints representation of the human body.

Figure 2 .
Figure 2. The general pipeline of skeleton-based action recognition using deep learning methods.Firstly, the skeleton data was obtained in two ways, directly from depth sensors or from pose estimation algorithms.The skeleton will be sent into RNNs, CNNs, GCNs, or Transformer-based neural networks.Finally, we get the accurate action category.

Figure 3 .
Figure 3. Examples of mentioned methods for dealing with spatial modeling problems.(a) shows a two-stream framework that enhances the spatial information by adding a new stream [106].(b) illustrates a data-driven technique that addresses the spatial modeling ability by giving a transform towards original skeleton sequence data [63].
[4] used a translation-scale invariant image mapping strategy (a) (b)

Figure 4 .
Figure 4. Data-driven based method.(a) shows the different importance among different joints for a given skeleton action [65].(b) give shows the feature representation processes, from left to right are original input skeleton frames, transformed input frames, and extracted salient motion features respectively [44].

Figure 5 .
Figure 5. Examples of the proposed representation skeleton image.(a) shows Skeleton sequence shape-motion representations [55] generated from "pick up with one hand" on Northwestern-UCLA dataset[108] (b) shows the SkeleMotion representation workflow [6].which addressed the importance of both joints and bones and fully utilized the information provided by the skeleton sequence.Similarly, [71] also use the enhanced skeleton visualization to represent the skeleton data, and Carlos et al. [7] also proposed a new representation named SkeleMotion based on motion information that encodes the temporal dynamics by explicitly computing the magnitude and orientation values of the skeleton joints.Figure 5 (a) shows the shape-motion representation proposed by [55] while Figure 5 (b) illustrate the SkeleMotion representation.What's more, similarly to SkeleMotion, [6] uses the framework of SkeleMotion but is based on tree structure and reference joints for a skeleton image representation.Commonly, CNN-based methods represent a skeleton sequence as an image by encoding temporal dynamics and skeleton joints as rows and columns, respectively.However, this simplistic approach may limit the model's ability to capture co-occurrence features, as it considers only neighboring joints within the convolutional kernel and may overlook latent correlations involving all joints.Consequently, CNNs might fail to learn the corresponding and useful features.In response to this limitation, Chao et al.[10] introduced an end-to-end framework designed to
Structural Graph Convolutional Networks (AS-GCN) could not only recognize a person's action but also use a multitask learning strategy to output a prediction of the subject's next possible pose.The constructed graph in this work can capture richer dependencies among joints by two modules called Actional Links and Structural Links. Figure

Table 1 .
The Performance of the latest state-of-the-art 3D Skeleton-based methods on NTU-RGB+D dataset.

Table 2 .
The Performance of the latest state-of-the-art 3D Skeleton-based methods on NTU-RGB+D 120 dataset.