MSST-RT: Multi-Stream Spatial-Temporal Relative Transformer for Skeleton-Based Action Recognition

Skeleton-based human action recognition has made great progress, especially with the development of a graph convolution network (GCN). The most important work is ST-GCN, which automatically learns both spatial and temporal patterns from skeleton sequences. However, this method still has some imperfections: only short-range correlations are appreciated, due to the limited receptive field of graph convolution. However, long-range dependence is essential for recognizing human action. In this work, we propose the use of a spatial-temporal relative transformer (ST-RT) to overcome these defects. Through introducing relay nodes, ST-RT avoids the transformer architecture, breaking the inherent skeleton topology in spatial and the order of skeleton sequence in temporal dimensions. Furthermore, we mine the dynamic information contained in motion at different scales. Finally, four ST-RTs, which extract spatial-temporal features from four kinds of skeleton sequence, are fused to form the final model, multi-stream spatial-temporal relative transformer (MSST-RT), to enhance performance. Extensive experiments evaluate the proposed methods on three benchmarks for skeleton-based action recognition: NTU RGB+D, NTU RGB+D 120 and UAV-Human. The results demonstrate that MSST-RT is on par with SOTA in terms of performance.


Introduction
Human action recognition attracts extensive attention due to its important application prospects in many fields, such as healthcare assistance, human-computer interaction and autonomous driving. Many exciting developments have taken place in the study of this research topic in recent years. A consensus has been reached that the key to human action recognition is learning how to capture sufficient spatial-temporal information.
Early research in this field was mostly based on RGB videos. However, RGB videos are often affected by the variations in illuminations, changes in camera viewpoints and background noise. All of these are obstacles in the process of extracting discriminative features. Compared with RGB data, the skeleton is free from these difficulties; it also has fewer restrictions in terms of data size. For example, it provides a compact form to represent dynamic information. These advantages make it possible to design lightweight models based on skeleton data. Meanwhile, with the development of human motion estimation technologies, such as advanced human pose estimation algorithms [1] and multimodal sensors [2], skeleton data are easier and cheaper to obtain than before, which inspires researchers to explore various approaches based on skeleton data for action recognition. In this paper, the action recognition research is based on 3D skeleton data.
To extract information from skeleton data, researchers have mainly explored four aspects: handcrafted-features-based methods, convolution neural-network-based methods, recurrent neural-network-based methods and graph convolution-network-based methods.
In recent years, the graph convolution network (GCN) has become the most widespread method, with the characteristics of low computation and high accuracy. Although these methods have achieved great success in terms of performance, there some defects still exist: (1) Most human actions are performed by a collaborative effort among the joints that not linked in the human skeleton. For example, the left hand and right hand are not linked, but they have a strong correlation in the action of "clapping". Graph convolutions only gather information from the local neighbor joints of two hands, respectively, but neglect the relationship between hands. Applying graph convolution repeatedly would obtain a multi-hop dependence between the two hands; however, this would occur at the expense of increasing computational complexity and would make optimization more difficult. (2) It is hard to achieve long-range dependence utilizing only graph convolution in the model, although it plays an important role in temporal dimension. (3) The rich information derived from different scales of motion can effectively supplement each other, but this is usually neglected in the research.
In this paper, we propose a novel mechanism with a lightweight transformer to overcome the first and second limitations mentioned in the previous paragraph, and call it a relative transformer. The relative transformer mechanism is employed in spatial skeleton-based architecture to build bridges between two distant joints and propagate signals. It is also employed in the temporal dimension to capture long-range interactions between two distant frames. As a consequence, the model is named the spatial-temporal relative transformer (ST-RT). For the third defect, we also propose an effective dynamic representation which fuses three different scales of motion and raw position to obtain richer information from a skeleton sequence in ST-RT.
The overall architecture of the proposed MSST-RT is shown in Figure 1. As well as the joint information, the bone information derived from the joints' positions is also learned by ST-RT. Furthermore, the skeleton sequence, obtained by different sampling strategies, provides supplementary information through model training, e.g., 10 and 20 frames are sampled from the joint sequence and bone sequence. Four ST-RTs are combined to extract features from two joint sequences and two bone sequences. We name this the multi-stream spatial-temporal relative transformer (MSST-RT). Significantly, batch normalization is adopted instead of layer normalization in relative transformer to obtain a faster training time and higher performance.  There are five main contributions of this work, summarized as follows: • We propose an MSST-RT model comprising four ST-RTs. Each of them extracts features from a corresponding skeleton sequence, which complement each other. It is worth noting that MSST-RT eschews recurrence and graph convolution and instead relies entirely on a relative transformer mechanism to learn long-distance dependencies.
• A multi-scale dynamic representation (DR) combines various scale-of-motion features from one skeleton sequence. • A lightweight relative transformer module is designed for spatial and temporal modeling. Regarding the spatial dimension, a spatial relative transformer module (SRT) is introduced to establish long-range dependencies while maintaining the origin skeleton topology. In the temporal dimension, the temporal relative transformer module (TRT) studies long-range interaction between nonadjacent frames, with the order of skeleton sequence remaining unchanged. • Attention responses in SRT and TRT are visualized to show the effects of the attention mechanism. This proves that the proposed ST-RT pays great attention to some nodes and the distribution of attention is consistent with human perception. • Our MSST-RT achieves state-of-the-art on three benchmark datasets, i.e., NTU RGB+D, NTU RGB+D 120 and UAV-Human, in action recognition based on skeleton.

Skeleton-Based Action Recognition
Action recognition based on skeleton can be summarized into two major categories, i.e., traditional methods [3][4][5] that use hand-crafted features and deep learning methods. In recent years, deep neural networks have gained increasing favor due to their remarkable performance in the following: (1) RNN-based methods [6][7][8][9][10] model the contextual information by splicing the coordinates of the key points at each moment into a vector and concatenating all vectors for each frame. For example, H. Wang and W. Liang introduce a two-stream RNN [7] architecture which incorporates both spatial and temporal RNN networks for skeleton based action recognition. The HBRNN-L model [9] decompose the human skeleton into five parts, and then feed them to five bidirectional RNN subnets. (2) CNN-based methods [11][12][13][14][15] usually reconstructed the skeleton sequenceto a series of pseudo-images to obtain the spatial cues. Wang et al. [16] propose a novel CNN network, which represents the spatial configuration and dynamic information of a joint trajectory as three texture images by color encoding. (3) GCNbased methods [17][18][19] preserve the inherent topological graph data struct of the skeleton by treating joints and bones as vertices and edges. Taking advantage of the graph data struct, GCN-based methods build operators in the non-Euclidean space and outperform the other two approaches. In particular, the ST-GCN model proposed by Yan et al. [17] is the first construct, a spatial-temporal graph that offers a new partitioning strategy.A lot of other GCN-based methods regard it as a baseline, or improve on it.

Transformer
Transformer [20] is a novel architecture, which has now become the standard for NLP tasks. It handles long-range dependencies by relying on self-attention rather than sequence-aligned RNNs or convolution. Recently, many transformer-based models have been proposed to improve the original architecture, e.g., Set Transformer [21], Routing Transformer [22] and Star-Transformer [23]. Most of them aim to overcome the computation complexity and large memory overhead. In addition to being developed for NLP, transformers became a research hotspot in the fields of Computer Vision. Carion et al. proposed a Detection Transformer (DETR) [24], which is the first object detection framework that combines the convolutional neural network with a transformer. Vision Transformer (ViT) [25] utilizes transformer architecture without CNN and it outperforms the stateof-the-art convolutional network in various image classification tasks. In our work, the model only consists of relative transformer modules. Inspired by the Star-Transformer, the proposed relative transformer is a variation based on standard transformer architecture designed for skeleton action recognition.

Background
The transformer model mainly includes an encoding component and decoding component. Our work is concerned with the encoding part, which is broken down into a self-attention layer and feed-forward neural network layer. This section will briefly review the relevant knowledge.

Attention in the Encoding
In NLP tasks, each word has a query vector q, a key vector k and a value vector v to calculate a score, which represents the effect of other words on the input sentence on the encoding of the word. The score is calculated by taking the dot product s ij = q i · k T j i, j = 1, · · · , n. The output of the self-attention layer is obtained by multiplying each value vector by the score and summing up the weighted value vectors. While packing all query vectors into a matrix Q, all key vectors into a matrix K and all value vectors into a matrix V, the attention function is defined as: leads to a more stable gradient. Furthermore, the "multi-headed" attention initializes multiple sets of weight matrices randomly and obtains different scores. Concatenating the scores as the result helps the network capture richer features.

Feed-Forward Neural Network in the Encoding
In addition to the attention layer, the encoder also contains a feed-forward neural network, which consists of two linear transformations and an ReLU activation: The network projects the refined vector, which was obtained from multi-head attention, into a larger space to improve the ability to capture information. We also apply it in our relative transformer network.

Multi-Stream Spatial-Temporal Relative Transformer Network (MSST-RT)
In order to solve the limitations of the small receptive field in traditional convolution, a Transformer is introduced to the skeleton-based action recognition models instead of graph convolution. Since the graph is in an unordered sequence, we propose a new transformer architecture, named the relative transformer to keep the topology of the graph with lower complexity. The proposed multi-stream spatial-temporal relative transformer network (MSST-RT) consists of four spatial-temporal relative transformer networks (ST-RT). Four kinds of skeleton sequence, including a joint sequence of 10 sampled frames, a joint sequence of 20 sampled frames, a bone sequence of 10 sampled frames and a bone sequence of 20 sampled frames (shown in Figure 1). They are fed into four ST-RTs for feature extraction, fusion and prediction.
In this section, we will introduce our model ST-RT, where relative transformers are employed in both space and temporal dimensions; the model architecture is illustrated in Figure 2. It consists of three modules: dynamics representation (DR), spatial relative transformer (SRT) and temporal relative transformer (TRT). Meanwhile, each relative transformer module contains three node update blocks and each block is subdivided into two sub-blocks: joint nodes update block and relay node update block. The feedforward neural network (FFN) is connected behind them in both a spatial and temporal relative transformer.  Figure 2. Illustration of the spatial-temporal relative transformer (SR-RT). The skeleton data are processed by three modules and then fed into the fully connected layer to predict the score for each action class.

Dynamics Representation (DR)
Temporal difference operations are always adopted for motion extraction in action recognition, such as TEINet [26] and STM [27]. In action recognition, all the 3D positions of skeleton joints in NTU60 and NTU120 datasets are stored, and UAV-Human stores 2D positions. Figure 3 shows the 3D positions of skeleton joints. A joint in frame a is represented as The temporal difference is the subtraction between the same joint in the two frames, which can be denoted as ( .In view of the significant motion variations in actions, we combine different scale motions to model the temporal information of action, as shown in Figure 3. This operation improves the generalization of our network, as the fixed motion focuses on different ranges of motion while the adaptive motion focuses on different durations. Figure 3. Illustration of dynamics representation (DR). There are four streams of skeleton information embedded into a higher dimension by the embedding block and then concatenated as an input of the spatial relative transformer. Each block consists of two convolution layers and two activation layers.

Convolution layer Relu layer
In more detail, we divide the original sequence I origin = [I 1 , · · · , I F ] into T equal clips and randomly sample one frame from each clip to form a new sequence I new = [I 1 , · · · , I T ] in order. The original sequence I origin is either a joint sequence or bone sequence. The bone extractor proposed by Shi et al. [28] is applied to datasets to obtain bone information, which ensures that the representation of both the bone sequence and joint sequence are exactly the same. I represents the row positions of all joint ponits in single frame. The motion is computed by taking the difference of each joint node Jo t i (Jn t i ) between two frames: Jo t i denotes the i-th joint node in frame t of I origin and Jn t i denotes the i-th joint node in frame t of I new . The adaptive motion I ma is the difference between consecutive frames in I new , which represents different scales of motion information in the unequal video: where I t ma denotes the adaptive motion of frame t in the new sequence. Note that although the difference is between the adjacent frames in I new , the distance between these two frames depends on their location of I origin , which is interconnected with the length of the skeleton sequence. Hence, each skeleton sequence obtains an adaptive scale motion based on length.
Furthermore, there are two types of fixed motion: short-scale I ms and long-scale I ml . The function is expressed in the following: where I t ms denotes the short motion of frame t in the original sequence and I t ml is the long motion of frame t in the original sequence. f represents the frame number in the original video.
Finally, the row position and three different types of motion are embedded into the high-dimension tensor, i.e., F t , F t ma , F t ms and F t ml , and concatenate them.
where Z t i , Z t and Z are the dynamic representations of the i-th joint node of frame t, the t-th frame and the new sequence, respectively.
The embedding block consists of two convolution layers and two activation layers, as shown in Figure 3. These blocks extract features from 2D/3D position and motion tensors. The size of the convolution kernel is explained in Section 5.2.

Architecture
Different from the standard transformer, we prepend a virtual node to the graph of skeleton as the input. The virtual node gathers the global information from each joint node and scatters the information to all joint nodes; we named it the spatial-relay node. For the joint node and sptial-relay node, there are two corresponding connections: the inherent connections and the virtual connections.

Spatial Inherent Connections
As shown in Figure 4a, we establish inherent connections for all adjacent joints that have bone connections to preserve the inherent graph topology in skeletons. Such connections with prior knowledge allow each joint node to gather the local information from its adjacent joint nodes. Meanwhile, they enable joints to obtain more direct information from neighbors than non-adjacent joints, consistent with the general perception: neighbor joints are generally more important. A skeleton graph with n joint nodes has n − 1 inherent connections.

Virtual connection
Inherent connection Spatial relay node update block Spatial joint nodes update block Relay node

Joint node
The computed joint node Figure 4. Illustration of the update blocks in a spatial relative transformer (SRT). The graph structure in SRT is described in (a). Updating operates on each joint node by obtaining local information from adjacent joint nodes and non-local information from the spatial-relay node in (b). Spatial-relay nodes are updated by scoring the contribution of each node, including spatial joint nodes and the spatial-relay node in (c).

Spatial Virtual Connections
The connections between every joint node and spatial-relay node are named after virtual connection (see in Figure 4a). Through the virtual connections, the sptial-relay node captures the global composition relationship; therefore, each joint node can obtain the information between the non-adjacent joint nodes. A skeleton graph with n joint nodes has n virtual connections. The combination of inherent and virtual connections makes the relative transformer obtain both local and global information. Compared with the standard transformer, the number of connections includes inherent and virtual connections. As a consequence, the model establishes a long-range dependency with low computational efficienvy and memory overhead.

Implementation of SRT
In the spatial relative transformer model, each frame has its own relative transformer and we look at the model within one single frame. The model input J graph = {J t 1 , J t 2 , · · · , J t N } is a sequence of joint nodes at time t, where N is the number of nodes in this frame. B J t i is a set which contains the label of the adjacent joint nodes of J t i . Each node J t i (R t ) has a query vector q t i (q t r ), a key vector k t i (k t r ) and a value vector v t i (v t r ).

Spatial Joint Nodes Update Block (SJU)
For each joint node, we calculate the strength of the correlations between them and their adjacent joint nodes (including the neighbor nodes J t B J t i , the relay node R t and itself J t i ) by taking the key dot product with the query vector, as shown in equation: where α t ij represents the importance of node j on node i. Each neighbor node value vector v t j is multiplied by the corresponding score α t ij and added up to update the joint node, as shown below: where J t i is the updated result, aggregating local and global information. d k is the channel dimension of the key value (shown in Figure 4b).
In the model implement, the computations are implemented in matrix form. First, the q t i , k t j and v t j vectors are packed into Q t , K t and V t . Matrix Q t ∈ R C×1×N contains all joints' query vectors for a single skeleton. Both matrix K t ∈ R C×A×N and V t ∈ R C×A×N contain all key vectors and value vectors, which correspond with an adjacent node matrix M a (the adjacent node matrix will be introduced in Section 4.4). C denotes the feature dimension; N is the number of joints in one single skeleton; A represents the maximum number of adjacent nodes. The attention in matrix form is defined as follows: where • is a Hadamard product and the mask operation removes the zeros taken by the padding operation.

Spatial Relay Node Update Block (SRU)
To ensure that the spatial-relay node better aggregates the information of all joint nodes, we also apply a transformer (see in Figure 4c). The importance of each joint node α t rj is computed by the query vector q t r and the key vector k t j , as shown in the following: The relay node R t is updated by: For the matrix key, all key vectors k t j and value vectors v t j are packed into matrix K t ∈ R C×L and V t ∈ R C×L , respectively. The attention in matrix form is defined as follows: where q t r ∈ R 1×C is the spatial relay node, · denotes the matrix product. By alternately updating the joint nodes and the relay node, the spatial relative transformer will capture all the local and non-local information for an input graph. The overall update algorithm of the SRT is shown in the Algorithm 1.

Temporal Relative Transformer (TRT)
After designing a spatial relative transformer for each skeleton frame, we then formulate a temporal relative transformer to the skeleton sequence. Similar to the spatial relative transformer, the temporal relative transformer also consists of inherent connections and virtual connections by introducing a temporal-relay node.

Temporal Inherent Connections
Along the temporal dimension, the same joints across consecutive frames are treated as an input sequence into the model. Aside from the same joint nodes used between the adjacent, the joint nodes in the first and last frame are also connected, constituting a ring-shaped structure, as depicted in Figure 5d. A sequence formed by n nodes contains n inherent connections.

Algorithme 1 :
The update of spatial relative transformer Input : Embedded skeleton feature Z t 1 , Z t 2 , · · · , Z t N Output : The joint nodes (J t 1 ) L , (J t 2 ) L , · · · , (J t N ) L and relay node (R t ) L after L updates · · · , Z t N ) 4 for l ← 1 to L do 5 // Update the spatial joint nodes Temporal joint nodes update block Temporal relay node Update block

Virtual connection Inherent connection
Relay node Joint node The computed joint node Figure 5. Illustration of the update blocks in temporal relative transformer (TRT). The same joint nodes in all sampled skeleton are connected in order and the joint nodes in the first and last frame are also connected in (d). This approach constitutes a ring-shaped structure, as shown in (a). Furthermore, each joint node and the temporal-relay node are updated by TJU in (b) and TRU in (c), respectively, similar to the methods in SRT.

Temporal Virtual Connections
In the temporal relative transformer, each virtual connection links a joint node to the temporal-relay node, similar to the operation in spatial relative transformer. Hence, a sequence which contains n nodes has n virtual connections, as shown in Figure 5a. In a word, temporal relative transformer can capture the relationship between adjacent frames by the inherent connections, and the long-range relationship is captured by the virtual connections. This means that the semantic compositions are divided between the inherent connections and virtual connections, which enables the model to work without heavy pre-training. Furthermore, it reduces the number of connections from n 2 to 2n, where n is the skeleton sequence length.

Implementation of TRT
In the temporal relative transformer model, each node is considered independently. As a result, the model is applied to the sequence J = {J 1 v , J 2 v , · · · , J T v }, which represents the same joint node from all frames in the sampled sequence. Each node Temporal Joint Nodes Update Block (TJU) As shown in Figure 5b, the joint node J i v is updated by the relay node r v , the same joint node in neighbor frames (J i−1 v , J i+1 v ) and itself. Their score α ij v is expressed in the following: where α ij v represents the importance of the node in j-th frame on the same node in i-th frame.The joint nodes J i v are updated by: For the matrix key, all query vectors q i v are packed into matrix Q v ∈ R C×1×T , all key vectors k j v are packed into matrix K v ∈ R C×B×T and all key vectors v j v are packed into matrix V v ∈ R C×B×T . B is the total number of j. The attention in matrix form is defined as follows: where • denotes Hadamard product.

Temporal Relay Node Update Block (TRU)
The information of all the frames is aggregated to the temporal-relay node in Figure 5c by scaled dot-product attention, as expressed in the following: where α rj v is the attention score and 1 √ d k is a scaling factor. R v denotes the updated relay node. For the matrix key, all key vectors k j v and value vectors v j v are packed into matrix K v ∈ R C×T and V v ∈ R C×T , respectively. The attention in matrix form is defined as follows: where q r v ∈ R 1×C is the temporal relay node, · denotes matrix product.
By alternately updating the relay node and the same joint node on all frames, the temporal relative transformer finally captures all the relationships in an input frame sequence. The overall update algorithm of the TRT is shown in Algorithm 2.

Algorithme 2 :
The update of temporal relative transformer

Detail of SRT and TRT
Given an input tensor of shape (B, C, V, T), where B is the number of batch sizes, C is the channel size of dynamic representation, V is the number of joint nodes in a skeleton and T is the number of frames in new sequence. As shown in Figure 6, we move the T dimension into the B dimension to get the new shape (B × T, C, V, 1) in SRT module and move the V dimension into the B dimension to obtain the new shape (B × V, C, T, 1) in the TRT module. The former implements the transformer on each frame independently, while the latter applies the transformer separately on each joint along the time dimension.
In a skeleton, each joint node generally has an unequal number of adjacent nodes. To solve this problem, we pad zeros to the nodes whose adjacent nodes are less than A. A denotes the maximum number of adjacent nodes. By doing this, we obtain an adjacent node matrix M a ∈ R N×A , where N denotes the number of joint nodes in one skeleton. This operation is contained in step "neighbor" only in the SRT module, as shown in Figure 4b. Since these padding nodes are meaningless, we use a mask before the SoftMax operation, so that the attention mechanism avoids them.
Otherwise, multi-head attention with N h heads is applied to obtain richer information. The formula for attention translates into the following form: · · element-wise product mask this operation is only for SRT module X X element-wise product Conv2d&&reshape Figure 6. Illustration of the detail of the SRT module and TRT module. The tensor shapes with black font are for the SRT module and the blue are for the TRT module. It can be seen that the "mask" operation is only employed in SRT. In the SRT (TRT) module, the "neighbor" operation finds the adjacent nodes for each node according to Figures 4b and 5b .

Experience
In this section, we evaluate the performance of the proposed MSST-RT on three large-scale skeleton datasets, namely, NTU RGB+D [29], NTU RGB+D 120 [30] and UAV-Human [31]. Extensive ablation studies are conducted to validate the contributions of different components in our model. Finally, we visualize the attention probabilities in SRU and TRU block to prove the effectiveness of the proposed ST-RT model.

NTU RGB+D
The data in NTU RGB+D are collected in the lab environment by three Microsoft Kinect v2 cameras, which are widely used for skeleton-based action recognition tasks. They contain 56,680 skeleton sequences performed by 40 subjects, covering a total of 60 action classes. Each sequence contains, at most, two subject skeletons and each skeleton is composed of 25 joints. As the authors of this dataset recommended, we used two evaluation protocols, namely cross-subject (CS) and cross-view (CV). In the CS setting, 20 subjects were used for training and the rest for testing. In the CV setting, the sequences captured by camera 2 and camera 3 were used for training while the rest were used for testing.

NTU RGB+D 120
NTU RGB+D 120 is an extension of NTU RGB+D, whose action classes are increased to 120, subjects are increased to 106 and sequences are increased to 114,480. There are also two benchmarks, namely, cross-subject (C-subject) and cross-setup (C-setup). In the C-subject setting, 53 subjects are used for training and the rest for testing. In the C-setup setting, the dataset is divided by the parity of the setup IDs into two group, with one used for training and another used for testing.

UAV-Human
UAV-Human is a new dataset which is of significance fo practical UAV application scenarios. This dataset covers different types of human behavior and is collected by a flying UAV in multiple urban and rural districts in the day and night. It contains 155 activity classes in six different modalities including RGB, depth, IR, fisheye, night-vision, and skeleton sequences. There are 22,476 frames with the 2D positions of 17 major key-points of the human body for skeleton-based recognition, 16,169 frames for training and 6307 frames for testing.

Implementation Details
All experiences are performed on the Pytorch platform. Similar to [20] , we used the Adam optimizer with β = [0.9, 0.98] and = 10 −9 . Moreover, a two-phase training strategy was utilized to make the model convergence faster and more stable. The gradual warmup strategy linearly increased the learning rate from 4 × 10 −7 to 0.0005 for the first training phase (the first 700 steps), while natural exponential decay with the weight decay of 0.9996 proportionally decreased the learning rate for the second phase. During training, the batch sizes for NTU60, UAV-Human and NTU120 were 64, 64 and 100, respectively, and the total training epochs was set to 30. Label smoothing of value ls = 0.1 was employed for all experiences.
In terms of data processing, the original skeleton coordinates of each frame were replaced with its displacement relative to the first frame. The actions, which contained two skeletons, such as hugging, were divided into two sequences so that each sequence contained only one. Moreover we employed data argumentation by randomly rotating the 3D skeletons to create more variational samples, which could improve the generalization ability of the network.
Raw position, short motion, long motion and adaptive motion were embedded from dimension space 2 (for UAV-Human) or 3 (for NTU and NTU120) to a dimension space of (C) 64 by the first convolution in all the experiments. Through the second convolution, they were embedded from a space of dimension (C) 64 to a space of dimension (C 1 , C 2 , C 3 , C 4 ) 256, 256, 128 and 128, respectively. Note that the weights of the convolution layers are not shared among them.

Comparison to State of the Art
The performance of the proposed MSST-RT is compared with other state-of-the-art methods on the NTU60 , NTU120 and UAV-Human datasets in Tables 1-3, respectively. The contents of the "ST-RT()" brackets represent the number of sampled frames (10 or 20) and the type of input skeleton data (joint or bone), respectively. As shown in Table 1, MSST-RT achieves a good performance, with 88.43% and 93.21%, respectively, for Cross Subject (CS) and Cross View (CV) settings of NTU RGB+D 60. It is worth noting that STA-LSTM and TS-SAN both adopt the attention mechanism, which is similar to the model idea that we proposed. The difference is that STA-LSTM uses LSTM in addition to the attention mechanism, while our proposed model only uses the attention mechanism. Moreover, our model outperforms this model by 14.96% with CS and 12.0% with CV. Cmparing our model with TS-SAN, the attention mechanisms in ST-RT are employed not only between video frames, but also between the joint nodes of the skeletons. Our model outperforms it for 1.23% with CS and 0.51% for CV.

Methods
Year CS(%) CV(%) As shown in Table 2, the proposed MSST-RT achieves the best performance, with 79.33% for the C-subject setting and 82.30% for the C-setup setting. Most of the methods we compare are reported in [22]. Table 3. Performance comparison on UAV-Human.

Methods
Year Accuracy (%) As shown in Table 3, the proposed MSST-RT achieves the best performance, with 41.12%, and outperforms the second place by 3.24%. UAV-Human is a new dataset, released in 2021, and we compare the performance of our model with the results reported in [31]. Single-stream ST-RT, namely "ST-RT(Seg = 20, joint)", outperformed all methods.

Ablation Study
Ablation studies of performance were performed by the proposed ST-RT model on the NTU RGB+D dataset with the CS setting. Firstly, two normalization methods were chosen for comparison to provide faster training and better performance. Moreover, we compared dynamic representations with different motion combination methods and different networks with different stream combination methods. Finally, the models were investigated with different hyper-parameters, such as the number of sampled frames, layers and heads. Furthermore, an ablation study regarding computation cost was based on UVA-Human.

The Effect of Different Normalization Methods
Normalization is used in both dynamic representation modules and relative transformer modules. The normalization method in a relative transformer module can either use layer normalization or batch normalization, while the normalization method in the dynamic representation module only can use layer normalization. Table 4 shows that "ST-RT (BatchNorm)" is superior to "ST-RT (LayerNorm)" by 2.03%. The accuracy of the two normalization methods is shown in Figure 7 as a function of the epoch. The accuracy of "ST-RT (BatchNorm)" increases faster than that of "ST-RT (LayerNorm)" and it is stable at a higher value. The experiments demonstrate that batch normalization provides faster training times and a better performance.

The Effect of Dynamics Representation with Different Combination
To demonstrate that each motion provides different information, we compare our ST-TR model with three different models whose DR module only contains two motions, and show the results in Table 5. "ST-RT(w/o X motion)" denotes that the "X" motion is removed from the DR module in the ST-RT model, "X" is "short", "long" or "adaptive". The accuracy of the ST-TR model has a 0.60% reduction, a 1.17% reduction and a 0.69% reduction when it is without short motion, long motion or adaptive motion, respectively. This means that the information captured from each motion can complement the other two motions. According to the reduction in accuracy, we conclude that long motion contains richer dynamic information compared to short motion.  Table 6 shows the effect of sampling frame numbers on model performance. From the results of "ST-RT(Seg = 15)" to "ST-RT(Seg = 10 && Seg = 15)", we observe that the model underperforms in both superfluous and insufficient frame scenarios. The experiments demonstrate that the model achieves the best performance with 86.46% when sampling 20 frames. In addition, we combined networks with different sampling frames (shown in the last four rows of Table 6, i.e., "ST-RT(Seg = 10 && Seg = 15)" denotes the combination of the network with 10 sample frames and the network with 15 sample frames). "ST-RT(Seg = 10 && Seg = 20)" outperforms other combinations with 87.48%, and this optimal parameter was applied in MSST-RT. Table 6. Accuracy comparison of the ST-RT models with different sampling strategies and their combination. "X && Y" means the fusion of the model with X sampling strategy and the model with Y sampling strategy.

CS (%)
ST  Table 7. "LX" represents the SRT (TRT) module with X layers; each layer contains one SJU (TJU) block, one SRU (TRU) block and two FFNs. "HY" represents each attention mechanism in ST-RT model with Y multi-heads. By comparing the number of heads, we observe that more multi-heads always results in better performance, especially when the model's heads increase from 4 to 8. However, the affect of head number peaks when it increases to 12, and excessive heads can make the accuracy of the lightweight model decrease (shown in "ST-RT (L2H8)" and "ST-RT (L2H12)"). From the results of " ST-RT (L2H8)", " ST-RT (L3H8)" and " ST-RT (L4H8)", we conclude that the model underperforms if the number of layers is too high or too low. In sum, number of layers and heads is set as 3 and 8, respectively, for the proposed model, considering the number of parameters and the accuracy. Table 7. Accuracy comparison for the ST-RT models with different sampling strategies, and their combination. "X && Y" means the fusion of the model with X sampling strategy and the model with Y sampling strategy.

CS (%)
ST  Table 8. When using the same number of frames, the joint stream has a similar computation cost to the bone stream. The computation cost increases with the increase in the number of frame. In the variants with two streams, "ST-RT(s3+s4)" achieves the best performance, which is slightly lower than MSST-RT, and the computation cost is half of MSST-RT's. Hence, the stream numbers can be reduced if there are concerns regarding computation cost. Table 8. The effect of number of streams in the ST-RT Model on UAV-Human. "s1" is the stream with 10 frames, joint frames, "s2" is the stream with 10 frames, bone frames, "s1" is the stream with 20 frames, joint frames, "s1" is the stream with 20 frames, bone frames. Furthermore, "ST-RT(X + Y)" is the fusion of the model with X stream and the model with Y stream. "Computation cost" means average second per epoch on the train set.

Visualization of SRU and TRU
Our model applies the attention mechanism when updating joint nodes and virtual nodes in both the spatial and temporal dimensions. The attention response is visualized from the last SRU block in the Spatial Relative Transformer (SRT) and the last TRU block in the Temporal Relative Transformer (TRT).
The action salute is selected to visualize the attention response for eight multi-heads from the last SRU layer in Figure 8. The red circles represent the spatial-relay nodes and the blue circles represent the joint nodes. The blue lines represent the inherent connections and the yellow lines represent the virtual connections. We zoom in on the five nodes with the highest corresponding attention values, and the other nodes are shown by small circles. The attention response of each head is different and head1, head2, head3 and head4 all focus mainly on the left hand. This suggests that the attention mechanism works in a similar way to human perception. Otherwise, the actions clapping and kicking something are visualized in Appendix A, which shows that different actions focus on different nodes. Both left and right hands are important for "clapping", so most heads focus on hands in Figure A1a. For "kicking something" in Figure A1b, only the right foot is of great value. Figure 9 shows the attention response for the action "salute" for eight multi-heads from the last TRU layer. The red circles represent the temporal-relay nodes (the left one in the sequence is the node before updating and the right one is the node after updating) and the 20 blue circles represent the eighth joint nodes from 20 sampled frames. The transparency of lines indicates the intensity of the attention response. The brighter color denotes the higher response. This shows that different frames are noted in different heads, and the information from each frame, including the temporal-relay node itself before being updated, is converged to the temporal-relay node. We can see that the temporal-relay node in last layer (the left red node in the sequence) receives a large amount of attention from head1 and head5.

Conclusions
In this work, transformer architecture is introduced to establish a long-range dependence instead of graph convolution. Significantly, MSST-RT relies on transformer architecture instead of recurrence, LSTM or graph convolution, which is a full attention model. Specifically, we propose a novel architecture based on standard transformer and named a relative transformer. This compensates for the deficiencies of the standard transformer while retaining the inherent topology of the skeleton, and significantly reduces computational complexity. The architecture, meanwhile, makes it possible for this to work without heavy pre-training. Furthermore, the relative transformer module evolves into a spatial relative transformer and temporal relative transformer, respectively, to extract spatial-temporal features. In addition, the DR module combines multi-scale motion information to adaptively recognize actions with different durations and different ranges of motion. Finally four streams with an ST-RT module, with four dynamic data streams, are fused to complement each other, realizing the further enhancement of performance. The final network, MSST-RT, achieves a state-of-the-art performance in skeleton-based action recognition on NTU RGB+D, NTU RGB+D 120 and UAV-Human. It is worth noting that single-stream ST-RT outperformed other methods on UAV-Human. It outperformed TS-SAN [37], which also adopted attention architecture. The results of attention response visualization verify the effectiveness of the proposed model for skeleton-based action recognition tasks.

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: (b) Figure A1. Visualization of the attention responses of the last SRU block in ST-RT model with respect to two actions, i.e., clapping (a) and kicking something (b).