A human activity recognition method based on Vision Transformer

Human activity recognition has a wide range of applications in various fields, such as video surveillance, virtual reality and human–computer intelligent interaction. It has emerged as a significant research area in computer vision. GCN (Graph Convolutional networks) have recently been widely used in these fields and have made great performance. However, there are still some challenges including over-smoothing problem caused by stack graph convolutions and deficient semantics correlation to capture the large movements between time sequences. Vision Transformer (ViT) is utilized in many 2D and 3D image fields and has surprised results. In our work, we propose a novel human activity recognition method based on ViT (HAR-ViT). We integrate enhanced AGCL (eAGCL) in 2s-AGCN to ViT to make it process spatio-temporal data (3D skeleton) and make full use of spatial features. The position encoder module orders the non-sequenced information while the transformer encoder efficiently compresses sequence data features to enhance calculation speed. Human activity recognition is accomplished through multi-layer perceptron (MLP) classifier. Experimental results demonstrate that the proposed method achieves SOTA performance on three extensively used datasets, NTU RGB+D 60, NTU RGB+D 120 and Kinetics-Skeleton 400.

HD-Graph containing those edges in the same semantic spaces of a human skeleton 18 .LKA-GCN enlarges the receptive field and improves channel adaptability without increasing too much computational burden 19 .DeGCN learns the deformable sampling locations on both spatial and temporal graphs, enabling the model to perceive discriminative receptive fields 20 .In DS-GCN, the joints and edge types are encoded in the skeleton topology in an implicit way, the joints type-aware adaptive topology and the edge type-aware adaptive topology are proposed 21 .
Although GCN-based methods have made significant progress, these methods still have two challenges: (1) how to model remote (long-distance) dependencies between joints more accurately, thereby alleviating the over-smoothing problem caused by stack graph convolutions.(2) How to improve robustness and semantics correlation to capture the large movements between time sequences.Therefore, the motivation of our work is to offer a feasible and effective approach for addressing these aforementioned limitations, which can be summarized as follows:

Motivation 1
ViT model applies the Transformer architecture in image recognition 22 , it interprets an image as a sequence of patches and processes it by a standard Transformer encoder as used in NLP.This simple, yet scalable strategy works surprisingly well when coupled with pre-training on large datasets.ViT matches or exceeds SOTA methods on many image classification datasets, whilst being relatively cheap to per-training.Both the lower layers and the high layers of the ViT model structure can have a large field of view, global feature information can be obtained in initial layer, so it can well guarantee the global and local features integrity and can resolve the oversmoothing problem caused by stack graph convolutions.Skip connections have a huge impact on the propagation and representation of feature, and so it can capture semantics correlations of joints between time sequences.ViT retains location information while transmitting feature information.By leveraging the advantages of ViT and addressing the shortcomings of GCN-based methods, we innovatively apply ViT in this kind of skeleton data with spatial and temporal characteristics.

Motivation 2
The inputs of both ViT and its other applications are 2D images, ViT can not be directly used in 3D skeleton data.The adaptive matrix is integrated in adjacency matrix of 2s-AGCN in the graph convolution, so that the model can adjust the topology of the graph adaptively and adapt to different input data.This innovative design greatly improves the recognition accuracy of skeleton data.However, the ability of capturing hidden data is insufficient and it is easy to cause gradient disappearance in product operation.We enhance adaptive graph convolutional layer (AGCL) in 2s-AGCN by replacement two embedding functions and normalization by a scale factor of Scaled Dot-Product Attention of trainable adjacency C k , namely eAGCL, thus it can automatically learn the connection strength problem between joints in the sample data and solve the gradient disappearance problem.

Our contributions
In summary, our work makes three major contributions: (1) For the first time, ViT is applied to skeleton 3D data and put forward a human behavior recognition method HAR-ViT.(2) The position encoder in ViT is rewritten to order the non-sequenced information (skeleton data) and reduce the idle spatial coding information.(3) We propose eAGCL model based on AGCL in 2s-AGCN, improve utilization of spatial features of our network model.

ViT
ViT attains excellent image classification results compared to SOTA convolutional networks while requiring substantially fewer computational resources to train.It shows that the reliance on CNNs is not necessary and a pure transformer applied directly to sequences of image patches can perform very well, especially when train with large scale training sets.In addition, the model pre-trained on large-scale datasets can also achieve better performance than CNN when migrating to medium or small datasets.
The lowest level of ViT allows the model to have larger windows through the self-attention mechanism.In the shallow layer, the model gradually acquires local and global characteristics while at the deep layer, the model has acquired the characteristics of a global view from the very beginning.Skip links play a important role on the propagation and representation of feature, and if they are removed, the accuracy of the model decreases by about 4%.The similarity between the input image and the feature map of the last layer in ViT is very high and this indicates that ViT retains location information while propagating feature information.
Pyramid ViT implements a variable self-attention mechanism through a space-reduction attention mechanism and is applied to ViT models to overcome square complexity in the attention mechanism 23 .A hierarchical ViT model with sliding windows is used by Swin Transformer, a local self-attention mechanism is applied to nonoverlapping windows based on its position, thus forming a hierarchical feature representation in the next level and finally integrate these features 24 .DINO is a self-supervised training framework proposed by Meta's AI team based on ViT, which can be trained on large-scale unlabeled data and obtain robust feature representation, even without a fine-tuned linear layer 25 .Scaling ViT has won first place in ImageNet's recognition result, it is proposed by Google Brain team by scaling up ViT model with 2 billion parameters 26 .SegFormer proposal by Nvidia focuses on the componentization of the system and adopts a simple MLP decoding model without requiring position encoder 27 .UNETR (Unet+ViT) utilizes a transformer as the encoder to learn sequence representations of the 3D medical images and effectively capture the global multi-scale information, the transformer encoder is directly connected to a decoder via skip connections at different resolutions to compute the final semantic segmentation output, it achieves preferable semantic segmentation 28 .We apply ViT to 3D skeleton human behavior recognition on account of its excellent performance on 2D and 3D image process and propose a method namely HAR-ViT.

AGCL in 2s-AGCN
The AGCL serves as the fundamental neural network layer of 2s-AGCN, enabling an end-to-end learning approach to optimize skeleton data.The network of AGCL is illustrated in Fig. 2. Its topology and network parameters are designed to enhance flexibility by accommodating unique data graphs for different layers and samples.Additionally, it is constructed as a residual branch to ensure the stability of the original model.
The graph convolution rule of AGCL is shown in Eq. (1), where f out and f in denote the output and input of the network respectively.K v represents the kernel size of the spatial dimension, which is set to 3. k denotes the position of kernel, while W k signifies the operation of a 1 × 1 convolution.The adjacency matrix is divided into three parts: A k , B k , and C k .A k denotes the fixed connectivity pattern among human joints.B k demonstrates the connection strength of two joints.The elements of B k are parameterized and optimized together with the other parameters in the training process.C k is a data dependency graph that learns a unique graph for each sample.To determine the existence and strength of connections between two joints, it uses the normalized embedding Gaussian function to calculate the similarity between two joints (Eq.2), while utilizing dot product to measure their similarity in the embedding space.By applying two embedding functions θ(v i ) and φ(v j ) to the feature map to rearrange them into matrices with dimensions N × T and T × N respectively, then multiply these matrices together to obtain an N × N similarity matrix C k .Each element of C k represents the similarity between corresponding joints v i , v j .
The residual connection of B k and C k in the model enhances its flexibility and stability without sacrificing the original performance.However, exponential order computing of Gaussian function ignores hidden information between samples and fails to sufficiently calculate the similarity between the joints.Moreover, its computational (1) cost is very expensive and prone to gradient disappearance.We improve the accuracy of similarity and decrease computing costs by replacement the two embedding functions (θ and φ) and normalization by a scale factor.

Materials and methods
In this paper, we propose an end-to-end HAR-ViT algorithm that creatively applies ViT to the field of human activity recognition.The overall framework of the model is illustrated in Fig. 3.
To address the limitation of ViT not being able to process 3D skeleton data, this study enhances the AGCL in 2s-AGCN.This enhancement enables extraction of spatial features encompassing connection relationships and strengths between joints.These features are then sorted along the timing axis before being transformed into feature vectors through linear projection.www.nature.com/scientificreports/ In order to capture temporal features, learnable embedding vectors are added to the end of the feature vectors, while a position encoder is integrated with them to provide temporal information.Subsequently, these fused feature vectors serve as input for transformer encoder where multiple Transformer Layer networks extract temporal feature by means of similarity computation.Finally, all temporal features are compressed for classification using MLP classifier.

eAGCL
The Internal network structure of eAGCL is described in Fig. 4. We introduce a novel trainable matrix as a replacement for the two Gaussian functions in C k .The trainable matrix facilitates the optimization of network parameters C k for each sample through back propagation, thereby enhancing the efficiency and effectiveness of learning from sample data.The similarity between joints is computed using a covariance matrix, which is presented in Eq. ( 3), eigenvector and X T is the transpose of X.The covariance matrix can capture the similarity between different dimensions in multiple elements, and the stronger the similarity, the higher the covariance value.
The shape of the skeleton data is x ∈ R C×V ×T , C represents the x, y, z coordinates, V denotes the number of joints and T is the length of the skeleton sequence.To convert the skeleton data into d-dimensional feature vectors, a trainable matrix conv is introduced to transform the skeleton data into x ∈ R d×T , elements in the trainable matrix can be regarded as initial weight values.They are optimized through back propagation.The dot product of each generated feature vector is divided by √ d , and we can get the final weighted values using softmax.Expres- sion for C k is shown in Eq. ( 4).The value of variance of each element in C k depends on d, thus pushing softmax function into the area with minimal gradient.To counteract this effect, we scale the dot products by 1

Positional encoder
Skeleton data is a form of sequential data that necessitates the utilization of a position encoder to augment the temporal positioning information inherent in the skeleton data.However, ViT employs learnable embedding vectors on position encoders without any involvement in supplementing timing information.While the position encoder in NLP will generate significant superfluous positional encoding information when processing skeleton data.
To address this issue, this paper proposes a redefined position encoder as depicted by Eq. ( 5), PE(pos,2 l) denotes the output result of the positional encoder, and pos represents the position of the feature vector in skeleton sequence, 2 l and 2 l + 1 indicate its odevity of pos, and n signifies the dimensionality of skeleton.The sine function is employed when pos is odd while cosine function is utilized when pos is even, T corresponds to the maximum length of the skeleton sequence.In order to capture and retrieve the entire sequence information, the model needs to add a new vector * at the end of the feature vector, which is then used to embed the position encoder value for each frame.
The relative position of the skeleton sequence can be calculated using Eq. ( 6) when its offset is m (a positive integer).Upon deduction, an elegant dot product formula emerges, which coincides with the standard inner product formula in Euclidean space.This formulation reveals a crucial topological structure that can be described mathematically: the encoded result at any given position can be decomposed into the dot product of two summations.
The topology employed facilitates the acquisition of relative position results effortlessly.We quickly obtain its frame sequence by introducing position embedding.Its visual heat map is shown in Fig. 5, where the horizontal coordinate represents the dimension and the vertical coordinate represents the frame number.
Position Encoder structure diagram is illustrated in Fig. 3, where rounded rectangle in blue with frame sequence number represent skeleton data and rounded rectangle in pink describe the positional embedding which corresponding with the skeleton data.We introduce a novel frame embedding vector with "T+1" and a novel positional embedding vector *, aiming to effectively capture and retrieve comprehensive timing sequence information.

Transformer encoder
The Transformer Encoder, comprising MLP and self-Attention modules, is employed to transform input sequences into hidden representations.The initial input consists of the human pose embedding vector for each position, which is then encoded into a fixed-length hidden vector representation through multiple layers of selfattention mechanism and fully connected layers.Figure 6

Multi-layer perceptron (MLP) classifier
The multi-layer neural network is represented by Eq. ( 7), where W 1 and W 2 denote the weights of the first and second layers respectively, while b 1 and b 2 represent the biases of the first and second layers correspondingly.Moreover, max(0, x) denotes the ReLU() activation function.It should be noted that through feed forward propagation, this neural network can approximate any continuous or square integrable function with arbitrary precision, thereby enabling accurate classification of any finite training sample set.The model exhibits a classification effect due to the inclusion of feed forward neural networks.Specifically, the frame embedding vector * is extracted from the end of the Transformer compressed coding vector and fed into the MLP classifier, as depicted in Fig. 6.

Ethical approval and informed consent
Data used in our study are publicly available, and ethical approval and informed consent were obtained in each original study.

Skeleton-based action datasets
To demonstrate the effect of the proposed HAR-ViT, four datasets were utilized in this paper: NTU RGB+D 60, NTU RGB+D 120, Kinetics-Skeleton 400 and our homemade data.The brief introduction is as follows.NTU RGB+D.It is a large-scale human action recognition dataset.NTU RGB+D 60 contains 56,880 sequences over 60 classes.It provides the 3D Cartesian coordinates of 25 joints, which are captured from 3 Microsoft Kinect v2 cameras with different viewpoints, for each human in an action sample.Each action sample is performed by 40 volunteers in different age groups.NTU RGB+D 120 is an extended version of NTU RGB+D 60 with an additional 60 action classes, with a total of 113,945 sequences.The datasets can be accessed publicly on https:// rose1.ntu.edu.sg/ datas et/ actio nReco gniti on/.We use four benchmarks recommended by the official for a fair comparison with SOTA methods: (1) NTU60 cross-subject (NTU60-Xsub): where the 40 subjects are divided into training and testing groups.
(2) NTU60 cross-view (NTU60-Xview):the data from camera views 2 and 3 are used for training, and data from camera view 1 is used for testing.(3) NTU120 cross-subject (NTU120-Xsub):where the 106 subjects are divided into training and testing groups.(4) NTU120 cross-setup (NTU120-Xset): the data from samples with even setup IDs are used for training, and data from samples with odd setup IDs are used for testing.
Kinetics-Skeleton 400.The skeleton data includes 18 major joints of the human body.It contains more than 300,000 clips covering 400 action categories.Among 260,000 total samples, 240,000 samples are used for training and 20,000 for testing.The datasets can be accessed publicly on https:// deepm ind.com/ resea rch/ open-source/ kinet ics.
Homemade datasets.We utilize 3 cameras with different views to capture 10 action classes, each action is performed from 10 volunteers (in our research laboratory) , it contains 100 videos.

Experiment description
The baseline and platform details are depicted as:

Training comparison with SOTA methods
We compare the training results with three open source code SOTA methods, including ST-GCN, 2s-AGCN and DSTA-Net.
The number of model parameters of each comparison algorithm is presented in Table 2.It can be observed that the model proposed in this paper achieves a reduction of 57.24% in parameter count compared with 2s-AGCN, while also demonstrating improved efficiency and lightweight characteristics.

Experiments on NTU RGB+D 60
In NTU60-Xsub experiment, the training set and testing set are divided in 1:1, ensuring complete independence between the subjects.This is performed to assess the model's capability of recognizing unfamiliar subjects.The training results of each compared algorithm are illustrated in Fig. 7.After the same training epochs, the average accuracy of HAR-ViT is higher than other three methods, and it soon reaches a higher level after 30 training epochs, so the proposed algorithm exhibits a higher fitting speed.
In NTU60-Xview experiment, the ratio between the training set and the test set is 2:1.Through such experiments, we can conduct a more specific analysis of the model's effectiveness in recognizing unconventional angular actions (view 1) when trained on other views (view 2 and 3).The training results of each comparative algorithm are depicted in Fig. 8.Our proposed model demonstrates exceptional adaptability to diverse shooting angles, exhibiting a notable 5.9% improvement compared to NTU60-Xsub.Moreover, when compared with other algorithms, HAR-ViT showcases a more pronounced enhancement.

Experiments on NTU RGB+D 120
In NTU120-Xsub experiment, the ratio between the training set and the test set is 1:1.The results of the four methods are illustrated in Fig. 9.In comparison to NTU RGB+D 60, there is a decrease in average accuracy, which can be attributed to the expanded range of identified categories.Our model demonstrates greater effectiveness when compared with other algorithms.
In NTU120-Xset experiment, the training results of each model are depicted in Fig. 10.We can observe that the accuracy of all the models slightly decreased compared to NTU120-Xsub experiment, but the average accuracy of our model is higher than that of 2s-AGCN and other models.This demonstrates that our method exhibits more pronounced advantages when compared with other algorithms.

Experiments on Kinetics-Skeleton 400
The training results of each model on Kinetics-Skeleton 400 are described in Fig. 11.We can see that the accuracy of all the models slightly decreased compared to NTU RGB+D, because the number of action categories covered is 3.3 times that of NTU RGB+D.Otherwise, the average accuracy of our model is higher than that of 2s-AGCN and other models.It shows that the algorithm in this paper has stronger generalization ability.than 2s-AGCN.The DSTA-Net represents the optimal outcome achieved by X-Sub, effectively extracts temporal and spatial features through the spatio-temporal attention.HAR-ViT exhibits a slight decrease of 0.44%under X-Sub compared with DSTA-Net, potentially due to the equilibrium of spatio-temporal convolution in DSTA-Net.Notably, HAR-ViT outperforms AAM-GCN and LCK-GCN by 0.66% and 0.36% respectively, simulating remote features using attention mechanisms as in 2s-AGCN.The performance of Shift-GCN is sub-optimal under X-view, with HAR-ViT outperforming it by 0.23% under X-view.The incorporation of the Shift graph convolution operation and lightweight point convolution in Shift-GCN enhances its spatial feature extraction capability.However, HAR-ViT's eAGCL effectively adapts to skeleton data through trainable matrices, mitigating the impact of positional errors and yielding superior results compared to Shift-GCN.Under X-view, HAR-ViT achieves improvements of 0.33% and 0.43%over AAM-GCN and DSTA-Net respectively.AAM-GCN exhibits limited generalization ability when confronted with different angles, while DSTA-Net lacks a fixed skeleton graph in its spatial attention mechanism.In contrast, eAGCL within HAR-ViT maintains a consistent skeleton diagram and ensures model stability through residual connections, leading to better performance across different shooting angles compared to other algorithms.

Experiments on NTU RGB+D 120
The comparison on NTU-RGB+D 120 of 6 methods is illustrated in Table 4, the best results are in bold and the sub-optimum results are italics.The recognition performance of our method is 5.81% under X-Sub and 4.12% under X-Set higher than the baseline 2s-AGCN.
The accuracy of our method is 1.01% and 0.02% higher than that of the sub-optimal DSTA-Net under X-Sub and X-Set.DSTA-Net expands the receptive field through the self-attention mechanism and achieves excellent  www.nature.com/scientificreports/recognition performance.However, the feature decoupling strategy of DSTA-Net involves four streams, which leads to expensive computation in case of large-scale sample.In contrast, our HAR-ViT achieves similar performance using only one stream .The classification performance of the four methods of four actions "drinking water", "touch pocket","shaking hands" and "punch/clap" on NTU RGB+D are shown in Fig. 12.All four methods correctly identify the four actions, notably, our method exhibits higher confidence than the baseline 2s-AGCN for all the four actions and demonstrates an impressive average confidence level of 94%, while the confidence of 2s-AGCN is 80%.The confidence of ours is the highest among the all the four methods, the validity of our method is illustrated again.The average inference time of the four algorithms in the test data of the standard dataset is presented in Table 5, with optimal results highlighted in bold and sub-optimal results italics.HAR-ViT achieves an average inference time of 4.75 s, which is 3.5 s faster than both 2s-AGCN and DSTA-Net.This improvement can be attributed to HAR-ViT's reduction in parameter count and substitution of exponential Gaussian function calculations with matrix multiplication, thereby reducing complexity in inference computations.

Experiments on Kinetics-Skeleton 400
The comparison on Kinetics-Skeleton 400 is shown in Table 6, the best results are in bold and the sub-optimum results are italics.Our HAR-ViT also achieves excellent performance promotions (+ 2.0% under Top-1 and + 2.2% under Top-5) over the baseline 2s-AGCN.The recognition accuracy is 0.3% higher than LKA-GCN and AGCN under Top-1.Under Top-5, ours is also 0.4% higher than AAM-GCN.It can be found that the performance of AGCN is better than our method and it utilizes multi-stream branch structure and has strong generalization ability.
The classification performance of the four methods of four actions "arm wrestling", "bar tending","bending back" and "book binding" under Top-5 on Kinetics-Skeleton 400 are shown in Fig. 13.All four methods correctly identify the four actions, notably, our method exhibits higher confidence than the baseline 2s-AGCN for all the four actions and demonstrates an impressive average confidence level of 54%, while the confidence of 2s-AGCN is 39%.Because the dataset covers far more classes than NTU RGB+D, the confidence level drop significantly for all the four methods.

Experiment on real-world datasets
In order to prove the generalization ability of our algorithm, we also test it on homemade datasets in addition to widely used standard datasets.The classification results of four action "clapping", "brush teeth","sneeze/cough" and "salute" are shown in Fig. 14.We can see that four methods can recognize all the four actions positively, however, our method exhibits higher confidence than all the other three methods even for the three similar actions( "brush teeth","sneeze/cough" and "salute" ).ours demonstrates an impressive average confidence level of 96%, while the confidence of 2s-AGCN is 86%.
Table 7 presents the average inference time of the four algorithms on the homemade datasets.The optimal results are highlighted in bold, while the sub-optimal results are italics.HAR-ViT demonstrates an average inference time that is 4 s and 3 s faster than 2s-AGCN and DSTA-Net, respectively.Since the skeleton data in the real environment consists of single action, its inference time is significantly shorter compared to that of standard datasets in Table 6.

Figure 1 .
Figure 1.Human skeleton joint location information.

Figure 4 .
Figure 4.The network structure of eAGCL.

Figure 7 .
Figure 7. Training results of compared methods in NTU60-Xsub experiment.

Figure 8 .
Figure 8. Training results of compared methods in NTU60-Xview experiment.

Figure 9 .
Figure 9. Training results of compared methods in NTU120-Xsub experiment.

Figure 10 .
Figure 10.Training results of compared methods in NTU120-Xset experiment.

Figure 11 .
Figure 11.Training results of compared methods in Kinetics-Skeleton 400 experiment.
13 the experiment, the comparison baseline is the representative 2s-AGCN13consisting of 10 layers of TCN-GCN.To prove the effectiveness of our methods, the training and test samples of this work are consistent with the baseline.No additional training strategies are applied in this work.Platform Details:The experimental setup for this study involves a NVIDIA GeForce RTX 2070 SUPER server running on Ubuntu 20.04.5 LST (CPU: 4 cores, memory: 16 GB, video memory: 8 GB), paddle2.5.1, and cuda11.4.Table1presents the training configuration parameters used in HAR-ViT.

Table 1 .
The training configuration parameters of HAR-ViT.

Table 2 .
Comparison of parameter number of four methods.

Table 3 .
Comparison of accuracy of nine methods on NTU-RGB+D 60.

Table 4 .
Comparison of accuracy of nine methods in cross-view experiment.

Table 5 .
Mean inference time on the standard datasets.

Table 7 .
Mean inference time on the homemade datasets.

Table 8 .
Ablation experiment results of our method.

Table 9 .
Ablation experiment results of the depth of transformer encoder.