A Novel Two-Stream Transformer-Based Framework for Multi-Modality Human Action Recognition

Shi, Jing; Zhang, Yuanyuan; Wang, Weihang; Xing, Bin; Hu, Dasha; Chen, Liangyin

doi:10.3390/app13042058

Open AccessArticle

A Novel Two-Stream Transformer-Based Framework for Multi-Modality Human Action Recognition

¹

School of Computer Science, Sichuan University, Chengdu 610065, China

²

Chongqing Innovation Center of Industrial Big-Data Co., Ltd., Chongqing 400707, China

³

Institute for Industrial Internet Research, Sichuan University, Chengdu 610065, China

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Appl. Sci. 2023, 13(4), 2058; https://doi.org/10.3390/app13042058

Submission received: 28 December 2022 / Revised: 2 February 2023 / Accepted: 3 February 2023 / Published: 5 February 2023

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Due to the great success of Vision Transformer (ViT) in image classification tasks, many pure Transformer architectures for human action recognition have been proposed. However, very few works have attempted to use Transformer to conduct bimodal action recognition, i.e., both skeleton and RGB modalities for action recognition. As proved in many previous works, RGB modality and skeleton modality are complementary to each other in human action recognition tasks. How to use both RGB and skeleton modalities for action recognition in a Transformer-based framework is a challenge. In this paper, we propose RGBSformer, a novel two-stream pure Transformer-based framework for human action recognition using both RGB and skeleton modalities. Using only RGB videos, we can acquire skeleton data and generate corresponding skeleton heatmaps. Then, we input skeleton heatmaps and RGB frames to Transformer at different temporal and spatial resolutions. Because the skeleton heatmaps are primary features compared to the original RGB frames, we use fewer attention layers in the skeleton stream. At the same time, two ways are proposed to fuse the information of two streams. Experiments demonstrate that the proposed framework achieves the state of the art on four benchmarks: three widely used datasets, Kinetics400, NTU RGB+D 60, and NTU RGB+D 120, and the fine-grained dataset FineGym99.

Keywords:

multi-modality action recognition; two-stream Transformer; skeleton modality

1. Introduction

Human action recognition refers to classifying and understanding what humans are doing in the input data. It has a wide range of applications in real life. For example, human action recognition can be used for home monitoring to monitor the behavioral activities of the elderly and to detect dangerous actions such as falls in a timely manner [1], and it can help an automatic navigation system analyze and predict the action of pedestrians [2]. Commonly used inputs for human action recognition algorithms include RGB images and videos [3], skeleton [4], depth [5], point-cloud [6], and so on. Among these modalities, the RGB modality is the easiest to acquire and manipulate, using only the RGB cameras. However, action recognition algorithms based on RGB modality are often affected by appearance information, such as background, illumination, and color textures [7]. Besides RGB modality, skeleton modality is the most popular. As shown in Figure 1, skeleton modality refers to representing the human body through joint points and limb lines. Therefore, it is naturally concise but very precise in expressing motion information. However, in the actions of human–object interaction, the key appearance information cannot be captured, and the performance of skeleton-based human action recognition methods will be poor. By combining the skeleton modality with the RGB modality, the appearance information of the RGB modality can be utilized, and the precise dynamic information in the skeleton modality can also be utilized. Therefore, the methods based on both RGB and skeleton modalities are the best choice for action recognition tasks in practical applications.

At present, some works [8,9,10,11] have proposed action recognition algorithms based on both RGB and skeleton modalities. The two methods [8,9] use the GCN networks to extract skeleton features, treat human joints as points, and presenr human limbs as lines, as shown in Figure 1. However, it has been proved that the GCN-based methods still have many limitations in robustness, interoperability, and scalability [10]. Therefore, Duan et al. [10] proposed PoseConv3D, which used the form of pseudo heatmaps to express skeletons and then fed the heatmaps into 3D CNN. However, each convolution operation can only process local information, which is a very small part of the entire input data. The size of the “patch” proposed in ViT [12] greatly increases the “receptive field” of the network. There have been many successful Transformer-based applications in computer vision, such as image classification [13,14,15], video detection [16,17], and action recognition [18,19]. Therefore, it is very meaningful to explore Transformer-based dual-modality action recognition methods. There is already a work [11], dubbed TP-ViT, trying to use Transformer to handle RGB and skeleton bimodal action recognition. However, TP-ViT still uses absolute coordinates to represent skeleton points, which will also face the problems of stability and poor performance in multi-person scenes. Hence, how to use Transformer to make the most of RGB and skeleton modalities is still a challenge.

In this paper, in order to make the most of skeleton data and RGB data efficiently, we propose a two-stream Transformer framework. In this framework, inspired by SlowFast, inputting frames at a high temporal resolution, i.e., using fast refreshing frames, can effectively simulate potentially fast changing motion. In action recognition, motion information is fast-changing information compared to the almost constant appearance information. Therefore, in order to capture action information precisely, skeleton stream is input into Transformer with higher temporal resolution and lower spatial resolution compared with the RGB stream. To avoid the different coordinates obtained by different pose extractors from affecting the behavior recognition accuracy and to solve the problem of greatly increasing the computation of multi-person scenes caused by treating skeleton joints as points. we generate skeleton heatmaps and use them as input to Transformer. In addition, because the skeleton heatmaps are primary feature representation, using the same network structure as the RGB stream can lead to overfitting. Inside the network, the skeleton stream has fewer attention layers than the RGB stream. We also propose two fusion methods, which can integrate RGB and skeleton conveniently and give full play to the complementary characteristics of these two modalities in action recognition. In summary, the contributions of this paper are as follows:

We propose a two-stream pure Transformer-based framework for human action recognition. The framework can use RGB videos to obtain skeleton heatmaps and then utilizes RGB and skeleton modalities as input to complete action recognition;
For the Transformer that processes skeleton data, the input is in higher temporal resolution and lower spatial resolution. From the structural point of view, there are less attention layers, which make the network accurate and precise to process skeleton heatmaps and capture motion information;
We propose two fusion methods of the two streams, which can give full play to the complementarity of RGB and skeleton modalities in action recognition.

We organize this paper as follows. We discuss the related works in Section 2. In Section 3, we describe the details of our framework. The experiment results and ablation studies on four popular action recognition benchmarks are shown in Section 4. Finally, the conclusion and future work are represented in Section 5.

2. Related Works

We discuss the works related to ours in this section. We present the discussion about skeleton-based action recognition methods and recent Transformer-based action recognition methods on RGB modality or skeleton modality.

2.1. Skeleton-Based Action Recognition

Skeleton sequences encode the trajectories of human joints and limbs, which accurately represent the motion of the human body. Skeleton data can be extracted from motion capture systems or using pose estimation algorithms on RGB videos. However, the deployment of the motion capture system is very troublesome, and many application scenarios do not have the capabilities to deploy the motion capture systems. Therefore, it is very popular to use pose estimation algorithms on RGB videos. Because of the simple but precise expression of motion information of skeletons and its stability to kinds of clothing textures and noisy backgrounds, there have been many methods trying to use skeleton data to solve action recognition problems [4,20,21]. Among them, the graph convolutional network (GCN) method is the most popular to deal with the graph structure of the skeletons, which treats joints as nodes and limbs as edges. However, GCN-based methods have limitations in robustness, interoperability, and scalability [10]. In detail, the joint point coordinates obtained by different pose estimators often have deviation displacements, which will affect the human action recognition algorithm. Secondly, the input of GCN is in the form of a graph, but the information of other modalities is often in the form of an image, which makes it difficult to fuse the skeleton modality with other modalities. At the same time, the complexity of GCN-based algorithms in multi-person scenarios will be greatly increased. Therefore, using a more appropriate way to express the skeleton is very critical and urgent for the tasks of human action recognition.

There are some works that have tried to use CNN networks instead of GCN to process skeleton data, which treat the skeleton data as an image not as a graph. Hou et al. [22] encode skeleton sequences into color texture images containing spatio-temporal information, called skeleton spectra, and use convolutional neural networks (ConvNets) to process skeleton spectra for action recognition. Wang et al. [23] proposed a joint trajectory maps (JTM), which records the trajectory of joints in the 3D skeleton as pictures and applied ConvNets to judge human behavior in real time based on the JTMs. Recently, Duan et al. [10] proposed PoseConv3D based on skeleton data. The input of PoseConv3D is a 3D heatmap volume instead of a sequence of graphs. Hence, inspired by [10], our work preprocessed the skeleton data as pseudo heatmaps, then fed them into Transformer to obtain the characterized features.

2.2. Transformer for Action Recognition

Since the great success of Transformer in natural language processing, a lot of works [19,21,24,25] have employed transformers for human action recognition. Inspired by ViT [12], a pure Transformer architecture which has gained great attention in image classification, Bertasius et al. [19] proposed a divided spatio-temporal attention mechanism to extract spatial and temporal features separately, dubbed TimeSformer. Each video is decomposed into a sequence of non-overlapping frame-level patches, then fed into each standard Transformer block with the proposed attention mechanism. Yan et al. [24] proposed a multi-view transformer for video recognition (MTV). MTV feeds different views of input for different encoders and uses lateral connections to fuse cross-view information. Some works also attempted to apply Transformer in skeleton-based action recognition tasks. Zhang et al. [21] designed Spatial Transformer Block and Directional Temporal Transformer Block for modeling skeleton sequences in spatial and temporal dimension, respectively. Plizzari et al. [25] introduced a Spatial-Temporal Transformer network, where the spatial and temporal self-attention modules learn intra-frame joint interactions and inter-frame motion dynamics, respectively. The above Transformer-based methods are all based on a single modality. However, for human action recognition, only using the RGB data will be affected by noisy appearance factors, and only using the skeleton modality will lack appearance information when facing human–object interaction. Combining these two modalities means richer and more comprehensive information; the human behavior recognition algorithm based on both modalities will achieve a higher accuracy.

Therefore, we propose a two-stream Transformer architecture; one stream processes RGB data, and the other stream processes skeleton data. At the same time, according to the different characteristics of these two modalities, different network structures are set up, and two fusion methods are tried to fully integrate the RGB and skeleton features.

3. Proposed Framework

We propose RGBSformer, a two-stream Transformer-based approach for RGB video-based human action recognition. An overview of RGBSformer is depicted in Figure 2. In this section, we start with a general overview of the framework and then discuss the differences between skeleton stream and RGB stream. Then, we discuss the detailed architectures, one to process pseudo skeleton heatmaps and the other to process RGB frames, and introduce how to fuse information of the two streams.

An overview of the framework. Inspired by SlowFast [26], we designed the two-stream framework as one slow stream to capture the spatial feature and the other to focus more on temporal information working on fast frame rate. Due to the lack of appearance information and detailed shape information, skeleton data naturally focus more on motion information. We argue that skeleton data provide more temporal information than spatial information. Hence, we take the skeleton heatmap stream as the fast pathway, and the RGB frame stream as the slow pathway. Then, as shown in Figure 2, an overview of the overall framework can be summarized as follows: Firstly, after acquiring RGB videos, we first use HRNet to acquire skeleton data and generate corresponding skeleton heatmaps. Secondly, the skeleton heatmaps and RGB frames are input to RGBSformer with different temporal and spatial resolutions. Thirdly, after decomposition, they are input to the spatial Transformer encoder, which has fewer layers than the RGB stream for the skeleton stream. Fourthly, the frame-level representations which are obtained from the spatial Transformer encoder of skeleton stream, are fused into the RGB stream, followed by the temporal Transformer encoder and the MLP head afterwards. Finally, after passing the classification header, the classification scores are averaged from two streams to obtain the final classification results.

Difference between the two streams. We propose three ways to make the fast pathway capture temporal information precisely. Compared to the slow pathway, the RGB stream, the skeleton stream has three differences.

Higher temporal resolution. The skeleton heatmap stream works at a higher frame rate (temporal resolution) than the RGB stream. The fast refreshing frames can effectively model the potentially fast changing motion;
Lower spatial resolution. As presented in Section 3.1, the heatmaps have smaller size than original frames. Noisy background and uninteresting information have been discarded in the cropping process. Meanwhile, we feed the generated heatmaps into Transformer directly without any cropping or resizing operations;
Less attention layers. Compared to original RGB frames, skeleton heatmaps are already mid-level features for human action recognition. In Transformer architecture, the attention blocks are repeatedly stacking on top of each other to construct the final model. Compared to raw RGB data, the fast pathway with less attention block layers can extract features more effectively.

3.1. The Skeleton Stream

Readily available large amounts of RGB videos captured by RGB cameras and progressively improved pose estimation algorithms have led to many recent works using pose estimation algorithms on RGB videos to acquire skeleton data. The authors in [10] conducted a review on key aspects of pose extraction methods to find the best practice for the final recognition accuracy. In this paper, following [10], we obtain skeleton data by adapting HRnet, a 2D Top-Down pose estimator proposed by [27].

Skeleton joint representation. By applying the pose estimator, we obtain the locations of K human joints from a frame F of size

W \times H \times 3

, where K is the number of skeleton joints, and W and H are the width and height of the frame. For k-th skeleton joint, we obtain the representation

(x_{k}, y_{k}, c_{k})

, where

(x_{k}, y_{k})

is the coordination of the joint in the frame, and

c_{k}

is the confidence score of the joint.

Pseudo skeleton heatmaps. We describe a 2D pose estimated from a frame as a heatmap of size

K \times W \times H

. We can directly use the heatmap produced by the HRNet and make it zero-padded to match the bounding box given by the origin frame. So the joint heatmap J composing K Gaussian maps centered as every point can be described as:

J_{k i j} = e^{- \frac{{(i - x_{a_{k}})}^{2} + {(j - y_{a_{k}})}^{2}}{2 σ^{2}}} \times c_{a_{k}}

(1)

where

σ

controls the variance of Gaussian maps,

(x_{a_{k}}, y_{a_{k}})

means the coordinate of the point

a_{k}

, and

c_{a_{k}}

means the confidence score of joint

a_{k}

.

The heatmap L of the limb between two joints

a_{k}

and

b_{k}

can be described as:

L_{k i j} = e^{- \frac{D {((i, j), s e g [a_{k}, b_{k}])}^{2}}{2 σ^{2}}} \times m i n (c_{a_{k}}, c_{b_{k}}),

(2)

where

s e g

means the segment between the two joints, and the function D calculates the distance from the point

(i, j)

to the segment

[(x_{a_{k}}, y_{a_{k}}), (x_{b_{k}}, y_{b_{k}})] .

From the above formulas, it can be seen that compared to the skeleton-based methods which treat joints and limbs in graph form, heatmaps are easy and low-cost to solve the multi-person cases by accumulating the k-th Gaussian map of every person without enlarging the heatmap. The example of the generated heatmaps are shown in Figure 3.

Reformulating to the input. To feed the 2D heatmaps to RGBSformer, we should stack all the heatmaps along the temporal dimension. In this process, the heatmap does not need to be as large as the original frame because most of the time people only occupy a small part of the video frame, and the zero-padded, blank background will only occupy more storage space and may affect the performance of the network. Hence, we crop all the frames to the size of the smallest detection bounding box which can envelop all 2D poses. The human-centered cropping is also helpful for the backbone network to extract the motion feature precisely. After cropping, we have acquired the 3D heatmap videos which are the input of Transformer.

Video Decomposition. Following [18], we denote a heatmap video as

V \in R^{F \times H \times W \times 3}

, where H is the height of one heatmap, and W is the width, F means there are F heatmaps in V, and then we separately decompose the heatmap video into N non-overlapping spatio-temporal “tubes” [18]

x_{1}, x_{2}, \dots, x_{N} \in R^{f \times h \times w \times 3}

, where

N = ⌊ \frac{F}{f} ⌋ \times ⌊ \frac{H}{h} ⌋ \times ⌊ \frac{W}{w} ⌋

. Next, we linearly map each tube

x_{i}

into a token

z_{i}

,

z_{i} = E x_{i}

. Finally, we concatenate all tube tokens

z_{i}

into a vector

z^{0}

. The same as BERT [28], we add a special learnable vector

z_{c l s} \in R^{d}

in the first position of the token sequence to represent the embedding of the classification token, and a positional embedding

p_{p o s} \in R^{(N + 1) \times d}

is also added to this sequence.

In fast pathway, there are

L^{'}

self-attention blocks. At block l, we compute the query/key/value of each tubelet from the last layer’s embedding vector

z^{(l - 1)}

:

q_{(p, t)}^{(l, a)} = W_{Q}^{(l, a)} L a y e r N o r m (z_{(p, t)}^{(l - 1)}) \in R^{D_{h}},

(3)

k_{(p, t)}^{(l, a)} = W_{K}^{(l, a)} L a y e r N o r m (z_{(p, t)}^{(l - 1)}) \in R^{D_{h}},

(4)

v_{(p, t)}^{(l, a)} = W_{V}^{(l, a)} L a y e r N o r m (z_{(p, t)}^{(l - 1)}) \in R^{D_{h}},

(5)

where

a = 1, \dots, A

is an index over multiple attention heads, A denotes the total number of attention heads,

p = 1, \dots, N

denotes the spatial location, and

t = 1, \dots, ⌊ \frac{F}{f} ⌋

depicts the temporal index.

Spatial Transformer Encoder. In each block, we compute the attention of tokens in the same temporal index firstly. The self-attention weights are as follows:

α_{(p)}^{(l, a) s p a c e} = s o f t m a x (\frac{q_{(p)}^{{(l, a)}^{⊤}}}{\sqrt{D_{h}}} [k_{(0)}^{(l, a)} {\{k_{(p^{'})}^{(l, a)}\}}_{p^{'} = 1, \dots, N}]),

(6)

where

D_{h} = \frac{D}{A}, D = (N + 1) \times d

.

The encoding

z_{(p)}^{(l)}

at block l is obtained as follows:

s_{(p)}^{(l, a)} = α_{(p), (0)}^{(l, a)} v_{(0)}^{(l, a)} + \sum_{p^{'} = 1}^{N} α_{(p), (p^{'})}^{(l, a)} v_{(p^{'})}^{(l, a)},

(7)

{z^{'}}_{(p)}^{(l) s p a c e} = W_{O} [\begin{matrix} s_{(p)}^{(l, 1)} \\ ⋮ \\ s_{(p)}^{(l, A)} \end{matrix}] + z_{(p)}^{(l - 1)},

(8)

where s is the concatenation of the vectors from all head.

Temporal Transformer Encoder. We obtain the frame-level representation

z_{(p)}^{(l) s p a c e}

, which can be seen as the classification token

z_{c l s}^{L^{'} s p a c e}

. Then, we represent the frame-level representation as

h_{i} \in R^{d}

and concatenate all the frame-level tokens:

H = \{h_{1}, \dots, h_{⌊ \frac{F}{f} ⌋}\} \in R^{⌊ \frac{F}{f} ⌋ \times d} .

(9)

The encoding H is then fed into the temporal Transformer encoder which consists of

L_{t}

layers to compute the attention from different temporal indexes, and new key/query/value vectors are obtained by using Equations (3)–(5). The temporal attention is computed by the following formula:

α_{(p, t)}^{(l, a) t i m e} = s o f t m a x (\frac{q_{(p, t)}^{{(l, a)}^{⊤}}}{\sqrt{D_{h}}} [k_{(0, 0)}^{(l, a)} {\{k_{(p, t^{'})}^{(l, a)}\}}_{t^{'} = 1, \dots, ⌊ \frac{F}{f} ⌋}]) .

(10)

The same as spatial attention, Equations (7) and (8) are adapted to obtain the encoding. Then, the encoding is passed to Equation (11). Then, we obtain the final classification token which will be send to an MLP head which consists of two linear projections separated by a GELU [29]:

z_{(p)}^{(l)} = M L P (L a y e r N o r m ({z^{'}}_{(p)}^{(l)})) + {z^{'}}_{(p)}^{(l)} .

(11)

Finally, the classification results are obtained by a classification head consisting of a cross-entropy loss function and a linear projection, where the loss function can be expressed as the following formula:

L o s s = - \sum_{i = 0}^{C - 1} y_{i} l o g (p_{i}),

(12)

where C means the number of categories, i means the i-th sample,

y_{i}

means the one-hot vector of the sample, and

p_{i}

means the probability distribution result of the sample.

The detailed structure is shown in Figure 4:

3.2. The RGB Stream

RGB data can provide rich appearance information and detailed shape information. However, it is sensitive to background, color texture, illumination, and viewpoint. Especially when people interact with objects, appearance information is very important for action recognition. Based on the above reasons, we think that in a two-stream framework, RGB data stream can be regarded as the slow pathway to focus on extracting spatial and appearance information instead of temporal information.

The slow pathway’s structure is similar to the fast pathway. The RGB frames are rescaled to larger spatial resolution than the skeleton heatmap. The frame rate is lower, and the attention layers are repeated more. In detail, we conduct the same decomposition on RGB videos as heatmap videos, then embed the tubelets and classification token and pass them to the attention blocks. The number of spatial attention layers is L, which is bigger than

L^{'}

. Unlike the skeleton stream, the RGB stream will fuse the frame-level representations with the other frame-level representations from the skeleton stream. After that, the same as the fast pathway, the classification token from the temporal Transformer encoder will be sent to classify. For the temporal Transformer encoder, both streams have the same amount of layers, which is

L_{t}

.

3.3. The Fusion Methods

To make the most of RGB and skeleton modalities, we propose two ways to fuse the result of the two pathways. The most simple way is score fusion. After the prediction scores are obtained from the two pathways, we average the scores of one video for the final result directly. However, the method of score fusion does not fully integrate the characteristics of the two modalities. Referring to the horizontal connection in SlowFast [26], it is necessary to use the horizontal connection in the network. Hence, we propose the classification token fusion method to fuse the skeleton information to the RGB stream.

Classification token fusion. Recall the process in the fast pathway: after

L^{'}

spatial attention layers, the frame-level representations are obtained. We can represent one frame as the frame-level classification token

h_{i}, i = 1, \dots, f^{'}

, where

f^{'}

means there are

f^{'}

frames in the fast pathway. The same as the the fast pathway, the frame-level classification token in the slow pathway can be described as

{h^{'}}_{i}, i = 1, \dots, f

, where f means there are f frame-level representations. Suppose that

f^{'} = t \times f

. We average t frames to one frame in the fast pathway. Then, we obtain the classification tokens

h_{i}, i = 1, \dots, f

. After that, according to the frame index, the fast pathway’s frame-level representation can be concatenated to the slow pathway. The fused token obtained by concatenating the frame-level tokens from two pathways will be sent to the temporal Transformer encoder. At the end of the classification head, score fusion is used again. The detailed fusion structure can be seen in Figure 5.

4. Results

In this section, we evaluate RGBSformer and skeleton pathway on four benchmarks, then compare them with the state-of-the-art approaches. The ablation studies are also shown in the last subsection.

4.1. Datasets

We experiment on four popular action recognition datasets: the widely used as a benchmark dataset Kinetics400 [30], the skeleton-based widely used benchmark dataset NTU RGB+D 60 [31], NTU RGB+D 120 [32], and the fine-grained dataset FineGym-99 [33].

In Kinetics400, there are some videos that are not human-centric, so it is impossible or difficult to extract skeleton data from these videos. After weeding out some invalid videos, Kinetics400 consists of 239K training videos and 20K validation videos of 400 action classes. The videos are sampled by stratified sampling to ensure a uniform distribution between the training and test sets.

NTU RGB+D is a large-scale human action recognition dataset collected indoors. NTU RGB+D 60 contains 57K videos of 60 action classes. NTU RGB+D 120 contains 114K videos of 120 human actions from 106 subjects. Both datasets provide 3D skeleton sequences collected by Microsoft Kinect v2 sensors [34] for these videos. In order to be consistent with other datasets, we still use the Top-Down pose estimator, HRNet, to obtain skeleton data. Meanwhile, we use the Cross-subject (X-sub) splits through all experiments which split the dataset as reported by the dataset authors. Similarly, we weed out the videos where skeleton data is missing.

FineGym99 has 29k videos of 99 fine-grained gymnastic action categories. The same as [10], for the skeleton stream, during skeleton extraction, we use the ground truth bounding boxed for the athletes in all frames. After stratified sampling, we used 23K videos for training and 6K videos for testing.

4.2. Implementation Details

We train the two streams separately. For the slow pathway’s training, we sample clips of 8 frames at a frame rate of 1/32 from RGB videos. We first re-scale the original frame to short-side 320 pixels, then randomly crop 224 × 224 pixels from the reshaped frames at training. We perform the center cropping in validation and three spatial cropping (top-left, center, bottom-right) of the same temporal index in test. The tubelet size is

16 \times 16 \times 2

, and we use the central frame method to initialize the tubelet embeddings. For the fast pathway’s training, we sample clips of 32 heatmaps at the frame interval of 10. Different from the slow pathway, we generate the pseudo heatmap to 224 × 224 pixels and feed them to the Transformer blocks directly without scaling, resizing, and cropping. Importantly, since skeleton heatmaps are mid-level features, the number of Transformer layers for the fast pathway is reduced to 10, less than the slow pathway where the block is repeated 12 times. However, the layer amount of temporal Transformer encoder is the same, which is four for all experiments.

Both pathways use the ViT model pretrained on ImageNet-21K [35] and SGD optimizer [36]. For Kinetics400, we train our mode for 15 epochs with an initial learning rate of 0.005/8, which is divided by 10 at epochs 5 and 10. For the other three datasets, we train our model for 10 epochs with an initial learning rate of 0.005/8, which is divided by 10 at epoch 5. Following TimeSformer [19], we average the scores of these three crops to obtain the final prediction result in the test. We use the Transformer implemented in MMAction2 [37].

4.3. Performance of RGBSformer

To prove that RGB modality and skeleton modality are supplementary for human action recognition. We compared RGBSformer and the single pathway only to three kinds of models: skeleton-based models, RGB video-based models, and multi-modality models. The detail of results are shown in Table 1. On the fine-grained dataset, FineGym99, our fast-pathway outperforms all methods except RSANet-R50 [38]. It proves that significantly distinguishable features of human actions can be captured by pseudo heatmaps. After fusing with the RGB modality, our results are highly improved and outperform all methods, showing that RGB is indeed complementary to the skeleton modality. The two-stream framework also performs well on the Kinetics400 dataset. However, the fast-pathway performs badly on Kinetics400. The reason may be that many videos in Kinetics400 are too difficult to extract skeleton data. Figure 6 presents two examples of these videos. There are four examples of the three models on FineGym99 in Figure 7. From the result of the top five categories, it can be seen that the fusion result obtained by RGBSformer is the best. For the fourth example, the RGB stream gave the wrong prediction, but after the fusion with the skeleton stream, the RGBSformer obtained the right answer.

4.4. Performance of Skeleton Stream

To show the effectiveness of the skeleton stream which treats skeleton data as heatmaps instead of graphs and uses less attention layers, we compare the fast pathway only to skeleton-based action recognition methods on the NTU RGB+D 60 and NTU RGB+D 120 datasets. Table 2 shows the result. Most skeleton-based methods use the 3D skeleton which contains 25 points collected by Microsoft Kinect v2 sensors. In contrast, RGBSformer takes 2D skeletons estimated by HRNet [27] pretrained on the COCO-keypoint [44], which consits of 17 points. On the NTU RGB+D 60 dataset, our framework outperforms other models by at least 0.9%, which proves the skeleton heatmaps are helpful for human action recognition, and on the NTU RGB+D 120 dataset, RGBSformer also performs well. It is worth noting that compared to the Transformer-based ST-TR-agcn [45], treating the skeleton as heatmaps performs better than treating skeleton as graphs. Figure 8 shows the normalized confusion matrix of the skeleton stream on both datasets. On the NTU RGB+D 60 dataset, the skeleton stream has extracted features to significantly distinguish the classes. However, on the extended dataset, the skeleton stream performs poorly in gesture-related categories, such as “thumb up”, “thumb down”, “make ok sign”, and ”make victory sign”. This is understandable, because the COCO-keypoint does not refine the skeleton joints of the hand, and the skeleton stream can only judge by the direction of the arm swing.

4.5. Ablation Studies

Different fusion ways. We compare two fusion methods of RGBSformer. Compared with the simple score fusion method, classification token fusion has greatly improved the results on the fast pathway. Compared with the skeleton stream, score fusion only improves the mean-class accuracy by 0.9%. The classification token fusion method has increased by 3%. Figure 9 shows the improvement and reduction of different fusion methods to the skeleton stream on the 99 action classes. The red bars indicate the improved accuracy of the skeleton flow after fusing the RGB modality. The green bar indicates the reduced accuracy of skeleton flow after fusing the RGB modality. Table 3 shows the result. The classification token fusion method improves the accuracy of more categories than simple fusion of scores rather than reducing the accuracy of the category. Unless otherwise specified, we use the classification token fusion in all experiments in this article.

Layers of skeleton stream. According to [10], the skeleton heatmaps are a mid-level feature for action recognition. To prove this, we change the number of attention layers of the skeleton stream. We first make the

L^{'} = L

, which means the skeleton stream and RGB stream have the same amount of attention layers. Then, we reduce two layers at a time until the

L^{'} = \frac{1}{2} L

. We report the mean-class accuracy and Top-1 accuracy on FineGym99. To simplify the experiment, we sample 16 frames from the heatmaps. Except for the number of layers, all parameters remain the same. Figure 10 shows the final results on the test set. The right subfigure shows the four loss curves of these experiments. It can be seen that more layers do not mean higher accuracy because the features may be “overfitting”. Therefore, we use

L^{'} = 10

in all other experiments in this paper.

Heatmaps for limbs and joints. Referring to [50], GCN-based methods for skeleton action recognition usually assemble results of joint stream and limb stream to obtain better recognition performance. We generate pseudo heatmaps of joints and joint–limbs on the FineGym99 dataset, then feed these two kinds of heatmaps to the skeleton stream only. The examples of the heatmaps are shown in Figure 3. The results are in Table 4. It can be seen that both joint-only heatmaps and joint–limb heatmaps can capture distinguishable features of human actions, and the joint–limb heatmaps can lead to an obvious performance boost.

3D poses vs. 2D poses. The NTU RGB+D dataset provides 3D skeletons consisting of 25 points, which are captured by Microsoft Kinect v2 sensors. In our experiments, the skeleton we extracted by adapting HRNet [27] pretrained on COCO-keypoint [44] consisted of 17 points. We generate heatmaps for both sorts of skeletons and conducted the experiments on NTU RGB+D 60 to explore whether different skeleton data sources affect the skeleton stream’s performance. We reduced the dimensionality of 3D skeleton points to 2D skeleton points by dividing a 3D skeleton

(x, y, z)

into three 2D points

(x, y)

,

(y, z)

,

(x, z)

. The examples of the 2D skeleton heatmap and the 3D dimensionality reduced skeleton heatmap are shown in Figure 11. After obtaining the three 2D skeleton joints, we input them into the skeleton stream separately and averaged the classification scores obtained to derive the final classification results. The aforementioned experiments confirmed that both joint-only heatmaps and joint–limb heatmaps can extract distinguishing action features. In addition, since the computational consumption of generating joint-only heatmaps is less than that of joint–limb heatmaps, we use joint-only heatmaps here. Table 5 shows the result. 3D keypoints do not help in action recognition. The reason could be that the dimensionality reduction may make the key distinguishable information of 3D skeletons less, or it could be that the 2D points, dimensionally reduced from 3D skeletons, become noise information that hinders action recognition accuracy.

5. Conclusions

In this work, we propose RGBSformer, a two-stream pure Transformer-based approach for human action recognition, which can take RGB frames and skeleton heatmaps as input. We achieved the state-of-the-art performance on four widely used action recognition benchmarks. Compared to most skeleton-based action recognition algorithms which use absolute coordinates to represent the skeleton, we use skeleton heatmaps to extract the motion information contained in skeletons. The skeleton heatmaps can avoid the problem of reduced accuracy due to the bias caused by different ways of skeleton acquisition and can use lower computational consumption to solve the action recognition problem in multiplayer scenarios. From a practical application point of view, RGBSformer does not require motion capture systems or other sensors and can perform RGB and skeleton-based bimodal action recognition using only RGB cameras. It provides a framework for Transformer-based bimodal action recognition, which provides a reference for the application of Transformer on human motion recognition tasks.

However, there are still two aspects of our framework that can be enhanced, and this is the direction of our research for subsequent work. Firstly, ablation studies show that RGBSformer still has room for improvement in 3D skeleton-based action recognition. In the future, we will explore more ways on Transformer-based frameworks to fuse RGB modality and other modalities such as 3D skeleton, depth, point cloud, and Wi-Fi for human action recognition. Secondly, RGBSformer, like all Transformer-based frameworks, requires a very large number of training samples and a large amount of computational overhead. How to obtain a better model using small samples and low overhead is still a problem.

Author Contributions

Conceptualization, J.S. and Y.Z.; methodology, J.S., Y.Z. and B.X.; software, J.S., Y.Z. and W.W.; validation, J.S., Y.Z. and W.W.; writing—original draft preparation, J.S. and Y.Z.; writing—review and editing, J.S., Y.Z., W.W. and B.X.; visualization, J.S.; supervision, D.H. and L.C. All authors have read and agreed to the published version of the manuscript.

Funding

This paper was funded in part by the National Natural Science Foundation of China under Grant 62072319, 62262074; the Sichuan Science and Technology Program under 2022YFG0041 and 2022YFG0159; the Luzhou Science and Technology Innovation R&D Program (No. 2022CDLZ-6).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The FineGym99 dataset can be found in https://sdolivia.github.io/FineGym/, accessed on 1 November 2022, the raw RGB videos can be downloaded from https://utexas.box.com/s/oq2ce7i1mdgoihdwy9o7moolxzxsib42, accessed on 1 November 2022, (applied by the authors https://github.com/SDOlivia/FineGym/issues/11, accessed on 1 November 2022). However, due to the limitations of storage and network speed, we downloaded the RGB data clipped as events from here https://github.com/minghchen/CARL_code, accessed on 1 November 2022. The NTU RGB+D dataset can be found in https://rose1.ntu.edu.sg/dataset/actionRecognition/, accessed on 1 November 2022, to download the datasets, one can register an account, submit the request and accept the Release Agrement and wait for the authors’ approval. The Kinetics400 can be found in https://www.deepmind.com/open-source/kinetics, accessed on 1 November 2022. The pretrained ViT model can be found in https://github.com/google-research/scenic/tree/main/scenic/projects/baselines, accessed on 1 November 2022. The pretrained HRNet model can be downloaded from https://github.com/HRNet/HRNet-Human-Pose-Estimation, accessed on 1 November 2022.

Acknowledgments

I would like to express my heartfelt thanks to Junhua Liao and Shengjie Liu who made valuable suggestions to this article.

Conflicts of Interest

The authors declare that they have no known competing financial interest or personal relationship that could have appeared to influence the work reported in this paper.

References

Shu, X.; Yang, J.; Yan, R.; Song, Y. Expansion-squeeze-excitation fusion network for elderly activity recognition. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 5281–5292. [Google Scholar] [CrossRef]
Park, S.K.; Chung, J.H.; Pae, D.S.; Lim, M.T. Binary Dense SIFT Flow Based Position-Information Added Two-Stream CNN for Pedestrian Action Recognition. Appl. Sci. 2022, 12, 10445. [Google Scholar] [CrossRef]
Yue, R.; Tian, Z.; Du, S. Action recognition based on RGB and skeleton data sets: A survey. Neurocomputing 2022, 512, 287–306. [Google Scholar] [CrossRef]
Liu, J.; Shahroudy, A.; Xu, D.; Kot, A.C.; Wang, G. Skeleton-based action recognition using spatio-temporal LSTM network with trust gates. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 3007–3021. [Google Scholar] [CrossRef]
Imran, J.; Kumar, P. Human action recognition using RGB-D sensor and deep convolutional neural networks. In Proceedings of the 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI), Jaipur, India, 21–24 September 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 144–148. [Google Scholar]
Chen, X.; Liu, W.; Liu, X.; Zhang, Y.; Han, J.; Mei, T. MAPLE: Masked Pseudo-Labeling autoEncoder for Semi-supervised Point Cloud Action Recognition. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 708–718. [Google Scholar]
Sun, Z.; Ke, Q.; Rahmani, H.; Bennamoun, M.; Wang, G.; Liu, J. Human action recognition from various data modalities: A review. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 3200–3225. [Google Scholar] [CrossRef]
Li, J.; Xie, X.; Pan, Q.; Cao, Y.; Zhao, Z.; Shi, G. SGM-Net: Skeleton-guided multimodal network for action recognition. Pattern Recognit. 2020, 104, 107356. [Google Scholar] [CrossRef]
Cai, J.; Jiang, N.; Han, X.; Jia, K.; Lu, J. JOLO-GCN: Mining joint-centered light-weight information for skeleton-based action recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 2735–2744. [Google Scholar]
Duan, H.; Zhao, Y.; Chen, K.; Lin, D.; Dai, B. Revisiting skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 2969–2978. [Google Scholar]
Jing, Y.; Wang, F. TP-VIT: A Two-Pathway Vision Transformer for Video Action Recognition. In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 2185–2189. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Cai, J.; Zhang, Y.; Guo, J.; Zhao, X.; Lv, J.; Hu, Y. St-pn: A spatial transformed prototypical network for few-shot sar image classification. Remote Sens. 2022, 14, 2019. [Google Scholar] [CrossRef]
Zhou, X.; Bai, X.; Wang, L.; Zhou, F. Robust ISAR Target Recognition Based on ADRISAR-Net. IEEE Trans. Aerosp. Electron. Syst. 2022, 58, 5494–5505. [Google Scholar] [CrossRef]
Zhao, X.; Lv, X.; Cai, J.; Guo, J.; Zhang, Y.; Qiu, X.; Wu, Y. Few-Shot SAR-ATR Based on Instance-Aware Transformer. Remote Sens. 2022, 14, 1884. [Google Scholar] [CrossRef]
Liao, J.; Duan, H.; Li, X.; Xu, H.; Yang, Y.; Cai, W.; Chen, Y.; Chen, L. Occlusion detection for automatic video editing. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 2255–2263. [Google Scholar]
Liao, J.; Duan, H.; Zhao, W.; Yang, Y.; Chen, L. A Light Weight Model for Video Shot Occlusion Detection. In Proceedings of the ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 3154–3158. [Google Scholar]
Arnab, A.; Dehghani, M.; Heigold, G.; Sun, C.; Lučić, M.; Schmid, C. Vivit: A video vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 6836–6846. [Google Scholar]
Bertasius, G.; Wang, H.; Torresani, L. Is space-time attention all you need for video understanding? In Proceedings of the ICML, Virtual, 18–24 July 2021; Volume 2, p. 4. [Google Scholar]
Hu, L.; Liu, S.; Feng, W. Spatial Temporal Graph Attention Network for Skeleton-Based Action Recognition. arXiv 2022, arXiv:2208.08599. [Google Scholar]
Zhang, Y.; Wu, B.; Li, W.; Duan, L.; Gan, C. STST: Spatial-temporal specialized transformer for skeleton-based action recognition. In Proceedings of the 29th ACM International Conference on Multimedia, Nice, France, 21–25 October 2021; pp. 3229–3237. [Google Scholar]
Hou, Y.; Li, Z.; Wang, P.; Li, W. Skeleton optical spectra-based action recognition using convolutional neural networks. IEEE Trans. Circuits Syst. Video Technol. 2016, 28, 807–811. [Google Scholar] [CrossRef]
Wang, P.; Li, Z.; Hou, Y.; Li, W. Action recognition based on joint trajectory maps using convolutional neural networks. In Proceedings of the 24th ACM international conference on Multimedia, Amsterdam, The Netherlands, 15–19 October 2016; pp. 102–106. [Google Scholar]
Yan, S.; Xiong, X.; Arnab, A.; Lu, Z.; Zhang, M.; Sun, C.; Schmid, C. Multiview transformers for video recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 3333–3343. [Google Scholar]
Plizzari, C.; Cannici, M.; Matteucci, M. Spatial temporal transformer network for skeleton-based action recognition. In Proceedings of the International Conference on Pattern Recognition, Shanghai, China, 15–17 October 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 694–701. [Google Scholar]
Feichtenhofer, C.; Fan, H.; Malik, J.; He, K. Slowfast networks for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 6202–6211. [Google Scholar]
Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5693–5703. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Hendrycks, D.; Gimpel, K. Gaussian error linear units (gelus). arXiv 2016, arXiv:1606.08415. [Google Scholar]
Kay, W.; Carreira, J.; Simonyan, K.; Zhang, B.; Hillier, C.; Vijayanarasimhan, S.; Viola, F.; Green, T.; Back, T.; Natsev, P.; et al. The kinetics human action video dataset. arXiv 2017, arXiv:1705.06950. [Google Scholar]
Shahroudy, A.; Liu, J.; Ng, T.T.; Wang, G. Ntu rgb+ d: A large scale dataset for 3d human activity analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1010–1019. [Google Scholar]
Liu, J.; Shahroudy, A.; Perez, M.; Wang, G.; Duan, L.Y.; Kot, A.C. Ntu rgb+ d 120: A large-scale benchmark for 3d human activity understanding. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 42, 2684–2701. [Google Scholar] [CrossRef]
Shao, D.; Zhao, Y.; Dai, B.; Lin, D. Finegym: A hierarchical video dataset for fine-grained action understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 2616–2625. [Google Scholar]
Zhang, Z. Microsoft kinect sensor and its effect. IEEE Multimed. 2012, 19, 4–10. [Google Scholar] [CrossRef]
Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, 20–25 June 2009; IEEE: Piscataway, NJ, USA, 2009; pp. 248–255. [Google Scholar]
Goyal, P.; Dollár, P.; Girshick, R.; Noordhuis, P.; Wesolowski, L.; Kyrola, A.; Tulloch, A.; Jia, Y.; He, K. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv 2017, arXiv:1706.02677. [Google Scholar]
MMAction2 Contributors Openmmlab’s Next Generation Video Understanding Toolbox and Benchmark. 2020. Available online: https://github.com/open-mmlab/mmaction2 (accessed on 1 November 2022).
Kim, M.; Kwon, H.; Wang, C.; Kwak, S.; Cho, M. Relational Self-Attention: What’s Missing in Attention for Video Understanding. Adv. Neural Inf. Process. Syst. 2021, 34, 8046–8059. [Google Scholar]
Carreira, J.; Zisserman, A. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6299–6308. [Google Scholar]
Tran, D.; Wang, H.; Torresani, L.; Ray, J.; LeCun, Y.; Paluri, M. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6450–6459. [Google Scholar]
Zhou, B.; Andonian, A.; Oliva, A.; Torralba, A. Temporal relational reasoning in videos. In Proceedings of the European conference on computer vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 803–818. [Google Scholar]
Lin, J.; Gan, C.; Han, S. Tsm: Temporal shift module for efficient video understanding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27 October–2 November 2019; pp. 7083–7093. [Google Scholar]
Crasto, N.; Weinzaepfel, P.; Alahari, K.; Schmid, C. Mars: Motion-augmented rgb stream for action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 7882–7891. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
Plizzari, C.; Cannici, M.; Matteucci, M. Skeleton-based action recognition via spatial and temporal transformer networks. Comput. Vis. Image Underst. 2021, 208, 103219. [Google Scholar] [CrossRef]
Papadopoulos, K.; Ghorbel, E.; Aouada, D.; Ottersten, B. Vertex feature encoding and hierarchical temporal modeling in a spatial-temporal graph convolutional network for action recognition. arXiv 2019, arXiv:1912.09745. [Google Scholar]
Peng, W.; Shi, J.; Xia, Z.; Zhao, G. Mix dimension in poincaré geometry for 3d skeleton-based action recognition. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 1432–1440. [Google Scholar]
Yang, H.; Gu, Y.; Zhu, J.; Hu, K.; Zhang, X. PGCN-TCA: Pseudo graph convolutional network with temporal and channel-wise attention for skeleton-based action recognition. IEEE Access 2020, 8, 10040–10047. [Google Scholar] [CrossRef]
Song, Y.F.; Zhang, Z.; Shan, C.; Wang, L. Constructing stronger and faster baselines for skeleton-based action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 1474–1488. [Google Scholar] [CrossRef] [PubMed]
Shi, L.; Zhang, Y.; Cheng, J.; Lu, H. Two-stream adaptive graph convolutional networks for skeleton-based action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12026–12035. [Google Scholar]

Figure 1. Skeleton-based methods most of the time represent skeletons in the form of graphs. It usually takes human joints as nodes, and human limbs as edges.

Figure 2. The overview of our proposed framework, RGBSformer.

Figure 3. The examples of generated joint heatmap, limb heatmap, and joint–limb heatmap.

Figure 4. The Transformer blocks of the skeleton stream to process skeleton heatmaps.

Figure 5. The details of the classification token fusion method.

Figure 6. The examples of Kinetics400 in which it is difficult for skeleton-based methods to handle human action recognition tasks. The people are in small regions, or there are no people in frames.

Figure 7. The examples of four videos from FineGym99 (left). The top five classes predicted by the skeleton stream, the RGB stream, and RGBSformer (right).

Figure 8. The normalized confusion matrix of the skeleton stream on NTU RGB+D 60 (left) and NTU RGB+D 120 (right).

Figure 9. The improvement and reduction to the skeleton stream by different fusion methods.

Figure 10. The performance of skeleton stream on FineGym99, while the number of layers is changing (left). The loss of the different layers during the training (right).

Figure 11. Examples of 2D skeleton consisting of 17 points and 3D skeletons reduced to 2D in dimensionality.

Table 1. The performance of RGBSformer and two single pathways on FineGym99 and Kinetics400 compared to state-of-the-art models. We report the mean-class accuracy for FineGym99 and Top-1 accuracy for Kinetics400. Units are in %.

Methods	Modality	FineGym99	Kinetics400
I3D [39]	RGB	63.2	71.1
R(2+1)D-TwoStream [40]	RGB, Flow	-	75.4
TRN [41]	RGB	68.7	-
TSM [42]	RGB	70.6	-
TSM Two-stream [42]	RGB, Flow	81.2	-
RSANet-R50 [38]	RGB	86.4	-
MARS [43]	RGB, Flow	-	74.9
TP-ViT [11]	RGB, Skeleton	-	80.8
STGAT [20]	Skeleton	-	39.2
RGB stream	RGB	69.3	78.8
skeleton stream	Skeleton	83.7	39.7
RGBSformer	RGB, Skeleton	86.7	80.9

Table 2. The performance of the skeleton stream only compared to the state-of-the-art skeleton-based action recognition methods on NTU RGB+D 60 and NTU RGB+D 120 (cross-subject). Units are in %.

Models	Method	NTU RGB+D 60	NTU RGB+D 120
GVFE + AS-GCN with DH-TCN [46]	GCN	85.3	78.3
Mix-Dimension [47]	GCN	89.7	80.5
PGCN-TCA [48]	GCN	88.0	-
ST-TR-agcn [45]	Transformer	90.3	82.7
EfficientGCN-B0 [49]	GCN	90.2	85.9
Ours	Transformer	91.1	85.7

Table 3. The different fusion methods for RGBSfomer. We report the mean-class on the FineGym99 dataset. Units are in %.

Fusion Method	Mean-Class Accuracy
skeleton stream	83.7
score fusion	84.6
classification token fusion	86.7

Table 4. The different heatmaps for skeleton-based human action recognition. We report the mean-class and Top-1 accuracy on the FineGym99 dataset. Units are in %.

Heatmaps	Mean-Class Accuracy	Top-1 Accuracy
joint-only	79.2	87.2
joint–limb	83.7	88.8

Table 5. The results of 2D and 3D skeletons from the skeleton stream only. The 3D skeletons from NTU RGB+D 60 are reduced to 2D points. Units are in %.

Skeleton Source	Top-1 Accuracy	Top-5 Accuracy
2D skeleton	90.0	99.4
3D skeleton	82.9	98.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shi, J.; Zhang, Y.; Wang, W.; Xing, B.; Hu, D.; Chen, L. A Novel Two-Stream Transformer-Based Framework for Multi-Modality Human Action Recognition. Appl. Sci. 2023, 13, 2058. https://doi.org/10.3390/app13042058

AMA Style

Shi J, Zhang Y, Wang W, Xing B, Hu D, Chen L. A Novel Two-Stream Transformer-Based Framework for Multi-Modality Human Action Recognition. Applied Sciences. 2023; 13(4):2058. https://doi.org/10.3390/app13042058

Chicago/Turabian Style

Shi, Jing, Yuanyuan Zhang, Weihang Wang, Bin Xing, Dasha Hu, and Liangyin Chen. 2023. "A Novel Two-Stream Transformer-Based Framework for Multi-Modality Human Action Recognition" Applied Sciences 13, no. 4: 2058. https://doi.org/10.3390/app13042058

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Novel Two-Stream Transformer-Based Framework for Multi-Modality Human Action Recognition

Abstract

1. Introduction

2. Related Works

2.1. Skeleton-Based Action Recognition

2.2. Transformer for Action Recognition

3. Proposed Framework

3.1. The Skeleton Stream

3.2. The RGB Stream

3.3. The Fusion Methods

4. Results

4.1. Datasets

4.2. Implementation Details

4.3. Performance of RGBSformer

4.4. Performance of Skeleton Stream

4.5. Ablation Studies

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI