Adaptive Feature Selection With Reinforcement Learning for Skeleton-Based Action Recognition

Skeleton-based action recognition has attracted extensive attention recently in the computer vision community. Previous studies, especially GCN-based methods, have presented remarkable improvements for this task. However, in existing GCN-based methods, global average pooling is applied to the extracted features before the classifier. This may hurt the recognition performance since it neglects the fact that not all features are equally important in the temporal dimension. To tackle this issue, in this article, we propose a feature selection network (FSN) with actor-critic reinforcement learning. Given the extracted feature sequence, FSN learns to adaptively select the most representative features and discard the ambiguous features for action recognition. In addition, conventional graph convolution is a local operation, it cannot fully capture the non-local joint dependencies that could be vital to recognize the action. Thus, we also propose a generalized graph generation module to capture latent dependencies and further propose a generalized graph convolution network (GGCN). The GGCN and FSN are combined in a three-stream recognition framework, in which different types of information from skeleton data are further fused to improve the recognition accuracy. Extensive experiments demonstrate that the proposed FSN is a flexible and effective module that can cooperate with any existing GCN-based framework to enhance the recognition accuracy, the proposed GGCN can extract richer skeleton features for skeleton-based action recognition, and our method achieves superior performance over several public datasets, e.g. 95.7 top-1 accuracy on NTU-RGB+D, 86.7 top-1 accuracy on NTU-RGB+D 120, etc.


I. INTRODUCTION
Human action recognition is a fundamental task with wide applications in various fields, such as visual surveillance, video retrieval and human-computer interaction [1]- [4], and is generally based on RGB videos [5] or depth videos [6], [7]. Recently, skeleton-based action recognition has attracted considerable attention since it is compact in action representation, robust to variations in surrounding distractions, and efficient in terms of computation and storage [8]- [10]. Moreover, with the rapid development of depth sensors (e.g., Microsoft Kinect) and human pose The associate editor coordinating the review of this manuscript and approving it for publication was Huiyu Zhou. estimation algorithms [11], [12], capturing skeleton-based data becomes easier than ever before.
Previous deep-learning-based methods directly adopted recurrent neural networks (RNNs) or convolutional neural networks (CNNs) for feature learning, in which the skeletons are simply represented as vector sequences [13]- [19] and pseudo-images [20]- [23] respectively. However, these methods ignore the dependencies between correlated joints. Inspired by the observation that the human skeleton is a natural graph structure, recent methods [24]- [29] construct a skeleton graph to capture joint dependencies. Specifically, the joints and bones are formulated as vertices and edges respectively, and graph convolution networks (GCNs) are applied to extract correlated features.
Although these GCN-based methods have achieved significant improvements, some problems remain to be solved. Generally in GCN-based methods [24], [26], [27], [29], global average pooling is applied to the extracted features before classifier. This operation considers all the extracted features in the temporal dimension equally, and thus fails to focus on the most informative features. Taking action 'throw' as an example, as shown in Fig. 1, the features extracted from the time when the subject throws are obviously more discriminative for recognition than features from other times. To address this concern, in this article, we propose a novel feature selection network (FSN) to select the most representative features for action recognition. Specifically, as the feature selection may differ from one sample to another and has a sequential relationship in the temporal dimension, we model the FSN as an actor-critic reinforcement learning process. The FSN adaptively selects the representative features and discards the ambiguous features. Then the selected features are fed into a classifier for action recognition. In particular, FSN can cooperate with any GCN-based architectures and improve action recognition performance. Obviously, the FSN proposed above is a feature selector, that only selects the representative parts from the features extracted by any network. Thus, an efficient feature extractor is necessary since it can provide more effective features for the FSN. However, conventional GCNs, such as ST-GCN [24] only focus on the joint relations within a local part and ignore the long-range dependencies among joints, where features are only extracted from a small neighborhood. In order to move beyond such limitations, several methods [26], [27] have attempted to extract features within a large receptive field. However, these methods often capture the joint dependencies only by the appearance similarity between joints in feature space. In this article, we propose a generalized graph convolution network (GGCN). It constructs a generalized graph by both considering the appearance and structural similarities among joints in data-driven manner to capture the latent dependencies between arbitrary joints. Then graph convolution is operated on this generalized graph, which increases the flexibility of the graph convolution. Subsequently, the GGCN and FSN are combined in a three-stream action recognition framework, in which multiple types of skeleton information are further fused to improve the recognition performance.
The main contributions of this article are summarized as follows: • We propose a feature selection network (FSN) with actor-critic reinforcement learning to select representative features and discard ambiguous features, which can cooperate with any GCN-based methods to improve recognition performance.
• A generalized graph convolution network (GGCN) is proposed, which can capture latent dependencies among joints for improved feature extraction.
• Three different types of information from skeleton data are fused in a three-stream framework, notably improving recognition performance.
• Extended experiments on the NTU-RGB+D, NTU-RGB+D 120 and Kinetics datasets show that our approaches outperform the state-of-the-art methods. The rest of the paper is organized as follows. Section II is an overview of the relevant work in the literature. In section III, we first describe the pipeline of our proposed method, and then we describe the generalized graph convolution network and the feature selection network in detail. Extensive experiments in section IV are performed to validate the effectiveness of our proposed method. Finally, the conclusion is given in section V.

II. RELATED WORK A. SKELETON-BASED ACTION RECOGNITION
Skeleton-based action recognition has received extensive attention recently due to its robustness to illustration changes and scene variation. Conventional approaches usually design handcrafted features to capture the dynamics of joint motion [30], [31]. With the development of deep learning, deep-learning-based methods have proved to be superior to conventional methods, where the most widely used models are RNNs and CNNs. In RNN-based methods, skeleton data are represented as a sequence of coordinate vectors and then modeled by RNNs to capture the temporal dependencies between consecutive frames [13]- [19], [32]. CNN-based methods usually represent the skeleton data as a pseudo-image and achieve remarkable results due to their ability to extract high-level features [20]- [23]. However, these methods ignore to exploit the relations between body joints. Recently, as graph convolutional networks (GCNs) have extended CNNs to non-Euclidean space [33], GCN-based approaches [24], [34] attracts much attention due to their flexibility in capturing joint dependencies. Yan et al. [24] construct a skeleton graph with joints as vertices and bones as edges, and further propose a spatial-temporal graph convolution network (ST-GCN) to learn spatial and temporal features simultaneously. Although ST-GCN achieves better performance than previous methods, it only extracts the features of joints within a local VOLUME 8, 2020 FIGURE 2. The pipeline of our proposed method. In each GGCN block, the graph convolution operation is based on generalized graph G i and predefined graph A i . The FSN takes extracted features as the input and then outputs whether to select those features. The outputs of three streams are finally fused before the softmax layer.
part and ignores the long-range dependencies among joints. Some methods recently have attempted to solve this problem, such as AS-GCN [27] and AGCN [26]. AS-GCN first infer A-links from input data for capturing actional dependencies and then refine them during training, AGCN proposes an adaptive graph convolution for extracting long-range dependencies. In this article, we generate a generalized graph by considering both the appearance and structural similarities among joints, and further construct a generalized graph convolution network to capture latent dependencies among all joints, which extracts more useful information for action recognition.

B. GRAPH NEURAL NETWORKS
Recently, graph neural networks have attracted a lot of attention due to the effective representation for the non-Euclidean data (e.g. graph structure data) [35] and have many applications. For example, the graph is used to construct a structured attention module for depth estimation [36] and contour prediction [37], Schlichtkrull et al. [38] introduced relational graph convolution networks to deal with the highly multi-relational data characteristic of realistic knowledge bases, Zhao et al. [39] extended the graph convolutional LSTM to a probabilistic model under bayesian framework, and [40], [41] apply graph convolution to remote sensing image classification, etc. Especially, one of the powerful architectures of graph neural networks is graph convolutional networks, whose principle of constructing generally follows two stream: spectral perspective and spatial perspective. The spectral perspective methods consider the locality of the graph convolution in the form of spectral analysis [33], [42], [43]. The spatial perspective methods apply the convolutional filters directly on graph nodes and their neighbors [44], [45]. In this work, we follow the spatial perspective methods to use GCNs in skeleton-based action recognition.

C. DEEP REINFORCEMENT LEARNING
Reinforcement learning (RL) [46] is an area of machine learning concerned with how agents ought to take actions in an environment so as to maximize some notion of cumulative reward, which can be mathematically formulated as a Markov decision process (MDP) [47]. Recently, traditional RL algorithms have been incorporated into deep learning frameworks and applied in various domains. In the field of computer vision, RL also draws much attention, such as image captioning [48], [49], image restoration [50], object tracking [51], action detection [52] and skeleton-based action recognition [25]. More specifically, [52] applied RL to decide both where to look next and when to emit a prediction when observing video frames, and [25] proposed a frame distillation network modeled by RL to select keyframes in skeleton sequences. In this article, different from previous studies, we apply an actor-critic [53] reinforcement learning method to select representative features in the temporal dimension to improve the performance of action recognition.

A. PIPELINE OVERVIEW
The pipeline of our proposed method for skeleton-based action recognition is shown in Fig. 2. Given the skeleton sequences of body joints in the form of 2D or 3D coordinates, we follow [24] to construct a spatial-temporal graph with joints as vertices and natural connectivity in both human body structures and time as edges. As a GCN-based feature extractor, our proposed generalized graph convolution network (GGCN) is applied to extract high-level features from the input data. Then, the extracted features are fed into our proposed feature selection network (FSN), where the representative features are selected and others are discarded. The selected features are finally fed into the classifier to obtain an output. Moreover, a three-stream framework is applied to three different types of input data, and the outputs are fused for classification. We will now introduce each component separately.

B. GENERALIZED GRAPH CONVOLUTION NETWORK 1) GCN-BASED FEATURE LEARNING
In existing GCN-based methods, such as ST-GCN [15], a skeleton graph is constructed with the joints as vertices and bones as edges. Then, multiple layers of graph convolution and temporal convolution are applied to the graph to extract high-level features. In detail, the graph convolution operation is formulated as follows: where K v denotes the kernel size of the spatial dimension and is set to 3 based on the partition strategy (see details in [15]). A k is the predefined graph for each partition group. M k is a learnable edge importance weight matrix initialized as an all-one matrix, and denotes the Hadamard product.

2) GENERALIZED GRAPH GENERATION
Performing activities requires not only the movement of human joints in small local groups but also the collaborative movement of far-apart joints. To adaptively capture latent dependencies among joints for various actions, we propose a generalized graph generation module that adaptively generates generalized graphs by computing the relations among joints. Let a joint consist of its appearance feature f A and structural feature f S . In this work, f A are the features extracted by the graph convolutional network and f S is the adjacency matrix. Given an input set of N joints f n A , f n S N n=1 , we compute the similarity among these features to describe the relation among joints. Similar to [54], two matrices are applied to embed features. Then we centralize the embedded features and use the dot product to measure the similarity of two joints in an embedding space. In detail, the relation R(i, j) between joint i and j is computed as follows: where f i A ∈ R C×T is the appearance feature of joint i, which is embedded with a learnable matrix W θ A /W φ A and then reshaped (R) to a vector E θ . f i S is a structural feature and embedded into a high-dimensional representation by Node2Vec [55], denoted as E, which considers both homophily and structural equivalence between joints.
Finally, a softmax function is applied to normalize R(i, j) into a matrix G, whose elements G(i, j) ∈ [0, 1] satisfy i G(i, j) = 1. Therefore, the latent dependencies among joints can be represented as a normalized generalized graph:

3) GENERALIZED GRAPH CONVOLUTIONAL BLOCK
Referring to the architecture of ST-GCN, our proposed generalized graph convolutional block consists of two parts: the generalized graph convolution (GGC) and temporal convolution (TCN). In the first part, the graph convolution operation is based on both the predefined graph and generalized graph. In detail, similar to multi-head in [54], we generate K v generalized graphs in each block which allows the model to jointly attend to information from different representation subspaces at different positions. Specifically, to avoid introducing additional parameters, the weight matrix W k is shared between the predefined graph A k and the generalized graph G k . Moreover, a residual connection is added as [56] for each layer. Then, in the generalized graph convolutional layer Eq. 1 is transformed into the following form: where G k is the normalized generalized graph defined above, and H res is a 1×1 convolution if the number of input channels is different from the number of output channels, and is identity mapping if they are equal. The second part is the temporal convolutional layer, which performs a K t × 1 convolution on the feature map output by the first part. Then the proposed generalized graph convolution network (GGCN) is stacked by 10 such blocks (see details in Appendix B).

C. FEATURE SELECTION NETWORK
The features f ∈ R d×T ×N extracted by the GGCN can be represented as a sequence F = {f 1 , · · · f T } by spatial average pooling, where f t ∈ R d and f t = 1 N N n=1 f n t . As mentioned in Section I, not all of the extracted features in the temporal dimension are equally important for action recognition. To adaptively select the representative features, we propose a feature selection network (FSN) modeled by an actorcritic [53] reinforcement learning method, which contains a policy network (actor) and a value network (critic). Both the policy network and value network are based on LSTM [57] for sequential action or value generation.

1) POLICY NETWORK
Policy network π is parameterized by φ and receives a state s t at time t to generate the distribution of actions (select and discard), i.e., a t ∼ π φ (s t ). Given the extracted feature sequence F, the state s t of MDP consists of two parts: T T t=1 f t denotes the global information of the features; and s 2 t = {a 1 , · · · , a t−1 } is the set of actions generated so far. To aggregate the information of state s t , we first employ multiple fully-connected layers on f t , f g , a t−1 to embed them into a vector. Then, the vector is fed into LSTM to obtain the hidden state h t . Finally, we add a stochastic output layer y consisting of several fully-connected layers and a softmax activation to build a probabilistic model for feature selection, which generates output a t ∈ {0, 1}: Thus, the policy network defines a probability distribution p(a t |s t ) of action a t given current state s t . The overall architecture of the policy network is shown in Fig. 4 and detailed parameters settings are shown in Appendix B.

2) REWARD FUNCTION
The reward reflects how good the token action a t is at state s t . Given fixed GGCN θ , we can obtain a classification score by feeding those selected features to the classifier when the entire process of selection is finished. Therefore, we define the reward as follows: where c is the ground truth, c p is the predicted category and score c is the prediction score.

3) VALUE NETWORK
The value network predicts the value of each state, which is defined as the expected reward that the network will receive if the actor continues to sample outputs according to its probability distribution. Given policy π, extracted features, sampled actions and the reward function, the value to output as the expected future return is defined as a function of the Algorithm 1 Actor-Critic Training for the FSN Require: Actor π φ , critic V ψ and a pretrained GGCN θ ; 1: Randomly initialize actor π φ and critic V ψ ; 2: for epoch=1 to max epoch do 3: for random example skeleton sequence do 4: Extract feature sequence F by GCN; 5: Generate sequence of actions {a 1 , . . . , a T } according to current policy π φ ; 6: Compute Q π (s t , a t ) = γ T −t r T ; 7: Update critic weights ψ by minimizing Eq. (9); 8: Update actor weights φ using gradient in Eq. (8); 9: end for 10: end for observed state s t : where γ is a discount factor that allows us to focus more on the current reward. In particular, we build the value network in the same manner as the policy network with LSTM and fully-connected layers. However, the difference is the network only produces a single output value.

4) POLICY GRADIENT TRAINING AND VALUE FUNCTION ESTIMATION
Typically, the policy network is trained by the policy gradient method, which maximizes the expected cumulative rewards via gradient descent according to the following formula: where A π (s t , a t ) is an advantage function and defined as A π (s t , a t ) = Q π (s t , a t ) − V π (s t ). In detail, V π (s t ) is the value function defined in Eq. 7, and Q π (s t , a t ) is also a type of value function that is defined as the expected reward given the current state s t and the current action a t . Obviously, A π (s t , a t ) measures whether the sampled action is better or worse than the policy's default behaviors. Thus, the gradient will guide an increase in the probability of better-thanaverage actions and a decrease in the probability of worsethan-average actions. Based on the definition of V π (s t ) and Q π (s t , a t ), it is easy to know that V π (s t ) = E a t ∼π Q π (s t , a t ). Thus, the most common approach to estimate the value function is to solve a nonlinear regression problem: Specifically, gradient descent is applied to train the value network by minimizing the mean square error loss (MSE). In practice, we use a Monte Carlo approach to estimate Q π (s t , a t ). According to the reward setting above, we have: In our experiments, we use M = 1, and the Q π (s t , a t ) can be replaced by γ T −t r m T . Then value network is estimated by minimizing Eq. 9 and policy network is trained via gradient descent according to Eq. 8. The overall algorithm of training the FSN is summarized in Algorithm 1.
In complete training, we apply an iterative method to iteratively optimize FSN π φ and GGCN θ , which consists of two steps at each stage (see the proof in Appendix A): (1) Training FSN π φ with fixed GGCN θ according to Algorithm 1.
(2) Refining GGCN θ with fixed FSN π φ by minimizing the cross-entropy loss. These two models promote each other mutually, as GGCN θ provides extracted features for training the FSN and the FSN selects informative features to refine GGCN θ , which significantly improves the performance of action recognition.

D. THREE-STREAM FRAMEWORK
Besides coordinates of joints, other information of skeleton data, i.e., the temporal difference of joints and vectors of bones, also have been proved effective for skeleton-based action recognition [22], [23], [58]. Thus, we apply a three-stream framework to fuse these information to enhance the recognition performance.

1) J-STREAM
Coordinates of original 3D skeleton joints is the most intuitive approach to representing the spatial information. Hence, the input of J-stream is calculated by where v t i = (x t i , y t i , z t i ) ∈ R 3 is the coordinate of the i th joint of the t th skeleton frame.

2) T-STREAM
Temporal information is also crucial cues to recognize the underlying action since they represent the temporal movements of joints [22]. Similar as optical flow, we use the temporal difference of joints as the input of T-stream:

3) B-STREAM
In a skeleton, each bone is bound with two joints. We use the vectors of such bones as the input of B-stream, since it can represent the lengths and directions of bones, which are naturally more informative and discriminative for action recognition [26]. Hence, the input of this stream can be written as where e = (i, j) ∈ E is the bone between joint i and j, E is a set that consists of the pairs of joints connected in a skeleton. In our framework, each stream has the same architecture, but different input. Finally, the outputs of these three streams are fused via weighted averaging before the softmax layer, where the weights are obtained via grid search. In detail, assuming the probability distribution of the output of the three stream is P 1 , P 2 , P 3 , we search for a, b, c through grid search method at an interval of 0.01, where a, b, c satisfy a, b, c ∈ [0, 1], a + b + c = 1 and the weighted probability P = aP 1 + bP 2 + cP 3 can achieve the highest classification accuracy on the validation set. [14] is the most widely used large-scale action recognition dataset, which contains 56880 action clips with 60 action categories. It provides the 3D coordinates of 25 joints for each human in an action. To evaluate the models, two benchmarks are recommended: cross-subject (X-Sub) and cross-view (X-View). In the cross-subject benchmark, 40320 videos are separated into the training set, and 16560 videos are allocated to the test set, where the subjects in these two subsets are different. In the cross-view benchmark, the training set contains 37920 videos, and the test set contains 18960 videos, where the horizontal angles are different. VOLUME 8, 2020 2) NTU-RGB+D 120 NTU-RGB+D 120 [59] extends NTU-RGB+D by adding another 60 classes and another 57600 video samples, and a new evaluation benchmark is recommended: cross-setup (X-Set). In this benchmark, all samples with even collection setup IDs are separated into the training set, and other setups are reserved for testing.

3) KINETICS
Kinetics [60] is a large-scale human action dataset that contains 30000 videos retrieved from YouTube. There are 400 classes, ranging from daily activities, sports scenes, to complex actions with interactions. The skeleton data are extracted by publicly available OpenPose toolbox [11], which can predict 18 joints' 2D coordinates for each person. We followed [15] to select two people for multi-person cases based on average joint confidence. In this dataset, the training set contains 240000 video clips, and the test set contains 20000 video clips.

4) IMPLEMENTATION DETAILS
In our experiments, all models are trained with the same batch size (64). To train the GGCN, stochastic gradient descent (SGD) with Nesterov momentum (0.9) is applied as the optimization strategy, and the learning rate is initially 0.1 and decays by 10 at epoch 30 and epoch 40.
To make training stable, in GGCN block, we use 1 + M k to replace M k and use normalized e α , e β to replace α, β. To train the FSN, Adam is applied as the optimization strategy, and the learning rate is initialized as 1e-5 and decayed by 0.1 every 20 epochs. Additionally, we follow [26] to perform some preprocessing.

B. ABLATION STUDY
To analyze the contributions of the components of the proposed method, we conduct extensive experiments on the NTU-RGB+D [14] and NTU-RGB+D 120 [59] datasets.

1) EFFECT OF THE FSN
Here we focus on validating the effectiveness of the proposed FSN. In the experiments, we consider the combination of the FSN with three GCN-based feature extractors: ST-GCN [24], AGCN [26] and our proposed GGCN. Specifically, we follow [26] to modify ST-GCN, which improves its performance. Table 1 shows the recognition performance on the NTU-RGB+D and NTU-RGB+D 120 datasets. We see that the FSN can significantly improve the recognition accuracies with different graph convolutional networks on both the NTU-RGB+D and NTU-RGB+D 120 datasets. Moreover, compared with these GCN-based feature extractors, FSN is a lightweight network as shown in Table 3. Hence, FSN can cooperate with any GCN-based methods to bring significant improvement while only increasing a few parameters. Intuitively, Fig. 6 shows the variation accuracy of each action after using FSN. Improvements are observed for most actions, while others exhibit lower performance. Moreover, Fig. 11 is the confusion matrix of these actions. The results show that some actions, i.e., 'reading', 'writing', 'typing on a keyboard', and 'playing with phone', have the largest chances of misclassification. Coincidentally, these actions are exactly the ones with the greatest fluctuations in accuracy in Fig. 6. We argue that this is due to FSN having difficulty distinguishing the importance of features in such similar actions. For example, the actions 'writing' and 'reading' are very similar, which results in the features extracted from these two actions being are very similar. Thus, FSN may confuse the representative features of these two actions such that one accuracy improves and the other decreases.
To analyze why the FSN yields a notable improvement, we conduct an additional experiment. In the temporal dimension, considering that some features are positive on classification and others are negative, we fed each feature into the classifier to obtain a classification result. Then, the feature f i is defined as positive if the result is correct and negative otherwise. Table 2 shows the proportion of positive features in the extracted features and selected features. The results show that FSN increases the proportion of positive features, which indirectly reflects its effectiveness. On the other hand, our FSN can be regarded as a hard attention method, we compare with a soft attention module, whose architecture is the same as FSN but its outputs are between 0 and 1. Thus, we train our FSN through the reinforcement learning algorithm and train the soft attention module with conventional SGD algorithm. The result in Tab. 4 shows that our FSN could achieve better performance than the soft attention module. We argue that FSN can better eliminate interference since it discards those ambiguous features instead of giving them a small     weight. We also replace the LSTM with Bi-LSTM in FSN, the experimental results show that Bi-LSTM does not bring performance improvement. We argue that may be because the FSN focuses more on temporal order when selecting features.
Moreover, Fig. 7 shows an example of the selection result for the action 'throw'. As shown in the figure, the features extracted around the time when the subject throws are selected, and the features extracted around the time when the subject just stands are discarded. This result verifies that the significance of the features in the temporal dimension is different, and the FSN can adaptively select the representative features and discard ambiguous features.

2) GENERALIZED GRAPH CONVOLUTION
As introduced in Section III-B, the generalized graph G k and predefined graph A k are combined for graph convolution in GGCN. In detail, we construct generalized graphs by simultaneously considering appearance features and structural features, where the structural features are formed by embedding adjacency vectors into a high-dimensional representation. To validate the effectiveness of each component in GGCN, we manually deleted the generalized graph in GGCN to validate the effectiveness of the generalized graph and also experiment with two other simpler variants to validate the effectiveness of structural features in the generalized graph generation module. The first is not to use structural features, i.e., R(i, j) = R A (ij) in Eq. 2. The second is to directly use the adjacency matrix (one-hot vector) instead of embedding it by the Node2Vec method [55]. The results are shown in Table 5, which shows that our proposed GGCN is beneficial for skeleton-based action recognition and that the generalized graph is important for feature learning, and it also proves the importance of structural features in generalized graph generation. Additionally, Fig. 8 is the visualization of structural features embedded by the Node2Vec method, where the structural features are reduced to 2 dimensions by t-SNE. The results show that not only adjacent joints (e.g., chest 21 and waist 1), but also the joints similar in structure (e.g., two hands 22 and 24) are approximated in embedding space. Thus, it keeps a balance between homophily and structural equivalence.  Moreover, Table 6 reports the values of learnable parameters α and β introduced in Section III-B. As we see, in lower layers, the generation of the generalized graph considers structural features more, while the appearance features dominate higher layers. We argue that higher layers contain more semantic information. Hence, our proposed GGCN can adaptively combine appearance features and structural features to construct a more informative generalized graph. Fig. 9 is a visualization of the generalized graph generated in block 5 for different actions. Since every joint in the generalized graph is connected, we choose 10 edges with the most weight in each generalized graph except for the joints connected to themselves for visualization. It shows that the generalized graph can capture long-range action-specific TABLE 6. α and β of the First Generalized Graph in Several Blocks in GGCN, Which is Trained on NTU-RGB+D (X-Sub).

FIGURE 9. Visualization of the generalized graph for different actions.
In each action, (left) is the predefined graph, and (right) is a graph consists of ten edges with the most weight in generalized graph.
correlations, e.g. shoulders and hands in the action 'punch'. It also verifies our point that different samples need different dependencies between joints. Fig. 10 shows an example of the whole generalized graph of action 'throw'. The blue scale of each element represents the strength of dependencies between joints. It shows that the predefined graph focuses on the joint relations within a local part, but the generalized graph can capture long-range dependencies between arbitrary joints.

3) THREE-STREAM FRAMEWORK
To verify the necessity of other information, we compared the performance between separated information and fused information on both NTU-RGB+D and NTU-RGB+D 120 datasets. The results are summarized in Table 7 As indicated in Table 7, we observe that all combinations achieve considerable improvements. In detail, every stream has good performance with a single input, where the lowest accuracy reaches 85.8%/92.2% in X-Sub/X-View benchmark on NTU-RGB+D dataset. Any additional stream brings considerable improvements, where the maximum accuracy improvement is 3.6%/2.4%. Moreover, three-stream can further improve the recognition performance, where all stream is helpful for action recognition especially the J-stream and B-stream. The result verifies the superiority of the proposed three-stream framework.

C. COMPARISONS WITH STATE-OF-THE-ART METHODS
To verify the superiority of our proposed method, we compared our final model with the state-of-the-art skeleton-based action recognition methods on the NTU-RGB+D [14], NTU-RGB+D 120 [59] and Kinetics [60] datasets.
To analyze the recognition results for each class of actions, a confusion matrix on X-Sub benchmark is shown in Fig. 11. More than half of the values on the diagonal of the matrix exceed 90%, which shows that our method is accurate in most categories. Values elsewhere in the matrix are mostly blank, indicating that they are less than 0.01. It represents there is almost no misclassification between the two respective actions. Although our method achieves great performance on most actions, it performed poorly in some difficult actions, e.g., reading, writing, playing with phone/tablet, and typing on a keyboard. This result is also mentioned in Section IV-B1, we argue it is because those actions are very similar that they are easily misclassified.

2) NTU-RGB+D 120
The NTU-RGB+D 120 dataset is a new dataset extended by NTU-RGB+D and many methods have not been tested on it. We compare our method with three approaches mentioned in [59]. The results in Table 9 show that our method outperforms these methods with a large margin.

3) KINETICS
On the Kinetics dataset, we compare our proposed method with seven state-of-the-art approaches, including a traditional method: Feature Enc [31], a RNN-based method: Deep LSTM [14], a CNN-based method: TCN [20], three GCN-based methods: ST-GCN [24], AS-GCN [27], 2s-AGCN [26] and a graph neural network DGNN [29]. We also list a method [60] using RGB images and optical flow for action recognition, which can achieve better performance than the skeleton-based method since they contain more information. Table 10 reports the top-1 and top-5 classification performances. Compared with other GCN-based methods, our method improves accuracy about 0.6%/1.0% on VOLUME 8, 2020 Top-1/Top-5 metrics. The results are identical to the experiments on NTU-RGB+D, which confirms the generality capability of our method for large-scale datasets.

V. CONCLUSION
In this work, we first propose a novel feature selection network (FSN) to adaptively select the most representative features and discarded ambiguous features in the temporal dimension, which can cooperate with any GCN-based approach to improve the performance of recognition and add only a few additional parameters. Moreover, we propose a generalized graph convolutional neural network (GGCN) for feature extraction. It generates a generalized graph by the appearance similarity and structural similarity between joints in high dimensional feature space. This data-driven method can adaptively capture the potential dependencies between any joints and increase the flexibility of the graph convolution. Furthermore, the GGCN and FSN are combined in a three-stream framework to fuse different types of information, which further enhances the recognition performance. The method is evaluated on the NTU-RGB+D, NTU-RGB+D 120 and Kinetics datasets. The experimental results verify the effectiveness of our method and it achieves stateof-the-art performance.

APPENDIX A THEOREM PROOF
Theorem 1: The feature selection network π φ and generalized graph convolution network θ (with classifier) can be optimized by an iterative method.

APPENDIX B MODEL ARCHITECTURES
In this section, we show the detailed architectures of the proposed GGCN and FSN.

4) GGCN
Our proposed GGCN is stacked by 10 generalized graph convolutional blocks, the whole architecture is shown in Table 11. The dimensions of output are denoted by C × T × N for channel, temporal, and spatial sizes. The green color marks the kernel size (K v ) of the spatial dimension, kernel size (K t ) and stride of the temporal dimension, and the orange color marks the output channels for each block. * indicates that the generalized graph is not used in this block.

5) FSN
In each state s t = {f t , f g , h t−1 , c t−1 , a t−1 }, we first use a MLP consisting of three fully-connected layers to aggregate the information of f t , f g , a t−1 . The output of MLP is fed into LSTM with hidden state h t−1 and cell state c t−1 at previous time. Then we obtain the hidden state h t and feed it into another MLP and a softmax to output the probability of selection. The architecture of FSN in shown in Table 12, specifically, the input f t and f g are a 256-dimensional vector, and a t−1 ∈ R N is one-hot vector, where N is the number of joints. In training stage, the value network has the same architecture with the policy network (FSN), but the last layer outputs one scalar.
LIANG LI received the degree in applied mathematics from Wuhan University, in 1989, and received the M.Sc. degree in 2000. He then spent 8 years working in the Jianghan Petroleum Institute, engaged in mathematics and applied mathematics teaching and research. From 1997 to 2000, he returned to Wuhan University and involved in computational mathematics. He entered Shenzhen Polytechnic as a Teacher. Since 2000, he has been serving as the Professional Director and the Head for the Teaching and Research Section, Shenzhen Polytechnic, the independent director for listed companies, the technical consultant for many technology companies, and the evaluation expert for science and technology plan projects and for municipal cultural industry projects in Shenzhen. He is currently the Vice President of the Digital Creativity and Animation College, engaged in teaching, scientific research, and scientific research management of computer software and applications. He has published more than 10 articles in international academic journals, obtained more than 10 software Copyrights, utility model inventions, and other patents, and undertaken more than 30 research and development projects. The software he developed has been applied in many enterprises and achieved sound effects. His research interests include optimization algorithm research, system development of computer application software, interaction design, and graphic image vision research. VOLUME 8, 2020