Adaptive cross-fusion learning for multi-modal gesture recognition

Background Gesture recognition has attracted significant attention because of its wide range of potential applications. Although multi-modal gesture recognition has made significant progress in recent years, a popular method still is simply fusing prediction scores at the end of each branch, which often ignores complementary features among different modalities in the early stage and does not fuse the complementary features into a more discriminative feature. Methods This paper proposes an Adaptive Cross-modal Weighting (ACmW) scheme to exploit complementarity features from RGB-D data in this study. The scheme learns relations among different modalities by combining the features of different data streams. The proposed ACmW module contains two key functions: (1) fusing complementary features from multiple streams through an adaptive one-dimensional convolution; and (2) modeling the correlation of multi-stream complementary features in the time dimension. Through the effective combination of these two functional modules, the proposed ACmW can automatically analyze the relationship between the complementary features from different streams, and can fuse them in the spatial and temporal dimensions. Results Extensive experiments validate the effectiveness of the proposed method, and show that our method outperforms state-of-the-art methods on IsoGD and NVGesture.


Introduction
Gesture recognition is attracting increasing attention in both research and industrial communities because of its vast applications [1][2][3][4] , such as human-computer interaction [5] and video surveillance.Because complementary feature learning can benefit from different data modalities within different aspects, multi- modal gesture recognition technology [6][7][8][9] has been proposed.For example, it can be easy to distinguish foregrounds (i.e., face, hands, and arms) from backgrounds with the help of depth modality, whereas RGB data can provide a higher texture/color appearance modality.However, a problem of multi-modal gesture recognition is how to effectively fuse the feature representations extracted from different data modalities, which is not a trivial task for gesture recognition.
According to multi-modal gesture recognition, the effective fusion of complementary feature learning from such data benefit gesture recognition.Previous fusion methods have attempted to directly add, multiply, or average the final Softmax scores [6,8,[10][11][12] .However, gesture recognition often focuses on information regarding subtle changes in hand/arm movements, which can often be captured in intermediate stages.These methods are equivalent to using the fusion machine for global representation, and do not consider the subtle changes in hand/arm movements in the intermediate stage.Thus, the results of different modalities of data can be fused in the network, rather than trained separately or combined as late fusion.
Motivated by the above observations, to exploit the spatio-temporal correlation among different modalities of data, we propose an adaptive cross-modal weighting (ACmW) module that is employed in different stages throughout the network.Unlike the one-off fusion strategies [10,13,14] , inspired by the message passing mechanism of FishNet [15] , we use the ACmW module to generate the "spatio-temporal correlation message" of different modalities, and combine it with the original data stream for successive feature learning.This strategy can avoid losing details of the single modality of data during the early stage.Our ACmW module can accept two inputs of different sizes, and the outputs have the same size as the original feature maps.With this design, it can be embedded in any network architecture.The network can be end-toend trainable and focus on gesture-related features, even with multi-modal inputs.Our contributions can be summarized as follows: (1) We propose an adaptive fusion module called ACmW.Unlike the previous offline multi-modal fusion scheme [6,16] , ACmW enables the network to train different data modalities in an end-to-end manner.It also leverages the strength of different modalities of data by generating a "spatio-temporal correlation message" and combining them from different streams instead of simply late-fusing the score of different data.
(2) Extensive experiments prove that the integration of our designs can ultimately improve the performance of gesture recognition.The experiment results demonstrate that our method can strike a balance between a good performance and low computation burden, and that it outperforms the top techniques on two large-scale benchmark gesture datasets: IsoGD and NVGesture.
The rest of this paper is organized as follows.Related works are introduced in Section 2. Section 3 introduces the proposed ACmW module.Section 4 provides the details of our experiment, including a performance evaluation of the ACmW module on two benchmark gesture datasets, comparing the results with those of other state-of-the-art methods.At the end of this section, we visualize the neural activation.
Finally, we provide some concluding remarks in Section 5.

Related work
In this section, we first introduce recent developments in multi-modal gesture recognition.Then, some recent progress in feature fusion strategies will be listed.
Meanwhile, with the release of RGB-D sensors in recent years, simultaneously captured RGB and depth data are easily available, which promotes the development of multi-modal gesture recognition technology.Miao et al. proposed a multi-modal gesture recognition method based on a ResC3D network to overcome the barriers of gesture-irrelevant factors [6] .Tran et al. proposed a multi-modal continuous gesture recognition method, which consists of two modules: segmentation and recognition [30] .In addition to the method based on a convolutional neural networks (CNN), Zhu et al. utilized the LSTM variants AttenConvLSTM [31] and PreRNN [32] for RGB-D gesture recognition.All of these methods first train different branches of the network on different data modalities, and then, combine the Softmax scores they predicted.An advantage of these methods is that the errors in the fusion scores come from different branches; thus, they do not affect each other and will not lead to further accumulation of errors.As mentioned by Roitberg et al. [8] , the disadvantage of this method is that it ignores many intermediate representations that have a significant impact on the classification performance.Therefore, a fusion strategy of multi-modal features has attracted the attention of researchers.

Multi-modal fusion strategy
The leveraging of multi-modal data can be found in many previous studies.Decision-and feature-level fusion are two common strategies in multi-modal gesture recognition.Decision-level fusion [14,[33][34][35] techniques are easily implemented but only concern the majority, and other types of data cannot help in the final recognition.Feature-level fusion [6,[8][9][10]16,36,37] contains sufficient information of all features and avoids the complicated pre-processing of registration, owing to its uniform dimension. Among thee methods, the Roitberg et al. [8] fusion strategy is the most similar to our own.However, instead of directly using a convolutional layer for fusion, we designed a more comprehensive fusion scheme, which can model the temporal correlation among different data modalities and fuse the extracted multi-stream features at an early stage, thereby enhancing the spatio-temporal representation.Specifically, the features of different modalities are first unfolded into 1-D vectors, and we then use different kernels of convolutional layers, which can learn a more adaptive fusion feature for each modality of data.The network can then exploit the complementary spatio-temporal information of the other modality of the data according to the property of the current data stream, rather than simply adding the blended features to it.At the same time, inspired by a study conducted by Hu et al. [38] , which proposed a channel-based attention mechanism, we model the correlation of multi-stream features in a temporal series to enhance the temporal representation, to achieve a fusion of multi-stream features.See Section 3 for more details.

Methodology
In this section, we first formulate the structure of the proposed ACmW module in Section3.1.We then introduce the implementation of the ACmW module in Section 3.2.Finally, the details of the ACmW network architecture are provided in Section 3.3.

Adaptive cross-modal fusion scheme
As mentioned before, the different modalities of data can be complementarity to each other.To improve the recognizing accuracy, a comprehensive scheme should be sophisticatedly designed by exploiting and combining the advantages of different modalities of the data.As shown in Figure 1, the ACmW module takes the features from the RGB and depth branches as inputs and conducts adaptive convolution to derive two groups of weighted feature maps, rather than simply blending the inputs to generate one fused data.This guarantees that the fusion process can be used to learn complementary features of a multi-modality from low-level visual features to high-level semantic features.ACmW mainly contains two sub-structures: the spatial feature fusion mechanism and temporal series-based adaptive fusion mechanism.
For the spatial feature fusion mechanism, as described in Equation 1, both RGB and depth features are initially unfolded.The unfolding processing is achieved by stretching the features into a one-dimensional vector, which makes the fusion more efficient.
where x i and y i indicate the feature maps of a specific stage, i, of the RGB and depth branches, respectively.Function map (•) achieves the unfolding process, and C indicates the convolution operation.
Later, as described in Equation 2, the unfolded features are concatenated in the unfolded dimension by function F (•) and sent to the adaptive convolution, which has two convolution kernels.Note that the number of kernels is in accord with that of the data types.Adaptive convolution is achieved by a 1 × 1 × 1 convolution with the weights of W. Next, we can obtain two weighted features, having the same shape as the original, through feature unfolding: where x' i and y' i indicate the fused feature maps.
For the temporal series-based adaptive fusion mechanism, inspired by a previous study [39] , we model the temporal series, instead of the channel, to learn the correlation of the multi-modal features.Specifically, both RGB and depth feature maps are mapped into a one-dimensional vector in the temporal dimension using transformation function F tr , which is formulated in Equation 3.
(3) They are then concatenated by the column, that is, As described in Equation 4, after passing through the fully connected (FC) layers to obtain a weight vector with the same shape as the temporal dimension of the single branch input, we finally expand this weight vector to the same number of dimensions as the raw input features.Moreover, an element-wise product is employed to enhance the temporal representation of multi-modal features to achieve the deep fusion of multi-stream features within the time dimension.By enhancing the temporal representation and fusing the spatial information, the ACmW module can effectively aggregate spatio-temporal features.
where δ refers to the ReLU function, and σ refers to the sigmoid function.In addition, W 1 ∈ R (r × C ) × C and W 2 ∈ R C × C , where r is the number of branches and E (•) indicates the expanded function.
To preserve the identical information of the original data modality, an element-wise sum that combines the weighted results and original features is used.
where x o i and y o i indicate the outputs of the ACmW module, and ⨀ indicates the element-wise product.Because the parameters of the adaptive convolution are learned by the network itself, the fusion of the features can be adaptive.Meanwhile, because the ACmW module does not derive one fused output stream directly, each branch can still learn the identical features of the corresponding data, and the complementarity of the different data can be exploited from low-level to high-level features.

Implementation of ACmW
As shown in Figure 2, both branches adopt the same base backbone, such as C3D and 3D ResNet-50 (Res3D).We fuse multi-stream features at different stages of the network.Meanwhile, to avoid losing the original feature information, an element-wise sum operation is utilized to combine the original and fused features.In addition, for the final prediction, instead of discarding the prediction scores of individual branches, we completely combine them with the scores of the fusion layer, which significantly improves the performance of the multi-branch network.For the C3D network structure, we extract the two feature streams after the pooling layer of each branch and input them into the ACmW module for fusion.
The fused features are then taken as the succeeding input of the next layer.Thus, a total of five ACmW modules are embedded in a cascade way in C3D.For the Res3D network, the ACmW modules are embedded behind each residual block for feature fusion, and a total of four ACmW modules are therefore embedded in a cascade manner in Res3D.

Details of the ACmW network architecture
In this section, we present the details of the proposed ACmW module.Taking the C3D as a backbone, as shown in Table 1, we give the size of the features, floating point operations per second (FLOPs), and parameters of each stage.Taking stage 1 as an example, we first conduct temporal-based fusion, which drives the multi-stream features into a vector with the shape of N × 1 × 32 × 1 × 1, and then, expands it to a N × 64 × 32 × 56 × 56 shape to match the spatial-based features.For spatial-based fusion, we first stretch the features into a one-dimensional vector with a shape of N × 1 × 6422528 (in each channel, we stretch the features with a size of 32 × 56 × 56 into a one-dimensional vector with a size of 1 × 100352, and the 64-channel feature is then concatenated together into a one-dimensional feature, the size of which is 1 × 64225284).This step makes the fusion more efficient.Then, the unfolded features are concatenated in the unfolded dimension, which has the shape of N × 2 × 6422528, and are then sent to the adaptive convolution layer, which has two convolution kernels.Next, we obtain two weighted features.These two features are reshaped to the same size (N × 64 × 32 × 56 × 56) as the original through the feature unfolding part.In addition, the total FLOPs and parameters of the ACmW are about 204.9 M and 4.1 K, whereas the single C3D network branch is about 308.7 G and 79.0 M, which clearly shows that it is an extremely lightweight network.

Experiments
In this section, we first present the details of the benchmark datasets in Section 4.1, which are used to evaluate our method.Then, the implementation details of the experimental setup are given in Section 4.2.
Finally, we thoroughly evaluate the impacts of the ACmW module by embedding it into different backbones on two benchmark datasets in Section 4.3.

Datasets
We evaluate our method on two RGB-D gesture datasets: the Chalearn IsoGD dataset [40] and NVGesture dataset [7] .As shown in Figure 3, NVGesture comprises constrained driving gestures, whereas IsoGD contains multiple types of gestures, e.g., mudra and diving gestures, that are in an unconstrained setting.
Chalearn IsoGD dataset.The Chalearn IsoGD dataset was proposed by Wan et al. [40] .It contains 47, 933 RGB-D gesture videos divided into 249 types of gestures performed by 21 individuals.The dataset has  NVGesture dataset.NVGesture [7] focuses on touchless driver control.It contains 1532 dynamic gestures that are separated into 25 classes, which involve RGB and depth videos and a pair of stereo-IR streams.This dataset is divided into training and testing subsets with at a ratio of 7:3, namely, 1050 samples for training and 482 for testing.Unlike work of Molchanov et al. [7] that used all modalities to obtain the result, because we consider RGB-D gesture recognition, we experiment using only RGB-D data.

Experimental setup
Our experiments are all conducted using Pytorch [41] on the RTX 2080 Ti GPU.During the training stage, the input frames are spatially resized to 256 × 256, and then, cropped to 224 × 224 randomly, whereas during the inference stage, they are cropped at the center.We randomly sample 32 frames in the video, train the network with a mini-batch of 32 samples, and utilize the SGD optimizer with a weight decay of 0.0005 and momentum of 0.9.The initial learning rate was 0.01, which is reduced by ten-fold when the accuracy on the validation set does not improve every three epochs.The training stage is stopped after the learning rate becomes less than le-5.

Impact of ACmW module on C3D network.
In these experiments, we use C3D as the backbone to study the impact of the ACmW module.C3D can simultaneously model the appearance and motion information, and is more suitable for spatio-temporal feature learning than 2D ConvNets.The training process is divided into two stages: (1) training two C3D network branches on the RGB and depth datasets, and ( 2) training as shown in Figure 2. First, the ACmW module is embedded between the RGB and depth network branches in a cascading manner.Then, the RGB and depth branches are fine-tuned using the weight of the first stage of training.Finally, we used a small learning rate (0.001 in this experiment) and the Adam optimizer to train the entire network.After a few (about 7 to 10) epochs, the network converged.
In addition, to reduce the number of parameters based on a previous study [43] , we finally use a 1 × 1 × 1 convolution layer, instead of a fully connected layer, to predict the final classification probability.Figure 4 shows the accuracy of different fusion results on the IsoGD and NVGesture datasets.Notably, compared with the performance of ACmW and other common fusion schemes, i.e., the score fusion and element-wise multiplication fusion, our fusion strategy significantly improves the recognition accuracy.
The accuracies of the different fusion results are shown in Table 2.The score fusion method is used to conduct the fusion in the latest stage, which makes the decision based only on the maximum probability of the predictions with different modalities.Element-wise multiplication fusion involves multiplying the predicted probability values in different modalities to obtain a new probability distribution.These two fusion methods cannot sufficiently exploit the advantages of different data modalities, and do not consider the correlation of the temporal series in video-based classification tasks.Therefore, they clearly cannot achieve a high score on the performance of any single modality of data.By comparison, the ACmW can expand the complementarity spatially and temporally throughout the network, which helps the features of different modalities focus on the gesture.Consequently, it can achieve the best performance.Specifically, the performance on these two benchmark datasets can be about 1% higher than the score fusion, and about  the ACmW module is embedded after each residual block, and the features fused after the last residual block are input into the 1 × 1 × 1 convolution to conduct the final fusion score prediction.Figure 5 shows the performance improvement of the ACmW module embedded in the dual-branch 3D ResNet-50 network.
Compared with the other fusion strategies, ACmW has a significant improvement on these two datasets.
Table 3 demonstrates this more clearly, where the performance on both the IsoGD and NVGesture datasets is about 1% higher than the score fusion and about 2% higher than multiplicative fusion.

Comparison with state-of-the-art methods
After studying the components described in Section 4.3, we evaluate the performance of the ACmW module on two benchmark datasets.Our method is compared with recent state-of-the-art methods on IsoGD and NVGesture datasets.
For the IsoGD dataset, because most of the methods release their result on the validation subset, we also conduct our experiments for a fair comparison.As shown in Table 4 and Table 5, existing video-based classification tasks adopt [34] 3DCNNs to first learn RGB-and depth-based network branches, respectively, and then, give the final classification result by combining the prediction results from them.Although this

Feature visualization
The neural activation is shown in Figure 6.It can be seen from Figure 6 that the proposed ACmW module can effectively fuse spatio-temporal representations to drive the model to focus more on the movement of the arms and hands.Clearly, we can see that ACmW has a significant effect on the appearance of the feature maps.Integrating the advantages of the RGB and depth modalities indicates the contextual information of the movement path.The ACmW module not only marks the regions related to the gesture, such as the arm of the performer, but it also distinguishes the ranges of motion at different positions of the video sequence.It effectively avoids the impact of noise on the feature, which is presented without an attention mechanism, particularly when a drastic movement, such as the raising or dropping of an arm, occurs.Therefore, our ACmW module can better guide the network to focus on the hand and arm, and provide a more accurate prediction.

Conclusion
In this study, we developed an ACmW scheme to exploit the complementarity features from RGB-D data

NVGesture dataset
Method HOG + HOG 2 [50] I3D [51] ACmW (Ours) effectiveness of our approach.Future directions include exploring the fusion performance of the ACmW module on more than two feature streams, and proving the applicability of the ACmW module in 2D convolutional networks.

Declaration of competing interest
We declare that we have no conflict of interest.
/©Copyright 2021 Beijing Zhongke Journal Publishing Co.Ltd., Publishing services by Elsevier B.V. on behalf of KeAi Communication Co.Ltd.This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by/4.0/).

Figure 1
Figure 1 Structure of the ACmW module.Here, ⊙ indicates an element-wise product, ⨁ indicates an element-wise sum, and the different colors of feature maps indicate different weight values.We applied two fusion strategies for multiple feature streams.The first time is temporal-based fusion (the left part), which mainly produces a temporal descriptor by aggregating feature maps across their spatial dimensions (C × H × W) to learn the correlation of the multi-stream features on the temporal series through the linear layers.The second time is spatial-based fusion (the right part), which mainly produces more representative features using an adaptive convolution layer (a 3D convolution with a kernel and step size of 1).By combining these two fusion strategies, we can make full use of the semantic information of high-level features and fine-grained information of low-level features.

Figure 2
Figure 2 An overview of the multi-stream classification model.The ACmW module is embedded between the two network branches (RGB and depth) from the early to late stage for feature fusion, where ⨁ indicates the elementwise sum.The RGB branch carries visual information about the scenes and objects in the video, and the depth branch significantly eliminates background noise.
three subsets, i. e., training, validation, and test sets, which contain 35878, 5784, and 6271 samples, respectively.The samples in the three subsets are excluded.It is also used as a benchmark for two rounds of the Chalearn LAP large-scale isolated gesture recognition challenge.

Figure 3
Figure 3 Some example images from different benchmark datasets: (a), (b) RGB frames and corresponding depth frames from the ChaLearn IsoGD dataset.(c), (d) RGB frames and corresponding depth frames from the NVGesture dataset.

Figure 4
Figure 4 Impact of ACmW module on 3D ResNet-50.(a) Fusion results on the NVGesture test set; (b) Fusion results on the IsoGD validation set.
through the network.The main functions of the ACmW module are exploring the correlation of multistream characteristics in the temporal dimension, and fusing the spatial representations of multi-stream features.Through an effective combination of these two functions, multi-stream features from different data modalities are deeply fused in the temporal and spatial dimensions.Extensive experiments show the

Figure 6
Figure 6 Feature visualization from dual-stream C3D network embedded with ACmW module on the IsoGD validation set.

Impact of ACmW module on 3D ResNet-50 network.
In this experiment, we use 3D ResNet-50 as the backbone to study the impact of the ACmW module.Similar to C3D, 3D ResNet-50 also uses a 3D convolution kernel to extract the spatio-temporal representations.However, Res3D outperforms networks such as C3D on large datasets.The training process is the same as that of C3D, whereas in 3D ResNet-50;