Dynamic Gesture Recognition Based on Three-Stream Coordinate Attention Network and Knowledge Distillation

Gesture recognition has always been one of the important research directions in the field of computer vision. The dynamic gesture has the problems of complex backgrounds and many interference factors. The gesture recognition model based on deep learning usually has high computational cost and poor real-time performance. In addition, deep learning models are limited to recognizing existing categories in the training set and their performance largely depends on the amount of labeled data. To address the above problems, this paper presents a dynamic gesture recognition method named 3SCKI based on a three-stream coordinate attention (CA) network, knowledge distillation, and image-text contrastive learning. Specifically, 1) CA is utilized for feature fusion to make the model focus more on target gestures and reduce background interference, 2) traditional knowledge distillation loss is improved to reduce the amount of calculation and improve the real-time performance. Specifically, the guidance function is added to make the student network only learn the classification probability correctly identified by the teacher network, and 3) multi-granularity context prompt template integration method is proposed to construct an improved CLIP visual language model MG-CLIP. It aligns text and visual concepts from the image level to the object level to the part level. Through comparative learning of image features and text features, gesture classification is performed, enabling the model to identify image categories that have not appeared during the training phase. The proposed method is evaluated on the ChaLearn LAP large-scale isolated gesture dataset (IsoGD). The results show that our proposed method can obtain recognition rates of 65.87% on the validation set of IsoGD. For single mode data, 3SCKI can obtain the state-of-the-art recognition accuracy on RGB, Depth, and Optical Flow data (61.22%, 58.84%, and 50.30% of the validation set of IsoGD, respectively).

based on whether the dataset is pictures or videos. The research object of static gesture recognition is a single image, which only considers the spatial information at a certain moment, ignoring the information in the time series. The research object of dynamic gesture recognition is gesture images on continuous time series, which adds temporal information and motion features. Compared to static gesture recognition, dynamic gesture recognition has more types of gestures, stronger expressive ability, and is more applicable and practical. In recent years, due to the natural realization of human-computer interaction (HCI) [6] through visual-based dynamic gesture recognition technology, dynamic gesture recognition has been widely used in many scenarios such as target detection, video retrieval, virtual reality, intelligent transportation, and sign language recognition [7], [8], [9], [10].
In dynamic gesture video, the proportion of effective information such as hand movements and motion trajectory is very small, and there is a large amount of interference information such as skin color differences, and too bright or too dark light, which has a certain impact on gesture recognition results. Although many methods have been proposed to solve this problem in the past, such as traditional gesture segmentation algorithms, neural network techniques, and improved convolutional neural network models, there are still deficiencies in recognition accuracy.
In addition, large-scale training sets not only require large memory, but also need powerful computing equipment to speed up training and testing. However, high-performance deep learning networks are usually computational and parameter intensive, which is difficult to apply to resource-limited edge devices. Given limited resource budgets, efficient small-scale networks need to be developed in order to run deep learning models on resource-constrained devices. Currently, there are five main methods for obtaining efficient deep learning models: direct manual design of lightweight network models, pruning, quantification, network automation design based on neural architecture search, and knowledge distillation (KD) [11]. Among them, knowledge distillation is an emerging method for obtaining efficient small-scale networks, with the main idea of transferring ''knowledge'' from complex teacher models with strong learning abilities to simple student models. In recent years, knowledge distillation with model compression has gradually been applied to task-specific recognition. At the same time, it also has significant effects on enhancing the performance of the model through optimization strategies such as mutual learning and self-learning of neural networks, and data resources such as unlabeled and cross-modal data. Drawing on the recognition of knowledge distillation in other specific tasks, this paper improves the traditional knowledge distillation loss by adding a guidance function, so that the student network only learns the classification probability of correct recognition from the teacher network, and applies it to dynamic gesture recognition tasks to compress the model so as to obtain a lightweight network and improve the recognition rate.
Finally, models based on deep learning rely on supervised learning, and their performance largely depends on labeled training data. In reality, not all categories have a large amount of labeled instance data available for use, so these models are not necessarily practical. In addition, training cannot be performed on all types of images. Therefore, how the model can recognize instance images of categories that have not been seen during the training phase is also an important issue. The previous self-supervised or unsupervised methods mainly focused on the ability to learn features, to learn a feature that has good generalization. However, even if have learned good features, if want to apply them to downstream tasks, still need labeled data to make fine-tuning.
In 2021, the Open AI team [12] proposed a contrast learning method called CLIP for training multimodal models of visual language. Once a large and good model has been trained using text, the text can be used as a guide for Zero-Shot reasoning through image-text pairing. Using the features of the images seen during training to judge the categories of the images not seen, there is no need to fine-tune the training set of the downstream task. In some pre-training data sets, the text paired with the image has only one word. But usually, the text is a complete sentence that describes the image in some way. To help close this distribution gap, CLIP uses prompt templates to help specify that the text is about the image content and combines multiple prompt templates to improve zero-shot performance [12]. Based on the research of CLIP, this paper proposes a multi-granularity context prompt template integration method to construct an improved CLIP visual language model, MG-CLIP, and improve the generalization performance of the model. MG-CLIP is used to align the text description with the corresponding visual concept in the image for multi-granularity visual language pre-training. Multi-granularity refers to the alignment at the image level, object level, and part level. We reconstructed the existing dataset into image-text pairs, region-text pairs and objecttext pairs. For an image, we have the following training data: 1) Image title describing the whole image. 2) Region markers, such as ''Gesture man'', each associated with a region in the image, whereas previous methods roughly aligned the area description with the entire image. 3) Object labels, such as ''gesture'', which previous methods used to train the target detector. We redefined the data so that an image could have multiple bounding boxes, with the text directly associated with visual concepts in each box. A ''visual concept'' can be an object, an area, or the image itself. MG-CLIP learns visual concepts related to a variety of text descriptions that are not limited to the object or image level, resulting in improved model generalization performance.
Most methods for isolated gesture recognition based on deep learning are designed based on Convolutional Neural Networks (CNNs) or Recurrent Neural Networks (RNNs). In recent years, 3DCNN has replaced 2DCNN in traditional convolutional neural networks in spatiotemporal characteristics, taking into account both spatial information and temporal information. Considering that inputting a single original image is not sufficient to obtain sufficient features, many researchers have begun to use convolutional neural networks and multi-stream input to obtain more information, such as optical flow information, depth information and so on, which is conducive to improving the accuracy of network recognition. With the development of deep learning, more and more novel and efficient network architectures have been proposed, among which the classic and representative method is the Inflated 3D ConvNet (I3D) network proposed by the Deep Mind team [13]. Drawing on the advantages of a two-stream network and 3D-CNN, 2D convolution is extended into a 3D structure, and then the optical flow is separately taken as a branch and RGB image branch to form a two-stream network. RGB and optical flow images are trained separately to learn spatial information and time motion information respectively from visual frames and optical flow, which can directly learn joint spatio-temporal features.
Based on the I3D network, we propose a dynamic gesture recognition method, 3SCKI, which is based on a three-stream coordinate attention network and knowledge distillation and utilizes image text contrast learning.
The main contributions of this work are summarized as follows: (1) After three modal data are input to the backbone network, the coordinate attention (CA) module is used for feature fusion. This operation pays more attention to the local area of gesture and further reduces other background interference unrelated to gesture.
(2) Improve the traditional knowledge distillation loss, add a guidance function, so that it can learn the correct soft target of teacher network output, reduce operational costs, and improve real-time performance. At the same time, crossmodal distillation is used to embed the characteristics of different modal data into the student network of single modal data, improving the performance of the student network using single modal data as input in prediction.
(3) A multi-granularity contextual prompt template integration method is proposed to construct an improved CLIP visual language model, MG-CLIP, and perform Zero-Shot reasoning through the comparative learning of image-text pairing to improve generalization performance.
The rest of this paper is arranged as follows. Section II reviews the work on gesture recognition and then implements the proposed architecture in Section III. In Section IV, experiments are conducted to compare the proposed method with other methods to verify the effectiveness of the proposed strategy. In the last section, conclusions are drawn and future research work is summarized.

A. GESTURE RECOGNITION METHODS
Early sign language recognition mainly relied on wearable devices and artificial experience to extract features, but wearable devices required people to wear specific devices, affecting people's freedom of movement, with certain limitations [14]. Parcheta and Martínez-Hinarejos [15] used gesture recognition based on Hidden Markov Model (HMM) to achieve relatively high accuracy on a dataset containing 91 sign language words. However, manual feature extraction is time-consuming and labor-intensive and requires experts in the field to design good classification features. With the continuous development of convolutional neural networks, researchers have focused on using deep learning to achieve video-based dynamic gesture recognition. Video processing requires consideration of temporal connectivity. In this regard, Recurrent Neural Network (RNN) has shown enormous advantages. However, RNN is not satisfactory in feature extraction. For this reason, some researchers choose to first use Convolutional Neural Network (CNN) to extract image features, generate feature vectors, and input RNN for calculation [16]. Considering the spatiotemporal information of video, 3DDD convolutional neural networks (3D-CNN) emerged at the times required. For example, the C3D network proposed by Tran et al. [17] uses 3D-CNN instead of 2D-CNN in traditional convolutional neural networks, taking time information into account. Chen et al. [18] proposed MFNet. There are also new 3D CNN structures generated by 2D CNN networks such as ResNet, which have achieved good results in behavior recognition. There are many similarities between gesture recognition and behavior recognition, and they can also be applied to gesture recognition [19].
To fuse temporal and spatial information, video feature extraction architectures based on two-stream methods, 3D convolution (conv3D) structures, and Long Short-Term Memory (LSTM) networks have been proposed in succession. Simonyan and Zisserman [20] proposed the two-stream method for the first time, which processes the spatial flow of RGB images and the temporal motion flow using dense optical flow diagrams respectively, and then fuses the results to achieve behavior recognition tasks. Hongying and Zhang [21] proposed a human behavior recognition algorithm based on an improved two-stream spatiotemporal network to enhance the network's feature expression ability and improve the recognition ability of temporal dominant behaviors in response to the problem that two-stream networks are highly dependent on timing. Donahue et al. [22] combine 2D convolution and LSTM structures, first using 2D convolution to extract features from images to obtain a sequence of visual features, and then directly inputting the feature sequence into LSTM to further mine context information. Ji et al. [23] used 3D convolution in the field of video analysis, but only used 3D convolution in shallow layers, and used manual methods to extract information such as grayscale and gradient of images, making it impossible to process video in real-time. I3D introduces 3D convolution into a two-stream framework, using dense optical flow to match all points in the image one by one to form an optical flow field. This method has good video target detection accuracy, but it requires too much computation [13]. Wang et al. [24] integrated channel and spatial attention mechanism CBAM into the I3D network, which effectively improved the accuracy of dynamic gesture recognition. In CBAM, large-scale convolution kernels were introduced to extract spatial features, but the long-term dependence problem was ignored.
This paper integrates coordinate attention into the I3D network. CA not only considers the relationship between space and channels but also considers the issue of longrange dependence, which can not only achieve accuracy improvement but also reduce the number of parameters and computations. Instead of channel attention converting inputs into a single feature vector by two-dimensional global pooling, CA breaks channel attention into two one-dimensional feature coding processes that aggregate features in different directions. This has the advantage of capturing long-range dependencies along one spatial direction and retaining accurate location information along the other spatial direction. Then, the generated feature maps are encoded separately to form a pair of directional sensing and position-sensitive feature maps, which can be complementary to input feature maps to enhance the representation of interested objects, thereby solving the problem of complex dynamic gesture backgrounds and multiple interference factors.

B. CURRENT SITUATION OF CLIP
Work in the field of learning supervisory signals from natural language uses a variety of terminology and even contradictory ones. Zhang et.al [25], Joulin et.al [26], and Desai & Johnson et.al [27] all learn visual representations from ''text-image'' contrast, but they describe it as unsupervised, weakly supervised and supervised respectively. Contrastive Language-image Pre-training (CLIP) proposed by OpenAI in 2021 is a pre-training method based on text-image contrast learning, whose core is also learning supervisory signals from natural Language. It has transformed natural language supervision from a rare method into an excellent zero-sample image classification method [12]. What this series of approaches has in common is the use of natural language as a training signal to improve visual representation, rather than the details of a particular approach. Although earlier works encountered the problem of natural language complexity when using topic models and n-grams [28] representation, the development of deep context representation learning can make better use of such resources.
The CLIP model aims to learn the visual concepts in the image from the text description of the relevant image, and then map the image into the category of the text description.
There are also many studies based on the CLIP model. For example, in ActionCLIP, the author regards the task of video dynamic recognition as video text retrieval. For text labels, the author proposes a prompt module for generating text sentences based on labels. Then, a CLIP text encoder is used to encode the generated text, and a CLIP image encoder is used to encode multiple frames of images of the video. Action-CLIP proposes several methods to convert multiple frames of image information into one frame of image information, and calculate the similarity between the text and the frame of the image. In this way, it is possible to fully consider the semantic information about labels, thereby performing a zero sample knowledge transfer, and also to utilize CLIP's preprocessing of image and text knowledge [29]. The previous work of CLIP4Caption has directly fine-tuned the Caption task, thus neglecting to learn a visual feature with strong text semantic information. Therefore, CLIP4Caption initializes the model with the pre-training parameters of CLIP, and then performs pre-training based on the video-text retrieval task. After the completion of pre-training, visual representations similar to text semantics can be learned, and video captioning can be fine-tuned based on this visual representation [30].
Drawing on the powerful application of CLIP, this paper develops multi-granularity prompt templates for gesture recognition tasks, integrates different context prompt templates, proposes a multi-granularity context prompt template integration method, constructs an improved visual language model MG-CLIP, and applies it to dynamic gesture recognition tasks to improve the generalization performance of the model.

C. CURRENT SITUATION OF KNOWLEDGE DISTILLATION
The initial purpose of knowledge distillation is to compress deep learning network models and obtain simple and efficient network models, which have wide applications in resource-constrained terminal devices.
It uses the supervision information of a large model with better performance to training a lightweight, small model so that the small model can achieve better performance and accuracy. This large model is called the teacher model, the small model is called the student model, and the supervision information output from the teacher model is called Knowledge, and the process of the student network learning the supervision information from the teacher network is the Distillation.
The core idea of monocular depth estimation based on knowledge distillation is to reconstruct the depth of a single RGB image by using the knowledge of a powerful teacher model [31]. Transfer the knowledge of experienced teacher models into scene depth estimation of a single RGB image. By means of knowledge distillation, the knowledge learned from the complex teacher model with high resolution can be transferred to the efficient student model with low resolution. For example, Fu et al. [32] transferred the spatial and temporal knowledge learned by the teacher model to a low-resolution lightweight spatiotemporal network to perform the video attention prediction task.
In subsequent studies, academic and industrial fields have expanded the application scope of knowledge distillation and proposed to use knowledge distillation to enhance model performance. Both model compression and model enhancement transfer the knowledge of the teacher model into the student model. The difference is that model compression means that the teacher network guides the training of the student network on the same labeled data set to obtain a simple and efficient network model. Model enhancement emphasizes the use of other resources (such as unlabeled or cross-modal data) or knowledge distillation optimization strategies (such as mutual learning and self-learning) to improve the performance of a complex student model. The modal information synchronously aligned by teachers in cross-modal distillation can be used to make up for the information flow that the student network does not have, and continue to enhance the performance of the student network through the training of knowledge distillation, which is an important application direction of knowledge distillation in model enhancement.
In many practical applications, data typically exists in multiple modes, and some data in different modes describe the same thing or event. We can implement cross-modal distillation with synchronized modal information. For example, the cross-modal emotion recognition method proposed by Albanie et al. [33] uses this synchronized and aligned modal information to train unlabeled video as input data, and the images in the video enter the pre-trained face teacher model to generate soft targets to guide students in speech model training. Dou et al. [34] use shared cross-modal information to improve the performance of image segmentation in both computed tomography and magnetic resonance imaging. Vielzeuf et al. [35] transfer the network features of different modal data into a single student network through knowledge distillation. Wang et al. [36] independently train teacher networks with modal data to provide multiple modal supplementary information for student networks.
Drawing on the recognition of knowledge distillation in other specific tasks, and aiming at the characteristics of gesture recognition tasks, this study optimizes and improves the traditional knowledge distillation methods, adding guidance functions, so that the student network only learns the classification probability of correct recognition from the teacher network.

III. PROPOSED METHOD
We propose a dynamic gesture recognition method, 3SCKI, based on three-stream coordinate attention networks and knowledge distillation, while using image text pairing. The implementation process of this method and the relationships between its related technical modules are described in Fig. 1.
First, optical stream video is generated from RGB video to further extract action information. By analyzing the data distribution characteristics, all the videos with different frames are preprocessed, and a uniform 32 frames video is obtained. Then, RGB, Depth and Optical Flow data are input into I3D network for feature extraction, and the CA module is used to fuse the three extracted features, which helps the network to pay more attention to the gesture area in the input image, and avoid the interference of the performer's clothing, skin color and other factors, and extract the image features of the video frame. Secondly, in order to be able to run the model on resource-limited devices, I3D network is used as the teacher model and ResNet network is used as the student model to carry out knowledge distillation training, compress the model, and shorten the network training time. Specifically, more accurate effects generated by the trained I3D large model are transferred to the ResNet network through knowledge distillation, so as to better guide student model training and improve the classification accuracy of the ResNet student network. In addition, the same model structure I3D is used to conduct distillation training experiments on the cross-modal data of teacher model and student model on different modal data to achieve the effect of model enhancement. Finally, combined with the text label, the transformer text encoder of CLIP is used to extract the text features, compare and learn the text features with the image features of the teacher model and the student model, calculate the similarity between the image features and all the text features, and obtain the maximum similarity. The text label corresponding to the maximum similarity is the label category of the given image, and then the gesture category information is output.

A. CA MODULE
Coordinate Attention (CA) Block encodes channel relationships and long-term dependencies through accurate location information. The specific operations are divided into two steps: coordinate information embedding and coordinate attention generation.
Coordinate information embedding: Global pooling is commonly used in channel attention to globally encode spatial information as channel descriptors, making it difficult to store location information. To enable the attention module to capture spatial long-range dependencies with precise location information, global pooling is decomposed into a pair of one-dimensional feature encoding operations; Coordinate attention generation: This operation is to better utilize the global receptive field and precise position information representation generated by the coordinate information embedding module, and finally generate an attention map to obtain attention weights. This operation can effectively capture the relationship between channels, and can also make full use of the captured location information to accurately locate regions of interest.

B. THREE-STREAM CA-I3D NETWORK
We propose a three-stream CA-I3D network based on the classic two-stream idea of behavior recognition that integrates coordinate attention CA and I3D models. The minimum structure is shown in Fig. 2, and the CA module is added after the concatenation layer of the 3D Inception v1 module of I3D. Inception is a network module that assembles multiple convolution or pooling operations together. Based on the network stack structure, it proposes the idea of numerous parallel branch structures, performs multiple convolution operations or pooling operations on the input image in parallel, and concatenates all the output results into a very deep feature graph. It designs a sparse network structure but can produce dense data, which can increase the neural network's performance and ensure the efficiency of computing resources. An Inception module can be regarded as a convolutional layer, which is able to output feature maps with complex features of different scales. The 3D Inception v1 structure is extended from 2D Inception v1. The specific approach is to copy the convolution kernel parameters in the 2D model continuously in the time dimension to form the parameters of the 3D convolution kernel, and then divide by N to ensure that the network output is the same as that on 2D convolution. The other nonlinear layer structures are the same as the original 2D convolution model, the convolution kernel and pooling have increased the time dimension.
CA not only considers channel information but also considers directional-related location information. It embeds location information into channel attention and decomposes channel attention into two one-dimensional feature encoding processes that aggregate features along different directions. This allows for capturing long-range dependent information along one spatial direction and retaining accurate location information along the other spatial direction. Then, the generated feature maps are encoded separately to form a pair of directional sensing and position-sensitive feature maps, which can be complementary to input feature maps to enhance the representation of interested objects, thereby solving the problem of complex dynamic gesture backgrounds and multiple interference factors. Adding a CA attention mechanism has little impact on the network structure, but it can enable the network to learn more important channel features and spatial locations in the image.
In addition to adding the coordinate attention CA module, we have also improved the structure of I3D from the following aspects: (1) Remove the first two maximum pooling layers to prevent the loss of low-level features of the image caused by pooling operations; (2) Remove the final average pooling operation and only preserve the convolution layer. This is to preserve the global information of the image while reducing a large number of parameters, increasing the robustness of the network.
The specific components of the CA-I3D network are shown in Fig. 3. The six numbers of the 3D CA-Inception v1 module are the number of feature maps output by each convolutional layer, and the number positions are in one-toone correspondence with each convolutional layer in Fig. 2. All convolution layers use the ReLU activation function. The network weights are randomly initialized using a standard normal distribution (mean value 0, variance 1).
The three inputs of the three-stream CA-I3D network proposed in this paper are RGB videos, depth videos, and Optical Flow videos. The advantage of using the three-flow network is that it can comprehensively consider the rich spatial feature information in the original RGB image, the spatial distance information in the depth image, and the motion feature information in the optical flow image, so as to extract more features for classification and improve the recognition accuracy.

C. KNOWLEDGE DISTILLATION OPTIMIZATION
To reduce computational complexity while improving realtime performance, we use the complex I3D network as a teacher model, and the ResNet18 network as a student model to conduct knowledge distillation training to compress the model to obtain a simple and efficient network model. As shown in Fig. 1, the input and feature fusion methods of the ResNet18 network are the same as those of the teacher network I3D network.
The distillation training process is divided into two steps: (1) Use training data to train a teacher model and save the parameters of the model; (2) The basic idea of knowledge distillation is to approximate the student network to the teacher network by minimizing the difference in the predicted distribution between the teacher network and the student network. The neural network usually converts the calculated logits of each category into classification probabilities by using the Softmax output layer, as shown in Equation (1): where, z i is the ith component of Logits, and T is the temperature parameter. The higher the temperature, the softer the inter-class classification probability will be. The knowledge distillation loss consists of two parts: one is the cross-entropy between the classification probabilities, where the student network and the teacher network use the same temperature T, and the other is the cross-entropy loss between the classification prediction of the student network and the real label, where the temperature is 1, as shown in Equation (2): where N is the size of a small batch, and L CE represents crossentropy. σ (·) represents the Softmax function, where T is the distillation temperature, y i is the ground truth label of sample i, and z S ∈ R C and z T ∈ R C are the Logits output from the student network and teacher network for class C classification tasks, respectively.
Although the teacher network is more accurate than the student network at the beginning of training, teachers still have some prediction errors. When the teacher network predicts errors, knowledge will also be transferred to the student network, which will affect the performance of the student network. Therefore, the traditional knowledge distillation loss is improved, ignoring the erroneous prediction distribution of the teacher network, and only passing the correct prediction distribution to the student network. The final objective function used to train the student network is shown in Equation (3).
where I(·) is the guidance function and y S is the label predicted by the student network. When the teacher network can correctly predict the classification of input samples, the guidance function is 1, and the student network learns both the sample ground truth label and the soft target output from the teacher network. When the teacher network cannot be correctly classified, the guidance function is 0, and only the cross entropy between the classification of the student network and the actual label is calculated.

D. CROSS MODAL DISTILLATION
Knowledge distillation can not only be used for model compression, but also improve the performance of a complex model through the characteristics of cross-modal data, and has a significant improvement effect on model enhancement. Cross-modal distillation is an important application direction in model enhancement. The synchronized modal information of the teacher can be used to make up for the information that the student network originally does not have, and to continue to enhance the performance of the student network through the training of knowledge distillation. This paper uses information from different modal data (RGB, Depth, and Optical Flow) to provide complementary cues for gesture recognition tasks for cross-modal distillation, thereby improving the performance of student networks. As shown in Fig. 4, the backbones of both the teacher and student networks are the CA-I3D networks proposed in this paper. The first group of cross-modal distillation training based on RGB and Flow data network as the input of teachers, to guide the students with the Depth data as input network FIGURE 4. Three sets of cross-modal distillation training. VOLUME 11, 2023 training. The second group is the teacher network distillation training with RGB data as input and Depth data as input to the student network. In the third group, the teacher network uses Depth data as input to guide the student network with RGB data as input for distillation training.

E. MG-CLIP
''Comparative learning'' of image text aims to achieve a better correspondence between images and text. The implemented method is to select the corresponding one among all the text features through the given image features. The ''positive example'' refers to the image-text pair in the dataset. The similarity between two features is calculated by vector dot multiplication.
The training method is shown in Fig. 1. The MG-CLIP architecture is divided into two parts, image encoder, and text encoder. The image encoder in this paper uses I3D and ResNet18, respectively, and the text encoder uses CLIP's text encoder.
In the training phase, for a batch of data, through the text encoder and image encoder, get the text features and image features, and then calculate their dot product of all the text features and all the image features, get an NxN similarity matrix. From an image perspective, row direction is a classifier, and from a text perspective, column direction is a classifier. And since the matching relationship between text and image in a batch is known, the objective function is to maximize the inner product of the same pair of image and text features, that is, the elements on the diagonal of the matrix, and minimize the inner product with unrelated features. By calculating the cross entropy loss for each line of image-text and each column of text-image, the purpose is to optimize the sum of the two losses, so as to achieve the effect of visual information and text information corresponding to each other. The standard image model combines training image feature extractors and linear classifiers to predict certain labels, while CLIP combines training image encoders and text encoders to predict the correct pairing of a batch of (image, text) training samples. In this paper, at prediction time, a zeroshot linear classifier is synthesized by embedding the descriptive information about the category of the gesture recognition task dataset into the text encoder.
For gesture recognition task datasets, we propose a multi-granularity context prompt template integration method to construct an improved CLIP visual language model, MG-CLIP. Specifically, MG-CLIP constructs specific text sentence templates for gesture category labels in the IsoGD dataset [37]. This sentence template contains multi-granularity context information for images in the IsoGD dataset, such as: A photo of the {OK} gesture, a type of hand gesture, one hand. The template specifies that the image category is a type of gesture, and specifies that the gesture action is completed with one hand. Then integrate these text templates. During the testing phase, the trained CLIP model is directly used for the dynamic gesture IsoGD dataset without the need for fine tune. As shown in Fig. 5, the constructed text template obtains text features through a text encoder. During prediction, any given image is passed through the image encoder to obtain image features. Then, all text features are dot products of the image features, and similarity is calculated. The label corresponding to the maximum similarity is the classification result of the gesture image. The prediction process is a zero-shot reasoning process, this enables the model to identify image categories that have not been seen during the training phase, thereby improving the generalization performance of the MG-CLIP model.

IV. EXPERIMENT AND RESULT DISCUSSION
In this section, the proposed method will be evaluated systematically on the public dataset ChaLearn LAP large-scale isolated gesture dataset (IsoGD) [37]. First, the dataset will be described briefly. And then, the data preprocessing processes and experiment settings will be described in detail. In addition, ablation studies are performed on each module in our proposed method. Finally, the evaluation result will be reported.
For dynamic gesture recognition tasks, this experiment uses the IsoGD dataset [37]. It originates from the ChaLearn Gesture Dataset (CGD) established by Guyon et al. [38], which was expanded by Wan et al. [37] through Kinect devices. This dataset is a large-scale RGB-D dynamic gesture recognition dataset with high complexity, complex backgrounds, diverse gesture types, and close-to-life scenes. It is one of the most commonly used datasets in the field of dynamic gesture recognition and is currently the largest gesture recognition dataset. The whole dataset contains dynamic gesture video data in RGB and Depth modes, both of which contain 47933 labeled video sequences, with a total of 249 gesture actions. There are only single human gestures in the video sequence in this data set, and there is no interaction between gestures and people or objects. The inconsistent duration of the different gestures resulted in a sequence of video frames ranging from 9 to 405 frames, each with an image size of 240 × 320. Each video contains only one gesture, which means that the classification of a single gesture does not need to consider continuous gestures. The performer of the action starts with the naturally drooping hands and performs the corresponding gesture. After completing the gesture, the hands naturally droop. All samples are collected in a natural scene. To improve the robustness of the algorithm, objective conditions such as background, lighting, environment, and clothing have been changed.
The IsoGD dataset consists of three parts: a training set, a verification set, and a testing set, each containing 249 gestures. To avoid interference with experimental results caused by performers' movement habits and human characteristics, it is proposed to use different performers in the training set, verification set, and testing set. That is, people who appear in the training set will not appear in the verification and testing sets. The training set includes 35878 gesture video sequences completed by 17 performers. The verification set contains 5784 video sequences completed by two performers. The testing set contains 6271 videos completed by two performers. The detailed information is shown in Table 1.

A. DATA PREPROCESSING
Optical flow is an important method for video motion recognition and analysis. It is a model that represents the motion of objects, surfaces, and edges in the visual field. It is generated by the relative motion between the observer and the scene, representing the instantaneous speed of each pixel of a three-dimensional object moving on the image plane. Generally speaking, optical flow is also a microcosm of the motion changes of objects in the adjacent two frames. Liu et al. [39] pointed out that from optical flow information, not only can we obtain the motion direction and speed of an object, but also can we obtain the distance and angle between this object and us. Therefore, using optical flow can well express the motion process of objects. We use RGB video to extract optical flow features, which are used to extract motion path information on the one hand, and also remove background, performer's skin color, and other information unrelated to motion on the other hand. The optical flow characteristics are calculated using the energy equation proposed by Brox et al. [40] based on the assumptions of brightness constancy, gradient constancy, and spatiotemporal smoothing constraints [39], [40].
Due to the limitations of the CNN full connection layer, it is required that the input data have the same size, so we need to normalize the data, that is, unify the number of frames, and have the same width and height for each frame. In the IsoGD dataset, the width and height of each video are consistent, so the main thing that needs normalization processing is the time domain, that is, frame number information. For ease of processing, we choose 32 frames as the reference frame number for the video, unifying all videos to 32 frames. Videos with a frame number greater than 32 need to be sampled, while videos with a frame number less than 32 are interpolated by copying frames selected in a certain proportion. Through this preprocessing method, more than 98% of video is sampled at least once every 3 frames, and most of the motion path information is preserved.

B. EXPERIMENTAL ENVIRONMENT AND PARAMETER SETTING
The experimental hardware is configured with an Intel Xeon E5-2660 v4 CPU, 64 GB of memory, and the GPU graphics card is NVIDIA GeForce RTX 2080Ti. The software environment is 64 bit Ubuntu 16.04 operating system, CUDA 9.0, cuDNN 5.1.10, and the deep learning framework is PyTorch, version 1.1.0, and Python version 3.6.
The number of each video frame in the IsoGD dataset is inconsistent, which makes network processing difficult. Therefore the video sequence frames unified processing, according to the Rank Pooling [41] way for sampling. In the experiment, 32 consecutive frames of RGB images with a size of 224 × 224 were input to the network for training. Randomly selected from each batch video 16 samples one iteration. The parameters are tuned on the IsoGD dataset required for the experiment. The cross-entropy loss function was used as the loss function, and the stochastic gradient descent (SGD) was used as the optimizer to optimize the parameters of the convolutional neural network. The batch size was set to 32, the initial learning rate was 0.001, the learning rate decay was exponential, the decay coefficient was 0.96, and the number of decay steps was 100. There are 50,000 iterations, and the training weights are saved every 100 times for the first 1,000 iterations, and every 1,000 after that. A model of a depth map and a model of an optical flow map were trained in the same way, and the weights were saved in the same way. Then, the three models were loaded into the three-stream network respectively, and all parameters were fine-tuned at a lower learning rate to complete the training of the network. For dynamic gesture recognition tasks, the dataset adopted is the balanced dataset, so the accuracy rate is used as the measurement index.

C. ABLATION STUDIES
We analyze the contribution of each component to the final result of our method, including the effect on three stream inputs, the effect of the CA attention scheme, the model compression performance of knowledge distillation, the cross-modal enhancement performance, and the effect of MG-CLIP comparative learning. All experiments were conducted on the verification set of the IsoGD dataset.
The performance of each module is shown in Fig. 6. We input RGB, Depth, and Flow data into the original I3D model as a baseline, labeled ''baseline''. The results of adding . Each of our modules shows the impact of improving recognition accuracy. The addition of CA improves the accuracy by about 1% compared to the baseline, which is because our feature fusion effect of CA coordinate attention effectively solves the problem of complementarity between different modal information, pays more attention to the gesture area, and reduces background interference. Model compression based on knowledge distillation reduces computational complexity without reducing recognition performance. At the same time, cross-modal distillation and comparative learning are also beneficial compared to the baseline results, and our method ultimately achieves a total improvement of about 3.5%.

D. EVALUATION ON ISOGD
To verify the effectiveness of the proposed method, the performance of the proposed method has been compared with that of traditional manual feature methods and other deep learning methods on IsoGD dataset validation sets. As can be seen from Figure 7, CNN has great advantages in the feature extraction of images or videos. Compared with traditional manual feature methods, our method improves the recognition rate by about 48%. Table 2 shows the comparison results of our method and other methods on single RGB, Depth, and Flow modal data, and Table 3 shows the comparison results of multimodal fusion.
Given that 3D CNN can directly learn spatiotemporal features and achieve excellent performance in various video analysis tasks, researchers have utilized C3D [42] and Res3D [1] networks for dynamic gesture recognition. Res3D+GateConvLSTM+Pyramid [43] explores the redundancy of spatial convolution and designs a new LSTM variant, where the convolution structure is only embedded in the input-to-state transition of LSTM. SeST [44] dynamically combines 3D CNN and ConvLSTM in parallel. These  modeling methods illustrate that deeper network structures have more advantages in fitting complex data. However, the future research direction of behavior recognition is to achieve high accuracy while also achieving low consumption on the network. CA is a lightweight module that can integrate well with other networks. As can be seen in Fig. 6, after adding CA, the network performance has improved to a certain extent. In addition, three-stream networks, knowledge distillation, and comparative learning have also brought certain improvements in accuracy. Table 2 shows that our method has achieved the best results in single-stream networks, with a recognition rate of 0.95% better than the existing best methods in the RGB domain, 1.82% better than the existing best methods in the Depth domain, and 3.79% better than the existing best methods in the Flow domain. This fully demonstrates that the proposed architecture can learn more unique features for gesture recognition.
Referring to the final multimodal data results in Table 3, we have achieved a balance between performance and computational burden. In this case, previous methods often relied heavily on a combination of a large number of models [27] or intermediate fusion schemes [29] to achieve higher performance. Using the method of Narayana et al. [45] (which achieves the best results on IsoGD multimodal fusion) as an example, it requires not only additional data mode Flow optical flow, but also a combination of 12 independent models, with performance similar to the Ada Boost strategy, and most of these models require pre trained hand detectors. It leads to a higher computational burden and makes it harder for networks to apply to different situations. However, our method achieved the second-best result of 65.87%, using knowledge distillation technology for model compression and end-toend training. The results of our method on single-modal and multimodal data are shown in Fig. 8. The results show that compared with using single-modal data as network input, our method has better performance on multimodal data, and the effect of three-mode data is the best. In addition, the use of optical flow data for background removal further improved our recognition rate. Compared with only using RGB mode, the addition of optical flow data increased the accuracy by 1.51%, and compared with using RGB-D mode, the addition of optical flow data increased the accuracy by 2.85%.
Experimental results show that our proposed method can recognize dynamic gestures well, with a higher recognition rate than other dynamic gesture recognition methods, and achieves a good balance between performance and computational complexity.

V. CONCLUSION
In this paper, we present a dynamic gesture recognition method, 3SCKI, based on three-stream coordinate attention networks and knowledge distillation. Firstly, everything except the hand area in the dynamic gesture is regarded as background interference, and the coordinate attention CA module is used to focus on the hand region, so as to eliminate the interference factors unrelated to the gesture. Then, in order to reduce the calculation amount, the guidance function is added to the traditional knowledge distillation loss to improve the calculation speed. In addition, cross-modal distillation training is carried out to enhance the performance of the single-modal network model. Finally, the multi-grained contextual prompt template is designed for task-specific datasets to enhance zero-shot reasoning performance. Experimental results show that the proposed method has achieved good results for dynamic gesture classification and recognition. For future work, due to the subtle differences between different gestures, we will attempt to learn more complex features to distinguish gestures with subtle differences. Therefore, a feature pyramid can be used in future work to extract features of different scales and further improve the recognition accuracy of gestures with subtle differences.