STFE-Net: A Spatial-Temporal Feature Extraction Network for Continuous Sign Language Translation

The main challenge of continuous sign language translation (CSLT) lies in the extraction of both discriminative spatial features and temporal features. In this paper, a spatial-temporal feature extraction network (STFE-Net) is proposed for CSLT, which optimally fuses spatial and temporal features, extracted by the spatial feature extraction network (SFE-Net) and the temporal feature extraction network (TFE-Net), respectively. SFE-Net performs pose estimation for the presenters in sign-language videos. Based on COCO-WholeBody, 133 key points are abbreviated to 53 key points, according to the characteristics of the sign language. High-resolution pose estimation is performed on the hands, along with the whole-body pose estimation, to obtain finer-grained hand features. The spatial features extracted by SFE-Net and the sign language words are then fed to TFE-Net, which is based on Transformer with relative position encoding. In this paper, a dataset for Chinese continuous sign language was created and used for evaluation. STFE-Net achieves Bilingual Evaluation Understudy (BLEU-1, BLEU-2, BLEU-3, BLEU-4) scores of 77.59, 75.62, 74.25, 72.14, respectively. Furthermore, our proposed STFE-Net was also evaluated on two public datasets, RWTH-Phoenix-Weather 2014T and CLS. The BLEU-1, BLEU-2, BLEU-3 and BLEU-4 scores achieved by our method on the former dataset are 48.22, 33.59, 26.41 and 22.45, respectively, and the corresponding scores are 61.54, 58.76, 57.93 and 57.52, respectively, on the latter dataset. Experiment results show that our model can achieve promising performance. If any reader needs the code or dataset, please email lunfee@whut.edu.cn.


I. INTRODUCTION
As the main communication channel between the people having total or partial hearing loss and hearing people, sign languages have played a very important role in daily life. The rapid development in the fields of computer vision and deep learning has opened up new opportunities for sign language recognition. Sign language uses different parts of the body, such as fingers, arms, hand movement trajectories, head The associate editor coordinating the review of this manuscript and approving it for publication was Mehul S. Raval . and facial expressions, to convey information [1]. In sign languages, each gesture has a specific meaning, and strong contextual information and grammatical rules are factors that should be considered in continuous sign language translation (CSLT).
In recent years, research on video-based sign language recognition, translation, and generation has received increasing and wide attentions. Sign language recognition (SLR) refers to the use of algorithms and techniques to recognize the resulting gesture sequences and elaborate their meaning in the form of text or speech [2]. SLR is a typical interdisciplinary problem involving several fields, such as computer vision, natural language processing, image recognition, human-computer interaction, and pattern recognition [3]. Challenges and difficulties in sign language recognition still exist because of the large set of vocabularies, rich and diverse expressions, and complex semantic-grammar structures.
With the availability of large-scale CSLT datasets, research on sign language translation (SLT) in an end-to-end manner has been emerging. Forster et al. [4] constructed the RWTH-PHOENIX-Weather German sign language dataset. Then, Camgoz et al. [5] expanded the dataset to form the RWTH PHOENIX-Weather 2014T dataset, and introduced the first end-to-end CSLT model. The model uses convolutional neural networks (CNNs) to extract the spatial features of sign language actions. An attention-based architecture was also employed to learn the mapping relationship between the sign language from a video and the reference translated text. Huang et al. [6] constructed a large-scale Chinese continuous sign language dataset CSL, and proposed the Hierarchical Attention Network with Latent Space (LS-HAN).
CSLT is a generalized sequence-to-sequence problem, and one of the difficulties is the recognition of visual information from videos. CSLT not only considers the information from the current image frame, but also relates it to the complex dynamic relationship between the consecutive frames. The information of the current frame is represented as spatial features, and the key points of the person concerned can be extracted by pose estimation methods. For CSLT, the key points on the body, face and hands are required to be located. A recent method for pose estimation is OpenPose [7], which combines multiple deep neural networks (DNNs) on different datasets, one DNN for body pose estimation on COCO [8], one DNN for locating facial key points trained with multiple datasets (i.e., FRGC [9] and i-bug [10]), and one DNN for locating hand's key points based on Panoptic dataset [11]. The COCO-WholeBody dataset [12] has 133 key points for the human body, but some of the key points are redundant or unnecessary for CSLT.
With the emergence of Transformer [13], the performance for CSLT has been further improved. Camgoz et al. [14] proposed an advanced architecture based on Transformer, namely Sign Language Transformer (SLT). SLT was trained using the Connectionist Temporal Classification (CTC) loss function and cross-entropy loss function. In a recent study, Camgoz et al. [10] used additional modal and cross-modal attention to synchronize the flow of different information. Kim et al. [15] proposed a key point normalization method and built a Korean sign language translation framework based on Transformer. Yin et al. [11] proposed an STMC-Transformer model, where the SMC module decomposes the input video into spatial features of multiple visual cues (face, hands, full frame, and pose), and the TMC module computes temporal correlations for different time steps.
In this paper, because of the scarcity of Chinese continuous sign language datasets, we construct a Chinese continuous sign language teaching dataset for real translation scenarios. In addition, the selection of the sign-language key points, pose estimation method, and the Transformer network will be studied in depth for CSLT. A spatial-temporal feature extraction network (STFE-Net) is proposed, which combines the spatial features and temporal features for the CSLT task.
The main contributions of this work are summarized as follows: 1. In this work, a Chinese continuous sign language dataset was constructed for real translation scenarios. The dataset provides useful data to support the study of Chinese continuous sign language translation.
2. Based on the fused spatial-temporal information, an endto-end Chinese continuous sign language translation network (STFE-Net) is proposed, which is composed of the spatial feature extraction network (SFE-Net) and the temporal feature extraction network (TFE-Net).
3. For SFE-Net, 53 key points related to a sign language are selected from the 133 key points in the COCO-WholeBody dataset. The selected key points can result in achieving better pose estimation performance than using all the 133 key points. In addition, high-resolution pose estimation is performed on the hands so as to obtain fine-grained sign language information.
4. For TFE-Net, Transformer is used to implement temporal feature extraction, in which relative position encoding and position-aware self-attention optimization are adopted.
5. Combining SFE-Net and TFE-Net realizes an end-toend network, i.e., STFE-Net, for CSLT, which achieves excellent performance on our created dataset and multiple public datasets. Moreover, STFE-Net outperforms many state-of-art methods.

II. RELATED WORKS
This section will focus on the work related to the various techniques used in SFET-Net, i.e., the whole-body pose estimation method and Transformer network.

A. POSE ESTIMATION
The COCO-WholeBody dataset contains 133 key points for whole-body pose estimation. However, for sign language recognition, 133 key points are more than necessary and many of them are redundant. Selecting appropriate key points for sign language recognition can lead to better performance.
OpenPose [16], [17], [18], Single-Network (SN) [19], HPRNet [20], and HRNet [21] are the current state-of-theart methods for pose estimation. OpenPose is a two-step pose estimation network, which requires separate training for hands and body of different scales. SN, HPRNet and HRNet are one-step pose estimation networks. HPRNet is pose estimation. HRNet is a high-resolution network, which iteratively exchanges information between parallel multiresolution sub-networks to perform multi-scale repetitive fusion. In this paper, we employ HRNet to construct a parallel structure for performing global body pose estimation and fine-grained hand pose estimation. VOLUME 11, 2023 46205 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply. The entire structure of STFE-Net. STFE-Net contains SFE-Net and TFE-Net. SFE-Net is used to extract key points in video. First the input image is processed by Feature-Net to extract Global Features (F1 and F2), using the features extracted from Feature-Net, Body-Net predicts body key points and hand bounding boxes at the same time. Using hand bounding boxes predicted, then Hand Features (H1 and H2) are applied to Hand-Net to predict the heatmaps of hand key points. Next, TFE-Net processes the key points obtained by SFE-Net. Transformer is used to extract the sign-language sequence features. To extract the temporal features, the embeddings of the spatial features and the sign-language words are first computed, and then input to the encoders and decoders of the Transformer, respectively. In the feature embedding, GRUs are used to achieve position awareness of the sequences. The Position-aware Spatial Embedding uses Bi-GRU, while the Position-aware Word Embedding uses unidirectional GRU to mask future inputs during prediction. VOLUME 11, 2023 Authorized licensed use limited to the terms of the applicable license agreement with IEEE. Restrictions apply.

46206
For hand detection in sign language recognition, we use CornerNet [22], which treats the target to be detected as an envelope composed of two corner points. Previous target detection models, such as Faster R-CNN [23], use RoI Pooling for coordinate mapping and scale transformation. RoI Align, proposed in Mask R-CNN [24], uses bilinear interpolation to avoid generating bias in coordinate mapping and pooling. Thus, in our model, RoI Align is adopted for hand-box detection.

B. TRANSFORMER WITH RELATIVE POSITION ENCODING
Transformer is an encoder proposed by Vaswani et al. [13], which utilizes self-attention mechanism with an encoding layer consisting of two sub-layers, i.e., the self-attention layer and the fully connected layer. In contrast to RNN structurebased encoders, Transformer is not structured to directly obtain absolute or relative position information [13], [25]. Therefore, in Transformer, in addition to the input embedding, a position vector based on sinusoidal functions of different frequencies is produced to embed the position information. Then, the input embedding and position embedding are added to form the input to the self-attention layer. To further improve the performance for long sequences, Shaw et al. [26] proposed the relative position embedding. The relative position embedding takes into account the positional relationships of the words in the sequence and allows for better modelling of the semantic information contained in the sequence.

1) SELF-ATTENTION LAYER
The encoder in Transformer mainly relies on the Multi-Head Self-Attention mechanism. For each self-attention head, denote the input and output sequences H = {h 1 , · · · , h n } , h i ∈ R d H and Z = {z 1 , · · · , z n } , z i ∈ R d Z , respectively, where n is the length of the sequences, d H is the dimension of the input sequence, and d Z is the dimension of the output sequence. The computations in the self-attention layer are described in the following.
First, each input h i is mapped to three different spaces to obtain query Then, for inputs h j and h l , the attention score A ij between them is computed as follows: After that, A ij is scaled and normalized to a jl , as follows: Finally, the output z j is obtained by weighted summation as follows: where j, l = [1, n] are the positions of the vectors in the input sequence.

2) RELATIVE POSITION ENCODING
Absolute positional encoding does not perceive the sequential information of the sequences. To better learn the position letters of the sign language sequences and reference translation word sequences, relative position encoding is adopted in our model. The relative position information is incorporated in the Multi-Head self-Attention layer of the encoder and decoder. In this work, bidirectional GRU is used for relative position encoding in the encoder, while GRU is used for relative position encoding in the decoder.
In [27], recurrent neural networks (RNNs) [28] were used for relative position encoding. Instead of adding position information, RNNs can output feature embeddings with position information. In [29], the sinusoidal position embeddings are replaced by learned two-dimensional convolutional layers. The latest Transformer variants, such as Reformers [30] and Transformer-XL [31], also employ the relative localization schemes.

III. THE PROPOSED MODEL
In this paper, a spatial-temporal feature extraction network (STFE-Net) is proposed for CSLT, which is implemented and trained for Continuous Sign Language Recognition (CSLR) in an end-to-end manner. The detailed architecture of our proposed model is shown in Figure 1. This deep model is mainly composed of two modules, namely the spatial feature extraction network (SFE-Net) and the temporal feature extraction network (TFE-Net). The model first learns the spatial features of 53 key points, selected for sign language recognition, from sign language videos. A global pose-estimation network is used to generate the spatial features. In order to obtain fine-grained hand pose information, high-resolution pose estimation is performed on the hands. After that, temporal features are extracted from the consecutive frames. TFE-Net is based on Transformer, which includes an encoding module and a decoding module. Firstly, spatial features and the corresponding words are converted into embeddings. The feature embeddings are then combined with relative position embedding based on GRU. The encoded feature and word embeddings are then input to the encoders and decoders of Transformer, respectively, for learning. The details of our method will be described in the following. [12] is the first dataset for whole-body pose estimation. Each whole body is represented by four VOLUME 11, 2023  bounding boxes: the person box, face box, left-hand box, and right-hand box. In addition, 133 key points are located over the whole body: 17 for the body, 6 for the two legs, 68 for the face, and 42 for the two arms. The locations of these key points are illustrated in Figure 2. However, a large number of key points results in a lot of redundancy for sign language recognition. Consequently, using all the key points will reduce the learning efficiency of the model. Therefore, in this paper, only 53 of the key points are selected, according to the amount of motions in a sign language, to simplify feature learning and extraction.

COCO-WholeBody
The key points are selected based on the following method. All the body movements, expressed in the sign language, are in the upper body, so the key points 12 to 23 are ignored. On the other hand, the face movements in the sign language do not vary significantly, so the key points 24 to 91 on the face are also ignored. Finally, only 53 key points (11 key points for the upper body and 21 key points for the left and right hands) are selected. These selected key points are connected to form a human skeleton, as illustrated in Figure 3.

2) STRUCTURE FOR GLOBAL POSE ESTIMATION
Sign language movements are characterized by the interaction of the hands and the body to collaboratively express semantic information. However, the body and hand regions have significant scale differences. Considering the human body hierarchy, this work performs feature extraction for the global body, while performing high-resolution feature extraction for the hands. As shown in Figure 1 (SFE-Net), this structure, which contains three parts: Feature-Net, Body-Net and Hands-Net, can efficiently acquire fine-grained hand pose information. Furthermore, it enables multi-scale sign language action feature extraction by fusing features for the body and hand postures.

a: FEATURE-NET
This network includes two convolutional layers. Each convolutional layer reduces the input to half of its resolution, and the corresponding features are denoted as F1 and F2, which are used as low-level features for Body-Net and Hand-Net.

b: BODY-NET
Inspired by CornerNet [22], Body-Net predicts both the body box and the hand box, based on the shared feature maps F1 and F2. Using the shared maps F1 and F2, finegrained hand and facial features will be extracted. Then BottleNeck and BasicBlock, proposed in ResNet [32], were utilized. Thus, shallow features can be preserved in the forward path. At the end of the network, the highest resolution feature maps obtained are used for representation prediction. HRNet-w32 [21] was chosen for the body-pose estimation network in this paper. By using the hand boxes predicted by Body-Net, the features corresponding to the hand regions in F1 and F2 are cropped. To obtain finer hand feature, HRNetV2p-w18 [21] was used. Compared to HRNet-w32 used in Body-Net, the number of channels corresponding to its highest resolution feature map is halved. This method is based on RoI Align in Faster R-CNN [23]. Hand-Net performs high-resolution pose estimation for the hands, which facilitates learning finegrained sign-language information.

B. TEMPORAL FEATURE EXTRATION NETWORK
At this point, the spatial features have been extracted from the current frame. To perform continuous sign-language translation, it is necessary to handle the temporal relationships between the spatial features from consecutive frames. In TFE-Net, Transformer is used to extract the sign-language sequence features. To extract the temporal features, the embeddings of the spatial features and the sign-language words are first computed, and then input to the encoders and decoders of the Transformer, respectively. GRU, which is a variant of RNN, is used to perform relative position encoding for the embeddings. Figure 1 shows the structure of the entire Transformer network, i.e., TFE-Net, which is described below in detail.

1) SPATIAL FEATURE AND WORD EMBEDDINGS
Transformer networks are based on self-attention mechanisms and lack sequential or positional information about sequences. Similar translation results can be obtained, even if the order of the utterances is disrupted, and such translations may be ambiguous in conveying the true message of the semantics. Therefore, sequential Position Embedding (PE) is introduced to the input embeddings. Both spatial features and word vectors need to be fused with Position Embedding, which is generated by using sine and cosine functions of different frequencies in the original Transformer. The input embeddings are added with the corresponding position embedding, so the dimension of the position vector must be the same as that of the spatial feature/word embedding. The position embedding can be computed as follows: where t denotes the temporal order of the absolute position of the spatial features/words in a sequence, d denotes the dimension of the corresponding embedding, and i refers to the position in an embedding. The position embeddings at even and odd positions, i.e., 2i and 2i + 1, are calculated using (5) and (6), respectively. Bi-GRU is used in the relative position encoder to generate position-aware spatial embeddings, while GRU is used to generate position-aware word embeddings in the decoder. As the decoder stack needs to mask future inputs during predicting, so GRU, instead Bi-GRU, is used for decoding. With the spatial feature embedding, Bi-GRU is used to calculate the state h i of the current layer, as follows: where s k and s k are the spatial embedding and position-aware spatial embedding, respectively, and h i−1 is the state of the hidden layer before the current layer. Similarly, the position-aware word embedding is calculated as follows: The position-aware spatial embeddings, s k , are input to the decoder stack, while the position-aware word embeddings, w k , are input to the decoder stack of Transformer.

2) ENCODER STACK
Transformer, with relative position encoding, can optimize the self-attention layer in Multi-Head Attention. The input embedding is treated as a directed fully connected graph. Denote the edges between input elements x i and x j as r V ij and r K ij , which represent the relative position information between two vectors. Here K , V are the query, key and value matrices generated from the input set of vectors. Then, by substituting x i and x j into the formulation for Multi-Head Attention, we have: where α ij is calculated as follows: exp e it (12) and e ij is calculated as follows: From the above equations, it can be seen that, for Multi-Headed Attention, the values of r V ij and r K ij are shared among the multiple attention heads. Furthermore, there is no linear transformation, and the position relationship is not decimated. The distance |j − i| between the input x i and x j is restricted to be within a fixed range, i.e., |j − i| < k. If the distance between x i and x j is greater than this range k, the distance is truncated to k.
To train the encoders, weakly supervised learning is adopted with the Connectionist Temporal Classification (CTC) loss function [33]. The sequence of spatial features extracted from T consecutive frames is denoted as S = {s 1 , s 2 , . . . , s T }. The annotated sequence of N sign-language isolated words is denoted as G = {g 1 , g 2 , . . . , g N }. Then, the conditional probability p(G/S) is modeled as follows: where δ represents the path and β represents the set of all allowed paths that matches G.
The CSLR loss is calculated as follows:

3) DECODER STACK
Unlike the Multi Head Attention in encoder, the decoding module uses a sequence mask, which is designed to mask the future inputs during prediction. In addition, different from the Multi-Head Attention layer in the encoding module, the decoding module incorporates attention rather than selfattention.
Denote the translated word of the m th decoding step as t m , and the translation representation learned by the decoding module at the m-1 st step as r m−1 . The conditional probability VOLUME 11, 2023 The CSLT loss can be calculated as follows: The CSLR loss and CSLT loss are combined to train the encoder and decoder jointly as follows.
where w 1 and w 2 are the weights assigned for the CSLR loss and CSLT loss, respectively.

C. SPATIAL-TEMPORAL FEATURE EXTRACTION NETWORK
TFE-Net receives spatial features from SFT-Net to form STFE-Net. STFE-Net incorporates spatial and temporal features, to achieve continuous sign language translation in an end-to-end manner. The entire workflow is shown in Figure 4.

IV. EXPERIMENT A. OUR DATASET
In this paper, we construct a RGB camera-based Chinese continuous sign language dataset. It provides data support for the study of Chinese continuous sign language translation in practical scenarios. The dataset contains 60 utterances categorized by scene units and 70 sign language presenters. Each utterance from each sign language demonstrator was recorded 5 times. In total, 21,000 sign language videos were recorded. The number of Chinese characters for each sign language utterance in each video ranges from 2 to 13. Each video is labeled with a corresponding translated text. The Chinese continuous sign language teaching dataset was annotated by 15 professional sign language interpreters and 55 school students having total or partial hearing loss. The recording process ensures that the sign language movements were carried out in strict accordance with the standards of the ''List of Words Commonly Used in National Sign Language''. The recordings were made using a monocular camera with a frame rate of 30 frames per second, at a resolution of 1280 × 720. The total size of the video data is 144G. These annotated reference sign language utterances were separated by spaces to obtain words with Chinese semantics. Then, manual checking and adjustment were performed. Finally, the word database was constructed based on the frequency of occurrence of words. A total of 171 meaningful words were obtained. These words are used for model learning.
An example of the key actions and annotations of the recorded sign language utterance is illustrated in Figure 5.
Our dataset was divided into training set, validation set, and test set with the number of presenters in the ratio of 5:1:1. This means that 15,000 videos from 50 sign language presenters are used for training, 3,000 videos from 10 sign language presenters are used for validation, and 3,000 videos from the remaining 10 sign language presenters are used for testing.

B. PUBLIC DATASETS 1) DATASETS FOR SFE-NET
The COCO-WholeBody dataset is the first large-scale benchmark dataset for whole-body pose estimation. In this paper, the 133 key points in the COCO-WholeBody dataset are reduced to 53 key points. Experiments are conducted on this dataset to evaluate the performance of our deep models with different numbers of key points. In addition, other whole-body pose estimation methods are also compared on this dataset.
The CSL-500 dataset [34] consists of 50 sign language presenters, 25,000 labeled video samples, and 500 sign language vocabularies. Each vocabulary contains 50 corresponding sign language videos, depth videos, and 21 skeleton key-points coordinate sequences. The dataset was divided according to the sign language demonstrators, with 36 presenters in the training set and 14 presenters in the test set. Only the RGB video data is used. The depth information and key point annotation are ignored, replaced by the 53 key points as in COCO-WholeBody.

2) DATASETS FOR TFE-NET
RWTH-PHOENIX-Weather 2014T [5] was used for training, validation and testing according to the official division of 7096, 519 and 642 video samples, respectively.

3) DATASETS FOR STFE-NET
To validate our proposed model, the continuous sign language dataset RWTH-Phoenix-Weather 2014T [5] and the Chinese continuous sign language dataset CSL [6] were used. The CSL dataset was divided into training set, validation set, and test set according to the number of sign language presenters in the ratio of 36:7:7. Only the RGB data in CSL was used, ignoring the depth information and key point annotation. A statistical comparison of our dataset with the public datasets is shown in TABLE 1.

C. EXPERIMENT ENVIRONMENT
The evaluation metric used for the COCO-WholeBody dataset is Object Key point Similarity (OKS), which is a measure of the similarity between the true and predicted key points. This metric is derived from Intersection of Union (IoU), calculated as follows: where i indexes the key point in the range (1,133). Key points (1,11) and (91,133) are selected and used in our model. d i is the Euclidean distance between the predicted key point and  . The closer OKS is to 1, the closer the predicted value is to the true value. VOLUME 11, 2023 The quantitative analysis of the pose estimation results is based on Average Precision (AP) and Average Recall (AR): where T is the OKS threshold, n is the number of key points used, and m is the number of samples.
To measure AP and AR, the OKS threshold is set in the range of 0.5-0.95, with a step size of 0.05, to obtain 10 AP and AR values, which are then averaged to compute the mean Average Precision (mAP) and mean Average Recall (mAR), respectively.
In addition, the highest probability accuracy (Top-1 Accuracy) is used as the accuracy metric for sign language recognition. That is, for an input sign language word, the sign language word is considered to be correctly recognized if the output word with the highest probability agrees with the reference word.
The Word Error Rate (WER) is used to evaluate performance on the sign language recognition task, based on the accuracy of the annotation of isolated words in sign language. WER is calculated as follows: (22) where N word is the number of words contained in the actual text, O s , O i , and O d represent the number of substitution operations, insertion operations, and deletion operations, respectively, when transforming the recognized text into the actual text. The smaller the value of WER, the better the recognition performance.
The performance on the sign language translation task is also measured using the Bilingual Evaluation Understudy (BLEU) [35], which is the most commonly used translation metric in machine translation. BLEU is calculated as follows: where w n and p n are the weight and precision of the N-gram, respectively. N-gram means matching all clauses of length N in a sentence. BP is the penalty factor. If the length of the translation is less than the reference translation, BP is set less than 1 to avoid losing information, as follows: where lt and lr represent the lengths of the actual translated text and the reference translated text, respectively. In our experiments, we set N = 1, N = 2, N = 3, N = 4 and the translation quality is evaluated using BLEU-1, BLEU-2, BLEU-3, BLEU-4.

D. EXPERIMENT ENVIRONMENT
All experiments were conducted on a hardware server with NVIDIA Tesla V100 32GB GPU and Intel(R) Xeon(R) CPU E5-2698 V4 @2.20GHz. Pytorch is the deep learning framework used in this paper.

E. EXPERIMENT RESULTS
Our spatial-temporal model consists of two main modules: the spatial feature extraction module and the temporal feature extraction module. We have conducted experiments on both modules and the whole network.  , TABLE 3 shows the poseestimation results on the COCO-WholeBody dataset with 53 key points and 133 key points. It can be seen that the mAP and mAR of body pose, hand pose, and global pose are improved by reducing the 133 key points to 53 key points. This demonstrates the positive effect of key point reduction for sign language spatial features extraction. Moreover, the OpenPose [16], [17], [18], Single-Network (SN) [19], HPRNet [20], and HRNet [21] are also compared with our pose estimation method SFE-Net, which are presented in TABLE 4. These methods are compared on the COCO-WholeBody dataset with original 133 key points and 53 key points. SFE-Net achieves the best estimation results for body-pose estimation and global-pose estimation. HPR-Net outperforms SFE-Net in hand-pose estimation, while  SFE-Net is the second best among the methods compared. HRNet performs better than HPRNet on global estimation, which verifies its superiority of feature sharing. SN, HPRNet, and HRNet are all one-step pose estimation networks, which perform predictions for the given 53 key points simultaneously. HPRNet performs well for small-scale estimation, such as hand pose. Both OpenPose and SFE-Net belong to two-step pose estimation networks. This means that separate training is required for hand and body pose estimation. From the results, SFE-Net outperforms OpenPose for all pose estimations.

1) SFE-NET RESULTS
In addition, we also evaluate the different methods on the public dataset CSL-500.   dataset, with the same temporal feature extraction model. The proposed SFE-Net achieves an accuracy of 95.7%, which is better than 3D-CNN and HPRNet. This proves that using the 53 key points can lead to better learning of spatial features. Figure 6 shows some results by SFE-Net on the CSL-500 dataset. It can be seen that the proposed SFE-Net can accurately extract the pose features for sign language actions.

2) TFE-NET RESULTS
In order to evaluate the performance of TFE-Net, the RWTH-PHOENIX-Weather2014 dataset was used. TABLE 7 tabulates the parameters of the Transformer for the dataset. TABLE 8 presents the comparison results based on different position embedding methods. The Transformer network based on absolute position encoding is used as the benchmark in the experiments. The methods based on LSTM and GRU perform relative position encoding for temporal feature embedding. Since the encoder input is fully VOLUME 11, 2023   context-aware, bi-directional Bi-LSTM and Bi-GRU are used for feature embedding. On the RWTH-Phoenix-Weather 2014T dataset, compared to absolute position encoding, Using Bi-GRU for relative position embedding achieves smaller WER, reduced by 0.76 and 0.21 on the validation and test sets, respectively. Furthermore, BLEU-4 is improved by 0.45 and 0.51 on the validation set and test set, respectively. Bi-GUR also outperforms Bi-LSTM for relative position encoding. This means that using Bi-GRU for relative position encoding can enrich the position information for the sign language recognition and translation task. We also present the results of optimizing the self-attention layer using position awareness in TABLE 9. Using position awareness can reduce WER by 1.41 and 0.13 and improves the BLEU-4 by 0.23 and 0.60 on the validation and test sets, respectively, compared to no position awareness.
In addition, TABLE 10 tabulates the results of our method and other Transformer models for sign language recognition and translation on the RWTH-Phoenix-Weather 2014T dataset. It can be seen that, compared with the Sign Language Transformer [14], using our method achieves a lower WER on the validation set and test set by 1.61 and 1.21, respectively, and a higher BLEU-4 by 1.05 and 0.94, respectively. NSLT [5] is used as a benchmark experiment for  the dataset, it achieves BLEU-4 of 18.40 and 18.13 on the validation set and test set, respectively. The temporal model of NSLT employs GRU-based attention mechanisms. All other models using Transformer, such as Sign Language Transformer [14], Multi-channel [10], and STMC-Transformer [11], can achieve better BLEU-4. Our proposed method achieves the highest BLEU-4 scores, which are 23.17 and 22.74 on the validation set and test set, respectively. Then, the direct CSLT task, i.e., with no intermediate supervision, can be achieved by setting the weight w 2 in equation (18) to 0. The corresponding results are tabulated in TABLE 11. Our network still outperforms all other models.

3) STFE-NET RESULTS
In this section, we evaluate the performance of STFE-Net, which combines the spatial and temporal models for continuous sign language translation. Figure 7 illustrates some pose estimation results of STFE-Net on our developed dataset.
After extracting the 53 key points, these spatial features are encoded and input to Transformer. The parameters of the Transformer on our dataset are shown in TABLE 12.  TABLE 13 shows the results of STFE-Net on our dataset and the public datasets RWTH-PHOENIX-Weather2014 and CSL. The results indicate that our network can perform better for Chinese continuous sign language translation. We also compare our method with different methods on the RWTH-PHOENIX-Weather 2014T dataset, and the results    [14] performs better than NSLT [5]. Multi-channel [10] was evaluated only for BLEU-4, and its result is inferior to that of Transformer. Our method achieves 22.57 and 22.45, in term of BLEU-4, on the validation set and test set, respectively, and outperforms Transformer [14].

V. DISCUSSION
In this paper, we design a CSLT network based on spatialtemporal feature fusion. The model mainly consists of two networks, the spatial feature extraction network and the temporal feature extraction network. Our CSLT network is optimized for CSLT.
The spatial feature extraction network is an improvement on an existing pose estimation method. On the one hand, we ablate the 133 key points in COCO-WholeBody to 53 key points. The ablation of the key points allows our model to learn better sign language spatial features. The original numbers of key points for body pose, hand pose, and global pose estimation are 17, 42, and 133, respectively. After the reduction, the corresponding numbers of key points are 11, 42, and 53. As the results shown in TABLE 4, the mAP and mAR for the key point detection of the body, hand, and whole body are improved by using the ablated key points. This demonstrates the positive impact of the key point ablation operation for learning the sign language action features. While the key point ablation is to increase attention to the hands, this abatement may not be applicable to the scenarios involving lower limb posture, which is a limitation of this method. On the other hand, the hand pose expresses a large amount of sign language information, so a parallel structure, based on HRNet [21], is adopted. High-resolution pose estimation is performed on the hand along with whole-body pose estimation. The proposed pose-estimation model was evaluated on the sign language recognition dataset CLS-500, and the recognition results were tabulated in TABLE 4. The results show that our model designed for pose estimation can extract better sign-language features compared to 3D-CNN [36] and HPRNet [20].
After extracting the continuous spatial features, the transformer network is utilized for temporal feature learning.
To learn the sequential ordering information for the sign language in video sequences, relative position coding is used and is introduced into the self-attentive layer. The experimental results shown in TABLE 8 and TABLE 9 demonstrate that relative position encoding and self-attentive position perception positively affect CLSR. In addition, TABLE 10 compares the results of different temporal models on RWTH-Phoenix-Weather 2014T, and our approach outperforms four methods, including NSLT [5], Sign Language Transformer [14], Multichannel [10], and STMC-Transformer [11].
The spatial feature extraction model and the temporal feature extraction model are combined to achieve end-toend video sign language translation. TABLE 14 shows the results of our method on RWTH-PHOENIX-Weather 2014T, as well as the results of other methods, i.e., NSLT [5], Transformer [14], and Multi-channel [10]. Our proposed network outperforms the other methods. In addition, as can be seen in TABLE 13, the proposed model achieves BLUE-1=77.59, BLUE-2=75.62, BLUE-3=74.25, BLUE-4=72.14 on our Chinese continuous sign language dataset. The results for the CLS dataset are also promising. However, on RWTH-PHOENIX-Weather 2014T, our results are worse. This may be due to the fact that the RWTH-PHOENIX-Weather 2014T dataset has a larger amount of data and sentences, but a smaller number of samples per sentence. This makes the generalization of the model worse.

VI. CONCLUSION AND FUTURE WORKS
In this paper, a spatial-temporal feature extraction network was proposed for Chinese Sign Language Translation (CSLT) for real-world scenes. In the spatial feature extraction module, 53 key points related to sign language characteristics are selected. A parallel structure is designed to perform whole-body pose estimation, along with high-resolution hand pose estimation, to obtain finer-grained sign language features. In the temporal feature extraction module, temporal features are learned based on Transformer. The designed Transformer employs relative position encoding for spatial features and sign language words, as well as incorporating relative position into Multi-Head Attention (MHA). Furthermore, a Chinese sign language dataset was created for this research, which enriches the research on Chinese sign language translation. Our proposed method can achieve promising performance on our dataset and multiple public datasets. This shows that our method is effective for continuous sign language recognition in practical scenarios.
The key points used in this paper are migrated from the whole-body pose estimation dataset COCO-WholeBody. Research on key points selected specifically for CSLT can further improve the translation accuracy. Furthermore, if a smaller number of key points are used, the model size and computational complexity can be further reduced. In addition, the dataset created in this work can be extended for the study of sign language recognition and translation with intermediate supervision. For a deep learning model, especially in real-time processing scenarios, accuracy and real-time performance (or availability) are two important evaluation metrics. The approach in this paper focuses more on providing a feasible approach for continuous sign language translation, more attention is paid to the recognition accuracy of pope estimation and sign language translation, while the real-time performance of the model is not quantified. Actually, key frame extraction will improve the real-time performance of video processing. Future work will be centered on efficient key frame extraction algorithms and real-time performance.