1 Introduction

Different from hearing people, hearing-impaired people rely primarily on sign language for communication, which is a gesture system with linguistic properties gradually influenced by spoken language and words [1]. Like natural language, different countries have their sign languages, and similarities exist between them [2]. Therefore, research on Chinese sign language recognition is gradually receiving attention. Computer vision and natural language processing researchers process and analyze the gestures shown by different signers in sign language videos and translate a series of signs to the corresponding sign gloss sentences to facilitate the daily communication between the hearing-impaired people and the hearing people [3].

Many recent studies related to continuous Chinese sign language recognition (CCSLR) focused on model training from scratch on a small-scale dataset [47]. However, it is a well-known fact that almost all deep neural network models [8] that are trained from scratch are prone to overfitting due to the small-scale of the dataset that has been used [9]. Moreover, the widely used sign language dataset, USTC-CCSL, is insufficient for the CCSLR task [4]. Therefore, it is crucial to investigate how to alleviate data insufficiency for Chinese sign language recognition.

There are many ways to alleviate data insufficiency, such as data augmentation [10, 11] and pre-training [12], and an intuitive solution to the problem of lacking Chinese sign language datasets is to learn from other countries’ sign language datasets. Research has found that CCSLR model performance can be promoted by pre-training on other cross-lingual datasets [1315]. Benefiting from the conclusions in the study of sign language linguistics, it has been proven that similarities exist between sign languages in different countries, such as the same gestures and similar facial expression changes [2]. Hence, we propose using transfer learning methods to complete the tasks of sharing knowledge between different countries’ sign languages. By using some non-Chinese sign language datasets with larger scale and better recognition performance, we propose a novel scheme for fine-tuning of pre-trained models termed FTP for CCSLR. As illustrated in Fig. 1, this method focuses on feature processing in the CCSLR task. It consists of two parts: the spatial feature module for initialization and fine-tuning to extract the frames’ spatial features and the temporal feature module for transferring learning to extract the frames’ temporal features. To better extract the feature expression of each frame, we use ResNet18 [16] as the backbone. At the same time, considering the differences in the shooting background and appearance of sign language performers in different videos, we adopt the pre-trained ResNet18 model as the initial value of the FTP backbone and then fine-tune it.

Figure 1
figure 1

The pipeline of the proposed scheme for fine-tuning of pre-trained models termed FTP to enhance the spatial feature module and temporal feature module for addressing data insufficiency for CCSLR. Our main idea is to share useful information from other non-Chinese sign language datasets through transfer learning

To extract the temporal features of frame sequence, we not only notice single-frame information but also focus on contextual information. This is essential for the CCSLR task [17]. The current better transformer encoder [18, 19] is used as the temporal feature module. It has been proven that knowledge of transformer encoder representation is easier to share with other models [20], which is very conducive to transferring knowledge between sign language recognition tasks. Considering the similarities between sign languages in different countries and the size limitation of USTC-CCSL, we choose to freeze the transformer encoder from a better non-Chinese dataset pre-trained in the FTP scheme. In this way, the model can avoid overfitting on the USTC-CCSL dataset. The main contributions of this paper are listed as follows:

  1. 1)

    A novel CCSLR method is proposed, which avoids overfitting when training on a smaller dataset. The ablation study shows that the proposed FTP scheme adopted on the foreign sign language datasets brings a significant improvement in recognition performance compared to training using a single USTC-CCSL.

  2. 2)

    For feature processing, our proposed fine-tuning scheme is designed to initialize the parameters of the backbone from the pre-trained model and update the parameters to obtain a more robust spatial feature module. For the temporal feature module, a frozen pre-trained transformer encoder is used to share more reliable contextual information.

2 Related work

Considering that the focus of this paper is feature processing in CCSLR, we primarily introduce the research progress of CCSLR from two aspects of spatial feature processing and temporal feature processing in recent years. Another essential part of CCSLR, the alignment model, is also introduced in Sect. 2.3.

2.1 Spatial feature processing in CCSLR

Feature extraction can be divided into single-cue processing and multi-cue fusion. The former focuses on a single visual factor, such as hands and body. The latter chooses a combination of multiple visual factors.

Some researchers [21, 22] utilized hand-crafted features for single thread feature extraction. For example, Zheng and Liang [23] extended the histogram of oriented gradient (HOG) descriptor to the 3D motion map based pyramid histogram of oriented gradient (M-PHOG) by adding a pyramid representation for CCSLR. Pu et al. [24] developed a curve feature descriptor depending on the shape of the trajectory based on the movement of the hands. At present, convolutional neural networks (CNNs) are applied to such image recognition tasks as CCSLR [25]. Li et al. [26] employed a ResNet-152 network that removes the final fully connected layer to extract the features of each frame.

Considering the coordination of visual factors in sign language, the idea of implicit cooperation of multiple cues has been gradually adopted [3]. Wang et al. [27] proposed a representation model based on the covariance matrix to fuse information from multimodal sources. Benefiting from the conclusions in the study of sign language linguistics, it has been proven that similarities exist between sign languages in different countries, such as the same gestures and similar facial expression changes [28]. At the same time, considering the differences in the shooting background and appearance of sign language performers in different videos, we developed the pre-trained ResNet18 model as the initial value of the FTP backbone and then fine-tuned it.

Guo et al. [29] proposed a new hierarchical deep recursive fusion (HRF) model for automatic identification of the red-green-blue (RGB) images and skeleton symbols. Zhang et al. [30] introduced a method of fusing trajectory probability and hand shape probability, specifically using a new feature of enhanced shape context (eSC) to represent the spatial and temporal information of the trajectory. Zhou et al. [3] interpreted multi-semantics as parts of the body that extract and merge different meanings in their work. Their work adopted the VGG-11 model as the backbone network to generate multi-cue features of full frames, hands, faces, and poses.

Hu et al. [12] investigated cross-language continuous sign language recognition (CSLR) by introducing an additional shared sequential module to learn shared knowledge of different languages, and in contrast to their work, our scheme can achieve significant performance improvement without introducing an additional module.

2.2 Temporal feature processing in CCSLR

Temporal features are primarily used to model context information, which is important for CCSLR [3, 29, 31]. Cheng et al. [32] established a fully convolutional network (FCN) for online sign language recognition to simultaneously learn the spatial and temporal characteristics of sequences. The CSLR framework solution composed of a pre-trained visual module followed by a contextual module such as recurrent neural networks (RNNs) and a connectionist temporal classification (CTC) based alignment module for gloss generation has been used in CCSLR tasks. RNN is used to learn long-term information of spatial-temporal feature sequence and the CTC algorithm is used for mapping video sequence to ordered sign gloss sequence. For instance, Wei et al. [7] introduced a new multi-scale perception (MSP) strategy based on the CNN-RNN-CTC framework to learn the discriminative representation of video clips. In addition, Liao et al. [33] proposed a feature extraction module based on B3D ResNet, which extracts spatial-temporal features by inputting segmented video frames. To improve the recognition accuracy, Zhao et al. [34] put forward a 3D-CNN method combined with optical flow processing. Yang et al. [35] adopted the combination of LSTM and 3D convolution at the frame level to create an effective temporal modeling architecture. Pu et al. [36] used soft-DTW to constrain the long short-term memory (LSTM) results to obtain better annotations, and to fine-tune 3D-ResNet to extract better temporal features. Niu and Mak [28] developed a transformer encoder to extract the temporal information between video frames. The advantage of the transformer encoder is that the residual connections between layers are beneficial for better backpropagation.

2.3 Feature sequence alignment in CCSLR

In CCSLR, the alignment module is used to find the correspondence between the video feature sequence and the gloss annotation sequence [37]. At present, the method based on the CTC algorithm is adopted in most sign language recognition methods [38] for alignment [6, 32, 36, 3942]. Aiming at the problem of inconsistency between the CTC objective and evaluation metric, Pu et al. [39] proposed cross-modality augmentation for their architecture. Zhou et al. [6] introduced a dynamic pseudo-label decoding method. The rational alignment path is found through dynamic programming. The model introduces blank classes through the CTC algorithm to filter out wrong labels and generate false labels that conform to the natural word order of sign languages.

Different from the above research, our FTP scheme focuses on mitigating the model overfitting problem caused by the small-scale dataset and the training method. A FTP scheme is proposed in this paper for CCSLR migration to achieve better feature processing.

3 Methodology

The pipeline of the proposed framework is illustrated in Fig. 2.

Figure 2
figure 2

An illustration of our fine-tuning pre-trained scheme with two modules

3.1 Overview of the proposed framework

Consider a processed sign language video \(\mathcal{X}=\{\boldsymbol{x}_{t}\}_{t=1}^{T}\), where x and T represent the video frame and the number of frames in the video. First of all, in the spatial feature module, the pre-trained and updated fine-tuned ResNet18 network is adopted to extract the corresponding spatial features \(\mathcal{Z}=\{\boldsymbol{z}_{t}\}_{t=1}^{T}\). Then, in the temporal feature module, the pre-trained and frozen fine-tuned transformer encoder network is used to extract the corresponding spatial-temporal features \(\mathcal{C}=\{\boldsymbol{c}_{t}\}_{t=1}^{T}\). We have developed an alignment model from the stochastic fine-grained label (SFL) method [28]. The gloss sequence y corresponding to the video \(\mathcal{X}\) is first input into the SFL model [28], and y is adjusted to the multi-state \(\tilde{\mathbf{y}} \) by SFL’s reinforcement learning algorithm. Next, in the alignment module, the gloss corresponding to the video frame x is determined by the probability \(p(\tilde{\mathbf{y}}|\boldsymbol{x})\), which is calculated by the CTC algorithm [38]. Because our basic framework is developed based on SFL; therefore, we use SFL as our baseline method in this work.

3.2 Fine-tuning pre-trained scheme

To perform pre-training from non-Chinese datasets, we process the relevant parameters of the convergence model, including spatial and temporal features, and transfer them to the CCSLR task. In addition, we use the transfer learning method of Ref. [13] for reference and make some adjustments on this basis to verify the effectiveness of our method.

3.2.1 Spatial feature extractor module

The goal of CCSLR is to predict the corresponding gloss sequence \(\mathbf{y}=\{\boldsymbol{y}_{i}\}_{i=1}^{K}\) \((K \leq M)\) based on the video sequence \(\mathcal{X}=\{\boldsymbol{x}_{t}\}_{t=1}^{T}\). When transferring the parameters of the spatial feature model, we transfer the pre-trained w and b (see Eq. (1) for details) on the PHOENIX-2014 dataset to FTP’s backbone for initialization and update it in the subsequent CCSLR training phase.

$$\begin{aligned} \boldsymbol{z}=\sum_{\mathrm{chan}} \sum _{\delta u, \delta v} \boldsymbol{x}(u+\delta u, v+\delta v, c) \times \boldsymbol{w}(\delta u, \delta v)+\boldsymbol{b}, \end{aligned}$$
(1)

where u and v denote the pixel coordinates of the video frame x with chan channels, respectively, and b is the bias term. The spatial neighbourhood, \(\delta _{u}\) and \(\delta _{v}\) is defined by the convolution layer’s kernel size [43], and w is the weight of the convolution kernel.

3.2.2 Temporal feature extractor module

After obtaining the spatial features \(\mathcal{Z}=\{\boldsymbol{z}_{t}\}_{t=1}^{T}\) of the video sequence \(\mathcal{X}=\{\boldsymbol{x}_{t}\}_{t=1}^{T}\), we need to obtain further temporal features \(\mathcal{C}=\{\boldsymbol{c}_{t}\}_{t=1}^{T}\) based on \(\mathcal{Z}=\{\boldsymbol{z}_{t}\}_{t=1}^{T}\). In this work, a transformer encoder is adopted as the temporal feature extractor. It achieves the effect of updating features by calculating the weights of features at different times. First of all, three types of features \(\boldsymbol{W}_{Q}\), \(\boldsymbol{W}_{K}\), and \(\boldsymbol{W}_{V}\) are mapped to \(\mathcal{Z}=\{\boldsymbol{z}_{t}\}_{t=1}^{T}\) through three linear transformation matrices \(\boldsymbol{W}_{Q}\), \(\boldsymbol{W}_{K}\), and \(\boldsymbol{W}_{V}\).

We obtain \(\mathrm{Attention} (\boldsymbol{Q}, \boldsymbol{K}, \boldsymbol{V} )\) according to Eq. (2):

$$\begin{aligned} \mathrm{Attention} (\boldsymbol{Q}, \boldsymbol{K}, \boldsymbol{V} )= \mathrm{softmax} \biggl( \frac{\boldsymbol{Q} \boldsymbol{K}^{T}}{\sqrt{d_{k}}} \biggr) \boldsymbol{V}, \end{aligned}$$
(2)

where \(d_{k}\) represents a scaling factor.

After the model outputs \(\mathcal{L}\), \(\mathcal{L}\) enters a fully connected feed-forward network, and the output of the network is added to \(\mathcal{L}\) and normalized, as explained in Eq. (3). Then, the spatial-temporal feature \(\mathcal{C}\) is obtained:

$$\begin{aligned} &\mathcal{L}=\mathrm{LayerNorm} \bigl(\mathcal{Z}+\mathrm{Attention} ( \boldsymbol{Q}, \boldsymbol{K}, \boldsymbol{V} ) \bigr), \end{aligned}$$
(3)
$$\begin{aligned} &\mathcal{C}=\mathrm{LayerNorm} \bigl(\mathcal{L}+\max (0, \mathcal{L} \boldsymbol{W}_{1}+\boldsymbol{b}_{1} ) \boldsymbol{W}_{2}+\boldsymbol{b}_{2} \bigr). \end{aligned}$$
(4)

We transfer the seven parameters, \(\boldsymbol{W}_{Q}\), \(\boldsymbol{W}_{K}\), \(\boldsymbol{W}_{V}\), \(\boldsymbol{W}_{1}\), \(\boldsymbol{b}_{1}\), \(\boldsymbol{W}_{2}\), and \(\boldsymbol{b}_{2}\), pre-trained on the PHOENIX-2014 dataset to the CCSLR task. In particular, the fine-tuning method is frozen.

3.3 Discussion

At present, few datasets can be selected for the CCSLR task. Training a deep neural network on a small-scale dataset is prone to overfitting. To alleviate data insufficiency for CCSLR at the method level, we learn from the mature fine-tuning pre-trained scheme in transfer learning, which focuses on feature processing in CCSLR. It alleviates the problem of overfitting by learning “knowledge” from better non-Chinese datasets, including spatial feature description and temporal feature description. Taking into account the differences in the image level of sign language speakers in different datasets (such as clothing and background), we choose to use the initialized pre-trained backbone model and further update it on the Chinese sign language dataset; at the same time, based on the premise that the temporal feature has a certain generalization ability, we freeze the transformer encoder parameters from Pheonix-2014 pre-trained on the CCSLR model.

4 Experiments

4.1 Implementation details

In the pre-training stage, for the pre-training of ResNet18, we remove its fully connected layer and set the output size to 512. For the pre-training of the transformer encoder, we adopt a 2-layer transformer encoder with four heads and set the dimension parameter of the position-wise feed-forward layer to 2048 and the dimension parameter of the model to 512.

In the model optimization stage, the Adam optimizer is used. The learning rate of the optimizer is set to 0.0001, and the weight decay is \(1 \times 10^{-4}\). In addition, we set the batch size to 4. We let the model iterate 100 times and save the model once after each epoch. After testing, the model with the best test result will be chosen as our pre-trained model.

In the fine-tuning stage, we transfer the seven parameters of the pre-trained model (see above text for the specific meaning of the parameters) to the CCSLR task. In particular, due to the frozen fine-tuning method, we set the corresponding frozen parameter learning rate to 0 in the CCSLR model in the transformer encoder. The model stops training after the 100th iteration. After the training is over, we test the saved model and select the model with the best performance as our FTP model.

The proposed CCSLR method is implemented using the PyTorch framework on an NVIDIA RTX2080Ti GPU.

4.1.1 Datasets and evaluation

The PHOENIX-Weather-2014 dataset [44] is a German sign language dataset, which was captured from sign language interpretations of weather forecasts. The resolution of the video is 210 × 260 pixels. The size of the vocabulary is 1081. It contains a total of 6841 videos. There are a total of nine sign language speakers for demonstration in this dataset.

The PHOENIX-Weather-2014-T dataset [45] is an extension of the PHOENIX-2014 dataset. It is designed for sign language translation, but it can also be used to evaluate CSLR tasks. Similar to the PHOENIX-2014 dataset, the video resolution of this dataset is 210 × 260 pixels and there are also nine signers in the video sample of PHOENIX-2014-T. The vocabulary size of this dataset is 1,066 words. There are 8257 video samples in total.

The RWTH-Boston-104 dataset [46] was collected by the National Center for Sign Language and Gesture Resources of Boston University. The dataset was captured by three black and white cameras and one color camera in multiple perspectives at the same time. The resolution of the video is 312 × 242 pixels. The dataset consists of a total of 201 annotated ASL videos. It divides the video into a training set and a test set. The training set includes 161 sentences, and the test set includes 40 sentences.

The GSL dataset [47] has a total of 40,785 video samples and a total of seven sign language speakers to demonstrate. The video resolution of GSL is 848 × 480. There are 310 words in total.

The USTC-CCSL dataset [4] was recorded using Kinect devices. The resolution of the video is 1280 × 720 pixels. A total of 50 signers participated in the video recording. The dataset has 25,000 videos in total.

4.1.2 Evaluation indicator

Similiar to most CCSLR studies in the literature, we choose word error rate (WER) as the evaluation index. The minimum sum of the replacement, insertion, and deletion operations is performed to convert the recognized sequence into the corresponding sign language annotation, which is described as follows:

$$\begin{aligned} \mathrm{WER}= \frac{\#\mathrm{sub}+\#\mathrm{ins}+\#\mathrm{del}}{\#\mathrm{reference}}, \end{aligned}$$
(5)

where #sub, #ins, and #del represent the least operations of substitution, insertion, and deletion, respectively, required to convert the hypothetical sentence to the reference sentence.

It is worth noting that we have used SPLIT-II [48] for the identification task, which is more realistic. The specific reasons are as follows:

1) SPLIT-I task pays more attention to distinguishing signers, which reflects the robustness of the model’s backbone. When the dataset is small, it is prone to overfitting. It is difficult to make a difference in WER indicators of different recognition methods.

2) In the SPLIT-II task, sentences in the test set do not appear in the training set, but the gestures in the test set are included in the training set. This feature is similar to the most influential PHOENIX-2014 dataset.

4.2 Comparisons with the state-of-the-art CCSLR methods

We evaluate our FTP method on the commonly used public dataset USTC-CCSL and report the comparison results with a large number of state-of-the-art CCSLR methods, such as CMA [39], STMC [3], and SFL [28]. At the same time, for fairness, we directly use the experimental results in these papers, except for the baseline (among the methods listed in Table 1, only baseline (SFL) has open source code; however, it has not been tested on the USTC-CCSL dataset).

Table 1 Comparison of our FTP method with the state-of-the-art CCSL methods on the USTC-CCSL SPLIT-II task. WER represents word error rate

For baseline (SFL) [3] and our FTP method, we fix the parameters of these two methods on the USTC-CCSL dataset and report the results in terms of the official evaluation measure recommended by USTC-CCSL. Table 1 lists the comparison results, where the evaluated CCSLR methods are ranked in descending order according to their WER. We can observe that our FTP method achieves the same performance as the CMA [39] method.

To vividly illustrate the superiority of our FTP method, we have selected three representative videos from the USTC-CCSL test dataset. By analyzing the challenging video sequences processed by FTP, as demonstrated in Fig. 3, we found that our FTP method performs well in the following three types of typical problems:

Figure 3
figure 3

Recognition results of the baseline method and our method. The red rectangle indicates an error in the recognition. It can be seen from the 96th video (top) that our method corrects an insertion error, from the 7th video (middle) that our method corrects a substitution error, and from the 15th video (bottom) that our method corrects two errors even though it also makes an incorrect prediction

In the 96th video, FTP handles better recognition of auxiliary words, because in Chinese sign language, the gestures of auxiliary words are usually simple and short. The 7th video shows that FTP’s backbone is more robust. Although the baseline captured this gesture, it generated an incorrect gloss. The gestures of “policeman” and “nurse” are indeed similar. In the 15th video, our FTP and baseline did not obtain full marks, but we only deviated from the two gestures of “ruler” and “country” (these two words have a common gesture of raising both hands in Chinese sign language), and the baseline showed more errors.

It can be observed that on the USTC-CCSL dataset, our method has achieved significant performance improvement on the SPLIT-II task. This means that our FTP method is effective. What needs to be explained is that our FTP method listed in the Table 1 has been pre-trained and fine-tuned on the PHOENIX-2014 dataset.

4.3 Comparison with different pre-trained datsets

To further analyze the strength and the weakness of the proposed FTP method on different pre-trained datasets, we further evaluate it on different non-Chinese sign language datasets. We fine-tune different pre-trained models on the USTC-CCSL SPLIT-II task. The results are displayed in Table 2. We can observe that the PHOENIX-2014 dataset performs best, while PHOENIX-2014-T, which focuses on translation tasks, has the least performance improvement. From the results of the experiment, it can be found that the recognition task requires special recognition datasets, and PHOENIX-2014 is a more suitable dataset for transferring knowledge to Chinese sign language.

Table 2 Comparison of the transfer effects of different non-Chinese sign language datasets. WER represents word error rate

Therefore, PHOENIX-2014 is selected as the pre-trained dataset in our proposed FTP method.

4.4 Ablation study of FTP

After determining the pre-trained dataset, the next step is to select the appropriate fine-tuning method to obtain the appropriate spatial and temporal features. Through the display of the ablation experiments in Table 3 and Table 4, we use the experimental data to answer why we should choose the respective specific fine-tuning forms for the backbone and transformer encoder modules of our FTP.

Table 3 Ablation results of our FTP regarding ResNet18’s weight state on the USTC-CCSL SPLIT-II task. In each different pre-trained dataset, the best results are highlighted in bold
Table 4 Ablation results of our FTP regarding the transformer encoder’s weight state on the USTC-CCSL SPLIT-II task. For each different pre-trained dataset, the best results are highlighted in bold

4.4.1 Effect of the spatial feature extractor

For fairness, we first pre-train the backbone on four typical non-Chinese sign language datasets and then transfer their backbones to the baseline method. Afterward, the initialized backbones are fine-tuned on the USTC-CCSL training dataset by freezing or updating, and at this time, the transformer encoder weights are kept updated. As presented in Table 3, it is found that for the spatial feature extractor, updating the transferred backbone parameters is more conducive to the CCSLR task. Because the backbone focuses more on the appearance model of the video frame, and different datasets often bring different appearance representations due to different signers and environments, it needs to be updated in time. It is worth noting that using the pre-trained backbone for initialization can not only alleviate overfitting but also shorten the training time.

In this ablation experiment, PHOENIX-2014-T obtained the best result, which is mainly due to the following factors. Compared with other datasets, PHOENIX-2014-T is not only of better quality but also contains more samples. Although the RWTH-BOSTON-104 is not large in scale, it still achieves a better WER. The main reason is that RWTH-BOSTON-104 is a dataset with multi-view shooting, which makes it easier to obtain the representative frames.

4.4.2 Effect of the temporal feature extractor

Table 4 presents that frozen weights can well assist the temporal feature extractor, and at this time, the ResNet18’s weights are kept updated. Experiments on four pre-trained datasets have confirmed that the freezing transformer encoder improves the recognition results more significantly. Compared with GSL and RWTH-BOSTON-104, PHOENIX-2014 has more gesture vocabulary and videos, and gloss annotations are more suitable for recognition tasks. PHOENIX-2014-T, which mainly serves the translation task, is different from the recognition task in the form of annotation and dataset division, which leads to the pre-trained temporal feature extractor unable to complete the recognition task.

It is noted that our FTP is primarily composed of the spatial feature module and the temporal feature module. In particular, the FTP method composed of these two modules has been completed on the PHOENIX-2014 dataset. Therefore, the selection of the PHOENIX-2014 dataset as the source dataset for transfer learning confirms the importance of temporal feature processing for the CCSLR task.

5 Conclusion

This paper presents a novel scheme for fine-tuning of pre-trained models named FTP for the CCSLR task with a couple of novel spatial feature module and temporal feature module designs. Our method aims to alleviate the problem of data insufficiency for Chinese sign language recognition. First, we adopt the fine-tuning method of updating the backbone model, which has been initialized by the pre-trained scheme. Second, we propose a frozen method for the temporal feature extractor module. These two innovations ensure that high-quality features can be obtained.