Deep learning-based sign language recognition system using both manual and non-manual components fusion

.


Introduction
Sign language [1] is a visual and silent language accomplished with the kinetic movement of hand motions, facial expressions and body posture.Sign language represents an efficient and useful method of communication for both deaf individuals and individuals who have problems speaking in a regular tone of voice.Employing and understanding sign language demands a respectable amount of time, apprenticeship and practice, which is not convenient and achievable for everyone.Furthermore, sign language has a large basis in culture [1,2], which also restricts its simplification potential.Even though computer vision and machine learning have reached a wide advancement in the past decade, it is still difficult to utilize sign language recognition (SLR), which automatically elucidates sign language and assists deaf-mute individuals in communicating with hearing individuals in their quotidian lives.
Compared with traditional action recognition, sign language recognition is a further exigent task.First, sign language demands both sensitive hand motions and total body gestures and clearly and precisely express its meaning.Moreover, facial expressions can be used to illustrate emotions.Analogous signs can even establish different meanings, which are reliant on the number of recurrences.Second, various signers may perform signs in a different way, which makes the recognition of sign language more challenging.Gathering many datasets from as many signers as possible is convenient yet pricey.Classical SLR systems principally prepare the dataset and manually use features, such as SIFT [3] and HOG [4], which are correlated with conventional classifiers such as SVM and KNN [5].While deep learning is making major advancements, general methods for learning video and chronological series depictions (e.g., LSTM, RNN) and efficient video-based action recognition systems (e.g., 3D convolutional neural networks (CNNs)) are initially exploited for SLR assignments in [6,7].Attention modules are joined with other modules to improve the precision to more adequately track down the information of local motion [8].Additionally, [9] employs semantic segmentation and detection models to clearly lead the recognition network in a two-phase pipeline.Lately, bodybased approaches have become suitable in gesture recognition tasks [10,11] and define the growing attentiveness of their solid flexibility to the dynamic conditions and intricate background.Since the body-based approaches supply additional information to the RGB procedure, their whole results further enhance global achievement.Nonetheless, some insufficiency prevents their employment with the SLR method.Those body-based, deed recognition approaches depend on annotations of ground truth skeletons afforded by systems of gesture acquisition, thereby limiting themselves to publicly accessible datasets filmed in supervised surroundings.In addition, a large majority of motion acquisition systems only regard the coordinates of the principal body and does not supply a real observations of hands.As mentioned earlier, the data consists of inadequate information to handle SLR since signs are based on dynamic hand gestures and motions interrelated with different body parts.In [12], the authors tried to obtain information regarding various hand poses and skeletal structures by employing segregate models; their work suggested the use of an RNN-based model for SLR.However, their acquired hand poses were doubtful and the pattern could not correctly model the dynamics of the skeletons.
Head pose estimation is an influential way to convey additional information.Considering this, the main contributions of this paper are as follows: 1) Two features are disclosed, which are the anisotropic feature and the unsmooth variation feature.Inspired by the work of [13], a learning model of anisotropic angle distribution for the estimation of head poses is suggested.By employing a covariance pooling layer to apprehend the frame features of the second order, model learning is performed through an end-to-end CNN.
2) The suggested end-to-end adjoining model that combines both manual and non-manual features for SLR revealed substantial refinement in the accuracy performance for two publicly available datasets.
3) The suggested multimodal temporal representation (MTR) unit uses temporal receptive fields of various scales and presents a considerable enhancement in the concluding recognition achievement.

Related work
Over the past few years, SLR has reached important advancements and has acquired a high recognition rate through the improvement of convenient deep learning structures and the thrusting of computational potency [14][15][16][17][18].There are some residual defiances in SLR, which are summarized in the simultaneous capture of overall body motion information, facial expressions and hand gestures.Authors in [19] suggested a multi-modal and multi-scale system that utilized spatial features at specific spatial ranges.An auto-encoder framework with a connectionist-based recognition component was suggested in [20] to model the sequence.Authors in [21] presented an end-to-end incorporation of a convolutional module within a hidden Markov model, and illustrated the approximation results in a Bayesian network.Authors in [8] suggested a CNN correlated with the attention component, which masters the spatio-temporal attributes from an unrefined video.In [22], the authors consolidated temporal convolutions and bidirectional recurrences with each other, which showed the efficiency of temporal information in gesture-based methods.In [23], the authors modeled a hierarchical attention network (HAN) with latent space to eliminate the temporal segmentation preprocessing.Nonetheless, these methods principally envisage raw visual features, which could be more effective to explicitly exploit various hand gestures and body movements.Authors in [24] presented a pose-based, temporal graph convolution network (GCN) that designs spatio-temporal reliances in trajectories of human posture.Authors in [25] adopted a hierarchical-LSTM auto-encoder pattern with visual content and a gloss incorporation for translation.They tackled various granularities by transmitting spatio-temporal transitions between frames.However, these methods were inefficient enough to exploit the total information of motion.Non-manual-based gesture recognition principally concentrates on examining peculiar patterns of motion and human joint position.Non-manual data can be used separately to carry out effective gesture recognition [26,27].Furthermore, it can as well be associated with other cues to obtain multi-cues learning desired for elevated recognition rates [28].Recurrent neural networks are common for designing non-manual data, as is seen in [26,27].Newly, [29] is the first study to design a graph-based framework, named ST-GCN, for modeling the dynamic patterns in nonmanual data through a GCN.This method attracts plenty of interest and a few ameliorations have been developed, such as in [30].Especially, authors in [31] suggested an AS-GCN to delve into the latent joint connections to reinforce the achievement of recognition.Authors in [32] suggested a ResGCN, which adapts a bottleneck hierarchy from ResNet [33] to decrease parameters while growing the model's capability.Nonetheless, non-manual-based SLR systems have not been explored enough.In [34], the authors tried to directly spread out STGCN to SLR; however, the results were unsuccessful, and only reached about 60% recognition on 20 classes of sign language, which is unfortunately less than traditional approaches.The multi-cues method aims to examine gesture data received from either various devices or resources to boost the final achievement.This method is based on the hypothesis that various cues contain single motion information which could possibly complement each other and ultimately acquire particular and comprehensive action illustrations.For obtaining robust illustrations for downstream jobs, a view-invariant illustration learning framework was suggested in [35].Authors in [36] deployed a shared weights network on a multi-cue script for obtaining cue vision for image classification.In [37], the authors proposed DA-Net, which is a viewindependent and view-specific module for acquiring features and successfully combined the prediction scores together.In [22], the authors suggested a feature factorization framework that investigated the specific information view shared for RGB-D gesture recognition.A cascaded residual auto-encoder was modeled in [38] to handle insufficient view classification settings.Inspired by the achievement of those multi-cue approaches, we intend to delve into more visual, gestural and hand cues alongside acquiring features from all appearances and combine them through a common framework to reach a more significant achievement.

Proposed approach
The global structure of the end-to-end, continuous SLR system suggested in this work shown in Figure 1.The pattern entails two-stream convolutional networks.The first ntwork aims to detect the head pose, while the second network includes the following three components: a spatial feature extraction component, a temporal feature extraction component and a multi-stage connectionist temporal classification (CTC) loss training component.In our model, there are five stages of gloss features.As the first step, the maximum a posteriori (MAP) estimation is used to design the network, which estimates the head pose.This network entails a convolutional pooling layer, a covariance pooling layer, and an output layer.For the second step, we employ the Resnet and two fully connected layers to the input sign language video to obtain the first-stage gloss features (spatial features).Hereafter, the temporal features are extracted by the suggested MTR unit.Specificaly, the spatial features proceed the prime MTR unit to obtain the second-stage gloss features; these latter features are successively adopted as the entry of the second MTR unit to obtain the third-stage gloss features, which become specified as the fourth-stage gloss features to the transformers timing coding.Lastly, the obtained five gloss features are combined and trained for model optimization by employing multi-stage CTC loss, and the conclusive SLR results are acquired by employing the fifth-stage gloss features.

Head pose estimation
Head pose estimation signifies that the computer resolves the parameters of attitude and the position of the head in 3D space by examining and divining either the video sequence or the input images.Usually, the head pose is examined as a transformation of the inflexible body part.Head pose estimation works by measuring the 2D Euler angles, which incorporates the angles of yaw and pitch.Given a head pose angle Y and an input face image X, the occupation of the head pose estimation network uncovers the correct label Y from image X.
Two vectors instanced from the last fully connected layers are employed to calculate the similarity of the cosine.Given two frames, X1 and X2, the neural network is considering as a function that generates a vector of features.The formula is specified as follows: The feature resemblances are computed between X 1 , which represents the central position, and its adjoining positions are X 2 , X 3 , X 4 and X 5 , respectively.All matrixes of resemblance are plotted and can be adjusted with 2D Gaussian distribution (Figures 2(b) and 4(c)) [13].Next, the map scale can be obtained by computing all matrices of resemblance.Figure 3 shows that all poses of the head are arranged in a matrix.This latter has nine columns and 13 rows.Given a frame X of the head pose, its angle of axial pose is interpreted as y mn = ( m, n), where n and m are the column and row numbers of the pose image, respectively.The angle distribution ŷ is interpreted as, ŷ = g(y mn ) and where n represents the column number and m represents the row number in the matrix.Equation (3.3) shows that the appearance of the Gaussian distribution will be isotropic.The distribution appearance will be anisotropic whether the diagonal constituents are unequal or not.η is defined to obtain the 2D anisotropic Gaussian distribution, which can depict the anisotropic characteristic for the head pose estimation occupation.Based on the quantitative computation shown in Figure 3, the values of η are included in the range of (0.6, 1).In Figure 4, the property of unsmooth variation (i.e., when the angle range raises up, the image variations boost at first and then decline in the angle direction of the yaw) is transformed into the various standard deviation values σ of matrix M. Figure 4(a) depicts the angle distribution when the yaw = 0 • and the pitch = 0 • .Figure 4(b) depicts the angle distribution when the yaw = -45 • and the pitch = 0 • .We can note that the value of σ3 is less than σ1 and greater than σ2.

Spatial feature representation
The spatial feature extraction module incorporates a leading network feature catcher and two fully connected layers.
As an entry video sequence, VS = (vs 1 , vs 2 , ..., vs T ) = vs t T 1 ∈ R T ×c×h×w consists of T frames, where vs t represents the t th frame in the sequence, c depicts the channels number (c = 3) and h * w represents the dimension of vs t .VS is fed into the Resnet network, R n , to acquire the feature composition f c 1 = R n (VS ) ∈ R T ×c 1 ; next, two fully connected layers are used to acquire the feature composition f c 2 = R f c ( f c 1 ) ∈ R T ×c 2 , which represents the concluding spatial feature vector and the spatial feature with settled sizes.In this work, we have specified the concluded vector as the first-stage gloss feature.The dimensions of c 1 and c 2 are 512 and 1,024, respectively.The main role of using two fully connected layers next to the principal network is to incorporate features in the maps of the frame features that have proceeded via numerous convolutional and pooling layers to obtain the high-stage significance of the frame features.We have applied a stochastic gradient stopping [39] between the Resnet and the fully connected layer to decrease the RAM usage and to speed up the training of the model.

Temporal feature representation
The temporal feature extraction module suggested in this work contain the following: An MTR unit and transformers.After passing through the MTR unit, gloss features of the second and the third stages are acquired.Ultimately, the gloss features of the fourth-stage are acquired after employing the transfomers.
1) MTR unit: These last years, exceptional continuous sign language recognition systems have been developed, though most of them utilize local features from the receptive fields of the designated temporalities.In sign language acquisition, the lengths of video sequences which represent various glosses are different.Additionally, the expertise of SL by various signers and certain other interferences over the filming operation produced incoherence in the length of the same word.Therefore, the obtained results will not be precise, thereby disturbing the achievement of temporal modeling.Figure 5 shows the proposed MTR unit in this work, which employs various ratios of temporal receptive fields to enhance the temporal representation efficiency.The MTR unit principally contains a multi-scale feature extraction and feature merging.Numerous one-dimensional CNNs with diverse convolution kernels are collaterally linked to make a multi-scale feature extraction element.The network is depicted as follows: where f c 2 ∈ R T ×c 2 represents the weight, and f c 2 ∈ R T ×c 2 represent the obtained data, S represents the kernel size, f c 2 ∈ R T ×c 2 and T represents the length in terms of time.For the first-stage gloss feature, the feature size is initial updated from f c 2 ∈ R T ×c 2 to f c T 2 ∈ R c 2 ×T .Afterward, it passes via the multi-scale feature extraction module.The multi-scale 1D-CNN has the equivalent number of channel dimensions number and various kernel sizes.The features number and the timing size make no changes over the treatment of feature pulling out.The kernel size of the beginning convolution layer is 2 × 2, while taking into account that the maximum size is S and the stride is two: where f c 2 represents the exit that follows the multi-scale network and n represents the number of 1D-CNNs.Afterwards, we employed a feature merging and sub-sampling twice using a 2D-CNN, and f c 3 ∈ R c 2 ×T 1 to obtain the second-stage gloss feature, where T 1 = T 2 .We repeat this operation to procure the third-stage gloss feature f c 4 ∈ R c 2 ×T 2 , T 2 = T 1 2 .
The input and output of the entity inside the multimodal temporal representation unit.
2) Transformers encoding: The transformers pattern is a traditional pattern of natural language processing introduced by Google in 2017.Instead of using the sequential structure of RNNs, it employs the self-attention mechanism, thereby permitting parallel training of the model and acquiring global information.The temporal sequence is encoded by employing the transformers encoding component after obtaining the temporal feature vector by the MTR unit, which leads to more precise temporal features.In our suggested model, two equivalent transformers encoding components were employed for the second CNN stream.The transformers encoding component was composed of a fully connected feed-forward element and a multi-head self-attention element.As the input of the multi-head selfattention element, we have introduced the third-stage gloss feature f c 4 ∈ R c 2 ×T 2 in parallel with the analogous position information.Afterward, the same process was iterated to acquire the final temporal feature f c 5 ∈ R c 2 ×T 2 via the temporal feature f c 4 ∈ R c 2 ×T 2 , which was acquired by the fully connected feedforward element; this gives us the fourth-stage gloss feature.In addition to the model's capacity to concentrate on various positions, multi-head self-attention also improves the capability of the attention structure to manifest the aspects among words inside the concerned sequences.In comparison with the self-attention of a single-head, every head in multi-head self attention preserves its own matrix (i.e., M 1 , M 2 , M 3 ) to accomplish distinct linear transformations in order, where every head further has its own particular meaningful information.Furthermore, the fully connected feedforward element consolidates the illustration in a non-linear manner, thereby permitting the features to be more eloquent.
3) Multi-stage CTC loss: Continuous SLR resides in faintly supervised learning.The entry is an unsegmented video sequence and misses a stringent accordance between labeled sequences and video frames.By succeeding to the step of the encoding of the entry sequence, it is highly suitable to employ CTC functioning as a decoder.The latter was initially conceived for recognizing speech, principally to carry out end-to-end temporal classification of the unsegmented signal to figure out the issue of contrasting lengths of entry and exit video sequences.During the last few years, CTC has been frequently employed in CSLR.It proposes a blank label {−} to indicate labels that do not classified over-decoding (i.e., each word in the entry video sequence that does not apply to the vocabulary of sign language).Thereby, the entry and exit video sequences can be paired, and the dynamic programming algorithm can be employed for decoding [40].Given an entry video sequence VS of T frames, every frame label is depicted by π = (π 1 , π 2 , ..., π T ), where π ∈ ν ∩ −, and ν represents the vocabulary of sign language.The label posterior probability is given as follows: for a specific sequence-stage label s = (s 1 , s 2 , ..., s L ), where L represents the sequence word number.
CTC specifies a mapping that numerous instances of this entity are mapped to one instance of another entity, the process of which is to eliminate any duplicate and blank labels in the path.Therefore, the label conditional probability s is defined as the addition of the occurrence probabilities of all correlating paths: where B −1 (s) = π B(π) = s represents the inverse mapping.A CTC loss is specified as the negative log-likelihood of the label conditional probability.
Therefore, the multi-stage CTC loss can be denoted as follows: where n represents the CTC number.The softmax function was implemented for normalization right after getting the four-stage gloss feature.The normalized outcome is decoded by CTC to acquire L CTC5 .Evenly, corresponding L CTC1 , L CTC2 , L CTC3 , and L CTC4 are acquired for the first, second, third, and fourth gloss features, respectively.Finally, these five CTC losses are summed to obtain the concluding loss for training:

Experimental result and analysis
In this work, the suggested pattern of CNN (MNM-SLR) and another derivative of CNN (VGG16) [27] was inspected for SLR on two large-scale sign language benchmarks: SIGNUM and RWTH-PHOENIX-Weather 2014.In this division, the experimental results for these two patterns are debated, while, considering that a similar analysis with different state-of-art methods will be introduced in Section 4.2.Several metrics, such as processing time, loss, accuracy, and recognition prediction results, are employed to evaluate the performance of these two models.

Accuracy
To measure the classifier efficiency, the classification accuracy is the best used metric indicator.It is specified as the proportion of properly guessed instances to the overall number of instances in the dataset, as the following equation shows: where TP, FP, TN, and FN are the true positive, false positive, true negative, and false negative, respectively.The accuracy of classification for SIGNUM employing MNM-SLR and VGG-16 is presented in Table 1.The concluding precision reached by the MNM-SLR model for continuous signs is 88.96% and 94.37% for SIGNUM and RWTH-PHOENIX, respectively.The reached precision of classification using the VGG-16 model is 88.17% and 92.45% for SIGNUM and RWTH-PHOENIX, respectively.The results disclose that a higher achievement is obtained with MNM-SLR in comparison to VGG-16.Moreover, the performance of the two models has been tested on the expanded dataset.This was performed for the generalization of trained models.Data expandation is the operation of producing further data by transforming the initial possessed dataset.In this work, two supplementary instances per sample were produced by adopting the process of scaling and rotation.Therefore, a random inbound and outbound scaling of [0.7-1.4] and a random rotation in the interval [-15 In this work, the cross-entropy loss function is adopted to compute the loss that takes place in the multiple gestures classification of sign language, which is defined as follows: where O i represents the i th value in the output of the model, O i depicts the analogous purpose value, and n represents the scalar value number in the exit of the model.The loss value calculated for two various patterns is shown in Table 1.This computed loss for all various CNN patterns regularly decreases with the augmentation of the iteration for a while, then subsequently obtains a determined value.For the SIGNUM dataset, the average loss for MNM-SLR and VGG-16 decrease to 0.72 and 0.87, respectively.For the RWTH-PHOENIX-Weather 2014 dataset, the loss for MNM-SLR and VGG-16 decreases to 0.53 and 0.64, respectively.

Confusion matrix
In the interest of better assessing the suggested framework, an alternative performance metric termed the confusion matrix, is also determined in this work.This matrix recapitulates the properly and wrongly predicted words of every class; therefore, the recognition precision of every class can be excerpted from it.Figures 6 and 7 demonstrate the confusion matrices of the obtained results employing our MNM-SLR system, which is applied on 26 classes of the RWTH-PHOENIX-Weather 2014 dataset.A qualitative analysis of the manual and non-manual confusion matrices (Figures 6 and  7) demonstrate that by employing non-manual features, it is possible to accurately determine more classes, which were classified incorrectly when employing solely manual features.We note that nonmanual features can be employed to support differentiate various signs from each other.

Other parameters
Computational time is a crucial criterion for sign language recognition in real-time applications.The entire parameters for every convolutional layer can be represented as follows: where w and h represent the width and height of the filter, respectively, pf depicts the filter number of the previous layer and f depicts the filter number.The entire parameters for every fully connected layer (P f c ) can be represented as follows: where cl represents the current layer and pl represents the previous layer.
It is obvious by the distinction of the attainments that the suggested model of MNM-SLR employs a decreased computational time and fewer parameters as compared to other CNN architectures.4.1.5.Cross-validation K-fold is a cross-validation method to maintain the pattern achievement.Therefore, to assess the achievement on the whole data interval, a 10-fold cross-validation was employed for MNM-SLR.The assessment results for 10 folds are shown in Tables 3 and 4 for MNM-SLR and VGG-16, respectively.DenseTCN is a dense temporal convolution network introduced by [41] and assumes the actions in hierarchical views.To learn the short-term correlation in this system, a temporal convolution (TC) is choosen in between neighboring features and extended to a dense hierarchical configuration.In [42], the authors nominated a CTM framework that enclosed the support of a temporal convolution pyramid module and a connectionist decoding pattern to design short-term and long-term sequence learning.Authors in [43] suggested a cross-modal learning model that weighed the text information for ameliorating vision-based CSLR.Hence, two efficient encoding networks are at first exercised for producing text and video enclosures before their alignment and mapping within a joint latent representation.Authors in [44] suggested a framework, namely ST-GCNs, which is an innovative deep-learning method that associates with spatio-temporal GCNs, which run on diverse, appropriately fusioned feature streams, assimilating signer's pose, motion information, appearance, and shape.The work of authors in [45] is sub-divided into three constituents: the first module is the feature extractor in a multi-view spatio-temporal Network (MSTN) that accurately extracts the spatio-temporal features of th RGB data and skeleton; the second module exemplifies an encoder network of SL based on the transformer, which can resolve dependency of long-term; the last module exemplifies a CTC decoder network.Table 5 exhibits that our proposed method obtains encouraging achievement, which is summarized by a decrease of the WER value to 30.7% on the RWTH-PHOENIX-Weather 2014 dataset.These results prove that the dynamic spatial correlation of SL sequences and the long-term temporal correlation can ameliorate learning of its visual features.

Ablation study
To evaluate the contributions of the designed model, we have performed an ablation study on the RWTH-PHOENIX-Weather 2014 dataset.An ablation study can analyze the different components that influence the performance of the system.As shown in Table 6, it is feasible to evaluate the impact of each proposed training structure.For this determination, the proposed model was trained either (i) with non-manual features or (ii) without non-manual features.As concluded from Table 6, the proposed model reveals a higher achievement when employing all modalities together, thereby yielding a 30.7%WER.

Conclusions
In this work, we suggest an inventive training approach to produce a nominated feature extraction module, which was thoroughly employed to better understand the convenient sign language gloss on video sequences, while continuing to benefit from the iteratively cleansed alignment propositions.We advance a multi-modal method to integrate the head position and motion gestures from video sequences of sign language, which supplies superior spatio-temporal representations for gestures.The substantial contribution of the proposed work is its capacity to recognize complex signs.It demonstrates that by employing non-manual features, it is possible to accurately determine more classes, which were classified incorrectly when employed solely as manual features.It was affirmed via experiments that our MNM-SLR framework achieves a state-of-the-art performance on continuous sign langue recognition with an accuracy of 90.12% on the SIGNUM dataset and 94.87% on the RWTH-PHOENIX-Weather 2014 dataset.

Use of AI tools declaration
The authors declare they have not used Artificial Intelligence (AI) tools in the creation of this article.

Figure 1 .
Figure 1.An overview of the proposed framework.

Figure 3 .
Figure 3. Head pose generation from Gaussian distribution.

Figure 6 .
Figure 6.Confusion matrix with manual features only.

Figure 7 .
Figure 7. Confusion matrix with both manual and non-manual features.

Table 1 .
• , + 15 • ] were employed.The results of the classification for the expanded dataset are presented in Table2.The distinguished augmented dataset results are sufficiently persuasive to demonstrate the generalization capability of the trained models.Accuracy and loss performance.

Table 2 .
Results of classification for the expanded dataset.

Table 5 .
Analysis of performance refinement on RWTH-PHOENIX-Weather 2014.

Table 6 .
Ablation study on RWTH-PHOENIX-Weather 2014.W/O NMF means without non-manual features and W NMF means with non-manual features.