Chinese fingerspelling sign language recognition using a nine-layer convolutional neural network

INTRODUCTION: Sign language is a form of communication and exchange of ideas by people who are hearing-impaired or unable to speak. Chinese fingerspelling is an important component of Chinese sign language, which is suitable for denoting terminology and using as the basis of gesture sign language learning. OBJECTIVES: We propose a nine-layer convolutional neural network (CNN) for the classification of Chinese sign language. METHODS: With self-learning and self-organization abilities, CNN is committed to processing data with similar network structure. CNN has a good application prospect in the aspect of image classification and plays a very important role in the classification of Chinese sign language. RESULTS : Through experiments on 1320 data samples of 30 categories, the results show that the classification accuracy based on the nine-layer convolutional neural network can reach up to 89.69± 2.10 %, it can be seen that this method can effectively classify Chinese gestures. CONCLUSION: We proposed a nine-layer convolutional neural network (CNN) that can classify Chinese sign language.


Introduction
In daily life, communication is essential.However, In the world, there are thousands of people suffering from hearing impairment [1].This special group of people is called deaf-mutes, and they cannot communicate through language as normal people can.As a way of expressing information, sign language (SL) has become the main way for deaf people to communicate with the outside world.With the development of artificial intelligence technology, novel, natural and convenient human-computer interaction has become a new trend in various industries.But sign language is not international.It is about more than just hand movements.They take advantage of both manual functions (hand gestures, movements, position and direction), and facial features that are colloquially called "non-manual" (eye gaze, mouth gestures and facial expressions) and upper body posture (nodding/shaking head and shoulder direction) [2].Chinese Sign Language (CSL) can be generally divided into gesture sign language and fingerspelling language [3].
Gesture sign language emphasizes the form and is relatively complex.However, fingerspelling language focuses on 30 basic finger languages (including 26 basic pinyin letters and 4 upturned tongues), which can be combined to express pinyin or some special meaning.
Sign language recognition (SLR) is a method that uses computer technology to translate sign language information into text, natural language, audio and other information for easy understanding and communication [4].SLR is divided into two parts, hand feature extraction and hand feature model training.Since 1980, scientists and researchers around the world have been working to grant recognition technology.In 1983, G.J.Grimes in the United States invented data input gloves [5], which can help users to analyze and recognize 72 English letters.Therefore, G.J.Grimes is regarded as the first person to recognize sign language.With the rapid development of computer technology and artificial intelligence technology in recent years, more and more high-tech technologies have been applied to the study of sign language recognition technology.In today's world, the study of sign language recognition technology is divided into two kinds, one is the recognition technology using data gloves as sign language input, the other is the sign language recognition technology based on machine learning.There are many methods for sign language recognition.To solve the problem of serious background interference, incomplete feature extraction and low recognition accuracy commonly existing in traditional sign language recognition.Si Yang et al. proposed a sign language recognition algorithm based on RGB-D images.
Compared with the traditional methods, the performance of this way is greatly improved [6].Peng Ping et al. proposed a sign-sign recognition method based on surf-BOW characteristics.The method collects images through a camera, firstly locates and tracks gestures, then extracts SURF features, and then constructs Surf-BOW as sign language features and USES SVM for recognition [7].Shen Juan et al. proposed a hidden Markov Model (HMM) sign language recognition method, which is based on the Kinect 3D node [8].In the field of sign language recognition, many traditional machine learning methods are faced with such problems as high accuracy, extensible row and robustness.
In recent years, the booming development of deep learning technology has brought new opportunities for more accurate and real-time sign language recognition [9].The method of deep learning is based on the idea that using hierarchical concepts, computers can learn complex concepts by constructing simple ones.
Convolutional neural network (CNN) is an important form of deep learning, which is committed to processing data with similar network structures, such as time series and image data.In addition, CNN has a good application prospect in many application fields, especially image classification and auxiliary clinical diagnosis, due to its self-learning and self-organization capabilities [10].
In our paper, we propose a nine-layer convolutional network classification for the classification of Chinese sign language.Our CNN is fully optimized for each layer to The rest of this paper is organized as follows: the second part mainly describes the data set, the third part mainly scans the convolutional layer and some methods used in this paper, the fourth part provides the implementation results, the fifth part compares and discusses the methods proposed by us and the latest methods, the sixth part gives the conclusion of this paper.

Data Collection
Establishing a data set with sufficient data is a necessary condition for a complete experiment.Therefore, we used the camera to take 1,320 Chinese finger language pictures from 44 different samples (the size of each image is 1080 × 1080), and each sample contains There are 26 basic letters and 4 tongue-rolling pronunciation words, a total of 30 categories, including most commonly used pronunciation words.Adobe Photoshop CS manually divides each Chinese finger picture, removes the hand-shaped area in the picture, and adjusts the image size to 256 × 256.At the same time, normalize the background color and set it to RGB (0, 0, 0), so that our data is richer and more abundant and accurate, reducing the contingency and uncertainty of experimental data.Finally, we save these images in TIF compression format, which can reduce the chance of image information loss.(see Figure 1).Two groups were randomly selected from 44 samples, and different pronunciation words were randomly selected from the two groups of samples.In this way, the accuracy and completeness of sign language recognition can be better tested.

Convolutional layer
The standard convolutional neural network covers all important phases of traditional computer vision methods: feature extraction and reduction, and classification.Its architecture can be thought of as a fusion of a trainable multi-level depth feature extractor and classifier.Among them, the function of the convolutional layer is to perform feature extraction [11], the pooling layer implements dimensionality reduction [12], and the fully connected layer completes classification.
The convolution layer uses multiple filters to perform discrete convolution operations on the input image to extract and combine local features to form a feature map.
The convolutional layer is the core of the convolutional neural network, and most calculations are performed in the convolutional layer.On each convolutional layer, there will be a whole set of filters, and the activation maps formed by these filters are stacked in the depth direction to form the output of the convolutional layer [13].The advantages of the convolutional layer mainly include two: local connection and weight sharing.A convolution operation is shown in Figure 5.Because all neurons of the feature map realize parameter sharing, the number of weights is greatly reduced, the learning efficiency is improved, and it is of great help to network training.Assuming the given two-dimensional image I (x, y) which is called "input" and the kernel K (m, n), the convolution can be expressed as Where (m, n) represents the size of kernel.As the convolution is commutative, the formula is equivalent to the output of this layer.Usually a pooling layer is periodically inserted between the convolutional layers [14].
Its function is to gradually reduce the spatial size of the data volume.This can reduce the number of parameters in the network, decrease the consumption of computing resources, and effectively control the over-simulation together.
Using the pooling function, the output of the network is replaced with the overall statistical characteristics of the adjacent output at a certain position.Max pooling and average pooling are two commonly used pool methods.The former gives the max value of adjacent rectangular areas, and the latter calculates the average value of adjacent rectangular areas.Figure 6 shows an example showing the max pooling and the average pooling.
Given the pool area R, the activation set U contained in Then the max pooling   can be expressed as Another pool strategy, the average pooling PA is defined as: Where   is the number of elements in the set U.
Pooling can help the input representation to be approximately unchanged.Another benefit is that it helps reduce the computational burden.But the max pooling and distribution.Finally, the output of P s at the activation map (l, 3) is 2.5.Stochastic pooling provides excellent performance.Therefore, the pooling layer can continuously reduce the size of the data space.There by reducing the number of parameters and the amount of calculation.At the same time, overfitting is also controlled.

Fully connected layer
The fully connected layer plays a role of "classifier" in the whole convolutional neural network.If operations such as convolution layer, pooling layer map the original data to the hidden layer feature space, the full connection layer maps the learned "distributed feature representation" to the sample marker space.In practice, each node of the full connection layer is connected to all nodes of the previous layer, which is used to synthesize the previously extracted features.The full connection layer is connected by the convolution kernel between the front layer and the back layer.From Figure 8, we can understand the function of convolutional layer: It can be written in the below matrix form: .
Then obtain the partial derivative of loss with respect to x by the chain rule: If we train 16 images at a time, that is, batch_size=16, then we can transform the calculation into the following matrix form (As shown in Figure 11).

Differentiate the weight coefficient W
Our forward calculation formula is as shown in the formula (10), formula (11), formula (12), the formula shows

Find the derivative of the bias coefficient b
From the previous derivation formula, we can know: That is, the partial derivative of loss to the bias coefficient is equal to the partial derivative of the output of the previous layer.
When batch_size=16, just add the partial derivatives of the same b corresponding to different batches, and write it in the form of a matrix that is multiplied by a matrix of all ones( As shown in Figure 13).
Each node of the fully connected layer is connected with all the nodes of the previous layer, and is used to integrate the features extracted from the front.Because of its fully connected characteristics, generally the parameters of the fully connected layer are also the most.The core operation of full connection is matrix vector product: Y=W*X, the essence is to linearly transform from one feature space to another feature space.Any dimension of the target space (a cell in the hidden layer) is considered to be affected by every dimension of the source space.
Regardless of rigor, it can be said that the target vector is the weighted sum of the source vector.
In CNN, full connection often appears in the last few  It is defined as: In traditional neural networks, the Sigmoid system (Logistic-Sigmoid, Tanh-SIGmoid) is regarded as the core of the neural network.Sigmoid is similar to the neural response of humans and plays a huge role in many superficial models.However, in the recent years of convolutional network learning, we usually choose ReLU function instead of sigmoid or tanh function.First of all, they calculated some of the exponentials in the activation function, which is a relatively large amount of computation.
For ReLU, not only is it not affected by these functions, but also the decentralized activity reduces the overall computing cost of the neural network [15].Secondly, for the deep network, sigmoid also reduces the derivative value transferred to 0.25 times in the best case, and the gradient value obtained from the lower network is significantly smaller.When back propagation gradually becomes a cumulative process, gradient disappearance and gradient explosion will occur, resulting in poor training effect of the model.ReLU solves this problem compared to these methods.At the same time, ReLU will set the output of some neurons to 0, which causes the sparsity of the network, reduces the interdependent relationship between parameters, and alleviates the occurrence of overfitting [16].Another advantage of ReLU is that it accelerates the convergence rate of stochastic gradient descent.

Batch normalization
Batch Normalization is a very simple but useful practice that speeds training convergence.In the shallow model, we use "whitening" treatment of input data so that its mean is 0 and variance is 1.Thus, the influence of input distribution changes on the model is not obvious during data training and testing [17].The formula of the batch normalization can be indicated as follow： Where [   ] denotes input set, [18] is output in a mini-batch.In our study, we order the mini-batch size as  Chinese fingerspelling sign language recognition using a nine-layer convolutional neural network

Experiment setting
In order to test the proposed method, some training parameters are set.We set MaxEpochs=30, InitialLearnRate=0.01, and MiniBatchSize=256.At the same time, we set the parameters for different training layers.Specific experimental settings are showed in Table 1.
After iterating over the training samples, we get the accuracy and loss that are shown in Figure 16.

Structure of proposed CNN
Grid searching approach was used.We create a nine-layer CNN with structure shown in Table 2, where the input layer and softmax layer are not considered.The convolutional layer, batch normalization and ReLU layer are combined.
Moreover, the first Layer that showed in the table also includes stochastic pooling (SP).Fully connected layer includes function of dropout.Thus, the first layer, six combined layers, the special fully connected layer and the fully connected layer constitute the whole nine-layer CNN.
The other hyperparameters are descripted as follow: the size of image input layer is 256×256, stride size is set as 2 and filter size is set to 7*7 and 3*3, respectively.The numbers of filters are correspondingly changed according to the feature map.Dropout rate is selected as 0.4.According to the activations showed in the table, we can see that each output layer's height and width are shrinking regularly with the process going on.

Statistical analysis
Based on the data set formed by reading the 1320 sign language pictures one by one, we ran the training and test 10 times, the Elapsed time was set randomly each time.
According to the results are showed in Table 3 .wecan find that the accuracy of our method are 87.50%,88.28% , 87.11%, 89.45%, 88.67% , 89.06%, 91.41%, 89.45%, 93.36%, 92.58%.From the table, we can also see that the accuracy rate is more than 90%, appearing in the seventh time, the ninth time and the tenth time.The lower accuracy rate is occurred in the first time and the third time.In general, the accuracy of our method is higher than 87%.
And the difference among the highest accuracy is small, which also demonstrates the stability of the experimental results.

Comparison of pooling methods
In this paper, we use the pooling function at the pooling layer to replace the output of the network with the overall statistical characteristics of the output adjacent to a certain location.We compare the max pooling with the average pooling [21].The max pooling is the maximum value given to adjacent rectangular regions, and the average pooling is the average value calculated for adjacent rectangular regions.The max pooling and the average pooling have its own shortcomings, the former is easy to over-fit training data and can only reduce the estimated average offset due to convolution layer parameter errors [22]and cannot be generalized to the test set.The latter may reduce relatively strongly activated when many elements in the pool areas are close to zero, and it only reduces the error estimates of the variance due to the size of the finite field.
In order to overcome the shortcomings of the max pooling method and the average pooling method, the stochastic pooling method (SP) has attracted the attention of many researchers.Stochastic pooling method is to output multiple distributed samples formed by activation of each pool area.Essentially, it is a subsample based on pixel probability.Stochastic pooling combines the advantages of average pooling and max pooling.Stochastic pooling provides excellent performance.Therefore, the pooling layer can continuously reduce the size of the data space [23].Thus, the number of parameters and the amount of computation are reduced.At the same time, overfitting has been controlled.

Comparison with the latest methods
In this experiment, four algorithms are studied and compared.The first is the Support Vector Machine (SVM) [24], the gesture recognition stage focus studied the support vector machine classifier.SVM is a pattern recognition mean based on structural risk minimization.It in solving small sample, nonlinear and high dimensional pattern recognition problem has many unique advantages.In the end, we chose the tree based on radial basis kernel function of support vector machine is used for signal classification.
And select the optimal parameters by calculation.Based on the gesture recognition algorithm of SVM, SVM classifier is adopted and radial basis kernel function is selected to improve the accuracy and efficiency of gesture recognition.
The second is Hidden Markov Model (HMM) [25], which is a pattern recognition method based on statistical theory.In the Chinese sign language recognition method established in HMM, the region of hand shape can be separated from the background by using the feature of histogram through the capture of gesture image, image processing and dimension reduction technology.Then, the region other than hand can be removed from the obtained hand image through dimension reduction processing, so as to obtain the contour of hand.Thus, the static simple sign language recognition can be achieved without other glove tools, and the accuracy rate reaches more than 85%.The traditional training method of HMM is based on the maximum likelihood criterion (MLE) of statistical probability, which can theoretically obtain the optimal result when the number of training samples is large enough.In the study of sign language recognition, it is very difficult to collect enough training samples.The third is dynamic time warping (DTW) [26].The idea of DTW is to conduct a global rough search first, and the gesture words to be recognized will be placed in a group of words with a small range, and then the words will be recognized through a more accurate local search by HMM.Compared with single-layer recognition using HMM only ,the recognition rate of each word is increased from the original 2.364 seconds to 0.137 seconds, which is increased by 94.2%,and the recognition accuracy is also increased by 4.66%.The last one is Kinect [27].Based on the Kinect technology, a dynamic sign language recognition method using finite state machine and DTW is proposed .Firstly, Kinect technology is used to obtain the depth image and bone feature information of human body.Then the hand depth image is obtained by hand segmentation algorithm, and then the HOG feature operator with high recognition accuracy is selected to extract hand features.This method can realize the recognition of commonly used sign language words and sentences, and the recognition accuracy can reach 95%.Relatively, the recognition accuracy of the Chinese sign language recognition algorithm based on a nine-layer convolutional neural in this paper has reached 89.69± 2.10%, which is better than some traditional convolutional neural network and some modern algorithms.(As shown in Table 4)

Conclusion
In this paper, a nine-layer convolutional network classification is proposed to classify Chinese sign language.
Our proposed CNN structure fully optimizes each floor.
The structure utilizes the excellent performance of SP, which not only reduces the number and calculation amount of parameters, but also controls the risk of over-fitting of training data.In addition, the batch normalization technology we adopted normalized the input of the layer into a small batch, so as to handle the problem of continuous training change triggered by parameter update of the previous layer, which improve the gradient and accelerates the speed of learning convergence.In this paper, the proposed nine-layer CNN is optimized, and the dropout method is adopted to solve the problems of overfitting and training time, so as to achieve better optimization performance.The post-sequence research will further optimize the network structure and perfect the training speed and recognition rate.Some popular deep neural networks such as ResNet, DenseNet [28], etc. will also be tried.
achieve excellent performance.The main contributions of this paper are as follows: 1) Convolutional operation and pooling are used for sample data input, which not only reduces the number and calculation amount of parameters but also controls the risk of over-fitting of training data.2) Batch normalization technology is adopted to normalize the input of the layer into a small batch, so as to solve the problem of continuous training change caused by parameter update of the previous layer, improve the gradient, and Ya Gao et al.EAI Endorsed Transactions on e-Learning 07 2020 -02 2021 | Volume 7 | Issue 20 | e2 3 accelerate the speed of learning convergence.3)Dropout method is adopted to solve the problem of time-consuming training.4)The proposed nine-layer CNN optimizes the previous eight-layer CNN.

Figure 1 .
Figure 1.Save images in TIF compression format (Left) and images in L1 folder (Right)

Figure 2 .
Figure 2. Flow chart for Manipulate pictures Here, we present two sets of images.Two samples are selected from 44 samples.Two images are drawn from each set of samples.Among them, one image in the first group represents the letter T, and the other image represents the

Figure 5 .
Figure 5. Illustration of the convolution operation

From the Figure 10 ,
we can know backpropagation of the fully connected layer, which we take our first fully connected layer as an example.This layer has 50*4*4=800 input nodes and 500 output nodes.

Figure 9 Figure 10 1 .
Figure 9 Forward calculation process of full connection layer result of the derivation above also confirms my previous sentence: In the back propagation process, if the a node of the x-th Ya Gao et al.EAI Endorsed Transactions on e-Learning 07 2020 -02 2021 | Volume 7 | Issue 20 | e2 9 layer contributes to the b node of the x+1 layer through the weight W, then in the back propagation process , The gradient propagates from node b back to node a through weight W.
write it in matrix form (As shown in Figure12).

Figure 11 Figure 12
Figure 11 Take the derivative of the output of the previous layer (the input of the current layer) layers and is used to weight and sum the previously designed features.For example, in mnist data set, the previous convolution and pooling are equivalent to feature engineering, and the latter full connection is equivalent to feature weighting.Convolution is equivalent to the intentional weakening of the full connection.According to Chinese fingerspelling sign language recognition using a nine-layer convolutional neural network EAI Endorsed Transactions on e-Learning 07 2020 -02 2021 | Volume 7 | Issue 20 | e2 the inspiration of the local field of view, the weak influence outside the local area is directly wiped out to zero influence; a little force is also made, and the parameters used by different parts are actually the same.Weakening reduces the number of parameters, saves the amount of calculation, and specializes in not being greedy for more completeness, forcing further reduction of parameters.

Figure 13
Figure 13 Find the derivative of the bias coefficient b With the progress of model training, when the parameters of each layer are updated, the output near the output layer changes dramatically.For deep neural networks, even if the output data has been "whiten", the updating of model parameters in training is still likely to cause drastic changes in the output near the output layer.This numerical instability often makes it difficult to train an effective depth model.By introducing batch normalization (BN), the distribution of the input value of any neuron in each layer of the neural network is forcibly pulled back to the standard normal distribution with the mean value of 0 and variance of 1 by optimizing the variance size and mean position.Make the activation input value fall in the area where the nonlinear function is sensitive to the input, so that small changes in the input will lead to large changes in the loss function, make the gradient increase, and avoid the problem of gradient disappearance.Moreover, the increase of the gradient means that the learning convergence speed is fast and accelerate the training speed greatly.Ya Gao et al.EAI Endorsed Transactions on e-Learning 07 2020 -02 2021 | Volume 7 | Issue 20 | e2 256.Additionally, α and β are learnable parameters, which realize the reconstruction of transform.A tiny number γ is added to prevent divide by zero.As shown in Figure 14.BN is generally utilized before the nonlinearity of the previous layer, and then the output of BN changes into the input of next layer.

Figure 15 .
Figure 15.Network with application of dropout

Table 2 .
Structure of proposed CNN

Table 3 .
Statistical Results of our method