Arabic Sign Language Characters Recognition Based on Deep Learning Approach and a Simple Linear Classiﬁer

One of the best ways of communication between deaf people and hearing people is based on sign language or so-called hand gestures. In the Arab society, only deaf people and specialists could deal with Arabic sign language, which makes the deaf community narrow and thus communicating with normal people difficult. In addition to that, studying the problem of Arabic sign language recognition (ArSLR) has been paid attention recently, which emphasizes the necessity of investigating other approaches for such a problem. This paper proposes a novel ArSLR scheme based on an unsupervised deep learning algorithm, a deep belief network (DBN) coupled with a direct use of tiny images, which has been used to recognize and classify Arabic alphabetical letters. The use of deep learning contributed to extracting the most important features that are sparsely represented and played an important role in simplifying the overall recognition task. In total, around 6,000 samples of the 28 Arabic alphabetic signs have been used after resizing and normalization for feature extraction. The classification process was investigated using a softmax regression and achieved an overall accuracy of 83.32%, showing high reliability of the DBN-based Arabic alphabetical character recognition model. This model also achieved a sensitivity and a specificity of 70.5% and 96.2%, respectively.


INTRODUCTION
The most natural ways that human beings used to communicate with each other are by using voice, gestures and human-machine interfaces. The last method is still very primitive and forces us to adapt to the machine requirements. Also, the use of voice signals to communicate with hearing-impaired people is impossible or not desirable at all. However, deaf signs or gesture signs of sign language can be desirable and used to communicate with deaf people during their daily life. It's well-known that sign language is the language of deaf people that depends on the body movements or any visualmanual way, particularly the human hands and arms, to convey meanings, where the deaf sign language is typically different from one language to another and from one country to another.
In the Arab society, the community of Arab deaf people is small and very limited; this is due to the fact that only specialists deal with them. Statistics show that over 3% of the Palestinian population are hearing-impaired [1]. Also, according to the Palestinian Central Bureau of Statistics, 19% of disabled Palestinian people are deaf and mute [2]. Indeed, helping those people is very important and thus developing technical systems capable of translating sign languages into text or spoken language is highly needed. Developing such systems will definitely participate in facilitating the communication between the hearing-impaired and hearing people. In addition to that, it has been shown that Arabic sign language (ArSL) is the most difficult recognition task among other foreign sign languages due to its unique structure and complex grammar [3], where the researchers in the Middle East and Arab countries started to pay attention to this problem in the early 1990s. Therefore, developing recognition systems for ArSL is still an open question and a challenging task, which involves three phases of recognition: the first phase is alphabets recognition; the second one is the recognition of an isolated word with one sign and finally, the recognition of a word that contains continuous signs. However, we have seen that most of the existing ArSLR models have been satisfactory for Arabic alphabet recognition with excellent accuracy. Figure 1 shows the 30 letters of Arabic alphabet.
(ROI) are identified using a segmentation algorithm. The output of the segmentation process can thus be used to perform the classification process. Indeed, the accuracy and speed of detection play an important role in obtaining accurate and fast recognition process. In the recognition phase, a set of features are extracted from each segmented hand sign and then used to perform the recognition process. These features can, therefore, be used as a reference to understand the differences among the different signs.  [4]. Right: The corresponding 32x32 small images.
As mentioned earlier, ArSLR systems have only been paid attention recently [5]- [7]. Some of these attempts are vision-based ArSLR and used the K-nearest neighbor rule [5] or used the hidden Markov model (HMM) [6] and achieved promising results. Some others are sensor-based ArSLR and used the CyberGlove coupled with principal component analysis for feature extraction, followed by a support vector machine (SVM) [7]. Therefore, investigating and developing a new ArSLR model are important and using alternative approaches has to be considered. This paper thus proposes an alternative simpler Arabic sign recognition system based on deep feature extraction methods followed by a simple linear classifier method. Deep learning models have recently shown significant successes in different applications, such as robotics [8], neuroscience [9], traffic sign recognition [10], object detection and recognition [11], audio recognition [12], Bib number detection and recognition [13], Arabic handwritten recognition [10], [14] and image compression and information retrieval [15]- [16].
The rest of the paper is organized as follows. Section 2 presents an overview of the related works. Section 3 presents the proposed model for the recognition of 28 Arabic machine-print characters. Section 4 details the experimental results. Finally, the discussion and conclusions are presented in section 5.

RELATED WORKS
Generally, sign language recognition systems for American, British, Indian, Chinese, Turkish and many international sign languages have received much attention compared to the Arabic sign language. A review of the recent development in sign language recognition for foreign languages can be found in [17]- [20]. Also, most of the proposed approaches to the problem of ArSLR have given rise to sensor-based techniques [7], [21]- [22]. Sensor-based model usually employs sensors attached to the hand glove and a look-up table software is usually provided with the glove to be used for hand gesture recognition. However, image-based ArSLR techniques have recently emerged and have been investigated [23]- [29]. The task of image-based ArSLR typically requires firstly producing an appropriate code for the initial data and secondly using this code to classify and learn the alphabet in a fast and accurate manner. Image-based model usually uses video cameras to capture the movements of the hand. However, image-based techniques exhibit a number of challenges, including lighting conditions, image background, face and hand segmentation and different types of noise.
Different classification approaches (generative & discriminative methods) have recently been developed and used to address the problem of ArSLR and most of them focused on recognizing the signs of Arabic alphabet [23], [26]- [27], [30]- [33]. In particular, the authors in [23] developed a neuro-fuzzy system, which includes five main stages, including image acquisition, filtering, segmentation, hand outline detection followed by feature extraction. The conducted experiment considered the use of the bare hand and achieved a hit rate of 93.6%. The author in [32] has also introduced an automatic recognition of the Arabic sign language letters. For feature extraction, Hu's moments are used and for the classification process, the moment invariants are fed to an SVM. A correct classification rate of 87% was achieved.
The authors in [26] proposed an automatic ArSLR system, that translates isolated Arabic word signs into text, which involves four main stages: hand segmentation, tracking, feature extraction and classification. This proposed model achieved a correct recognition rate of 97% in signer-independent mode. Another recent ArSLR model based on optical flow-based features and HMM was proposed in [30]. In this model, the signs were transformed using four transformation techniques, including: Fourier Transform (MFT), Local Binary Pattern, Histogram of Oriented Gradients (HOG) and combination of HOG and Histogram of Optical Flow. The best classification result was achieved using HMM with MFT features with an accuracy of 99.11%.
Furthermore, the authors in [31] proposed an isolated sign language recognition system that extracts geometric features from a camera for the hand gesture and builds a geometric model for the hand gesture. Using the extracted geometric features, the recognition process of a specific gesture was then performed using the rule-based classifier. The proposed model was tested on seven Arabic words and achieved an overall classification rate of 95.3%. Another work in [32] described two dynamic sign language recognition systems based on two different methods; real-time (online) and offline ones. The comparison was made between the two methods and it was found that the recognition rate for the online hand gesture recognition is lower than the recognition rate for the offline system, which underlines that further enhancements are still needed to improve the real-time recognition performance.
Moreover, several recent attempts have been proposed to use recurrent neural networks (RNNs) [34], convolutional neural networks (CNNs) [27], [31] to achieve ArSLR. More specifically, the authors in [34] proposed to use RNNs together with colored gloves in their experiments. This proposed model achieved an accuracy rate of 89.7%, while a fully recurrent network improved the accuracy to 95.1%; however, it complicated the learning process. The authors in [31] proposed to use 3D CNNs to recognize 25 gestures of Arabic sign dictionary. In this model, the features were extracted with deep behaviour and the proposed model achieved a correct classification rate of 85% for the testing data. Finally, a similar ArSLR approach was proposed in [27] based on CNNs and direct use of softmax regression. When 50% of the dataset was used for training, this similar approach achieved 83.27% of correct classification rate, while when the training dataset was increased to 80%, the correct classification rate increased to 90.02%.
Although most of the current approaches have achieved excellent classification results, some of them are either based on sophisticated classifiers, such as [32], or based on complex learning methods, such as [26]- [27], [31], [33]. In contrast, this paper proposes a deep learning approach that uses a fast learning technique, a Restricted Boltzmann Machine (RBM) and a simple linear classification method. The simplification of the overall classification process of Arabic sign letters must also achieve accurate results. This hypothesis is based on improving the linear separation between signs of Arabic alphabet in the learning phase. This can be achieved by using a powerful machine learning method capable of extracting appropriate features that can be used later to generate an appropriate code for the classification phase. The proposed model is illustrated in the following section and has been recently proposed as a perspective study in our previous publication [35], where it is mainly based on Deep Belief Networks (DBNs) and softmax regression.

ArSL Dataset and Data Preprocessing
The ArSL dataset used in this paper to test the proposed model was collected by Suez Canal University and used by [4]. In brief, this dataset consists of 210 gray-scale images representing the gestures of 30 Arabic letters; i.e., 7 images for each letter gesture. As stated in [4], the dataset has been captured with different rotations, under different illumination conditions and based on various volunteers who have different hand sizes. In our experiments, we used the first 28 signs of the Arabic letters, as shown in Figure 1. 50% of these images were repeated to create a sufficient dataset of a total of 6000 images, which were randomly distributed into groups to make them appropriate for epoch training method. The other sub-set was used for testing the proposed model.
One can see that ArSL images contain a lot of hand features, borders, corners and edges which will contribute and foster the learning network to extract localized and sparse features based on a deep learning approach. It has been shown that the typical input layer of DBN training should be approximately around 1000 pixels, which requires a significant reduction of the original images [8]. Therefore, all images were cropped and significantly reduced to 32×32 = 1024 pixels with a fixed scale, as shown in Figure 1 (right). Despite the big reduction, one can see that the reduced images remain fully recognizable. It has also been shown that DBNs are still capable of extracting interesting features from tiny images [8], [36]. To ensure the network to learn sparse and localized features with higher-order statistics, the tiny images were locally normalized with zero-mean and unit variance. It has recently been shown that the local normalization method achieved better results in terms of feature extraction and classification [8]. Consequently, the normalized tiny images have been used as the input vector to train the network and build the model.

Deep Learning Model
We developed a novel deep learning approach specifically for the classification of ArSL. The general workflow of the proposed model includes three main stages (see Figure 2), which can be summarized as follows: 1) image pre-processing, 2) unsupervised feature space construction and finally 3) Arabic sign language recognition. The first two steps of the proposed model have recently been carried out and published [35]. Typically, the input layer of RBM corresponds to the input normalized data. Thus, given the visible vector, the activation probability of the hidden layer can be computed as follows: θ and represents the model parameters, σ(x) is the sigmoid function, ij w is the weight matrix between the visible layer and the hidden layer and j c is the bias of the hidden layer. Similarly, once the hidden units are computed, the zero-mean Gaussian activation of the visible layer can be recomputed as follows: where, i b is the bias of the visible layer and denotes a Gaussian distribution with zero-mean μ and variance 2 σ . first RBM layer based on a CD learning technique and using the normalized tiny images as stated before. As illustrated in Figure 3 (left), CD learning starts by setting the states of the visible units to a training vector. Then, the binary states of the hidden layer are computed in parallel using Equation 1. Once binary states have been sampled for the hidden units, a "reconstruction" is produced by setting each v to 1 with a probability given by Equation 2. Therefore, the model parameters can be updated using the following equations: The convergence of the network is achieved once the difference between the data statistics and the statistics of its representation generated by Gibbs sampling approaches zero, where the training dataset is fed to the network over the epochs. After the convergence of the network, a simple linear classifier, like softmax, is finally used to perform the classification process in the feature space and new samples from the validation dataset are used. It has been shown in many recent studies [8]- [9], [38] that using a deep learning approach plays important roles in: (1) Reducing the dimensionality of the data by using the most significant features to represent an object thus speeding up further tasks; the classification process for instance.
(2) Improving the linear separability of the 28 signs of Arabic letters thus simplifying the overall classification process among them.
Therefore, the use of softmax regression was based on the assumption that the data becomes linearly separated in the feature space operated by DBNs. To underline this hypothesis, a non-linear classification algorithm, like SVM, will be used in the classification phase.

Feature Extraction
Preliminary experiments have shown that the best structure for DBN training in terms of the final classification rate is the complete one (1024-1024). The training protocol is similar to the one proposed in [35] (300 epochs, a mini-batch size of 200, a learning rate of 0.02, an initial momentum of 0.5, a final momentum of 0.9, a weight decay of 0.0002, a sparsity target of 0.02 and a sparsity cost of 0.02).
After the network of the first RBM layer is converged, a set of sparse localized features were learned and extracted, as shown in Figure 4, by training the first RBM layer. One can see that these features represent most of the gestures of Arabic language letters. Some of the extracted features are very localized and represent small parts of the initial hands, like finger edges and hand borders and shapes. Another observation that can be mentioned here is that the thumb and/or index fingers can be seen in most of the extracted features which can be used later as reference features to code the testing images and perform the classification process. We have also seen that training a second RBM layer leads to reduce the classification rate, which might have suppressed some important features. Therefore, we assume that there is no need to train a second RBM layer, as the extracted features from the first RBM layer represent the shapes of most of the 28 signs of Arabic letters [35]. It can also be seen that some of the extracted features are overrepresented; for example, the signs of the following letters "NOON", "GHAYN" and "LAM". Some other features are underrepresented, such as the signs of the following letters "HA", "JIEM", "QAF", "FA" and "SAD". This might be due to the fact that the overrepresented signs have very sharp shapes which forced the learning network to extract them multiple times over other less sharp signs. The over-representation of high frequencies in the obtained feature will definitely play an important role in improving the linear separation of the initial data in the feature space thus enhancing the classification rate.

Arabic Gesture Recognition
As mentioned before, after the learning phase and building an appropriate model using the extracted features, the next phase is the classification process for the testing dataset, which was created from the original dataset. The best classification results were achieved using a network of a single RBM layer, as stated before. Therefore, the real-valued output of the first RBM units is used as an input to a softmax regression to perform the classification process. Figure 5 shows the classification results obtained using DBNs coupled with tiny images and followed by a softmax regression method. The use of a simple classifier was based on the assumption that after an appropriate image coding process is performed using the extracted features shown in Figure 4, the coded images become linearly separated.
For each image from the testing dataset, the softmax network uses the coded sign units to compute the probability of being one of the 28 Arabic letters. Based on the maximum probability value, the system identifies the corresponding character. As illustrated in Figure 5, the correct classification rate for each character of the Arabic language was ranging from 70% to 98% using the proposed model. The overall average of correct classification results for the 28 characters was 83.32% and 83.5% using a softmax regression and an SVM, respectively. These results are consistent and quite comparable when a sophisticated classification algorithm, like SVM, was used instead of softmax. This underlines and demonstrates that the use of DBN has significantly contributed to improving the linear separation between the gestures of Arabic language characters and thus a simple linear classifier was sufficient in the classification stage. These results are also quite comparable to the best recently published imagebased approach [27], when the dataset is equally divided for training and testing the model.

DISCUSSION AND CONCLUSIONS
In this paper, a simple alternative approach based on deep learning and a linear classifier was proposed to classify the signs of Arabic letters and achieve accurate results. The overall obtained results of this proposed model are comparable to some of the existing approaches; for instance, see [23], [27] based on more complicated learning techniques and sophisticated classifiers and outperformed the results obtained in [29] based on various visual descriptors followed by an SVM. However, these results are still relatively lower than those obtained based on the use of hand-engineered signatures, like SIFT descriptors, followed by the use of a sophisticated classifier, like SVM [38]. Based on the obtained classification results shown in Figure 5, several observations can be discussed. It can be seen that the correct classification rate for some gestures; for example, the letters "NOON", "GHAYN" and "LAM", is high and reached 98%, while the accuracy for other gestures; for example, the letters "HA", "JIEM", "QAF", "FA" and "SAD", was ranging from 70% to 75%. This confirms that the representation of gestures in feature extraction plays an important role in achieving accurate results. It can also be seen that the correct classification rate for the following pair of gestures {("DAL", "THAL"), ("TAH", "THAH") and ("RA", "ZAY")} was ranging from 70% to 80%. The misclassification rate is attributed to the fact that each pair of these letters has strong similarities of gestures, which has probably forced the learning network to extract similar features thus complicating their classification process.
The final result is also illustrated in Table 1, where the recognition evaluation system parameters have been calculated for both models; DBNs followed by softmax regression and DBNs followed by an SVM. The results of these parameters are somehow similar and demonstrate that the linear separation of the initial data has been gained by DBNs and the use of the classification algorithm later will not significantly affect the result. In other words, the use of a simple classifier, like softmax regression, was therefore sufficient to obtain comparable classification results and the use of a sophisticated classifier, such as SVM, slightly improved the classification rate from %83.2 to %83.5. Also, the mean specificity of the proposed model was less when we used softmax regression compared to SVM, which indicates that the true negative rate of the classification results is lower. In addition to that, the mean sensitivity of the proposed model was higher when we used the softmax regression instead of SVM, showing that the true positive rate of the classification results is higher. Finally, the use of DBNs followed by softmax regression achieved 81.1% F1 score, while f the use of DBNs followed by an SVM achieved 80.4% F1 score, which underlines that the proposed model is more simple and accurate. The major advantages of the proposed model can be summarized as follows. First, the obtained results demonstrated that a small image-based model followed by an appropriate feature space projection is capable of achieving comparable results to the recent methods [27], based on a complex learning technique (the use of convolutional neural network) and [38], based on a more sophisticated algorithm (the use of SIFT descriptors followed by the use of SVM classifier). Second, based on the obtained results, this paper presents a simple alternative to the existing approaches of ArSLR, by improving the linear separability of the initial data in the feature space thus simplifying the overall classification process. Third, after projecting the gestures onto an appropriate feature space that increases the linear separation between them the use of a nonlinear classifier, like SVM, did not change or improve the results.
Different ways will be investigated in the future to improve the results, including: (1) Assuming sharp signs to force the learning network to extract them multiple times. Thus, studying the effect of normalization and whitening on feature extraction remains an open question.
(2) If we assume that extracting sparse features plays an important role in improving the classification results, then studying the sparsity factor with different parameters needs also to be considered. (3) Increasing the size of the dataset to ensure its scalability and to include new gestures that represent the Arabic digits for instance. (4) Increasing the number of images that were underrepresented in feature extraction and have similarities in gestures, thus studying their effect on the learning and classification phases.