Arabic Sign Language Recognition and Generating Arabic Speech Using Convolutional Neural Network

. Sign language encompasses the movement of the arms and hands as a means of communication for people with hearing disabilities. An automated sign recognition system requires two main courses of action: the detection of particular features and the categorization of particular input data. In the past, many approaches for classifying and detecting sign languages have been put forward for improving system performance. However, the recent progress in the computer vision ﬁ eld has geared us towards the further exploration of hand signs/gestures ’ recognition with the aid of deep neural networks. The Arabic sign language has witnessed unprecedented research activities to recognize hand signs and gestures using the deep learning model. A vision-based system by applying CNN for the recognition of Arabic hand sign-based letters and translating them into Arabic speech is proposed in this paper. The proposed system will automatically detect hand sign letters and speaks out the result with the Arabic language with a deep learning model. This system gives 90% accuracy to recognize the Arabic hand sign-based letters which assures it as a highly dependable system. The accuracy can be further improved by using more advanced hand gestures recognizing devices such as Leap Motion or Xbox Kinect. After recognizing the Arabic hand sign-based letters, the outcome will be fed to the text into the speech engine which produces the audio of the Arabic language as an output.


Introduction
Language is perceived as a system that comprises of formal signs, symbols, sounds, or gestures that are used for daily communication. Communication can be broadly categorized into four forms; verbal, nonverbal, visual, and written communication. Verbal communication means transferring information either by speaking or through sign language. However, nonverbal communication is the opposite of this, as it involves the usage of language in transferring information using body language, facial expressions, and gestures. Written communication, however, involves conveying information through writing, printing, or typing symbols such as numbers and letters, while visual communication entails conveying information through means such as art, photographs, drawings, charts, sketches, and graphs.
The movement of the arms and hands to communicate, especially with people hearing disability, is referred to as sign language. However, this differs according to people and the region they come from. Therefore, there is no standardization concerning the sign language to follow; for instance, the American, British, Chinese, and Saudi have different sign languages. Since the sign language has become a potential communicating language for the people who are deaf and mute, it is possible to develop an automated system for them to communicate with people who are not deaf and mute.
Sign language is made up of four major manual components that comprise of hands' figure configuration, hands' movement, hands' orientation, and hands' location in relation to the body [1]. There are mainly two procedures that an automated sign-recognition system has, vis-a-vis detecting the features and classifying input data. Many approaches have been put forward for the classification and detection of sign languages for the improvement of the performance of the automated sign language system. The American Sign Language (ASL) is regarded as the sign language that is widely used in many countries such as the USA, Canada, some parts of Mexico, with little modification it is also used in few other countries in Asia, Africa, and Central America. The research activities on sign languages have also been extensively conducted on English, Asian, and Latin sign languages, while little attention is paid on the Arabic language. This may be because of the nonavailability of a generally accepted database for the Arabic sign language to researchers. So, researchers had to resort to develop datasets themselves which is a tedious task. Specially, there is no Arabic sign language reorganization system that uses comparatively new techniques such as Cognitive Computing, Convolutional Neural Network (CNN), IoT, and Cyberphysical system that are extensively used in many automated systems [2][3][4][5][6][7]. The cognitive process enables systems to think the same way a human brain thinks without any human operational assistance. The human brain inspires the cognitive ability [8][9][10]. On the other hand, deep learning is a subset of machine learning in artificial intelligence (AI) that has networks capable of learning unsupervised from data that is unstructured or unlabeled which is also known as deep neural learning or deep neural network [11][12][13][14][15]. In deep learning, CNN is a class of deep neural networks, most commonly applied in the field of computer vision. The vision-based approaches mainly focus on the captured image of gesture and get the primary feature to identify it. This method has been applied in many tasks including super resolution, image classification and semantic segmentation, multimedia systems, and emotion recognition [16][17][18][19][20]. One of the few well-known researchers who have applied CNN is K. Oyedotun and Khashman [21] who used CNN along with Stacked Denoising Autoencoder (SDAE) for recognizing 24 hand gestures of the American Sign Language (ASL) gotten through a public database. On the other hand, the proposal to use Convolutional Neural Network (CNN) for recognizing the Italian sign language was made by Pigou et al. [22]. Whereas Hu et al. had made a proposal for the architecture of hybrid CNN and RNN to capture the temporal properties perfectly for the electromyogram signal which solves the problem of gesture recognition [23]. An incredible CNN model that automatically recognizes the digits based on hand signs and speaks the particular result in Bangla language is explained in [24], which is followed in this work. In [25] as well, there is a proposal of using transfer learning on data collected from several users, while exploiting the use of deep-learning algorithm to learn discriminant characteristics found from large datasets.
There are several other techniques, which are used to recognize the Arabic Sign Language such as a continuous recognition system using the K-nearest neighbor classifier and statistical feature extraction method for the Arabic sign language was proposed by Tubaiz et al. [26]. Unfortunately, the main drawback of the Tubaiz's approach is that the users are required to use an instrumented hand gloves to obtain the particular gesture's information that often causes immense distress to the user. Following this, [27] also proposes an instrumented glove for the development of the Arabic sign language recognition system. The continuous recognition of the Arabic sign language, using the hidden Markov models and spatiotemporal features, was proposed by [28]. Research on translation from the Arabic sign language to text was done by Halawani [29], which can be used on mobile devices. In [30], the automatic recognition using sensor and image approaches are presented for Arabic sign language. [31] also uses two depth sensors to recognize the hand gestures of the Arabic Sign Language (ArSL) words. [32] introduces a dynamic Arabic Sign Language recognition system using Microsoft Kinect which depends on two machine learning algorithms. However, Arabic sign language with this recent CNN approach has been unprecedented in the research domain of sign language. Therefore, this work aims at developing a vision-based system by applying CNN for the recognition of Arabic hand sign-based letters and translating them into Arabic speech. A dataset with 100 images in the training set and 25 images in the test set for each hand sign is also created for 31 letters of Arabic sign language. The suggested system is tested by combining hyperparameters differently to obtain the optimal outcomes with the least training time.

Data Preprocessing
Data preprocessing is the first step toward building a working deep learning model. It is used to transform the raw data in a useful and efficient format. Figure 1 shows the flow diagram of data preprocessing.

Raw Images.
Hand sign images are called raw images that are captured using a camera for implementing the proposed system. The images are taken in the following environment: (i) From different angles (ii) By changing lighting conditions (iii) With good quality and in focus (iv) By changing object size and distance The objective of creating raw images is to create the dataset for training and testing. Figure 2 shows 31 images for 31 letters of the Arabic Alphabet from the dataset of the proposed system.

Classifying
Images. The proposed system classifies the images into 31 categories for 31 letters of the Arabic Alphabet. One subfolder is used for storing images of one category to implement the system. All subfolders which represent classes are kept together in one main folder named "dataset" in the proposed system.

Formatting Image.
Usually, the hand sign images are unequal and having different background. So, it is required to delete the unnecessary element from the images for getting the hand part. The extracted images are resized to 128 × 128 pixels and converted to RGB. Figure 3 shows the formatted image of 31 letters of the Arabic Alphabet.   2.5. Augmentation. Real-time data is always inconsistent and unpredictable due to a lot of transformations (rotating, moving, and so on). Image augmentation is used to improve deep network performance. It creates images artificially through various processing methods, such as shifts, flips, shear, and rotation. The images of the proposed system are rotated randomly from 0 to 360 degrees using this image augmentation technique. Few images were also sheared randomly with 0.2degree range and few images were flipped horizontally. Figure 4 shows a snapshot of the augmented images of the proposed system.  3. Architecture Figure 5 shows the architecture of the Arabic sign language recognition system using CNN. CNN is a system that utilizes perceptron, algorithms in machine learning (ML) in the execution of its functions for analyzing the data. This system falls in the category of artificial neural network (ANN). CNN is mostly applicable in the field of computer vision. It mainly helps in image classification and recognition. The two components of CNN are feature extraction and classification. Each component has its characteristics that need to be explored. The following sections will explain these components.
3.1. Feature Extraction Part. CNN has various building blocks. However, the major building block of the CNN is the Convolution layer. Convolution layer refers to the mathematical combination of a pair of functions to yield a third function. It is required to do convolution on the input by using a filter or kernel for producing a feature map. The execution of a convolution involves sliding each filter over particular input. At each place, a matrix multiplication is conducted and adds the output onto a particular feature map.
Every image is converted as a 3D matrix by specified width, specified height, and specified depth. The depth is included as a dimension since image (RGB) contains color channels. Numerous convolutions can be performed on input data with different filters, which generate different feature maps. The different feature maps are combined to get the output of the convolution layer. The output is then going through the activation function to generate nonlinear output.
One of the most popular activation function is the Rectified Linear Unit (ReLU) which operates with the computing the function f ðκÞ = max (0,κ). The function shows that the activation is threshold at zero. The ReLU is more reliable and speeds up convergence six times compared to sigmoid and tanh, but it is much fragile during operations. This disadvantage can, however, be overcome by fixing the appropriate learning rate.
Stride refers to the size of a particular step that the convolution filter functions each time. The size of a stride usually considered as 1; it means that the convolution filter moves pixel by pixel. If we increase the size of the particular stride, the filter will slide over the input by a higher interval and therefore has a smaller overlap within the cells.
Because the feature map size is always lesser than the size of the input, we must do something to stop shrinking our feature map. Here, we are intended to use padding. Now it is required to add zero-value pixels layer to gird particular input by zeros to prevent the feature map from shrinking. Padding also helps in maintaining the spatial dimension constant after doing convolution so that the kernel and stride size matches with the input. So it enhances the performance of the system.
There are three main parameters that need to be adjusted in a convolutional neural network to modify the behavior of a convolutional layer. These parameters are filter size, stride, and padding. It is possible to calculate the output size for any given convolution layer as: Output size = input size − filter size + 2 * padding size Stride size where output size = the size of the output Convolution layer. input size = the size of input image. filter size = the size of filter.

Pooling
Layer. Naturally, a pooling layer is added in between Convolution layers. However, its main purpose is to constantly decrease the dimensionality and lessen computation with less number of parameters. It also regulates overfitting and reduces the training time. There are several forms of pooling; the most common type is called the max pooling. It uses the highest value in all windows and hence reduces the size of the feature map but keeps the vital information. It is required to specify the window sizes in advance to determine the size of the output volume of the pooling layer; the following formula can be applied.
In all situations, some translation invariance is provided by the pooling layer which indicates that a particular object would be identifiable without regard to where it becomes visible on the frame.

Classification.
The second important component of CNN is classification. The classification consists of a few layers which are fully connected (FC). Neurons in an FC layer own comprehensive connections to each of the activations of the previous layer. The FC layer assists in mapping the representation between the particular input and output. The layer executes its functions by applying the same principles of a regular Neural Network. However, One Dimensional data can only be accepted by an FC layer. For transforming three Dimensional data to one Dimensional data, the flatten function of Python is used to implement the proposed system.

Experimental Result and Discussion
The proposed system is tested with 2 convolution layers. Then 2 × 2 maximum pooling layers follow each convolution layer. The convolution layers have a different structure in the first layer; there are 32 kernels while the second layer has 64 kernels; however, the size of the kernel in both layers is similar 3 × 3. Each pair of convolution and pooling layer was checked with two different dropout regularization values which were 25% and 50%, respectively. So, this setting allows eliminating one input in every four inputs (25%) and two inputs (50%) from each pair of convolution and pooling layer. The activation function of the fully connected layer uses ReLu and Softmax to decide whether the neuron fire or not. The experimental setting of the proposed model is given in Figure 5.
The system was trained for hundred epochs by RMSProp optimizer with a cost function based on Categorical Cross Entropy because it converged well before 100 epochs so the weights were stored with the system for using in the next phase.
The system presents optimistic test accuracy with minimal loss rates in the next phase (testing phase). The loss rate was further decreased after using augmented images keeping the accuracy almost the same. Each new image in the testing phase was processed before being used in this model. The size of the vector generated from the proposed system is 10, where 1/10 of these values are 1, and all other values are 0 to denote the predicted class value of the given data. Then, the system is linked with its signature step where a hand sign was converted to Arabic speech. This process was completed into two phases. The first phase is the translation from hand sign to Arabic letter with the help of translation API (Google Translator). The generated Arabic Texts will be converted into Arabic speech. In this stage, Google Text To Speech (GTTS) was used.
The system was constructed by different combinations of hyperparameters in order to achieve the best results. The  Table 1 represents these results. It was also found that further addition of the convolution layer was not suitable and hence avoided. Figure 6 presents the graph of loss and accuracy of training and validation in the absence and presence of image augmentation for batch size 128. It is indicated that prior to augmentation, the validation accuracy curve was below the training accuracy and the accuracy for training and loss of validation both are decreased after the implementation of augmentation. The graph is showing that our model is not overfitted or underfitted.
The confusion matrix (CM) presents the performance of the system in terms of correct and wrong classification developed. Therefore, CM of the test predictions in absence and presence of IA is shown in Table 2 and Table 3, respectively.

Conclusion
The main objective of this work was to propose a model for the people who have speech disorders to enhance their communication using Arabic sign language and to minimize the implications of signs languages. This model can also be used in hand gesture recognition for human-computer interaction effectively. However, the model is in initial stages but it is still efficient in the correct identification of the hand digits and transferred them into Arabic speech with higher 90% accuracy. In order to further increase the accuracy and quality of the model, more advanced hand gestures recognizing  devices can be considered such as Leap Motion or Xbox Kinect and also considering to increase the size of the dataset and publish in future work. The proposed system also produces the audio of the Arabic language as an output after recognizing the Arabic hand sign based letters. In spite of this, the proposed tool is found to be successful in addressing the very essential and undervalued social issues and presents an efficient solution for people with hearing disability.

Data Availability
The data used to support the findings of this study are included within the article.

Conflicts of Interest
The authors declare that they have no conflicts of interest. "Secure enforcement in cognitive internet of vehicles," IEEE Internet of Things Journal, vol. 5, no. 2, pp. 1242-1250, 2018.