Machine Learning Approach for Gesture Based Arabic Sign Language Recognition for Impaired People

Mahyudin Ritonga (  mahyuritonga@gmail.com ) Muhammadiyah University of Sumatera Utara: Universitas Muhammadiyah Sumatera Utara Rasha M.Abd El-Aziz Jouf University Varsha Dr. Ct. institute of Engineering Maulik Bader Alazzam Ajloun National Private University Fawaz Alassery Taif University NagaJyothi A. Vignan's Institute of Information Technology Yousef Methkal Abd Algani College of Sakhnin for Teacher Education Kiran Bala B. K Ramakrishnan College of Engineering Balaji S Panimalar Engineering College


Introduction
The sign language indicates the hand gestures or signs that help a person for expressing the expressions, words, and letters of a particular language.
Communication could be made easier with the introduction of systems, which could recognize the hand gestures and signs (Shahin and Sultan 2019). The interaction between machines and humans provides huge bene ts for improving the sign language recognition techniques. Building of the sign language datasets is highly important for constructing automated systems. Sign language recognition is a challenging process as it is di cult for representing the primary features such as hand con guration, movement etc. (Escobedo et.al 2019).The approaches used in sign language recognition can be of three types.
Hybrid based approach, vision-based approach and sensor-based approach. The di culties in these approaches arise out of the integration of hardware sensors. Traditional machine learning techniques are employed in computer vision systems, which are used for recognition, detection, and feature extraction, segmentation and preprocessing. These systems had limited datasets that leads to generalization and robustness. The deep machine learning algorithms were found as highly e cient. Convolutional Neural Networks (CNN) is one among the best tools, which is implemented for various recognition processes. The proposed system employs a Convolutional Neural Network (CNN) based machine learning technique using wearable sensors for recognition of sign gestures of the Arabic sign language.
2. Literature Review (Bencherif et.al 2021) built a new Arabic Sign Language (ArSL) video-based sign database. It was an online sign recognition system, where the inverse e ciency score was used for determining optimal successive frames used in recognition decisions. Best results were gathered in the dependent mode for both static and dynamic signs. (Hassan et.al 2019) presented an e cient comparison between two recognition methods such as Hidfden Markov Model (HMM) and the modi ed k th Nearest Neighbor (m-KNN), which was contemplated on two unique tool kits. Glove based data sets and ArSL datasets were introduced in their study. It was found that the accuracy of classi cation, which were got for the collected sign sentences were equivalent to the accuracy got by utilizing the sensor gloves. An inferior difference was observed between the two classi cation techniques.
(Al-Hammadi et.al 2020) proposed a novel system for hand gesture recognition, which is dynamic with the use of various deep learning architectures, which were used for sequence feature recognition and globalization, global and local feature representations and for hand segmentation. The results of their system indicated that their system outperformed the other sign language recognition approaches by indicating its e ciency. (Mustafa 2021) reviewed the sign language recognition system, which is contemplated on various classi er approaches. His work focused on the deep learning methods mainly and on the recognition systems of the Arabic sign language with the objective of nding out the best classi er model for sign language recognition. In his review the deep-learning classi ers were found the best model for sign language recognition. (Kamruzzaman 2020) developed a system, which automatically detects the hand sign letters and converts it into speech of the Arabic language. 90% accuracy was obtained with his proposed system. His system was found e cient in recognizing the Arabic sign letters with the application of the deep learning models. However more advance systems are recommended for improving the accuracy of his system.

The Proposed System
The proposed system comprises of the data gloves for getting the data set encapsulated with the sign letters of the Arabic Sign Language (ArSL), Convolutional Neural Network for classi cation of the extracted features in the data set, evaluation stage, which involves the conversion of hand gestures of the Arabic Sign Language (ArSL) into speech, which is the output of the proposed system. The following gure indicates the block diagram of the proposed system.
Wearable sensors are devices, which are embedded into materials, which could be worn in a body in a comfortable manner. The wearable sensors make use of the measurements, which has direct contact with the human hands. These sensors are in the form of gloves, position tracking instruments etc. In the proposed system, DG5-V Hand data gloves are utilized for capturing the hand movements in the data set. This glove comprise of a software module, microcontroller unit and a communication unit.

ArSL Dataset
The Arabic Sign Language data set comprises of 30 sign letters of the Arabic sign language. This Arabic Sign Language dataset is used for training the proposed system. The 30 sign letters of the Arabic Sign Language (ArSL), which is used for the proposed system is shown below.

CNN Classi cation
The proposed system makes use of a CNN based machine learning technique. The CNN classi er is used here, which is used for extracting the features of the ArSL dataset through the input image processing utilizing the Convolutional layers, which are designed for producing a features' map. Convolutional Neural Networks (CNN) refers to the system that makes use of Perceptron, machine learning algorithms in the implementation of their functions for data analysis.

Evaluation
The proposed model is evaluated based on 30 sign letters of the Arabic sign language. The images of these sign letters were used for training the Convolutional Neural Network. The dataset is divided in 80:20 ration with 80 for the training phase and 20 for the testing phase. The results of the proposed system provide an accuracy of 90%, which is high when compared to the previous studies.

Materials And Methods
In the proposed system, a wearable sensor-based approach is employed, which makes use of a Convolutional Neural Network (CNN) based machine learning technique for the recognition of the Arabic Sign Language (ArSL). The wearable sensor-based recognition method would process the data gathered from the hand data gloves, which is embedded with the wearable sensor. Cyber gloves, data gloves and power gloves are commonly employed for the recognition of the Arabic Sign Language. In the proposed method, the data gloves are employed. The information regarding the nger bending, hand orientation, movement, rotation and position are provided by these data gloves. Multiple features could be extracted from the acquired data, got from the data gloves. These features could be utilized with a suitable classi er for recognizing the Arabic Sign Language (ArSL). The classi er used for this purpose in the proposed system is the Convolutional Neural Network (CNN) classi er.

Data Preprocessing
The initial step for building an e cient deep learning model is the data preprocessing step. This step is step is utilized for transforming the raw data into an e cient and useful format. The following gure indicates the ow of processes in the data preprocessing step.

Raw images
The raw images of the proposed system are the hand sign images, which are got using wearable sensors for executing the proposed system. The ways used for capturing these images are by altering the distance and size of the object, with a good focus and quality, by altering lighting conditions and from various angles. The focus for creating these raw images is for creating the dataset for training phase and testing phase. The raw images of the Arabic Sign Language (ArSL), which are used in the proposed system are indicated in gure (1).

Classi cation of images
The images are classi ed in the proposed system into 30 letters of the Arabic Alphabet, which comes under 30 categories. One subfolder is created and utilized for storing one category images for implementing the system. All the subfolders that encapsulate the classes are together placed in one important folder called the dataset in the system presented here.

Formatting of images
The hand sign images are unequal usually and have various backgrounds. Thus it is essential for erasing the unnecessary data from these images for fetching the hand section. Then resizing of the extracted images is done with pixel size of 128*128 and then the resized images are converted into RGB.

Dividing images for training and testing
The 30 Arabic alphabets have 125 pictures in each of them. The dataset of these Arabic alphabets are sectioned into two datasets such as the dataset for learning or training and the dataset for testing. These datasets are divided in the ratio 80:20; where 80 represents the training dataset and 20 represents the testing dataset. The images of the training set are 100 in number and the number of images in the testing set is 25 in number, which are allotted for every handsign.
The data in real-time is usually unpredictable and inconsistent. This is because of the existence of many transformations such as moving, rotating etc. The image augmentation is utilized for enhancing the performance of the deep network. Arti cial images are created with multiple processing methods like rotation, shear, ips, shifts etc. The proposed system's images are rotated randomly up to 360 degrees with the help of the image augmentation technique.
Fewer images are randomly sheared in the range of 0.2 degrees and were horizontally ipped. The following table indicates the accuracy and loss without and with augmentation of images. There is a need for the creation of a list, which encapsulates all the images that are placed in a unique folder for getting the information regarding le name and labels. This would be useful for generating the training and testing data needed for the proposed system implementation. . At the threshold condition, the activation function = 0. This activation function is highly reliable and fragile in its functioning. The learning rate can be adjusted to overcome this drawback. A step, which is frequently performed by the convolution lter function, is the stride. The stride size is equal to 1 usually. This indicates the moving of the convolution lter pixel by pixel. When the stride size is increased, then the convolution lter moves over the high interval input and thus a small overlapping occurs in the cells. The input size is always greater than the size of the feature map and hence the shrinking of the feature map should be stopped. Thus, a padding technique is used here. Then the zero valued pixel layers must be added to the inputs for preventing the inputs of the feature map. The padding process assists in retaining the spatial dimension as constant post convolution such that the size of the stride and the kernel is equal to the input. Thus, this improves the entire system performance. Three essential parameters require adjustments within the CNN for modifying the convolution layer's behavior. These three parameters are padding, stride and size of the lter. The output size of the convolution layer is calculated as follows; Output size = (2* padding amount -lter size + input size )/ Stride size …………. (1) Here, Output size indicates the output size of the convolution layer; input size indicates size of the input image, lter size indicates the size of the lter, padding amount indicates the amount of padding and stride size indicates size of the stride.

Pooling layer
In between the convolution layer, the pooling layer is included. But, the important purpose of the pooling layer is to decrease constantly the dimensions and computations having less parameter. It helps in regulating the training time and regulating the over-tting of the layers. Max pooling is a commonly employed pooling technique and thus it is included in the proposed system as well. It makes use of the largest value in the windows and thus minimizes the feature map size and retains the needed information. There is a requirement for specifying the sizes of the windows earlier for determining the output size of the pooling layer. The output size of the pooling layer is calculated as follows; Output size = 1+ (input size -lter size )/stride size …………….. (2) In all the situations, invariance in the translation is offered by the pooling layer, which represents the identity of a speci c object with respect to its visibility within the frame.

Classi cation of features
The next eminent CNN component is the classi cation. The classi cation part comprises of fewer layers that are connected with each other. These are called as fully connected (FC) layers. The neurons within these layers have their own connection to the previous layers'activations. The mapping between the output and input is assisted with the help of the FC layer. The function of this layer is similar to that of a neural network. The fully connected layer accepts only the data having one dimension. For converting the 3D into one dimensional data, Python's attening function is utilized in the proposed system.

Results And Discussion
The testing of the proposed system is done with the usage of two convolution layers. Then each convolution layer is followed by 2×2 maximum pooling layers. In the rst layer, different structures are exhibited by the convolution layers. 32 kernels are there in this rst layer. 64 kernels are there in the second layer. But, the kernel size in these two layers is equal to 3×3. Each pair of pooling and convolution layers are veri ed with two various drop out values, which are regularized. These values are 25% for the convolution layer and 50% for the pooling layer. This helps in erasing one input out of four inputs in the convolution layer and two inputs of the pooling layer. Softmax and ReLu are used by FC layer's activation function for determining whether there is ring of the neuron or not. The training was done for 100 epochs with the help of an RMS prop optimizer, which is having a cost function and is contemplated on the categorical cross entropy as there is a convergence before the 100 epochs such that the system has the storage of the weights, which is to be used in the following phase. In the testing phase optimistic test accuracy was presented by the proposed system having less loss rates. These loss rates could be further reduced utilizing augmented images with the accuracy as kept constant. Every new image is processed in the testing phase, before they are employed in the proposed system. The vector size that is got from the proposed system is equal to 10. Here, one among the ten values is equal to one and the remaining value indicates the data's predicted class value. The system is then connected with the signature part that involves the conversion of hand signs into speech of the Arabic language. There are two phases of this process. The initial phase is the translation phase that involves the translation of hand signs into letters in Arabic. This is done with Google Translator (API). Better results are achieved in the system with the built up of different combinations of the convolution and pooling layers. For the convolution layers having drop-out values of 50% and 25% having 64 and 32 kernels, 83% accuracy is attained and the validation loss is 0.84. The convolution layers of batch size 128 and 64 are also tested in the proposed system. The accuracy rate could be raised with the help of Image augmentation. For the convolution layer having batch size 128, the validation loss was found to be 0.50. These results are indicated in Table 1. It was observed that additional Convolutional layers were not added in the system. The confusion matrix indicates the proposed system's performance based on right or wrong classi cation, which is developed here. The confusion matrix in the presence and absence of image augmentation is indicated in Table 2 and 3 respectively. Table 2  Presence of Image Augmentation   PR   AC   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  2   1  18  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0  0   2  0  20  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0   3  0  1  18  0  0  0  0  0  0  0 Table 3  Absence of Image Augmentation   PR   AC   1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  2   1  18  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0   2  0  20  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0   3  0  0  18  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1  are used for computer-human interactions. The proposed model is in its earlier development stage and still it is highly e cient in the proper identi cation of gestures and hand digits, which are converted into Arabic speech with accuracy higher than 90%. The quality and accuracy of the proposed model could further be improved by using highly advanced recognition devices such as Xbox Connect or Leap Motion. Also the data set size could be increased further in the future research studies. The output of the proposed system is the speech in Arabic language, which is got with the recognition of the Arabic sign language. Further, the system proposed here would be a best hearing remedy for the hearing-impaired people.
Declarations Figure 1 Block Diagram of the Proposed System Architecture of ArSL Recognition with CNN