Sign Language Recognition

: Sign language is used by deaf and dumb people to share information with others in their community. Talking to someone with a hearing problem is always a big challenge. Sign language recognition is an improvement in helping deaf and dumb people and has been around for years. Unfortunately, all research has its limitations and cannot be used commercially. Some research is known to be successful in recognizing sign language but requires expensive costs to be done for commercial purposes. Electronic recognition of sign language is associated with the acquisition of signals and continues until the production of text/speech. Touch gestures can be classified as permanent and flexible. Yet vertical touch detection is easier than recognizing dynamic touch but both awareness systems are important in human society. There are 26 symbols in Indian sign language corresponding to the alphabet where the proposed algorithm provides 95 accurate alphabetical results and its image is captured at all possible angles and distances i.e. in all the letters indeed though they've about 5 pictures independently. Angle and distance and algorithm work accurately for 45 input types. This paper aims to review sign language recognition methods and to determine the best method used by researchers. Other future research guides in this area are also suggested. In this paper, we discuss sign language recognition.


INTRODUCTION
Communication skills are most important for every human to share their thoughts and ideas.The word communication is a Latin word derived from SCIO means to share Communication includes the sharing of thoughts, knowledge, messages, or any kind of information.Communication is the only tool for the exchange of information through oral, writing, visual signs, or behavior.Communication is completed when the receiver receives the message and recognizes the message of the sender.Thoughts of ordinary people can be communicated to other people through speech, but for hearingimpaired people, communication becomes a major problem.The means of communication for the hearing-impaired people is done by the use of sign language.In daily communication, around 500,000 to 2,000,000 hearing-impaired people express their thought through Sign Language.Many Sign Languages have been invented for the hearing-impaired people and among those sign languages, American Sign Language (ASL) is the most used or the most referred sign language.II.CONVOLUTION NEURAL NETWORKS CNN's are amazing picture handling, man-made consciousness (AI) that utilizes profound figuring out how to perform both generative and descriptive tasks, frequently utilizing machine vison that incorporates picture and video recognition, alongside recommender frameworks and Natural Language Processing (NLP).A neural system is an arrangement of equipment or potential programming designed after the activity of neurons in the human mind.Conventional neural systems are not perfect for image processing and should be given a decreased in the resolution of the image.CNN has their "neurons" arrangement similar to the frontal lobe, the region liable for preparing visual boosts in people and different creatures.The layers of neurons are orchestrated so as to cover the whole visual field maintaining a strategic distance from the piecemeal picture handling issue of conventional neural systems.CNN utilizes a framework much like a multilayer perceptron that has been intended for diminished processing necessities.The layers of a CNN comprise of an input layer, an output layer, and a hidden layer that incorporates different convolutional layers, pooling layers, fully connected layers, and normalization layers.The expulsion of constraints and increments in proficiency for image processing brings about a framework that is unquestionably progressively compelling, less complex to trains restricted for image processing, and natural language processing [1].

III. LITERATURE REVIEW
This survey describes the existing and established theory and research in sign language from the American Sign language.There are thirty-six gestures of ASL which are used as a communication for the recognition and we have different algorithms to recognize them.Communication plays a vital role in human-to-human interaction which allows the people to express themselves.Different methodologies are used to recognize American gestures using different sign languages.
Zafer Ahmed ansari has done the research in the major field of Indian sign language categorization.By including fingerspelling numbers, common phrases as well as alphabets zafer classified 150 classes.A Kenect sensor is used for the dataset and along with the depth data RGB resolution of 640x480 are captured.In depth data file each pixel of the image has a value and the depth values are identified and stored in the depth data of the image.As the datasets that is the images which are collected consists of their outstretched hands.Through this outstretched hand, Zafer identified the hand with the least depth and the least depth hand is identified by masking pixels which has the depth value more than a certain threshold value.However, there were some problems during the identification of hand gestures, Zafer was unable to train the data using neural networks but he carried out the process using unsupervised learning and also by using k-algorithm.To detect various parts of the body such as torso, hand points of local maxima are used to initialize k algorithm.SIFT algorithm was used to train the data set and also to extract features.
Divya Deora has also recognized the sign symbols by using principal component analysis (PCA).The paper proposed by divya was with neural networks.They collected the data with a 3-megapixel camera and due to the poor quality of camera the result was not satisfactory and the other reason for their unsatisfactory result is the collection of a smaller number of datasets.They have collected 15 images for each sign and stored as a dataset.After the collection of datasets, segmentation of images is performed.They performed a simple boundary pixel analysis by separating the RGB into components.They got an accuracy of 60%.From this we can say that even though by combining the PCA with fingertip analysis has got the result, this result can be more accurate if they have used neural networks.
In order to recognize the sign language Lionel et al has carried out the process in two steps.They are extraction of features and action classification.This process was also carried out by using neural networks and by using Microsoft kinnect.Lionel has collected 20 different Italian gestures which were signed by 27 people of their surroundings.Microsoft kinnect is used to record the videos and he used a total of 6600 images in which 4000 images were sent to training and remaining images are sent to testing or for validating.For the learning of the model, only one side is used and for extracting the features max pooling was used.The model contains of two outputs and this are combined and given to the ANN.For the lengthening of the data CPU was used and the model was trained using GPU.The disadvantage of this process is output of the model which is trained is with less accuracy of almost 72.5%.
Emil M.P. et.al proposed that for image description a Haar like features are used and for the classification Adaboost is used.But, the usage of this process has many disadvantages such as it requires large number of features because of the usage of AdaBoost classifier and the other problem is detection of skin colour.Rather than this disadvantages it has the only advantage of fast computation.So this process was not preferred.
FACS, that is a facial action coding system was developed by Ekman and Friensen to code the facial expression movements and this are described using AU's which are also known as action units.Some kind of muscular basis is given to each AU and this coding process was done using a set of prescribed rules.The main disadvantage of this process is it is very time consuming.The other researcher named mase continued this process and to recognize the facial techniques he used an optical flow.But by using this process accuracy was very less.
From this survey we have decided to identify the sign gestures using neural networks.Firstly, the collection of datasets is done and then skin filtering is completed.After the process of skin filtering, hand cropping is done and after the cropping of hand the image is converted into binary image.From the binary image, features of the images are extracted and then classified into different datasets.This whole part comes under training of datasets and training of datasets is done using Recurrent neural networks (RNN) and then testing of datasets is completed.

IV. SIGN LANGUAGE DATA SET
We created a data set of the American Sign Language of all the alphabets and extra three gestures.The extra gestures are Space, Nothing, and Delete.We developed a code to capture the images of the data set when the gesture is placed in different light conditions and a different distance from the camera.We used OpenCV for capturing the images and to save them in a prescribed path.We captured the images using our laptop camera with the dimensions 200 x 200.Width and height of the image are 200, 200 pixels respectively, horizontal resolution and vertical resolution with 96 dpi and 96 dpi respectively and with Bit depth 24.For each gesture, we captured 3,000 images.In total, we captured 87,000 images of a data set.

IMPLEMENTATION
Presently it is an ideal opportunity to explain the examination work with thoughts accumulated in the above ideas by receiving any of beneath appropriate methodologies: Let us execute two separate programs for the RGB data set and threshold data set.

A. Loading of the Data set
The previously created data set is loaded into the system by resizing them to 64 x 64 pixels.In one program code is written to convert the resized data set from BRG to RGB data set and split the data set into training data and testing data.In another program, a code is written for converting from RGB to thresholding.

B. Creation of Model
From the above two, we got the dimension of training data as (82650,64,64,3) and testing dataset as (4350,64,64,3) for RGB converted images.And for dimensions of training data as (82650,64,64) and testing dataset as (4350,64,64) for thresholding converted images.
We have different dimensions so we will be using conv2d [2] to create a model for RGB converted images and conv1d [3] to create a model for thresholding converted data set.

Training and validation of created models
Now we have a training data set of 82650 images in both the cases.Among these we use 0.1% for validation and rest for training purpose i.e. we have 74385 images for training and 8265 samples for validation in both cases.We will be using the batch size 64 and 5 epochs.
In the case of RGB converted images, the total time taken for completion of the process is 56 minutes.On average each epoch took 11.2 minutes to complete.In the case of thresholding converted images, the total time taken for completion of the process is 3.7 minutes.On average each epoch took 44.6 micro minutes to complete.On comparing both the plots of RGB and thresholding converted datasets Accuracy plots we can say that we have stability in Thresholding plots because on average we can say that the accuracy of this plot is increasing linearly.
Whereas the accuracy of the RGB converted data set in accuracy plot is not linear but it has a zig-zag approach.That has a greater deflection of the values.

E. Giving Testing data set to the created model
We have 4350 testing set images.All these images are fed into the model to predict the output for the corresponding input gesture to model.The evaluation metrics for the testing the data set are Evaluation Accuracy and Evaluation Loss.
The greater the evaluation accuracy the model has a good prediction of the input given to it.Here are the evaluation metrics for both the RGB converted data set and the Thresholding converted data set.The time taken to complete the evaluation of the test data set is for Thresholding data set very much less when compared to RGB converted data sets.Here if we consider the evaluation Accuracy for RGB is 100% whereas in the accuracy plot we are having high fluctuations of accuracy.Whereas in the case of Thresholding converted data set we have evaluation accuracy as 98.69% and we do not have large fluctuations of accuracy in the plot.

F. Saving the model
The models created by both cases are saved in the JSON format.The weights are stored in the H5 file.The purpose of saving the model and weights are from next onwards no need of creating a model for giving the input.We can access the model directly and its weights and give input to the model to get the output.This saves the time for next use.

G. Accessing the model with live input
The saved model and weights are loaded back into systems.Now we use OpenCV for capturing the images.The Region of Interest (ROI) is cropped and the images are resized to 64 x 64 pixels and this is fed to the model by applying both the RGB method and Thresholding method and we get the output on the screen for the corresponding input gesture.RESULTS These are the output we got when we executed the program while giving live input to the model.In the case of the Thresholding converted data set we got exact output for the given input gestures.Whereas in RGB converted data set we did not get the accurate output.VIII.CONCLUSION Here we used two feature extraction methods.Among these the model that trained with Thresholding converted data got an accuracy of 98.69% and also, we got an accurate prediction for given gesture into ROI.But in the case of the RGB converted data sets we did not get the accurate results for the given input gesture.The thresholding feature extraction technique is used for getting an accurate output for gesture recognition.

Fig
Fig: American Sign Language Gestures

Fig
Fig: A CNN sequence to classify handwritten digits

Fig:
Fig: Screenshot of creation of data set and saving them to a specified folder

Fig:
Fig: Report for splitting the converted RGB data set into training and testing set.

Fig:
Fig: Report for splitting the converted threshold data set into training and testing set.

Fig:
Fig: Summary for the model created using conv1d and conv2d i.e.thresholding data set and RGB data set.

Fig:
Fig: Training and Validation report for RGB Converted Images

Fig:
Fig: Training and Validation report for Thresholding Converted Images

Fig:
Fig: Accuracy plot and Loss plot for Thresholding converted dataset

Fig:
Fig: Testing Report for RGB converted data set.

1 : 2 : 3 :
VI. ALGORITHM Algorithm to create a Model and Saving the model: STEP Creation of Data set of the required Gestures and saving them.STEP Converting all the data sets form BRG into RGB.STEP Converting all the data sets from RGB to Gray Scale.STEP 4: Converting all the data sets from the Gray Scale to Binary.STEP 5: Creation of Images list, Labels list, and adding all the Binary images to the Images list.STEP 6: Converting all the images list into arrays.STEP 7: Now dividing the data into Training (95%) and Testing (5%) set.STEP 8: Creation of Keras model using sequential by adding the layers.STEP 9: By calling create_model() function model is created.STEP 10: While training the model we need to have some data for validation of the model.STEP 11: Here we are creating a validation data of 0.1% of total training data.STEP 12: Plotting the Accuracy and Loss graphs between the Training and Validation.STEP 12: Now getting the evaluation accuracy and loss.STEP 13: Now Loading the Testing data.STEP 14: Repeat the step from 2 to 6 on training data.STEP 15: Now feeding the Test data to model for prediction of the output.STEP 16: Save the model and weights created.Thus, the model is created and saved.

Fig:
Fig: Creation of Data Set

Fig:
Fig: These are the output for the few inputs given for the Thresholding data set.