Automated sign language detection and classification using reptile search algorithm with hybrid deep learning

Sign language recognition (SLR) contains the capability to convert sign language gestures into spoken or written language. This technology is helpful for deaf persons or hard of hearing by providing them with a way to interact with people who do not know sign language. It is also be utilized for automatic captioning in live events and videos. There are distinct methods of SLR comprising deep learning (DL), computer vision (CV), and machine learning (ML). One general approach utilises cameras for capturing the signer's hand and body movements and processing the video data for recognizing the gestures. One of challenges with SLR comprises the variability in sign language through various cultures and individuals, the difficulty of certain signs, and require for realtime processing. This study introduces an Automated Sign Language Detection and Classification using Reptile Search Algorithm with Hybrid Deep Learning (SLDC-RSAHDL). The presented SLDC-RSAHDL technique detects and classifies different types of signs using DL and metaheuristic optimizers. In the SLDC-RSAHDL technique, MobileNet feature extractor is utilized to produce feature vectors, and its hyperparameters can be adjusted by manta ray foraging optimization (MRFO) technique. For sign language classification, the SLDC-RSAHDL technique applies HDL model, which incorporates the design of Convolutional Neural Network (CNN) and Long-Short Term Memory (LSTM). At last, the RSA was exploited for the optimal hyperparameter selection of the HDL model, which resulted in an improved detection rate. The experimental result analysis of the SLDC-RSAHDL technique on sign language dataset demonstrates the improved performance of the SLDC-RSAHDL system over other existing DL techniques.


Introduction
Sign language is a computer vision-based comprehensive complex language that captivates signs formed by the actions of hands in association with facial expressions [1].It is a natural language employed by an individual with less or no hearing intelligence for communication.Sign language can be implemented for communicating words, letters, or sentences by employing diverse gestures of the hands [2].This kind of communication makes it simple for hearing-challenged individual to express their opinions and assist in linking the communication gap amongst normal and hearing-challenged individuals.People have adapted to sign language for communicating since the antique period [3].Hand signs are as old as human civilization itself.Hand gestures are specifically advantageous in expressing any emotion or word to communicate.Hence, humans around the globe employ gestures from hand regularly in expressing themselves spite the creation of writing conventions [4].Recently, much study has been continuing in emerging systems that are able to classify gestures of diverse sign languages as provided class.Such systems have found applications in robot controls, natural language communications, virtual reality environments, and games [5].The automated identification of human gestures is a convolutional multi-disciplinary issue that has not yet been totally resolved.In recent years, a count of methods can be employed that involve the implementation of ML procedures for sign language identification [6].Since the beginning of Deep Learning (DL) methods, there have been attempts to identify human gestures.
To identify gestures, diverse aspects like articulated models and hand-crafted spatio-temporal descriptors were employed together Fig. 1.Overall flow of SLDC-RSAHDL approach.
with gesture classifiers, conditional random fields [7], hidden Markov models, and Support Vector Machines (SVM) have been extensively employed.But categorization of signs is unforeseeable under changing illumination conditions, and from diverse subjects is still a threatening issue [8].An instinctive approach for producing interfaces is to look at the user's muscle activity.The device can record this action by employing a camera [9].This recorded imagery can be recognized by DL algorithms to determine the gesture.In recent times, categorization with DCNN networks has been efficient in several identification challenges [10].Multi-column DCNNs that use several similar networks have been demonstrated to enhance recognition rates of single networks.This study introduces an Automated Sign Language Detection and Classification using Reptile Search Algorithm with Hybrid Deep Learning (SLDC-RSAHDL).In the SLDC-RSAHDL technique, MobileNet feature extractor is utilized to produce feature vectors, and its hyperparameters can be adjusted by manta ray foraging optimization (MRFO) system.For sign language classification, the SLDC-RSAHDL technique applies HDL model, which incorporates the design of Convolutional Neural Network (CNN) and Long-Short Term Memory (LSTM).At last, the RSA was exploited for the optimal hyperparameter selection of the HDL model, which resulted in improved detection rate.The experimental result examination of the SLDC-RSAHDL algorithm was executed on sign language database.

Literature review
Pandey et al. [11] proposed a novel Feed Forward Neural Network (FFNN) model system that can automatically identify sign language to help normal humans in more efficient communication with impaired visually, hearing-wise, or speech-wise.This scheme recognized the hand gesture aspect point extraction given with FF point extraction given with FFNN.Hand gesture recognition with voice process scheme by implementing Hidden Markov Model (HMM) is employed to deliver communication for normal and dump individuals.In Ref. [12], a new outline is suggested for gesture-autonomous sign language identification by employing several DL constructions containing hand semantic segmentation, Deep Recurrent Neural Network (DRNN), and hand shaped factor depiction.Abstracting hand shaped aspects is performed by implementing a single layer Convolutional Self-Organizing Map (CSOM) rather than depending on transfer learning (TL) of pre-trained CNNs (DCNNs).The series of abstracted aspect vectors is later identified by implementing BiLSTM-RNN.
In [13], a two-stream CNN (2 S-CNN) framework was suggested to identify the American Sign Language (ASL) hand signs founded on multi-modal (RGB and depth) data fusion.Initially, the hand sign information was improved to eliminate the impact of noise and background.Next, hand sign RGB and depth features are abstracted for hand sign detection by corresponding CNNs on 2 streams.Lee et al. [14] suggest an ASL learning application model.This application will be a whack-a-mole gaming with an embedded real time gesture identification scheme.As both dynamic and static gestures (J, Z) are present in ASL alphabetical system, LSTMRNN with KNN technique is accepted as the categorization technique is founded on management of a series of inputs.Features like angles amongst fingers, distance amongst finger positions, and sphere radius are abstracted as input for the categorization prototype.
Rastgoo et al. [15] suggest a new DL-founded pipeline construction for effective instinctive hand gesture language identification by implementing 2DCNN, Single Shot Detector (SSD), 3DCNN, and LSTM from RGB input videos.The authors employ a CNN-founded prototype that evaluates the 3D hand keypoint from 2D input segments.Das et al. [16] suggested a fusion porotype comprising deep TL founded on CNN with an RF categorizer for the instinctive identification of Bangla Sign Language (BSL) (numeric and alphabetical symbols).'Ishara-Bochon' and 'Ishara-Lipi' are both datasets of secluded numeric and alphabetical symbols, corresponding to the initial comprehensive multi-purpose open-access dataset for BSL.Also, the authors suggested a background elimination protocol that eliminates needless aspects from the gesture imageries.The authors [17] suggest a Fully Convolutional Network (FCN) for online SLR to simultaneously learn temporal and spatial aspects from feebly interpreted video series with sole sentence-level explanations provided.A Gloss Feature Enhancement (GFE) segment is presented in the suggested networks to apply better series orientation learning.

The proposed model
In this article, we have introduced a new SLDC-RSAHDL technique for automated detection and classification of sign language using the DL and metaheuristic optimization algorithms.It follows a four stage process: MobileNet feature extraction, MRFO based hyperparameter tuning, HDL based sign language recognition (SLR), and RSA based parameter tuning.Fig. 1 signifies the overall flow of SLDC-RSAHDL approach.

Feature extraction using MobileNet
The basic principle of lightweight model is to develop effective network computation for convolution models that could minimalize the number of parameters and the computation time while guaranteeing the detection performance.Sifre, in the US in 2014, first proposed the MobileNet model, which was the depth-separable convolution that splits the typical convolutional layer into point-wise and depth-wise convolutional layer separable convolutional layer that implies the summation and convolution in the classical convolutional model are divided as, such that the computation speed is improved increased and thus, the amount of weight parameters evaluated by the network could be decreased considerably [18].
Consider that the length and width of output and input are constant and that the number of channels M, input is a feature map of length D F and width D F , later a convolutional kernel of height D K and width D K , the typical convolution will output a number of channels N, feature map of length D K and width D K .Set this to G; the typical convolution is process was mathematical process written as: The computation of every Ĝ needs the sum of each m.Depth-separable convolution to take out the m alone.
Later depth separable convolutional layer splits the classical convolution kernels into summation and convolution parts.In such cases, the pointwise convolution map has single parameter, the amount of resultant features N, whereas the depth convolutional map has three variables, the amount of input features M, the length D K and the width D K .The original 4 parameters were split into 1 and 3 parameters; hence it can be mathematical model has been changed.

Ĝ
Where K represents the convolutional kernel for pointwise convolutional and K represents the convolutional kernel for depthwise convolutional.
For such reasons, it is easier to ensure for the depth separable convolutional, the amount of convolution execution is evaluated in 2 stages.Initially, MD K xD K matrices moved D F xD F times; next, N 1x 1xM convolutional kernels moved D F xD F times; hence the overall amount of convolutional executions can be attained by adding the number of depth-separable convolutional and the abovementioned two executions.The amount of computation is The ratio of computation work of the depth separable convolutional layer to the typical convolutional can be given as follows: The abovementioned formula demonstrates that the computation reduction is positively related to D K and N. Furthermore, the convolution kernels of the depth convolutional layer in MobileNet are known to be 3x3, and during their implementation, the computation of depth separable convolutional layer is 1/8 to 1/9 of that of the typical convolutional, thereby accomplishing the drive of enhancing the computational rate of network structure.

Hyperparameter tuning using MRFO algorithm
For hyperparameter tuning process of the MobileNet algorithm, the MRFO technique was employed.The MRFO algorithm simulates three foraging performances for upgrading the solution position [19].The foraging performances like cyclone, somersault, and chain.The mathematical process for every foraging performance is described below: Chain foraging: The foraging chain has been developed if manta rays arrange head-to-tail.In each iteration, an optimum solution was utilized for updating every individual.The subsequent mathematical model can demonstrate it: Whereas N signifies the dimensional of populations, r denotes the random vector among 0 and 1, x d i (t) refers to the i th individual's position at t th iteration, α implies the weighted coefficient, and x d best (t) stands for the plankton with maximal concentration (an optimum solution gained so far).
Cyclone foraging: If the manta rays spot food, they can generate a lengthy foraging chain and therefore swim for receiving the food.The subsequent mathematical formula defined the cyclone foraging performance: β= 2e r1(T− t+1/T) • sin(2πr 1 ) In which β and T signify the weighted factor and maximal iteration count correspondingly, and r 1 denotes the random value among zero and one.The exploration process is utilized for improving the algorithm by utilizing the subsequent mathematical process: H. Alsolai et al.
whereas x d rand denotes the random position from the searching space, and Ub d and Lb d imply the lower and upper limits of d th dimensional correspondingly Somersault foraging: The food position at this point was considered as pivot, whereas all the individuals performed to swim near or around the pivot and afterwards somersaults to a novel position.The equivalent mathematical formula is offered as depicted: Whereas the somersault factor was defined by S, and r 2 and r 3 signify the random numbers among zero and one.

Sign language classification using optimal HDL model
In this work, the classification of signs takes place by the HDL model.For two major reasons, CNN provides better accuracy in pattern recognition and classification.Primary, its structure was highly relevant for determining local connections amongst data points; next, it decrease the amount of network parameters [20], thus resulting in a low computation difficulty than traditional plain neural network architecture.Fig. 2 displays the structure of CNN.The equation of one standard convolution layer is formulated by Eq. ( 11): Where X conv , W conv , correspondingly denotes the output vector, weighted matrix of convolutional layer, X indicates the sensors input, and conv1 D indicates the 1D convolutional operator.The hyperparameter of convolutional layer is the length of kernel L k representing the count of neighboring data points aggregated, and the amount of kernel N k representing the amount of local features extracted.Then, X conv is fed into the LSTM layer that exploits data at many preceding time steps for perceiving insight into current time step, represented as "long-term dependency".Introducing L a classical linear conversion of integration of X conv t with N k feature at t time step and resultant of hidden state h t− 1 with N h features at prior step: In Eq. ( 12), W and b denote the weighted matrix and bias vector; it can be noteworthy that the amount of features of L is equivalent to that of hidden output h.All the cells of LSTM include 3 gates such as forget gate f f , input gate f i , and output gate f 0 , that include nonlinear sigmoid function σ to a linear conversion L as follows: At the same time, a novel candidate of data produced at t time step can be evaluated by the tanh activation function to linear conversion of concatenation [h t− 1 ; X conv t ]: Next, the candidate enters LSTM cells: Fig. 2. Structure of CNN.
and hidden output of LSTM cell at t time step can be evaluated at the output gate: Where ⊕ and ⊙ correspondingly represents component-wise addition and multiplication of two vectors.As soon as input data enter a network, it can be split into fixed-length segments, and then the IDCNN layer extracts local connections amongst their surrounding points and data points beforehand, feeding to the memory cell of LSTM where long-term dependency is recognized and preserved over time.During this hybrid DL structure, the hyperparameter that needs to be further defined is the size of hidden output N h , amount of kernels N k , and the kernel length L k in the convolutional layer at all the LSTM cells.Finally, the RSA adjusts the hyperparameter values of the HDL model.The highly coordinated and cooperative hunting method demonstrated by the crocodiles includes encircling the target, and hunting has been an inspiration for the current reptile search algorithm RSA [21].
The initialization stage begins with generating X matrix of random solution x i,j based on Eq. ( 17), where n denotes the dimensionality of specific problem, i represents the index of the individual, j shows its existing location, and N represents the overall amount of individuals.
Eq. ( 18) produces random individuals.Now, rand represents the arbitrary integer within the range, and LB and UB represent the lower and upper bounds of searching spaces.The search process was split into two major procedures (neighboring prey, afterwards the attack) accompanied by the 4 distinct behaviors for emphasizing exploration and exploitation.Exploration exploits 2 walking strategies demonstrated by crocodiles: stomach walk and elevated walk.The key objective of the crocodile is to extend the searching region and helps for the next hunting stage.The elevated walk method can be used if t ≤ T 4 , whereas the stomach walk is triggered if t > T 4 and t≤ 2 T 4 .Eq. ( 19) is accountable for updating the position of crocodile: In Eq. ( 19), T shows the maximal amount of iterations, Best j represents the present optimum individual at j-th position, and t denotes the ongoing iteration.The hunting operator η (i,j) was determined by Eq. ( 20), where β shows the sensitive parameter fixed at 0.1, which governs the exploration performance.
The searching space was shrunk by using the reduction function, determined using Eq. ( 21), where r 1 denotes a random integer ranging from 1 to N, x r1,j signifies the i th l s solution random location, and e represents a smaller value.
Eq. ( 22) evaluates the probability ratio, named "Evolutionary Sense", that arbitrarily alternates in [− 2, 2] as round passes by: Where r 2 indicates the arbitrary value inside.Eq. ( 23) define the percentage difference between the position of the observed and best-obtained individual: In Eq. ( 23), α denotes the sensitive variable, with the predetermined value 0.1, which controls the fluctuations amongst possible individuals appropriate for co-operated hunting.The corresponding upper and lower boundaries of the j th position were indicated as UB (j) and LB (j) ..

H. Alsolai et al.
The average location M(X) of ith individual was expressed as follows.
The RSA exploitation process is divided into hunting coordination (if t≤ 3 T 4 and t > T

2
) and cooperation (if t ≤ T and t> 3 T

4
) technique, aims to strengthen the local investigation of the search realm and closer to the optimum individual.The hunting behavior shown by the crocodile has been expressed as.
Best j (t) − η (i,j) (t) × e − R (i,j) (t)×rand,t ≤ T and t> 3 The basic RSA shows the time complexity of the O(N×(T ×D+1 where N indicates the candidate counts, T represents the round counts, and D denotes the dimensional of solution spaces.The RSA method creates a fitness function (FF) to make superior classifier result.It explains a positive integer to exemplify the good performance of candidate outcomes.During this effort, the minimizing of classifier error rate was supposed that FF is formulated in Eq. (26).

fitness(x
number of misclassified samples Total number of samples * 100 (26)

Experimental Evaluation
In this section, the SLR performance of the SLDC-RSAHDL technique is studied using the ASL alphabet dataset from Kaggle repository [22].The database has a group of images of alphabets in American Sign Language, divided into 29 folders that expose several classes.Table 1 and Fig. 3 offer a detailed recognition result of the SLDC-RSAHDL technique under 29 classes.The results indicate that the SLDC-RSAHDL technique performs proficiently in each class.At the same time, it is noticed that the SLDC-RSAHDL technique accomplishes effectual outcomes with average prec n of 99.42 %, reca l of 99.43 %, accu y of 99.51 %, and F score of 99.43 %.
Table 2 and Figs. 4 and 5 reports a brief recognition outcome of the SLDC-RSAHDL approach with other optimizers.The experimental values highlighted that the RMSProp optimizer and Adam optimizers had reached almost nearer performance with accu y of 98.95 % and 98.93 %, respectively.Along with that, the SGD optimizer gains considerable outcomes with accu y of 99.28 %, prec n of 99.19 %, reca l of 99.24 %, and F score of 99.11 %.However, the SLDC-RSAHDL technique resulted in enhanced performance with accu y of 99.51 %, prec n of 99.42 %, reca l of 99.43 %, and F score of 99.43 %.Table 3 reports an overall comparison analysis of the SLDC-RSAHDL technique in terms of recognition rate (RR) and computation time (CT) [23].In Fig. 8, a comparative RR investigation of the SLDC-RSAHDL technique with other models was performed.The results imply that the KNN model resulted from ineffective outcomes with minimal RR of 97.29 %.At the same time, the SVM and ANN models have accomplished considerably enhanced performance with closer RR of 98.31 % and 98.54 % respectively.Concurrently, the CNN model accomplishes reasonable RR of 99.12 %.But the SLDC-RSAHDL technique reaches higher performance with RR of 99.43 %.In Fig. 9, a comparative CT examination of the SLDC-RSAHDL approach with other techniques was performed.The outcomes inferred that the KNN system resulted from ineffective outcomes with maximal CT of 16.84min.Besides, the SVM and ANN algorithms have obtained considerably superior performance with closer CTs of 15.10min and 14.36min.Finally, the CNN method reaches reasonable CT of 11.26min.However, the SLDC-RSAHDL system attains effectual performance with CT of 6.14min.
From the detailed results and discussion, it can be concluded that the SLDC-RSAHDL algorithm reaches effectual performance on the SLR process.

Conclusion
In this study, we have introduced a novel SLDC-RSAHDL technique for automated detection and classification of sign language using the DL and metaheuristic optimization algorithms.It follows a four-stage process: MobileNet feature extraction, MRFO based hyperparameter tuning, HDL based SLR, and RSA based parameter tuning.The design of the MRFO and RSA algorithms assists in the effectual selection of the hyperparameters related to the MobileNet and HDL models, which results in improved detection rate.The experimental result analysis of the SLDC-RSAHDL technique on sign language dataset demonstrates the improved performance of the

Data Availability Statement
The data used in this article was not collected from any public repository.The data collected as responses for this study was collected from individuals working in the case organization.

Fig. 6
inspects the accuracy of other existing techniques during the training and validation process on test dataset.The figure stated that the other existing techniques reach enhancing accuracy values over increasing epochs.Moreover, the increasing validation accuracy over training accuracy exposed those other existing methods that learn effectively on the test dataset.The loss investigation of other existing systems at the time of training and validation is exhibited on the test dataset in Fig. 7.The outcomes inferred that other existing methods gain closer values of training and validation loss.It is clear that other existing techniques learn effectively on the test dataset.

Table 2
Recognition outcome of SLDC-RSAHDL approach with distinct measures.

Table 3
Comparative outcome of SLDC-RSAHDL system with other techniques.