Deep convolutional neural network-based Leveraging Lion Swarm Optimizer for gesture recognition and classification

: Vision-based human gesture detection is the task of forecasting a gesture, namely clapping or sign language gestures, or waving hello, utilizing various video frames. One of the attractive features of gesture detection is that it makes it possible for humans to interact with devices and computers without the necessity for an external input tool like a remote control or a mouse. Gesture detection from videos has various applications, like robot learning, control of consumer electronics computer games, and mechanical systems. This study leverages the Lion Swarm optimizer with a deep convolutional neural network (LSO-DCNN) for gesture recognition and classification. The purpose of the LSO-DCNN technique lies in the proper identification and categorization of various categories of gestures that exist in the input images. The presented LSO-DCNN model follows a three-step procedure. At the initial step, the 1D-convolutional neural network (1D-CNN) method derives a collection of feature vectors. In the second step, the LSO algorithm optimally chooses the hyperparameter values of the 1D-CNN model. At the final step, the extreme gradient boosting (XGBoost) classifier allocates proper classes, i


Introduction
Noncontact gesture recognition has made a significant contribution to human-computer interaction (HCI) applications with the enormous growth of artificial intelligence (AI) and computer technology [1].Hand gesture detection systems, with their natural human-computer interaction features, enable effective and intuitive communication through a computer interface.Furthermore, gesture detection depends on vision and can be broadly implemented in AI, natural language communication, virtual reality, and multimedia [2].Daily, the demand for and the level of services essential to people is increasing.Hand gestures are a main component of face-to-face communication [3].Hence, human body language serves a significant part in face-to-face transmission and making hand gestures.In interaction, many things are expressed with hand gestures, and this study presents few visions into transmission itself [4].Yet, recent automation in this region does not concentrate on using hand gestures in everyday actions.The emerging technology eases the difficulty of processes of different user interfaces and computer programs presented to the user.To make this mechanism less complex and easy to understand, nowadays image processing is utilized [5].
When transmission has to be recognized between a deaf and a normal person, there is a robust necessity for hand gestures.To make the system smarter, there comes a necessity to enter hand gesture imageries into the mechanism and carry out an examination further to determine their meaning [6].Still, conventional hand gesture detection related to image processing methods was not broadly implemented in HCI due to its complex algorithm, poor real-time capability, and low recognition accuracy [7].Currently, gesture detection related to machine learning (ML) has advanced quickly in HCI owing to the presentation of AI and image processing graphics processor unit (GPU) [8].The ML methods like neural networks, local orientation histograms, elastic graph matching, and support vector machines (SVM) were broadly utilized.Due to its learning capability, the NN does not require manual feature setting through simulating human learning processes and can execute training gesture instances to form a network classification detection map [9].Currently, DL is a frequently utilized approach for HGR.Recurrent neural networks (RNN), CNNs, and stacked denoising auto encoders (SDAE), and are usually utilized in HGR applications [10].
This study leverages the Lion Swarm optimizer with deep convolutional neural network (LSO-DCNN) for gesture recognition and classification.The aim of the LSO-DCNN technique lies in the proper identification and categorization of various categories of gestures that exist in the input images.Primarily, the 1D-convolutional neural network (1D-CNN) method derives a collection of feature vectors.In the second step, the LSO algorithm optimally chooses the hyperparameter values of the 1D-CNN model.At the final step, the extreme gradient boosting (XGBoost) classifier allocates proper classes, i.e., recognizes the gestures efficaciously.To portray the enhanced gesture classification results of the LSO-DCNN algorithm, a wide range of experimental results are investigated.A brief comparative study reports the improvements in the LSO-DCNN technique in the gesture recognition process.

Literature survey
Sun et al. [11] suggested a model dependent upon multi-level feature fusion of a two-stream convolutional neural network (MFF-TSCNN) which comprises three major phases.Initially, the Kinect sensor acquires red, green, blue, and depth (RGB-D) imageries for establishing a gesture dataset.Simultaneously, data augmentation is accomplished on the datasets of testing and training.Later, a MFF-TSCNN model is built and trained.Barioul and Kanoun [12] proposed a new classifying model established on an extreme learning machine (ELM) reinforced by an enhanced grasshopper optimization algorithm (GOA) as a fundamental for a weight-pruning procedure.Myographic models like force myography (FMG) present stimulating signals that can construct the foundation for recognizing hand signs.FMG was examined for limiting the sensor numbers to appropriate locations and giving necessary signal processing techniques for observable employment in wearable embedded schemes.Gadekallu et al. [13] presented a crow search-based CNN (CS-CNN) method for recognizing gestures relating to the HCI field.The hand gesture database utilized in the research is an open database that is obtained from Kaggle.Also, a one-hot encoding method was employed for converting the definite values of the data to its binary system.After this, a crow search algorithm (CSA) for choosing optimum tuning for data training by utilizing the CNNs was employed.
Yu et al. [14] employed a particle swarm optimization (PSO) technique for the width and center value optimization of the radial basis function neural network (RBFNN).Also, the authors utilized a Electromyography (EMG) signal acquisition device and the electrode sleeve for gathering the fourchannel continuous EMG signals produced by 8 serial gestures.In [15], the authors presented an ensemble of CNN-based techniques.First, the gesture segment is identified by employing the background separation model established on the binary threshold.Then, the contour section can be abstracted and the segmentation of the hand area takes place.Later, the imageries are re-sized and given to three distinct CNN methods for similar training.
Gao et al. [16] developed an effective hand gesture detection model established on deep learning.First, an RGB-D early-fusion technique established on the HSV space was suggested, efficiently mitigating background intrusion and improving hand gesture data.Second, a hand gesture classification network (HandClasNet) was suggested for comprehending hand gesture localization and recognition by identifying the center and corner hand points, and a HandClasNet was suggested for comprehending gesture detection by employing a similar EfficientNet system.In [17], the authors utilized the CNN approach for the recognition and identification of human hand gestures.This procedure workflow comprises hand region of interest segmenting by employing finger segmentation, mask image, segmented finger image normalization, and detection by utilizing the CNN classifier.The segmentation is performed on the hand area of an image from the whole image by implementing mask images.

Materials and methods
This study has developed a new LSO-DCNN method for automated gesture recognition and classification.The major intention of the LSO-DCNN method lies in the proper identification and categorization of various categories of gestures that exist in the input images.The presented LSO-DCNN model follows a three-step procedure: Step 1: The 1D-CNN method derives a collection of feature vectors.
Step 2: The LSO method optimally chooses the hyperparameter values of the 1D-CNN model.

Stage I: 1D-CNN based feature extraction
First, the 1D-CNN model derives a collection of feature vectors.The CNN can be referred to as a neural network that exploits convolutional operations in at least one layer of the network instead of normal matrix multiplication operations [18].Convolution is a special linear operation; all the layers of the convolutional network generally consist of three layers: pooling, convolutional, and activation layers.In the image detection domain, the 2DCNN can be commonly utilized for extracting features from images.The classical CNN models are AlexNet, LeNet, ResNet, VGG, GoogleNet, and so on.The 1D-CNN is used for extracting appropriate features of the data.The input of the 1D-CNN is 1D data, hence its convolutional kernel adopts a 1D architecture.The output of every convolutional, activation, and pooling layer corresponds to a 1D feature vector.In this section, the fundamental structure of the 1DCNN will be introduced.

Convolution layer
The convolution layer implements the convolution function on the 1D input signals and the 1D convolution filter, and later extracts local features using the activation layer.The data is inputted to the convolution layer of the 1D-CNN to implement the convolutional function.
Here,    ,    correspondingly characterize the output and offset of the -ℎ neurons in layer ;   −1 characterizes the output of -ℎ neurons in layer  − 1;   −1 characterizes the convolutional kernels of -ℎ neurons in the  − 1 layer, and the -ℎ neurons in layer ,  = 1,2, …,,  denotes the amount of neurons.

Activation layer
The activation layer implements a non-linear conversion on the input signal through a non-linear function to improve the CNN's expressive power.Currently, the typical activation function is ReLU, Sigmoid, and ℎ.Since the ReLU function may overcome gradient dispersion and converge quickly, it is extensively applied.Thus, the ReLU function was applied as the activation function, and its equation can be represented as where    denotes the activation value of layer .

Pooling layer
The pooling layer can generally be employed after the convolution layer.Downsampling avoids over-fitting, decreases the spatial size of parameters and network features, and decreases the calculation count.The typical pooling operations are maximum and average pooling.
Where   () signifies the jth value in the -th neuron of layer ;   () characterizes the -th activation value in the-th neuron of layer ;  denotes the pooling area's width.

Stage II: LSO-based hyperparameter tuning
In this work, the LSO approach optimally chooses hyperparameter values of the 1D-CNN model.This approach is selected for its capacity for effectively navigating the parameter space, adapting the nature of the model to local characteristics, and converging toward optimum settings, making the model more appropriate to fine-tune intricate methods.In the LSO algorithm, based on the historical optimum solution, the lion king conducts a range search to find the best solutions [19].The equation for updating the location is given below: A lioness arbitrarily chooses an additional lioness to cooperate with, and the equation for location updating can be represented as Follow the lioness, leave the group, or follow the lion king to find an updated position are the three updating approaches for young lions: In Eq (6),    denotes the -ℎ individuals at the  ℎ generation population;    represents the prior optimum location of the -ℎ individuals from the 1st to  ℎ generation;  shows the uniform distribution random number (0,1)    is randomly chosen from the  ℎ generation lioness group;   shows the optimum location of the  ℎ generation population;  denotes the uniform distribution random number [0,1]  =  +  −   ,    is arbitrarily chosen from the  ℎ generation lion where  shows the maximal amount of iterations and  denotes the existing amount of iterations.
The fitness selection becomes a vital component in the LSO method.Solution encoding can be used to evaluate the candidate solution's aptitude.Here, to design a fitness function, the accuracy value is the main condition used.
From the expression, FP means the false positive value and TP denotes the true positive.

Stage III: XGBoost classification
Finally, the XGBoost classifier allocates proper classes, i.e., recognizes the gestures efficaciously.XGBoost is an ensemble ML technique, a gradient boost method utilized for improving the efficiency of a predictive model, which integrates a series of weak methods as a strong learning approach [20].
The ensemble methods offer optimum outcomes related to a single model.Figure 2 defines the architecture of XGBoost.The steps involved are given as follows.

Step 1: Initialize
To solve a binary classifier problem, where   is the actual label denoted as 1 or 0. Consequently, the commonly exploited log loss function is assumed during this case and is demonstrated as (  ′  ̂  ) = −(   (  ) + (1 −   ) (1 −   ) (11) where Based on the   ,   , and  values, the   and ℎ  values are evaluated.
From the ( − 1)ℎ tree of instance   , the evaluated forecasted value is projected as  ̂(−1) , in which the actual value of   is   .But, the predictive value is 0 for the 0 ℎ tree, which implies  ̂(0) = 0.
Step 2: The Gain value of features required for traverse and is computed for determining the splitting mode for the present root node.The Gain value is support to evaluate the feature node with maximal Gain score.
Step 3: During this step, the establishment of the Current Binary Leaf Node setup is performed.Based on the feature with maximal Gain, the sample set can be categorized as 2 parts for obtaining 2 leaf nodes.Moreover, the second step can repeat to 2 leaf nodes assuming a negative gain score and end criteria, correspondingly.This step establishes the entire tree.
Step 4: Whole Leaf Node forecast values are computed in this step.Leaf node   forecast values are computed as (14) and the second tree forecast outcomes are expressed as   (2) =   (1) +  2 (  ) Afterward, this will result in establishing the second tree.
Step 5: The next step is to repeat steps 1 and 2 to set up further trees until a sufficient count of trees can be introduced.The predictive values of model   () are expressed as  ̂() =  ̂(−1) +  2 (  ) , whereas   () refers to the predictive value of  trees on instance   .This procedure creates the  ℎ tree.
Step 6: This equation that is utilized for determining the classifier outcome of an instance is to attain the probability by changing the last forecast value  ̂ of the instance.If   ≥ 0.5, the probability of the instance is 1; else, it is 0.

Results and discussion
In this section, the results of the LSO-DCNN technique are validated using two benchmark datasets: the sign language digital (SLD) dataset and the sign language gesture image (SLGI) dataset.
group;   and   denotes the disturbance factor,  and  indicates the minimal and maximal values of all the dimensions within the range of lion activity space   = 0.1 ( − ) ×  (

Figure 3 .
Figure 3. Comparative outcome of LSO-DCNN approach on the SLD dataset.

Figure 4
Figure 4 inspects the accuracy of the LSO-DCNN method in the training and validation of the SLD dataset.The figure notifies that the LSO-DCNN method has greater accuracy values over higher epochs.Furthermore, the higher validation accuracy over training accuracy portrays that the LSO-DCNN approach learns productively on the SLD dataset.

Figure 4 .
Figure 4. Accuracy curve of LSO-DCNN approach on the SLD dataset.The loss analysis of the LSO-DCNN technique in the training and validation is given on the SLD dataset in Figure 5.The results indicate that the LSO-DCNN approach attained adjacent values of training and validation loss.The LSO-DCNN approach learns productively on the SLD database.

Figure 5 .
Figure 5. Loss curve of LSO-DCNN approach on the SLD dataset.

Figure 7
Figure 7 portrays the accuracy of the LSO-DCNN method in the training and validation of the SLGI database.The result shows that the LSO-DCNN technique has higher accuracy values over greater epochs.Moreover, the higher validation accuracy over training accuracy shows that the LSO-DCNN technique learns productively on the SLGI database.
This study developed a new LSO-DCNN technique for automated gesture recognition and classification.The major intention of the LSO-DCNN approach lies in the proper identification and categorization of various categories of gestures that exist in the input images.The presented LSO-DCNN model follows a three-step procedure, namely 1D-CNN based feature extraction, LSO-based hyperparameter tuning, and XGBoost classification.In this work, the LSO method optimally chooses the hyperparameter values of the 1D-CNN model and it helps to recognize the gestures efficaciously.To prove the enhanced gesture classification results of the LSO-DCNN approach, a wide range of experimental results are investigated.The brief comparative study reported the improvements in the LSO-DCNN technique in the gesture recognition process.In the future, multimodality concepts can enhance the performance of the LSO-DCNN technique.

Table 1 .
Comparative analysis of the LSO-DCNN approach with other systems on the SLD dataset.

Table 2 .
Comparative analysis of the LSO-DCNN approach with other methods on the SLGI dataset.