Machine learning methods for sign language recognition: a critical review and analysis.

Sign language is an essential tool to bridge the communication gap between normal and hearing-impaired people. However, the diversity of over 70 0 0 present-day sign languages with variability in motion position, hand shape, and position of body parts making automatic sign language recognition (ASLR) a complex system. In order to overcome such complexity, researchers are investigating better ways of developing ASLR systems to seek intelligent solutions and have demonstrated remarkable success. This paper aims to analyse the research published on intelligent systems in sign language recognition over the past two decades. A total of 649 publications related to decision support and intelligent systems on sign language recognition (SLR) are extracted from the Scopus database and analysed. The extracted publications are analysed using bibliometric VOSViewer software to (1) obtain the publications temporal and regional distributions, (2) create the cooperation networks between aﬃliations and authors and identify productive institutions in this context. Moreover, reviews of techniques for vision-based sign language recognition are presented. Various features extraction and classiﬁcation techniques used in SLR to achieve good results are discussed. The literature review presented in this paper shows the importance of incorporating intelligent solutions into the sign language recognition systems and reveals that perfect intelligent systems for sign language recognition are still an open problem. Overall, it is expected that this study will facilitate knowledge accumulation and creation of intelligent-based SLR and provide readers, researchers, and practitioners a roadmap to guide future direction. © 2021 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/ )


Introduction
Communication is an essential tool in human existence. It is a fundamental and effective way of sharing thoughts, feelings and opinions. However, a substantial fraction of the world's population lacks this ability ( El-Din & El-Ghany, 2020 ). Many people are suffering from hearing loss, speaking impairment or both. A partial or complete inability to hear in one or both ears is known as hearing loss. On the other hand, mute is a disability that impairs speaking and makes the affected people unable to speak. If deaf-mute happens during childhood, their language learning ability can be hindered and results in language impairment, also known as hearing mutism. These ailments are part of the most common disabilities worldwide ( Hasan et al., 2020 ). Statistical report of physically challenged children during the past decade reveals an increase in the number of neonates born with a defect of hearing impairment and creates a communication barrier between them and the rest of the world ( Krishnaveni et al., 2019 ).
According to the World Health Organization (WHO) report, the number of people affected by hearing disability in 2005 was approximately 278 million worldwide ( Savur & Sahin, 2016 ). Ten (10) years later, this number jumped to 360 million, a roughly 14% increment ( Savur & Sahin, 2016 ). Since then, the number has been increasing exponentially. The latest report of WHO revealed that 466 million people were suffering from hearing loss in 2019, which amount to 5% of the world population with 432 million (or 83%) of them being adults, and 34 million (17%) of them are children ( Bin et al., 2019 ;Hisham & Hamouda, 2019 ;Saleh & Issa, 2020 ). The WHO also estimated that the number would double (i.e. 900 million people) by 2050 ( El-Din & El-Ghany, 2020 ). In these fastgrowing deaf-mute people, there is a need to break the commuhttps://doi.org/10.1016/j.iswa.2021.20 0 056 2667-3053/© 2021 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY license ( http://creativecommons.org/licenses/by/4.0/ ) I.A. Adeyanju, O.O. Bello and M.A. Adegboye Intelligent Systems with Applications 12 (2021) 20 0 056 nication barrier that adversely affects the lives and social relationships of deaf-mute people.
Sign languages are used as a primary means of communication by deaf and hard of hearing people worldwide ( Izzah & Suciati, 2014 ). It is the most potent and effective way to bridge the communication gap and social interaction between them and the able people. Sign language interpreters help solve the communication gap with the hearing impaired by translating sign language into spoken words and vice versa. However, the challenges of employing interpreters are the flexible structure of sign languages combined with insufficient numbers of expert sign language interpreters across the globe ( Kudrinko et al., 2021 ). According to the World Federation of Deaf, more than 300 sign languages are using by more than 70 million worldwide ( Rastgoo et al., 2021b ). Therefore, the need for a technology-based system that can complement conventional sign language interpreters.
Sign language involves the usage of the upper part of the body, such as hand gestures ( Gupta & Rajan, 2020 ), facial expression ( Chowdhry et al., 2013 ), lip-reading ( Cheok et al., 2019 ), head nodding and body postures to disseminate information ( Butt et al., 2019 ); ( Rastgoo et al., 2021b ;Lee et al., 2021a). The key techniques for sign language recognition are vision-based and wearable sensing modalities such as sensory gloves. Sign language recognition systems based on these approaches have been proposed by several researchers ( Ionescu et al., 2005 ;Yu et al., 2010 ;Li et al., 2015 ;Sonkusare et al., 2015 ;Bobic et al., 2016 ;Islam et al., 2017 ;Islam et al., 2017 ;Saha et al., 2018 ;Rastgoo et al., 2021a ;Xu et al., 2021 ). The glove-based employs the mechanical or optical sensors attached to the glove worn by the user and converts finger movements into electrical signals for determining the hand posture for recognition. In a vision-based approach, the features corresponding to the palms, finger position and joint angles are estimated, which are then used to perform recognition. This method requires acquiring images or videos of the signs through a camera and processing using image processing techniques.
The recent advancement in Artificial Intelligent (AI) on sign language recognition has paved the way for research communities to apply AI in sign interpreting operations. There are several excellent references on intelligent systems for sign language recognition ( Admasu & Raimond, 2010 ;Deriche et al., 2019 ;Zapata et al., 2019 ;Song et al., 2021 ;Lee et al., 2021aLee et al., , 2021bGao et al., 2021 ). More recently, further attention has been given to the intelligent based SLR system as it is now being applied in many applications. This includes robotics ( Ryumin et al., 2019 ), interpreting services, realtime multi-person recognition systems, games, virtual reality environments, natural language communications, online hand tracking of human communication in desktop environments, and humancomputer interaction ( Deng et al., 2017 ;Supan či č et al., 2018 ;Wadhawan & Kumar, 2020 ;Rastgoo et al., 2021b ). Despite historical research progress and remarkable achievements made in intelligent sign language recognition systems, there is still great potential in creating intelligent solutions to sign language recognition systems. Therefore, this study aims to provide a systematic and comprehensive review of research papers published in the field of intelligent sign language recognition systems to gain insights into the application of decision support and intelligent systems in this context. The major objectives of this study include 1. Analysis of the research published on intelligent systems in sign language recognition using bibliometric analysis on 649 publications extracted from the Scopus database for the period of two decades . 2. Provides comprehensive review in the context of sign language recognition systems with an explicit focus on decision support and expert system technologies over the past two decades. 3. Highlight open issues and possible research areas for future consideration.
The remainder of the paper is organised as follows. Section 2 presents the methodology for the systematic review and biblio-metric analysis of the selected articles for studies. Section 3 provides a review of the techniques used in sign language recognition systems. This includes techniques used for data acquisitions, preprocessing, segmentation, feature extraction, and classification. In Section 4 , a comprehensive review of decision support and intelligent system algorithms used in sign language recognition is presented. The conclusions and discussion on future needs are presented in Section 5 .

Bibliometric analysis
This section employs a bibliometric analysis methodology to analyse the research published on intelligent systems in sign language recognition over the past two decades. This is done to discern and understand historical trends and publication patterns over time by journals, regions and cooperation between institutions and organizations. Over the years, there has been increasing research interest in ASLR. Therefore, investigating the overall research trend and the new research direction will be compelling in this field. A comprehensive bibliometric analysis of the state of research trend related to decision support and intelligent systems on sign language recognition from 2001 to 2021 was searched and analysed. This is to identify the potential research gap and highlight the boundaries of knowledge during this time frame. The choice of two decades of research was to give a broader view of how research has progressed during these periods to enable researchers to understand the research pattern activities and their characteristics. This information is helpful as it elucidates the scientific activities regarding publications trends across the countries, institutions, journals and authors.
The keywords used for data collection are "Sign Language recognition", "Intelligent Systems", "Artificial intelligence", "Decision support system", "Machine learning", "Neural network", "Expert system", "Fuzzy system", and "Knowledge-based systems". The search strategy in retrieving publication during the selected time frame is as follows: (("sign AND language AND recognition") AND (machine AND learning) OR (artificial AND intelligent)) AND DOCUMENT TYPE: (research article) AND PUBLISHED YEAR: ( > 20 0 0) INDEX: (Scopus database). The introduction of quotation marks (" ") in the search field aimed at retrieving exact keyword or phrase search. This is to identify the articles focusing on the subject of the study topic. The analysis was based on countries, institutions, publication distribution sources, citations and co-occurrence of search keywords. These indicators were selected as a benchmark to highlight the countries, institutions, and leading experts at the forefront of ASLR. It also guides in exploring collaborations networks across these indicators. The two leading tools for carryout bibliometric analyses are VOSviewer and CitNetExployer software. The VOSviewer was employed in this study because it focuses on the analysis at the level of the aggregate publication, while the CitNetExployer emphasises individual publications. The VOSviewer enables creation mapping from the network dataset to explore and visualize links in scientific publications, journals, citations, countries, institutions, and authorship.

Bibliometric analysis procedure
The procedure for bibliometric analysis is shown in Fig. 1 . According to the process, the first two steps involve collecting extensive literature from the Scopus database on March 20, 2021. The rationale for selecting the Scopus database was that its indexed journals are globally recognised as influential and as top places for modern and eminent research outputs on sign language recognition systems. Therefore, it is possible to search for and obtained a significant proportion of published articles in this context. I.A. Adeyanju, O.O. Bello and M.A. Adegboye Intelligent Systems with Applications 12 (2021)  Different keywords were used collectively to extract research articles published on intelligent sign language recognition systems throughout the search process. The authors firstly searched the Scopus database using the keyword "Sign Language recognition", and the search resulted in 1312 journal articles published between 2001 and 2021.
Since the focus of this study is on intelligent systems in sign language recognition. After conducting the first search step on general sign language recognition, the authors repeated this process by refining the search using keywords in step 2 ("Intelligent Systems" AND "Sign Language recognition"). This search resulted in 26 journal articles that are focused on intelligent-based sign recognition systems. However, the search result of 26 articles is likely not sufficient to determine research trends in intelligent-based sign language recognition systems over the last two decades. Therefore, the authors further explored the literature and identified the different intelligent system types employed in sign language recognition systems. The identified intelligent system types include Artificial intelligence, Decision support system, Machine learning, Neural network, Expert system, Fuzzy system, and Knowledge-based systems . These search terms were combined into a search string using the conjunction AND and disjunction OR operators. The query string's output was 652 papers and combined with initial 26 journals for further analysis. To avoid any confusion and duplicates in the final articles selected for analysis, the authors further filtered out the identical papers in step 4. Finally, 649 published articles were considered for bibliometric analysis to determine research trends on intelligent-based sign language recognition systems. The VOSViewer, which is a software tool for constructing and visualising bibliometric networks, is used for step 5. The obtained result is presented in Section 2.2 .

Findings and discussion
The findings of the bibliometric evaluation of the research published on intelligent systems in sign language recognition over the past two decades are presented under different sections. Each of the three sections presents the findings in connection to a specific variable. Fig. 2 shows the number of publications indexed in the Scopus database on intelligent systems sign language recognition between two decades (2001 to 2021). As shown in Fig. 2 , the largest number of publications were published in 2020 with a publications figure of 114 articles, followed by the year 2019 recorded 91 papers. The least number of papers was published in the year 2003. The annual rate of research output in 2019 and 2020 is more than double the yearly publications produced since 2001. Please note that this research was conducted on March 20 2021. Only 28 papers were recorded in the Scopus database for the year 2021 when conducting this study. Therefore, the manuscript submitted in the previous years or early year 2021 may be published anytime from now upward. As it can be observed in Fig. 2 , the publication trend highlights an increase in the number of researches in the intelligent sign language recognition systems from 2011 upwards with 495 papers; whereas the 2001 decade featured 154 articles out of 649 recorded for the entire two decades. Therefore, this trend evidently indicates that intelligent systems are gaining attention in the research area of sign language recognition, most especially from 2011 upwards. The publication trend is consistent with the rapid increase in the number of people affected by hearing disability ( Bin et al., 2019 ). Beside, the growth rate of the cumulative publications  signifies that the research on intelligent sign language recognition systems is still a hot research area. Therefore, it is expected that the annual publication will continue to increase.

Publications distribution sources
Our findings showed that 367 different journals were used to publish 649 articles recorded on the Scopus database from 2001 to 2021 in the area of intelligent-based sign recognition systems. To circumvent a lengthy table, most active journals as per the domain of this study are presented in Table 1 .
The topmost journal was published by Multidisciplinary Digital Publishing Institute (MDPI), whereas IEEE journal tops the highest number of journals with four in total. The rest of the three journals were published by Springer Nature, Little lion Science, and Elsevier Ltd. Among which Springer and Elsevier journal has three journals each. Little Lion Scientific publisher has one journal that makes to 10 journals. Sensors Switzerland journal indubitably plays a dominant role in the domain with 15 publications and 113,885 total journal citations. This is followed by Multimedia Tools and Applications (14 publications). Journal of Ambient Intelligence and Journal of Theoretical and Applied Information Technology have seven journals each. IEEE Transaction on Fuzzy Systems and Journal of Biomedical Informatics share an equal number of papers (6 publications each). According to the CiteScore 2020 report, eight journals had CiteScore of 5 and above. The topmost leading journal scope is broad, covering different disciplines of science and technology about the sensor and its applications. The frequency of pub-lication and being open access journal may also be attributed to the Sensor journal publishing the highest number of articles.

Geographical distribution of publications
We analysed the countries where the research was carried out based on a country's unique occurrence in the affiliations of each article using the VOSViewer tool. The presence of each country is then summed up to the total number of publications published per country. Note that a publication might sum to more than one country. That leads to 797 research output across 78 countries which published at least one full-length research article on intelligent systems in sign language recognition over the past two decades. Fig. 3 shows the geographical map of the top 16 most leading countries contributing to the growth of SLR systems research over the past two decades. India, China and U.S. contributed about 38.14% of the global publications.
The three countries play a key role in advancing sign language recognition research, with India leading worldwide. India led with 123 publications over the past two decades, covering 15.4%of the total global publications. Both China and U.S contributed 13.93% and 8.8%, respectively. In addition to Fig. 3 , we also listed the most active institutions for the topmost 16 countries in Table 2  Collaboration between institutions across different countries leads to a paper having more than one affiliation or country. A vi-  sualization network is developed using co-authorship analysis of VOSViewer to explore the research cooperation between different affiliations. The distribution of countries or territories per region is illustrated in Fig. 4 . The nodes (squares) in the network symbolize countries, and their size depends on the countries' cooperation. The colour in the visualization network characterises the collaboration network. The closer the countries located to each other in the network, the sturdier the link, the thicker the line and the stronger their relatedness. As shown in Fig. 4 , the largest thickness is observed between the United States and the United Kingdom. Results of co-authorship collaboration showed that the United States is the most affiliated country, linked to 41 countries/territories. The list was followed by China (35 links), the United Kingdom (27 links), France (16 links), Saudi Arabia (13 links), Singapore (13 links), India (12 links), Turkey (12 links), Germany (11 links), Italy (11 links), Switzerland (10 links), and other 55 countries connected from a range of nine to one link. It was observed that about 17% of the listed countries had international collaborative publications with not less than ten countries. The possible contributory factors to the dynamic of international collaboration among these countries can be attributed to the diversity of research partners, substantial research funding, and many foreigners visiting or postgraduate schol- ars. Flexible and stable research policy can also be attributed to the international collaboration in most high-linked countries.

Review of vision-based sign language recognition techniques
The stages involved in vision-based sign language recognition (SLR) can be categorised into five stages: image acquisition, image pre-processing, segmentation, feature extraction, and classification, as shown in Fig. 5 . Image acquisition is the first stage in sign language recognition that can be acquired through selfcreated or available public datasets. The second stage is preprocessing to eliminate unwanted noise and enhanced the quality of the image. Next, after preprocessing step is to segment and extract the region of interest from the entire image. The fourth stage is feature extraction, which transforms the input image region into feature vectors for recognition. The last stage in vision-based SLR is classification, which involves matching the features of the new sign image with the stored features in the database for recognition of the given sign ( Raj & Jasuja, 2018 ;Cheok et al., 2019 ).

Image acquisition devices
Researchers have used different image acquisition devices to acquire sign images for classifications. These devices include the camera or webcam, data glove, Kinect and leap motion controller ( Kamal et al., 2019 ). Among these devices, a camera or webcam is the most widely used by many researchers because it provides better and natural human-computer interaction without additional devices, unlike data glove based. Data glove has proven to be more accurate in data acquisition but very costly and inconvenient for the users. Kinect is widely used and effective. It provides both colour video and depth video stream simultaneously. It easily separates the background from the actual sign image and extracts 3D trajectories of hand motions ( Kamal et al., 2019 ). The shortcoming of Kinect, however, is the cost implication as it is very costly.
The leap motion controller can operate in a limited range, but it is a low-cost device with better accuracy than Kinect ( Suharjito et al., 2017 ). The uses of a camera for sign images acquisition can be found in the literature Mekala et al., 2011 ;Kumarage et al., 2011 ;Pansare & Ingle, 2016 ;Athira et al., 2019 ;Sharma et al., 2021 ). Data glove was used to acquire sign images in hand gesture recognition studies proposed by Mehdi and Khan (2002) , Gao et al. (2004) , Phi et al. (2015) and Pan et al. (2020) . Kinect was used to acquire sign images by Jiang et al. (2015) , Wang et al. (2015aWang et al. ( , 2015bWang et al. ( , 2016, Raheja et al. (2016) , Carneiro et al. (2017 and Escobedo and Camara (2017) . Similar studies employed a leap motion controller for sign image acquisition ( Kiselev et al., 2019 ;Alnahhas et al., 2020 ;Enikeev & Mustafina, 2021 ). Table 3 shows the advantages and disadvantages of different data acquisition devices for classifications. These studies revealed that the images acquired were either static or dynamic captured under different positions, background and lighting conditions as frames of images.

Image preprocessing techniques
Preprocessing techniques are applied to an input image to remove unwanted noise and also enhance the quality. This can be accomplished by resizing, colour conversion, removing unwanted noise, or a combination of several of these techniques from the original image. The output of this process can greatly affect the accuracy with a good selection of preprocessing techniques. Image preprocessing techniques can be broadly classified into image enhancement and image restoration. Image enhancement techniques include Histogram Equalization (HE), Adaptive Histogram Equalization (AHE), Contrast Limited Adaptive Histogram Equalization (CLAHE) and logarithmic transformation. Image restoration includes median filter, mean filter, Gaussian filter, adaptive filter and wiener filter. Fig. 6 shows a detailed outline of the image preprocessing techniques.  It is a sensor-based device connect to personal computer which detects hand movement and converts the signals into computer commands for data acquisition.
It has high recognition accuracy and faster processing speed around 200 frames per second. It can detect and track hands, fingers, and finger-like objects.
Due to it highly sensitive, accuracy of the recognition might be affected with small movement in sign position. 7 I.A. Adeyanju, O.O. Bello and M.A. Adegboye Intelligent Systems with Applications 12 (2021) 20 0 056 It is suitable to enhance local contrast and edges in every region of an image. It outperforms the histogram equalization technique.
It has an adverse effect on desired output due to its noise-amplification behavior. It fails to retain the Brightness on the input image. Contrast Limited Adaptive Histogram Equalization (CLAHE): It has a reduced noise compared to AHE and HE. It provides local output response and avoids brightness saturation.
It produces an unsatisfactory result when the input image has an unbalanced contrast ratio and increased brightness.

Logarithmic transformation
It is used to reduce higher intensities pixel values into lower intensities pixel values.
Applying the technique to a higher pixel value will enhance the image more and cause loss of actual information in the image. It does not apply to all kinds of images.

Image enhancement techniques
In the aspect of image processing, image enhancement is one of the most challenging problems. It is an important process for restoring an image's visual appearance. The main goal of image enhancement is to improve input image quality, which can be better suited for human or machine analysis. The choice of technique depends on the area of application. These techniques can be used to refine the boundary and improve the accuracy of an input image ( Majeed & Isa, 2021 ). Table 4 summary the advantages and disadvantages of various image enhancement techniques.
Histogram equalization (HE) : Histogram equalization is used as an image preprocessing technique to strengthen the colour and increase the contrast ( Gonzalez, 2002 ). As part of its operation, the histogram equalization technique remaps the grey level of a particular image based on its probability distribution. This method is applied by distributing the grey value level in an image. It increases the lower limit of the range of colours to the darkest point and decreases the upper limit of the content of colours to the brightest point. Using this technique enhances the edges and boundaries of different images and reduces the images' local details ( Verma & Dutta, 2017 ). In determine the histogram equalization, the probability density function (pdf) and cumulative distribution function (cdf) are computed using Eqs. (1) and ( 2 ) where n is the total number of pixels in a sample, k is the range of grey value, and n k is the number of pixels with grey level with the grey level of A k within the image A. The histogram equalization is operated on an image in three steps as stated by Bagade and Shandilya (2011): i Formation of histogram. ii New intensity values are calculated for each intensity levels.
iii Replace the previous Intensity values with the new intensity values.
Histogram equalization is the prevailing method for image enhancement, and it increases the contrast of the image, improving contrast and obtaining a uniform histogram.
It was used to improve the contrast of the input image in a different location and makes the brightness and illumination of the image uniform ( Mahmud et al., 2019 ;Nelson et al., 2019 ;Sethi et al., 2012 ;Suharjito et al., 2019 ).
Adaptive Histogram Equalization (AHE) : Adaptive Histogram Equalization (AHE) is a well-known and efficient algorithm for improving image contrast. However, it is time-consuming and computationally expensive. It has been widely applied to the various image processing application. Adaptive histogram equalization varies from ordinary histogram equalization. T adaptive approach calculates many histograms, each corresponding to a different section of the image, and uses these to redistribute the image's lightness values (Sund & Møystad, 2016). It is suitable for improving local contrast and edge in every region of an image but tends to increase the noise in relative homogenous regions (Sudhakar, 2017). Adaptive histogram equalization is a contrast enhancement method applicable to grayscale and colour images with very good efficiency. The steps in the AHE algorithm is given ( Longkumer et al., 2014 ) as: Step 1 : Start the program.
Step 2: Obtain all the input images with the number of regions, dynamic range and clip limit.
Step 3: Pre-process the input image.
Step 4: Process each contextual region producing grey level mapping.
Step 5 : Interpolate grey level mapping to assemble the final image.
Contrast limit adaptive histogram equalization (CLAHE) : Contrast Limited Adaptive Histogram Equalization (CLAHE) provides an improvement over adaptive histogram equalization, and it operates on small regions in the image rather than the entire image. The enhancement function is applied over all neighbourhood pixels, and the transformation function is derived. To achieve CLAHE technique, the image is divided into the contextual region known as tiles, while histogram equalization is applied to each tile to obtain a desired output histogram distribution ( Aurangzeb et al., 2021 ). The CLAHE algorithm is given as ( Rubini & Pavithra, 2019 ): Algorithm for CLAHEs Step 1: Read the input image.
Step 2: Apply mean and median filter on the input image.
Step 3: Find the frequency counts for each pixel value.
Step 4: Determine the probability of each occurrence using the probability function.
Step 5: Calculate the cumulative distribution probability for each pixel value.
Step 6: Perform equalization mapping for all pixels.
Step 7: Display the enhanced image. Suharjito et al. (2019) used contrast limit adaptive histogram equalization to enhance image edges and skin detection from the images. The contrast and brightness are enhanced such that the original information is not lost, and the brightness is preserved ( Sykora et al., 2014 ). Three image enhancement techniques, namely CLAHE, HE and AHE, were compared by Nelson et al. (2019) .
Logarithmic transformation : Logarithmic transformation is used when the grey level input values are huge. The technique is used to reduce the skewness of highly skewed distributions in an image by spread out the dark pixels of the image while compressing the higher values ( Mahmud et al., 2019 ). To achieve the transformation technique, an input image is first converted to grayscale before performing a logarithmic transformation so that the transformation's effects can be seen ( Chourasiya et al., 2019 ;Maini & Aggarwal, 2010 ). Logarithmic transformation equation is given as: where s is the logarithmic transformation, c is the factor by which the image is enhanced (it is usually set as 1), and r is the current pixel in the image. The comparison of image enhancement techniques in terms of merits and demerits is presented in Table 4 .
Other preprocessing techniques : Other preprocessing techniques include image scaling and grayscale conversion, which are also referred to as image resizing. In the image preprocessing technique, they have an important role in enlarging and reducing the given image size in a pixel format ( Perumal & Velmurugan, 2018 ). Image resizing is one of the basic operations for image preprocessing. It helps gain a lot of importance in improving low-resolution images and can be used to resample the image to decrease or increase the resolution ( Fadnavis, 2014 ). The input images are scaled or cropped into a unform size since the input images might be of different sizes collected from different sources. The resulting output of this technique produces either reduce or enlarge the size of an input image. The research community has used this technique to reduce the computational time and storage size Jin et al., 2016 ;Islam et al., 2017 ;Ramos et al., 2019 ).
Grayscale conversion is one of the preprocessing techniques. It is one of the simplest enhancement techniques used in image processing. It is done by converting the colour space image such as Red, Green and Blue (RGB) to a grayscale image. This colour model consists of grey tones of colours, which have only 256 grey colours. It is composed exclusively of shades of grey, varying from black at the weakest intensity to white at the strongest ( Buyuksahin, 2014 ). The advantages of using grayscale image over RGB coloured image include its simplified algorithm and reduces computational requirements while preserving the salient features of the colour images. However, a grayscale image has the advantage of losing colour information in an image ( Güne ¸s et al., 2016 ). The equation for the conversion of RGB colour model into a weighted grayscale image is given ( Biswas et al., 2011 ) in Eq. (4) : where GY denote the resulting grey level (grayscale) for the computed pixel, R is red, G is Green and B is Blue in a given image. In Mekala et al. (2011) , Karami et al. (2011) , Oyewumi Jimoh et al. (2018 and Sharma et al. (2020aSharma et al. ( , 2020b, the colour image was converted into a grayscale image of two dimensions with two possible intensity values of white and black.

Image restoration
Image restoration is the process of restoring a degraded image caused by noise and blur ( Reeves, 2014 ). The image restoration process is determined by the type of noise and corruption present in the image. Denoising techniques must be chosen based on how much noise an image contains. There are various image restoration techniques available to remove unwanted noise and blur in the image, these include; median filter, mean filter, Gaussian filter, adaptive filter and wiener filter.
Mean filter : Mean filter is a spatial filtering method based on sliding windows. It is used for image smoothing and reducing or eliminating noise in an image. The mean filter technique finds the value of the center pixel in the window by averaging the values of the neighbouring pixels ( Aksoy & Salman, 2020 ). This filtering technique performs well with salt and pepper noise and Gaussian noise ( Singh & Shree, 2016 ). The equation used to compute the mean filter is given as: where M is a value of the number of pixels used in the calculation, k and l are values that represent the location of these pixels. A [ i, j ] is the mean filter at any point of the image by shifting the window on neighbour pixels. Mean filter was used by Kasmin et al. (2020) to remove noise from the sign images. There are limited numbers of researchers who have used mean filters in sign language recognition. This review work finds out that most of the research papers used median filter and Gaussian filter in the preprocessing stage to remove noise in the image, as reported in the review work ( Cheok et al., 2019 ).
Median filter : The median filter is a non-linear method, which is most commonly used as a simple way to reduce noise in an image while preserving edges ( Dhanushree et al., 2019 ). It achieves better results in removing salt and pepper noise. The pixel being considered is replaced by the median pixel value of the neighbouring pixel calculated from the pixel values of all the surrounding neighbouring pixels sorted in numerical order and then replacing the pixel being considered with the middle pixel value ( Ahmad et al., 2019 ). The average of the two middle pixel values is used when the neighbourhood under consideration contains an even number of pixels. Its performances are better than the mean filter in preserving edges and make it sharper while removing noise in the image and easy to implement.
The arithmetic means filter equation used is given as ( Gonzalez, 2002 ): where S xy represents the set of coordinates in a rectangular subimage window (kernel size) of m × n centered at any point ( x , y ) in the original image, s and t are the rows and column coordinates of the pixels whose coordinates are members of the set S xy . The technique is very good only for removing salt and pepper noise. It was used to reduce noise for the image and preserve edges by various researchers ( Islam et al., 2017 ;Lahiani et al., 2016 ;Pansare et al., 2012 ;Pansare & Ingle, 2016 ;Zhang et al., 2011 ).
Gaussian filter : A Gaussian filter is a linear filter and a nonuniform low pass filter resulting from blurring an image by a Gaussian function. It is the most used to reduce noise and smoothen filter edges in an image ( Basu, 2002 ). The Gaussian filter is generally utilized as a smoother and widely employed in image preprocessing to reduce the image noise by blurring an image. It acts as a smoothing operator and is mainly used by several researchers in sign language recognition. Gaussian blur is a filter used to smooth or remove blurs a digital image ( Umamaheswari & Karthikeyan, 2019 ). Gaussian filter was used to remove noise and image smoothening in ( Oliveira et al., 2017 ;Pansare et al., 2012 ;Yusnita et al., 2017 ). Equation of two-dimension (2D) Gaussian filter is given by: where G ( x , y ) denotes gaussian filter value, x and y denote row and columns values, and σ is the standard deviation of the gaussian distribution.
Adaptive Filter : An adaptive filter is applied to the noisy image to remove noise from the image while detailed information about the image is retained. It preserves edges and other high-frequency parts of an image more than a similar linear filter. The mean and variance are the two statistical measures used to determine adaptive filters. The algorithm used to achieve adaptive filter is given as ( Kaluri and Reddy, 2016a , 2016b ): Step 1 : Read the input colour image.
Step 2 : Convert the input colour image to a grayscale image.
Step 3 : Increase Gaussian noise to the image so that the image size is large.
Step 4 : Eliminate noise from the image by using wiener2 func- An adaptive filter was used to remove noise from the input sign image in Kaluri andReddy (2016a , 2016b ).
Wiener filter : Wiener filter is primarily used to remove noise from the image and minimizes the mean square error (MSE) between the estimated random process and the desired process. It optimizes the trade-off between smoothing the image discontinuities and removal of the noise ( Tania & Rowaida, 2016 ). The technique blurs image significantly due to the use of a fixed filter throughout the entire image. The restoration function of wiener filter includes both the degradation function and the statistical characteristic of noise. It uses a high-pass filter for deconvolution and a low-pass filter for noise reduction during compression ( Maru & Parikh, 2017 ). This technique can be used to remove various kind of noise such as salt and pepper, gaussian and speckle noise in from the image ( Kaur, 2015 ). The equation to compute wiener filter is given as: where u and v are the values, represent the location of these pix- Various filtering techniques have been used in sign language recognition. A Winer filter was used to eliminate noise from the sign image ( Kaluri & Reddy, 2017 ). Some researchers combined two filtering techniques in other to remove noise from the image. Pansare et al. (2012) combined both median and Gaussian filters to remove noise and to smooth the input image, respectively. Table  5 summary different image filtering techniques with their advantages and disadvantages.

Image segmentation techniques
Image segmentation is the process of partitioning an image into meaningful regions called segments ( Egmont-Petersen et al., 2002 ). The images are segmented to obtain the region of interest. There are two basic approaches used for segmentation; contextual and non-contextual segmentation ( Enikeev & Mustafina, 2020 ;Jin et al., 2016 ;Al-Shamayleh et al., 2020 ). In contextual segmentation, it exploits the relationships between the image features, such as edges, similar intensities and spatial proximity. A non-contextual segmentation ignores spatial relationships between image features but groups pixels based on global attributes value ( Pal & Pal, 1993 ;Sharma et al., 2021 ). Image segmentation techniques are classified into edge detection-based, thresholding, region-based, clusteringbased, and artificial neural network-based. A detailed outline of the image segmentation techniques is shown in Fig. 7 .

Thresholding techniques
Thresholding is the simplest and commonly used segmentation technique to separate objects from the background ( Lee et al., 1990 ;Cheng et al., 2002 ;Dong et al., 2008 ;Xu et al., 2013 ). The techniques separate the image pixels according to their degree of strength and the range of values in which a pixel lies-examples of thresholding techniques are global thresholding, local adaptive thresholding and multilevel thresholding.
Global thresholding : The global thresholding technique uses a single threshold value for the whole image to separate foreground from background. This technique assumes that the image has a bimodal histogram. Thus the image can be extracted from the background using a simple operation that compares image values with a threshold T ( Rogowska, 2009 ).
The threshold technique is defined by Kaur and Chand (2018) as: where g( x, y ) is the output image of the original images. T is the threshold value constant for entire images. Otsu thresholding technique is the most widely used global thresholding technique. The technique converts the multi-level image into a binary image through a given threshold value to separate the foreground from the image's background. It is based on discriminant analysis by maximizing the between-class variance of the Gray levels in the image and background portions. The weighted sum of the variance of the two classes is given as: Advantages and disadvantages of image filtering techniques.

Filter techniques Advantages Disadvantages
Mean filter Easy to implement A single wrongly represented pixel value can significantly impact the mean value of all pixels in their immediate neighbourhood. It blurs an edge when the filter neighbourhood crosses a boundary.

Median filter
It preserves thin edges and sharpness from an input image. Both of the problems of the mean filter are tackled by the median filter.
It is relatively expensive and Complex to compute. It is good only for removing salt and Pepper noise. It is less effective at removing the gaussian type of noise from the image. Gaussian filter It is effective for removing the gaussian type of noise. The weights give higher significance to pixels near the edge.
It has high computational time and sometimes removes edges details in an image.
Adaptive filter It preserves edges and other high-frequency parts better than a similar linear filter.
It is computational complexity. There are still some visible distortions available in the image using an adaptive filter.

Weiner filter
It is a popular filter used for image restoration. It is not sensitive to noise. Suitable to exploit the statistical properties of the image. The small window size can be used to prevent blurring of edges Prior knowledge of the power spectral density of the original image is unavailable in practice. It is comparatively slow to apply because it works in the frequency domain. The output image is very blurred. where; w 0 and w 1 are the probabilities of the two classes separated by a threshold t (with a value range from 0 to 255), σ 2 0 and σ 2 1 are variances of these two classes. The class probabilities w 0 and w 1 for each pixel value, i are computed from the bins of the histogram as: where i is the maximum pixel value (255) The total variance is given as: where variance between-class is determined by: Research presented in Rahim et al. (2020) , Pansare et al. (2015) , Islam et al. (2017) and Tan et al. (2021) used Otsu's thresholding algorithm based on global thresholding to segment the hand region from its background region using the computed threshold value. Otsu's thresholding was fused with the canny edge and Discrete Wavelet Transforms (DWT) to segment the region of interest from the video sequence ( Kishore & Kumar, 2012 ). Due to the use of a single threshold value, this technique does not produce effective segmentation regarding illumination over an entire image.
Local adaptive thresholding : Local adaptive thresholding is employed to address the problem of global thresholding based techniques by dividing an image into sub-images and calculating thresholds for each sub-image ( Korzynska et al., 2013 ). This thresholding technique uses the mean value of the local intensity distribution or other statistics metrics, like mean plus standard deviation, to separate foreground from background in each sub-image ( Senthilkumaran & Vaithegi, 2016 ). The most basic approach to adaptive thresholding approach proposed by Niblack (1986) , which local threshold is calculated based on the mean ( m ) and standard deviation (s) of the local neighbourhood of pixel w. Adaptive thresholding value based on Niblack's techniques is given as: where T niblack is the adaptive threshold, m is the mean of the window of size w, s is the standard deviation, and k is the fixed value introduced depending on the remaining noise level in the image's background. The adaptive threshold technique helps separate images from varying backgrounds and extract tiny and sparse regions. The major drawback of adaptive thresholding is that it is computationally expensive than global thresholding. Adaptive thresholding was used to segment the region of interest in sign language recognition for further processing ( Dudhal et al., 2019 ;Rao & Kishore, 2018 ).
Multilevel thresholding : Multilevel thresholding technique is employed to extract homogenous regions in an image. This technique determines multiple thresholds for the given image and divides the image into distinct regions. The method yields adequate results for images with coloured or complex backgrounds, where bilevel thresholding has failed. Skin colour segmentation is one of the approaches used for multilevel thresholding and is widely used used in different areas of application, including human-computer interaction (HCI), image recognition, traffic control system, video surveillance and hand segmentation ( Sallam et al., 2021 ;Lee et al., 2021aLee et al., , 2021bRazmjooy et al., 2021 ). The technique used the colour model to separate the skin region from the image in hand segmentation. The colour models used include Red, Blue, and Green (RGB) colour space, Hue Saturation Value (HSV) and Luminance (Y) and Chromaticity (CbCr) (YCbCr) colour space model. These colour models were extensively discussed in Garcia-Lamont et al. (2018) and Zarit et al. (1999) . YCbCr and HSV are the most widely used skin colour segmentation techniques in sign language and the equations to transform the input RGB image into YCbCr using Eqs. (16) to ( 18 ).
and that of the HSV colour model using Eq. (19) .
The value of H, S, V, Y, Cb and Cr is determined and used as threshold values to obtain the required colour model.
The RGB colour space is the least preferred for colour-based detection and colour analysis because it is challenging to identify and establish human skin from RGB colour tone due to the variation in human skin ( Shaik et al., 2015 ). The HSV model is an effective mechanism for determining human skin based on hue and saturation. The other efficient models are YUV and YIQ colour space models ( Tabassum et al., 2010 ).
Research in Hartanto et al. (2014) and Huong et al. (2016) converted RGB colour space into HSV colour space for identifying skin regions. Tariq et al. (2012) transformed the input RGB video into to YCbCr colour model. The value obtained against each frame pixel is compared with a specific threshold value that isolates the hand region from the whole image. The YCbCr colour model was hybridised with the Gaussian Mixture Model (GMM) and Morphological operation to obtain skin colour region in Pan et al. (2016) re-search. Athira et al. (2019) converted single-handed dynamic gestures from the input video to the YCbCr model and later eliminated the face region with only the hand region. The efficiency of CIELAB colour space was studied by Mahmud et al. (2019) . Similar research conducted by Shaik et al. (2015) and Suharjito et al. (2019) found that YCbCr is more robust and gave more accurate results than HSV and other skin colour segmentation techniques in different illumination conditions.

Edge detection techniques
The edge detection approach is one of the essential image processing techniques. The technique relies on the quick change in intensity value in an image. The Edge detection algorithms detect edges where either the first derivative of intensity is larger than a certain threshold or the second derivative contains zero crossings. A good edge-based segmentation requires three critical steps: detecting edges, eliminating irrelevant edges, and joining the edges. Edge detection techniques reviewed in this paper are Robert edge detector, Sobel edge detector, Prewitt edge detector, Laplacian of Gaussian edge detector and Canny edge detector. Various edge detector techniques were discussed extensively in Bhardwaj and Mittal (2012) ; Chinu & Chhabra, 2014 ;Maini & Aggarwal, 2009 ;Rashmi and Saxena, 2013 ;Shrivakshan & Chandrasekar, 2012 ).
Robert edge detector : Robert edge detector is a gradient based operator that computes the sum of the squares of the difference between diagonally adjacent pixels through discrete differentiation and then calculates approximate gradient of the image. The input image is convolved with 2 × 2 kernels operator with gradient magnitude and directions computed ( Maini & Aggarwal, 2009 ). To obtain gradient component in each direction (Gx and Gy), 2 × 2 kernels operator are applied independently to the input image. The equation used to compute gradient magnitude | G | and directions θ are given as: where G x and G y are the gradients in the x and y directions, respectively. Sobel edge detector : The Sobel edge detector is a discrete differentiation operator of the first-order derivative used to approximate the gradient of image intensity function for edge detection. The technique convolves the input image with a kernel size of 3 × 3 to obtain gradient magnitude and emphasises pixels closer to the kernel's center. This technique is sensitive to noise compared with the Robert edge technique but has a fast computation time due to kernel size ( Rashmi and Saxena, 2013 ;Sujatha & Sudha, 2015 ). The direction of the gradient is calculated using Eq. (22) . The direction of the gradient for θ Sobel edge detector is given as: Prewitt edge detector : Prewitt edge detector operation is very similar to the Sobel edge detector with a kernel of 3 × 3 and is widely used to detect an image's vertical and horizontal edges ( Maini & Aggarwal, 2009 ). In the Prewitt edge detector, the maximum response of all eight kernels for a pixel location is used to calculate the local edge gradient magnitude. At the same time, emphases are not placed on pixels closer to the kernel's center ( Shrivakshan & Chandrasekar, 2012 ). The technique is less computational expensive but detects many false edges during detection, which results in a noisy image. It is susceptible to noise and more effective with noiseless images ( Muthukrishnan & Radha, 2011 ).
The equation to compute the gradient magnitude of the Prewitt edge detector is given as ( Sujatha & Sudha, 2015 ): where | G | i is the kernel i response at the particular pixel position, and n is the number of convolution kernels.
Laplacian of Gaussian (LoG) Edge Detector : This technique uses second-order derivatives of pixel intensities to locate edges in an image. An image's Laplacian is used to highlight areas of rapid intensity change in order to detect edges. Before applying the Laplacian function, the image is subjected to a Gaussian smoothing filter to reduce noise levels. It takes a single grey level image as input and produces another grey level image as output ( Maini & Aggarwal, 2009 ).
The Laplacian L( x , y ) of an image with pixel intensity values I( x , y ) is given as: where I is the image intensity, x , y are the coordinates of the point. The Gaussian smoothing filter LoG used to reduce noise in an image before the Laplacian technique is applied and is given as: The standard deviation of the Gaussian smoothing filter employed in the LoG filter dramatically influences the behavior of the LoG edge detector. An increase in the value of σ resulting in a wider Gaussian filter and more smoothing, which may be difficult to distinguish edges in an image.
Canny edge detector : The Canny edge detection technique ( Canny, 1986 ) has been suggested as an optimal edge detector . It is one of the standard edge detection techniques and was first created by John Canny at MIT in 1983 ( Muthukrishnan & Radha, 2011 ). The goal of this technique is to detect edges in an image while simultaneously suppressing noise. It finds edges by separating noise from the image before find edges of the image. The basic algorithm used to detect Canny edge is given as: Step 1: Read the input image.
Step 2: Smooth the image with a Gaussian filter of chosen kernel size to reduce noise and unwanted details (sometimes median filter can be used because it preserved edges more than gaussian filter).
Step 3: The gradient of the image is used to determine edge strength. For this, a Robert mask or a Sobel mask can be utilized. The equation used to compute the magnitude of the gradient | G | is given as: G x and G y are the gradients in the x and y directions, respectively.
Step 4: Find the edge direction by using the gradient in the x and y directions.
The direction of the gradient θ for Robert edge detector is given as: Step 5: Computed edge directions are resolved into horizontal, positive, vertical, negative directions.
Step 6: Apply non-maxima suppression to trace the edge in the edge direction and suppress any pixel value below the set value (that is not considered to be an edge). This gives a thin line in the output image.
Step 7: Hysteresis thresholding is applied to eliminate streaking. Different researchers have used various edge detection techniques in sign language recognition. Canny edge performs better than many edge detection techniques that have been developed. It is an essential and better method without disturbing the features of the edges in the image afterwards. However, the method is faced with challenges of extracted noise with edges, and sometimes disjoint edges were obtained by manually selecting edges when working with multiple images. Lionnie et al. (2012) employed Sobel edge detection in their study. They compared the performance of Sobel edge detection with skin colour segmentation in HSI colour space, low pass filtering, histogram equalization, and desaturation. The desaturation technique obtained the best performance, which converts the image into grayscale to remove the chromatic channel and preserve only the intensity channel. Jayashree et al. (2012) used Sobel edge detection to extract the image's region of interest. Other research by Thepade et al. (2013) performed Sobel edge detection on image datasets. The result of the proposed system was compared with four other edge detection techniques, namely, canny, Robert, Prewitt and Laplace. The results show that Sobel performed better, but Canny edge detection had the best overall result compare with others. Also, Prasad et al. (2016aPrasad et al. ( , 2016b used canny edge with the discrete wavelet transform to detect boundary pixels of the sign video image. The technique helps extract satisfactory edges, preserves the hand, and makes the video image exactly. In recent research, canny edge detection was used to obtain the object of interest from the image and achieve optimal performance on overall accuracy ( Jin et al., 2016 ;Mahmud et al., 2019 ;Singh et al., 2019 ).

Region-based techniques
Region-based techniques grouped the pixels exhibiting similarly closed boundaries based on predefined criteria ( Garcia-Lamont et al., 2018 ). This technique is also known as Similarity-Based Segmentation ( Yogamangalam & Karthikeyan, 2013 ) and requires the use of appropriate thresholding techniques to group its similarities. The similarity between pixels can either be in the form of intensity, colour, shape or texture. Region-Based techniques are further classified into Region growing and Region splitting and merging methods ( Divya & Ganesh Babu, 2020 ;Kaur' & Kaur, 2014 ;Yogamangalam & Karthikeyan, 2013 ).
Region growing technique : The region growing technique clusters the pixels that represent similar areas in an image. This is done by grouping pixels whose properties, such as intensity, colour and shape, differ by less than some specified amount. Each grown region is assigned a unique integer label in the output image. Region growing is capable of correctly segmenting regions that have the same properties and are spatially separated. However, it is timeconsuming and sometimes produces undesirable results. The output of region growth depends strongly on the selection of a similar criterion. If it is not chosen correctly, the regions leak into adjoining areas or merge with areas that do not belong to the object of interest ( Rogowska, 2009 ). Region growing technique was extensively reviewed by Ikonomakis et al. (20 0 0) , Mehnert and Jackway (1997) and Verma et al. (2011) Assuming p( x, y ) is the original image that is to be segmented, s ( x, y ) is the binary image where the seeds are located and T is any predicate which is to be tested for each ( x, y ) location. The region growing algorithm based on 8connectivity is given by Kaur' and Kaur (2014) as follows: Step 1: All the connected components of s are eroded.
Step 2: Compute a binary image s. The connected components in q are segmented regions used as the output of the region growing technique.
Region splitting and merging technique : Region splitting and merging based segmentation technique uses two basic approaches; splitting and merging for segmenting an image into various regions. Splitting is done by iteratively dividing an image into maximum regions having similar characteristics, and merging contributes to combining the similar adjacent regions to form an excellent segmented image of the original image ( Kaur' & Kaur, 2014 ). A combination of these two techniques gives better performance. The basic algorithm for region growing and merging is given as follows: Let p be the original image, R i represent subdivide region and T be the particular predicate.
Step 1: All the R i is equal to p .
Step 2: Each region is divided into quadrants for which T ( R i ) = False.
Step 3: If for every region, T ( R j ) = True, then merge adjacent regions R i and R j such that T R i U R j = True.
Step 4: Repeat step 3 until merging is impossible.
Ikonomakis et al. (20 0 0) summarise the procedure for the region as follows: i Split into four disjointed quadrants any region where a homogeneity criterion does not hold. ii Merge any adjacent regions that satisfy a homogeneity criterion. iii Stop when no further merging or splitting is possible.

Clustering based segmentation techniques
Clustering-based segmentation is an unsupervised learning technique that divides a set of elements into uniform groups. It is highly used for the segmentation of images into clusters having pixels with similar characteristics. Several clustering-based segmentations exists. The two major techniques most widely used include K-means and Fuzzy C-means clustering ( Cebeci & Yildiz, 2015 ;Ghosh & Dubey, 2013 ;Khan, 2013 ).
K means technique : K means algorithm is a clustering-based technique, and it has been extensively reviewed in Duggirala (2020) and Panwar et al. (2016) . It takes the distance from the data point to the prototype as the primary function of optimization. The adjustment rules of iterative operation are obtained by the method of finding extreme values of functions. The K-means algorithm makes use of Euclidean distance as the similarity measure that finds the optimal classification of an initial cluster center vector so that the evaluation index is minimum. The error square sum criterion function is used as a clustering criterion function. Although the algorithm of K means is efficient, the value of K should be given in advance, and the selection of K value is very difficult to estimate ( Zheng et al., 2018 ).
K means algorithm by Ghosh and Dubey (2013) is given as follows: Step 1 : Set desired clusters, K and Initialize to choose k starting points which are used as initial estimates of the cluster centroids.
Step 2: To classify all the data or images: (a) Calculate the distance between clusters centroids and the data.
(b) Move the data closer to the cluster that has less distance as compared to others.
Step 3: Centroid calculation -When each point in the dataset is assigned to a cluster, it is needed to recalculate the new k centroids.
Step 3 : Convergence criteria -Step (2) to step (3) are repeated until no point changes its cluster assignment or until the centroids no longer move.
Fuzzy C means (FCM) Technique : Fuzzy C means (FCM) technique assigns membership levels and uses them to assign data elements to one or more clusters. It provides a more precise computation of the cluster membership and has been used successfully for many image clustering applications ( Dhanachandra & Chanu, 2017 ). FCM is an iterative approach that generates a fuzzy partition matrix and requires a cluster center and an objective function. The cluster center and objective function values are updated for every iteration, and the process is terminated when the difference between two successive object function values is smaller than a predetermined threshold value ( Khan, 2013 ). The major difference with the K means algorithm is that instead of making a hard decision about clusters, which cluster belongs, it assigns a value between 0 and 1 that describes every cluster. It requires much computation time, unlike K-means clustering that produces nearly the same results with less computation time. The FCM algorithm is given as follows ( Khan, 2013 ): Step 1: Assign the values for c (number of clusters; 2 = c < n ), q (weighting exponent of each fuzzy member) and threshold value ε Step 2: Initialize the partition matrix U = [ v ik ] degree of membership of x k in i th cluster .
Step 3: Initialize the cluster centers and a counter p .
Step 4: Calculate the membership values and store them in an array.
Step 5: For each iteration, calculate the parameters a p i and b p i till all pixels are processed, where Step 6: After each iteration, update the cluster center and compare it with the previous value ( Step 7 : If the comparison difference is less than the defined threshold value, stop iteration else and repeat the procedure.

Artificial neural network based segmentation technique
Artificial neural network-based segmentation techniques replicated the learning mechanisms of the human brain. It is made up of a large number of connected nodes, each with its weight, and the neuron corresponds to a pixel in an image ( Khan, 2014 ). The image is trained as a neural network using training samples, and then the connections between neurons with pixels are created. The newly created images are segmented from the trained image. There are two essential steps to this technique; extracting features relevant to the image and segmentation using a neural network. The technique has performed very well in a difficult image segmentation problem. Backpropagation neural network (BPNN), Feedforward neural network (FFNN), Multilayer feedforward neural network (MLFF), Multilayer Perceptron (MLP), Convolution neural network (CNN) ( Kanezaki, 2018 ) and Self organization Map (SOM), It is not suitable for images with too much noise or too many edges.

Region-Based Method
It is less susceptible to noise and more useful when defining similarity criteria is easy.
It is quite expensive in terms of computation time and memory consumption.

Clustering Method
It's more useful for real-world challenges due to the fuzzy partial membership employed.
Determining membership functions is not easy.

Artificial Neural Network-Based Method
It does not require a complex program to work. Less prone to noise.
Computational time in training is higher.
are among the most often used neural networks for image segmentation ( Amza, 2012 ;Kanezaki, 2018 ;Moghaddam & Soltanian-Zadeh, 2011 ). The algorithm used to separate the region of interest from its background was proposed in An and Liu (2019) and Zhao et al. (2010) . In a neural network, an image can be segmented based on pixel classification or edge detection. Sections 3.3.1 to 3.3.5 explained various segmentation techniques, including thresholding, edge-based, region-based, clustering-based, and artificial neural network-based. Table 6 summary the advantages and advantages of the segmentation techniques. Table 7 illustrates a summary of the vision-based SLR systems. Presented in the table are data acquisition devices, data preprocessing techniques, and segmentation techniques used.

Feature extraction techniques
Feature extraction is a technique used to obtain the most relevant features from the input image. It aims at finding the most distinctive features in the acquired image ( Patil & Sinha, 2017 ). It is a form of dimensional reduction that effectively represents the interesting parts of an image. The compact feature vector is extracted by removing an irrelevant part to increase learning accuracy and enhance the result's visibility ( Khalid et al., 2014 ;Kumar & Bhatia, 2014 ). The feature extraction output supports the classification stage by checking for features that can effectively be distinguished between classes and help achieve high recognition accuracy. The features extracted from the interested region are characterised into colour, texture and shape features ( Patel & Gamit, 2016 ). The important feature extraction techniques used in SLR that have achieved a good performance include principal component analysis (PCA), Fourier descriptor (FD), histogram of oriented gradient (HOG), shift-invariant feature transform (SIFT), and speed up robust feature (SURF).

Principal component analysis (PCA)
Principal Component Analysis (PCA) is a technique widely used to extract features in image processing . PCA requires a statistical procedure that uses an orthogonal transformation to transform a series of observations of potentially correlated variables into a set of values for non-correlated variables ( Kumar & Bhatia, 2014 ). It operates by calculating new variables called principal components and use the new variable to create a linear combination of the initial variables. The PCA operation is as follows: first, the principal components produce the greatest potential variance, while the larger portion is extracted. This is followed by the computation of the eigenvectors and corresponding eigenvalues. The computed eigenvectors are stored by decreasing the order of the eigenvalues, which form the dimensionality reduction of data in PCA. The algorithm used to calculate PCA is given as follows ( Cheok et al., 2019 ;Karamizadeh et al., 2013 ): Step 1: Column or row of vector size represents a given training set of images, M with an S -dimensional vector.
Step 2: The mean, μ of all images in the training set is computed using Eq. (30) : with x i as the i th image through its columns concatenated in a vector.
Step 3: PCA basis vectors which are eigenvectors of the Scatter matrix, S T are computed using Eq. (31) : Step 4: The eigenvectors and corresponding eigenvalues were calculated. The eigenvectors values are stored by decreasing eigenvalues order. The eigenvectors with lower eigenvalues contain less information on the distribution of data. These are filtered to reduce the dimensionality of data. Huong et al. (2016) employed PCA to extract the features needed to recognise 25 Vietnamese sign language (VSL) alphabets under uniform background. Research in Zaki and Shaheen (2011) introduced PCA with Kurtosis position and Motion chain code (MCC). Kurtosis was fused with PCA to calculate the edges denote articulation, and MCC extracted the vectors representing the hand movement. The research finding shows that combining these three feature extraction techniques improved recognition accuracy by 89.90% compared to separate use or combination of two methods. In a similar study presented by Li et al. (2016) , PCA was combined with the entropy-based K-means algorithm and applied to Hidden Markov Model (HMM). The input image used for feature extraction was acquired from glove-based data using an Attitude Heading Reference System (AHRS) sensor. Simultaneously, the noise was removed from an unclear image using a low pass filter (LPF). The developed model demonstrated better performance than the result from the Kalman filter. Prasad et al. (2016aPrasad et al. ( , 2016b proposed a study that combined PCA and Elliptical Fourier Descriptors for feature extraction on a video dataset. The Elliptical Fourier Descriptors was used to optimize and preserve the shape details without any changes in rotation. Simultaneously, using PCA, the feature vector output for a given sign from multiple frames is treated to form a single vector.

Fourier descriptor (FD)
Fourier descriptors are used to characterise shape complexity and have been used for identifying different sign shapes ( Agrawal et al., 2012 ;Kishore et al., 2015 ). The Fourier transformed coefficients form the Fourier descriptors of the image and represent the shape in the frequency domain, lower and higher frequency descriptors ( Cheok et al., 2019 ). The lower frequency descriptors contain information about the general features of the shape, while the higher frequency descriptors contain information about the relevant part of the shape.
Complex coordinate of the boundary pixels is given  as: The complex coefficient of the Fourier descriptors of the boundary coordinates a ( u ) is given as: where K is the total number of pixels in the image , k = 0, 1, 2, ……, K-1, u = 0, 1, 2, ……, K-1 and ( x , y ) are the coordinates of the point.
Fourier descriptor coefficient to achieve rotation, scale and translation invariance is given as: For shape like recognition, Fourier descriptors are particularly effective due to their invariance in terms of scale, rotation, and translation. In research introduced by Kumar  to obtain features needed for gesture recognition of 26 Indian sign language (ISL) alphabets. Their system performed better than other studies ( Rekha et al., 2011a( Rekha et al., , 2011bAgrawal et al., 2012 ;Kishore et al., 2015 ). A combination of SIFT, Hu moments and FD was employed by Pan et al. (2016) to extract features from given images. The features in the image are reduced by applying PCA and LDA techniques and performed for both Chinese Sign Language (CSL) and ASL alphabets.

Histogram of oriented gradient
Histogram of Oriented Gradient (HOG) is a feature descriptor used to identify an object in image processing. The features obtained from HOG offer a concise and efficient image representation for image classification. It is one of the most robust feature extraction techniques used in recent times to identify shapes or structures within an image ( Torrione et al., 2014 ). The HOG features are used to measure the degree of the input image gradient and its gradient path. The central concept behind HOG features shows that object entry and outline are considered by the circulation of edge directions ( Mahmud et al., 2019 ). Tavari and Deorankar (2014) introduced HOG as a feature extraction technique. In their study, the features extracted are calculated by counting gradient orientation occurrences in localized portions of an image. The algorithm used for the HOG descriptor is given as follows: Step 1 : The gradient of the image i is computed by sifting it with horizontal and vertical one-dimensional distinctive derivative mask given as: where, D x and D y are horizontal and vertical masks respectively and obtained from X and Y derivatives using the following convolution operation: The magnitude of the gradient is obtained as: The orientation of the gradient is given as: Step 2 : Create the cell histograms. Each pixel calculates a biased vote for an orientation-based histogram channel based on the values found in the gradient computation. The cells themselves were rectangular, and the histogram channels were evenly spread over 0 ˚to 180 ˚or 0 ˚to 360 ˚, depending on whether the gradient was unsigned or signed.
Step 3 : In order to change the illumination and contrast, the gradient strength is regionally normalized, which needs grouping the cells together into larger spatially connected blocks.
Step 4 : The final feature vectors are obtained. Mahmud et al. (2019) used HOG to segment the image into 64 blocks, while every block constitutes 2 × 2 cells. The histogram edge orientation is manipulated for each pixel in a created cell to obtained gradient direction, orientation binning and descriptor blocks which are the feature extracted and stored in the feature matrix. Their system performed with K-NN classifier better than the result obtained using Bag of features with a support vector machine in the same experiment. Raj and Jasuja (2018) used HOG as a feature extraction technique for British sign language alphabets. The features are obtained by constituting an 8 × 8 pixel size window, which glides over the image. Similar research ( Butt et al., 2019 ) used HOG with Local Binary Pattern (LBP) and statistical features to extract essential images. LBP is used to enhance the output of the extracted features. Despite the efficiency of HOG in many studies on sign language recognition, it was observed that the choice of HOG parameters affects the feature vector size-this results in computation time and accuracy issues. Joshi et al. (2020) proposed a multi-level HOG feature vector to address HOG parameters selection challenges. The study adopts a combined Taguchi and Technique for Order of Preference by Similarity to Ideal Solution (TOPSIS) based decision-making method to determine the optimal set of multi-level HOG feature vector parameters. The proposed system demonstrates better performance than the state-of-the-art method on complex background ISL dataset.

Shift-invariant feature transform (SIFT)
SIFT extracts features from an image with a different variant, such as translation, scaling, and rotation. It offers a robust framework to detect unique invariant image features that robustly match different views of an image, such as translation, rotation and scaling, and some invariant changes in lighting and camera perspective. The four SIFT computational steps ( Lowe, 2004 ) include: Step 1: Determine approximate location and scale of salient feature points Difference-of-Gaussian (DoG) function is used to identify potential interest points improve the computation speed, which are invariant to locations and scales ( Mistry & Banerjee, 2017 ).
The scale-space of an image L ( x, y, σ ) is computed by performing convolution between Gaussian function G ( x, y, σ ) with an input image I( x, y ) .
The scale-space function is given as: The Gaussian function is given as: G ( x, y, σ ) = 1 2 πσ 2 e − ( x 2 + y 2 ) / 2 σ 2 (41) D ( x, y, σ ) can be computed from the difference of two nearby scales separated by a constant multiplicative factor k .
The difference of Gaussian equation is given as: Step 2: Keypoint localization The keypoint localization is determined by eliminating low contrast (edge response) and selected based on their stability measures. The key point location, θ ( x, y ) and scale, m ( x, y ) is computed as given in Eqs. (43) and ( 44 ), respectively: Step: 3 Determine orientation(s) for each keypoint.
In the third step (Orientation Computation), orientations are assigned based on the image gradient at each keypoint location Step 4: Obtain a descriptor for each keypoint.
The key points are transformed into a representation that allows for significant levels of local shape distortion and change in illumination.
Several studies on sign recognition systems in which SIFT is employed for future extraction can be found in the literature ( Agrawal et al., 2012 ;Tharwat et al., 2015 ;Yasir et al., 2016 ;Patil & Sinha, 2017 ;Shanta et al., 2018 ). Patil and Sinha (2017) developed a framework that focused on the time required to implement various phases of the SIFT algorithm. A system that hybridized SIFT with adaptive thresholding and gaussian blur image smoothing for feature extraction was proposed by Dudhal et al. (2019) . Agrawal et al. (2012) combined shape descriptor with HOG and SIFT methods. A shape descriptor was used to analyse the overall shape of the segmented image, while the HOG descriptors were employed for invariance to illumination change, orientation and the articulated or occluded gestures associated with two-handed gestures.
The key points for each sign image were determined sing SIFT tecnique and stored as a feature vector for classification. Their findings show a better performance with the combination of three techniques.

Speed Up Robust Feature (SURF)
Speed Up Robust Feature (SURF) is an effective technique for feature points extraction ( Raj & Joseph, 2016 ). It is a newlydeveloped framework to improve the performance of an object recognition system. It is designed as an efficient alternative to SIFT, which is much faster and robust than SIFT ( Sykora et al., 2014 ). The descriptors are derived from the pixel surrounding an interesting point. The SURF can detect objects in images taken under different extrinsic and intrinsic settings (Bhosale et al., 2014). SURF uses integral images instead of the Difference of Gaussians (DoG) employed in SIFT. Integral image is the sum of intensity value for points in the image with a location less than or equal to the original image ( Cheok et al., 2019 ).

The integral image I(X ) equation presented by Mistry and
Banerjee (2017) is given as: To obtain points of interest, SURF uses a hessian blob detector. Hessian matrix determinant defines the magnitude of the response. Integral image is convoluted with box filter while box filter is the approximate filter of Gaussian filter ( Mistry & Banerjee, 2017 ) The Hessian matrix is given as: where the convolution of the Gaussian second-order derivative; ∂ 2 ∂ x 2 g(σ ) with the image I in point X and similarly for L xy ( X, σ ) and L yy ( X, σ ) .
SURF is a well-suited feature extraction technique for the image because of its efficient attributes including scale invariance, translation invariance, and rotation invariance. SURF has attracted a lot of research on sign language recognition systems ( Rekha et al., 2011a ;Yang & Peng, 2014 ;Fang Hu et al., 2014 ;Wang et al., 2015aWang et al., , 2015bJin et al., 2016 ;Ravi et al., 2018 ;Kaluri & Reddy, 2017 ;Narayanan et al., 2018 ). Hartanto et al. (2014) conducted a study on alphabets recognition using hand gesture images. The SURF algorithm is employed to detect and extract keypoint features. Sykora et al. (2014) compared SIFT and SURF methods of ten gestures. The findings show that the overall accuracy of SURF is higher than SIFT method. In the study of Merin Joy and Rajesh (2017) , the feature detector and descriptors were extracted from surveillance video using the SURF method. The descriptors obtained from different images for the SURF comparator were later used by finding the matching pairs for recognition. Rekha et al. (2011aRekha et al. ( , 2011b hybridised SURF and Hu Moment Invariant methods to achieve a good recognition rate along with low time complexity. Elouariachi et al. (2021) proposed new feature extraction techniques called Quaternion Krawtchouk Moments (QKMs) for sign language recognition. The authors compared the QKMs technique with KNN classifier with various traditional feature extraction techniques (HOG, LBP, SIFT, SURF, and Gabor). Their finding shows better performance for the proposed method compared with conventional techniques.
The type of features extracted determines the performance of the recognition. Therefore, it is necessary to choose a good feature extraction technique to achieve good performance. Table 8 summarise the advantages and disadvantages of various feature extraction techniques used in vision-based sign language recognition systems.

Sign language categories
Sign languages are used to accomplish the same functions as spoken languages while simultaneously interpreting spoken words into sign language. The structure of sign language is different from spoken words and has its own phonology, syntax morphology, vocabulary, and grammar distinct from spoken languages ( Agrawal et al., 2016 ;Sahoo et al., 2014 ). In achieving this, Signs are made in or near the signer's body with either one or two hands in a particular hand configuration ( Schembri et al., 2013 ;Wadhawan & Kumar, 2021 ). The five basic parameters used in sign language communication include handshape, location of the hand, palm orientation, movement of the hand, and facial expression; these influence the meaning of a sign. The meaning of sign changes when either of these parameters changes ( Wilcox & Occhino, 2016 ).
Sign languages are not universal like spoken languages, and it varies across countries due to different geographical locations, nationality, social boundaries and vocabularies ( Lemaster & Monaghan, 2007 ) , and many others that have evolved in hearing-impaired communities ( Agrawal et al., 2016 ;Cheok et al., 2019 ;Kadhim & Khamees, 2020 ;Sahoo et al., 2014 ;Wadhawan & Kumar, 2021 ). It is important to know that most countries that share a spoken language do not share the same sign language. This happens because the deaf and hearing impaired in the two or more countries using the same spoken language were not in contact. Therefore, the sign language of these countries was developed independently. The challenges in developing a sign language recognition system include signing speed, which varies per signer. Segmentation of region of interest from entire images are also problematic due to the different environments in which the images are taken, illuminations, computational time, gesture tracking, nature of parameters used for communication such as hand shape, facial expression, orientation and so on ( Agrawal et al., 2016 ).

Benchmark datasets
There are various publicly accessible benchmark datasets for evaluating the performance of static, isolated, and continuous sign language recognition systems. For the American sign language It is susceptible to image rotation. It is computationally expensive due to its in-depth scanning approach over the entire region of interest.
(ASL) dataset, the Purdue RVL-SLLL ( Martínez et al., 2002 ) consists of gestures, movements, words and sentences signed by fourteen (14) signers. It comprises 2576 videos, which made up 184 videos per signer. Thirty-nine (39) of the videos are isolated motion primitives, 62 hand shapes and sentences. American Sign Language Lexicon Video Dataset (ASLLVD)  encompasses high-quality video sequences of about 3800 ASL signs corresponding to about 30 0 0 signals signed by four native signers. RWTHBOSTON-104 ( Zahedi et al., 2006 ) and RWTHBOSTON-400 ( Dreuw et al., 2008 ) datasets consist of isolated and continuous ASL. RWTHBOSTON-104 dataset contains isolated sign language with a vocabulary of 104 signs and 201 sentences signed by three signers. The RWTHBOSTON-400 dataset is created to develop continuous ASL Recognition. It comprises 843 sentences with a vocabulary size of 406 words signed by four (4) signers. Also, the Massey University dataset ( Barczak et al., 2011 ) consists of 36 classes of alphabets (A-Z) and numbers (0-9). The total numbers of images are 2160. For the Arabic Sign Language (ArSL) dataset, the Sign Language Database ( Assaleh et al., 2012 ) contains 40 sentences, with each sentence repeated 19 times. It was acquired from eighty signers. The Signs World Atlas ( Shohieb et al., 2015 ) has about 500 static gestures (finger spelling and hand motions) with dynamic gestures (non-manual signs) involve body language, lip reading and facial expressions. Brazilian Sign Language dataset (LIBRAS-HCRGBDS) consists of sixty-one hand configurations of the Libras ( Porfirio et al., 2013 ). The Kinect sensor was used to collect the dataset of about 610 video clips obtained from five different signers. British sign language dataset was created by British Sign Language (BSL) Corpus ( Schembri et al., 2013 ) and made up of videos that contained 249 people conversing in BSL with annotations of 6330 gestures from the conversation. RWTH-PHOENIX-Weather 2014 ( Forster et al., 2014 ) and SIGNUM Database ( Agris & Kraiss, 2007 ) were created for German sign language recognition. RWTH-PHOENIX-Weather 2014 contains continuous sign language of 6861 sentences and 1558 vocabularies. SIGNUM Database contains a vocabulary size of 450 basic gestures and 780 sentences signed by 25 signers. Table 9 summarised benchmark datasets from different countries used by various researchers. These datasets were used as the basis for comparison or performance evaluation of the developed model.

Review of intelligent classification architectures Employed in sign language recognitions
After the pre-processing, segmentation, and extraction of features from the images have been completed, it is necessary to use a predictor algorithm to help gives valuable meaning to the extracted features. Just like humans learn by doing tasks repeatedly, machines are also trained to learn, and machine learning improves their performance. Machine learning is a subfield of computer science, and it is also classified as an artificial intelligence method ( Voyant et al., 2017 ). The artificial intelligent techniques used for sign language recognition include supervised or unsupervised. Supervised machine learning took in a set of known training data and used it to infer a function from labelled training data, whereas; unsupervised machine learning is used to draw inferences from datasets with input data with no labelled response. After a comprehensive literature review, the commonly intelligent predictors utilized for recognition of sign language are k-nearest neighbour (KNN), artificial neural network (ANN), support vector machine (SVM), hidden Markov Model (HMM), Convolutional Neural Network (CNN), fuzzy logic and ensemble learning. This section briefs about machine learning techniques used for the recognition of sign language. Many papers that employed machine learning to recognise or classify sign language are reviewed and presented in the subsequent sections.  ( Forster et al., 2014 ) The SIGNUM Database ( Agris & Kraiss, 2007 )

K-Nearest Neighbour algorithm (K-NN)
K-Nearest Neighbour algorithm is also referred to as lazy learning. It is based on the principle that the instances within a dataset will generally exist in close proximity to other instances with similar properties ( Selim et al., 2019 ). It predicts the class of a new object based on its k-nearest neighbours' classes by performing a simple majority voting to decide the class of the test instance ( Alamelu et al., 2013 ) . The 'k' parameter in K-NN refers to the number of nearest neighbours of a test data point to include in most voting processes ( Kotsiantis et al., 2006 ). The procedure for design the K-NN classification algorithm presented in Mahmud et al. (2019) is given as follows: Step 1 : Training dataset and new input image dataset for testing were loaded.
Step 2 : K value was chosen, which is the nearest data point.
Step 3 : For each point in the testing data, do the following: i Distance between each row of test data and each row of training data were computed using distance metrics. Euclidean distance equation is given as: where a i is the test data, b i is the training data and k is the Numbberof neighbour i The distance value obtained were sorted in ascending order. ii Top K rows values from the sorted array were chosen. iii The class was assigned to test data point-based majority vote and a most frequent class of these rows.
Step 4 : Repeat step 1 to step 3 for all testing images in the dataset.
K-NN is one of the machine learning algorithms that use distance measures as its core features. Some of the classification methods that are based on distance measurement used for sign language recognition includes Mahalanobis distance ( Huong et al., 2016 ), Manhattan ( YanuTara et al., 2012 ) and Euclidean distance ( Pansare et al., 2012 ;Tara et al., 2012;Hartanto et al., 2014 ;Huong et al., 2016 ). Tara et al. (2012) used the Manhattan distance method to achieve a recognition accuracy of 95% with small computational latency. A similar study by Huong et al. (2016) investigated and compared Euclidean distance and Mahalanobis distance methods. The system performance revealed 90.4% and 91.5% recognition accuracy for the Euclidean distance and Mahalanobis distance, respectively. Izzah and Suciati (2014) applied KNN to recognise 120 stored images of ASL and 120 images captured in a real-time using webcam. The system achieved an accuracy of 86% with images stored in the database and 69% recognition accuracy with the real-time captured image. Their system has a better performance compared to PCA with KNN. K-NN is used as a classifier to obtain an accuracy of the test image for static sign language recognition ( Jasim & Hasanuzzaman, 2015 ;Tharwat et al., 2015 ;Gupta et al., 2016 ;Anand et al. (2016) developed a system for recognition of 13 alphabets of Indian sign language. The features extracted using Discrete Wavelet Transform (DWT) fed into the KNN classifier and achieved recognition of 99.23% accuracy. Also, Mahmud et al. (2019) employed K-NN for the recognition of ASL alphabets. The system used L * a * b colour space to segment the region of interest and canny edge with HOG to extract features. The system achieved recognition accuracy of 94.23%. In the same research, the system's performance was tested using a combination of a bag of features (BoF) with k-means for features extraction and SVM as classification with a recognition accuracy of 86%, which is lower than the result with K-NN. Butt et al. (2019) used K-NN for sign language recognition. They compared their model performance with the generalized linear model (GLM) and deep learning algorithms. The obtained accuracy for GLM, K-NN and deep learning are 100%, 98.03% and 99.9%, respectively. The authors reported that the performance of Naive Bayes, decision tree and other considered algorithms were much lesser than that of GLM, K-NN and deep learning for the same experiment. Table 10 summarises some of the vision-based SLR studies using various feature extraction techniques with KNN.

Artificial neural network (ANN)
Artificial Neural Network (ANN) is a computational analytical tool inspired by the biological nervous system of the brain in a bid to mimic human reasoning. It consists of highly interconnected networks that can compute input values and perform parallel computations for data processing and knowledge representation. It is a branch of artificial intelligence (AI) that helps build predictive models from large databases. Due to its robust and adaptive nature, ANN has been used to perform computations like pattern recognition, matching, classification ( Jielai et al., 2015 ;Adegboye et al., 2020 ). It is generally defined by three parameters: the interconnection pattern between different layers of neurons, the weight of interconnections, and the activation function. A neuron has inputs x 1 , x 2, x 3, …, x p , each is labelled with a weight w 1 , w 2, w 3, …, w p, and measures the permeability, K is the activation function. Fig. 8 shows the structure of neuron network layers.
Output function ( y ) is given as:   Neural Network algorithms used for gesture recognition includes; feed-forward neural network, backpropagation neural network algorithms and multilayer perceptron (MLP). Kishore et al. (2015) employed ANN with a backpropagation algorithm to recognise hand gestures of Indian sign language. The system obtained a feature vector using elliptical Fourier descriptors, and four cameras were used to enhance the result with a recognition accuracy of 95.10%. Prasad et al. (2016aPrasad et al. ( , 2016b employed the backpropagation NN algorithm to recognise static and dynamic gestures of ISL alphabets and numbers. The selected words are consisting 59 signs gestures with a recognition accuracy of 92.34%. The dataset used is video acquired data and extracted using a combination of PCA and Elliptical Fourier Descriptors. A similar approach using the feedforward backpropagation NN algorithm ( Islam et al., 2017 ) was proposed for ASL alphabets and numbers and attained recognition accuracy of 94.32%. In recognition of the Bangla sign language of 20 Bangla alphabets, the feedforward backpropagation algorithm NN was employed by Hasan et al. (2017) . The extracted features from a combination of canny edge and FCC are fed into the neural network. The system achieved recognition accuracy of 96.5%. Also, Raj and Jasuja (2018) proposed an ANN-based system to recognise British sign language alphabets. The testing dataset contained 780 sign images, and the system performance attained 99.01% accuracy, which is better than similar research ( Liwicki & Everingham, 2009 ) for British sign language alphabets with a recognition accuracy of 98.9%. Shaik et al. (2015) hybridised backpropagation neural network with a genetic algorithm model. They reported that their system performed better than existing backpropagation-based hand gesture recognition. The extensive review we performed revealed several researchers ( Tariq et al., 2012 ;Adithya et al., 2013 ;Yasir & Khan, 2014 ) had used ANN for sign language recognition. The proposed systems have demonstrated good performance. Recent research by Sharma et al. (2020aSharma et al. ( , 2020b used several machine learning algorithms to recognise features extracted from the combination of ORB with K-means and Bag of words (BoW). Their finding shows better recognition accuracy of 96.96% using the multilayer perceptron model. The summary of SLR studies using various feature extraction techniques with ANN is presented in Table 11 .

Support vector machine (SVM)
Support Vector Machine (SVM) is a supervised learning model with associated learning algorithms that are non-probabilistic. It is a popular pattern recognition learning technique for classification regression analysis ( Xanthopoulos & Razzaghi, 2014 ). SVM can be used to solve both pattern-classification and nonlinear-regression problems ( Adeyanju et al., 2015 ), but it is most useful in solving difficult pattern-classification problems ( Martiskainen et al., 2009 ). Classification is performed in SVM by differentiating between two or more data classes. This is done by defining an optimal hyperplane that splits all categories, as shown in Fig. 9 (a). However, suppose it is impossible to find a single line to separate the two 21 I.A. Adeyanju, O.O. Bello and M.A. Adegboye Intelligent Systems with Applications 12 (2021) 20 0 056  classes in the input space. In that case, a classification can be performed using the kernel trick, as shown in Fig. 9 (b). SVM operates according to the concept of margin calculation, and boundaries are drawn between classes. The margins are drawn so that the distance between the margin and the classes is maximum. Hence the classification error is reduced. Optimization techniques are employed in finding the optimal hyperplane ( Cheok et al., 2019 ). The hyperplane equation that does the separation is given as: where w is the weight vector for w , for training data ( − → x 1 , − → y 1 ) , . . . ( − → x n , − → y n ) , y i are either 1 or −1, indicating to which class the data − → x l belong (target output). The weight vector decides the orientation of the decision boundary, whereas bias point b decides its location.
Distance, D 1 between the support vector 1 and plane is computed according to: Distance, D 2 between the support vector 2 and plane is computed according to: The Margin, M is the addition of distance between support vectors D 1 and D 2 is given by: SVM has been proposed for the recognition of sign language by many researchers ( Rashid et al., 2009 ;Rekha et al., 2011b ;Mohandes, 2013 ;Moreira Almeida et al., 2014 ;Kong & Ranganath, 2014 ;Dahmani & Larabi, 2014 ;Sun et al., 2015 ; (Jayshree R. Pansare & Ingle, 2016 ); Raheja et al., 2016 ;Lee et al., 2016 ;Chong & Lee, 2018 ;Rahim et al., 2019 ;Abiyev et al., 2020 ). Agrawal et al. (2012) developed a model for recognition of Indian sign language in real-time using SVM. The system combined shape descriptors with HOG and SIFT to extract features. The model achieved a recognition accuracy of 93%. The performance of SVM has been compared with K-NN in ( Tharwat et al., 2015 ;Yasir et al., 2016 ). It reported that SVM has better performance compare to the KNN classifier. Also, Tharwat et al. (2015) used SMV to recognition Arabic sign language (ArSL). The model achieved a recognition accuracy of 99%. A similar research also used Kinect with the SVM classifier to develop Indian Sign Language Recognition (SLR) system. 22 I.A. Adeyanju, O.O. Bello and M.A. Adegboye Intelligent Systems with Applications 12 (2021) 20 0 056 Table 12 Summary of sign language recognition based on SVM.

99%
SVM achieved better results compared with K-NN ( Yasir et al., 2016 ) SIFT, K-means clustering and Bag of words (BoW) SVM K-NN -SVM achieved a better result with a large dataset than K-NN. ( Pan et al., 2016 ) SIFT, Hu- The system demonstrated a recognition accuracy of 97.5% ( Raheja et al., 2016 ). In Yasir et al. (2016) , SVM was employed to predict Bangal sign language. A bag of words with k-means clustering was used to reduce feature vectors obtained from the video sequence of signs. SVM was used as a classifier, and Zernike's moments were employed to find the keyframe in a video sequence in Athira et al. (2019) . The system attained an accuracy of 91% and 89% for static and dynamic, respectively. However, the performances of both systems are low under complex backgrounds and poor lighting. SVM was implemented on a mobile application to recognise 16 ASL alphabets and achieved a recognition accuracy of 97.13% in Jin et al. (2016) . Other studies that employed SVM as a classifier with better performance include ( Joshi et al., 2020 ;Barbhuiya et al., 2021 ;Sharma et al., 2020aSharma et al., , 2020b. The summary of SLR using various feature extraction techniques coupled with SVM is presented in Table 12 .

Hidden Markov model (HMM)
One of the most effective sequence detection and recognition techniques is the Hidden Markov Model (HMM). It is a statistical model assumed to be a Markov process with an unknown set of hidden parameters. The hidden parameters can be obtained from related observation parameters ( Lan et al., 2017 ). Several re-searchers have used this technique in sign language recognition, where the image to be processing might be in a video sequence and has achieved better recognition accuracy. In HMM, a new state is generated when input apply. The "hidden" in the Hidden Markov model implies that shifts from the old state to the new state are not clearly measurable, and the probability of transition depends on how the model train with the training sets. The technique works by train the model using training sets. The training has to be valid and get all classes classified because the model will only learn from what is trained. Some of the HMM used as a classifier for sign language recognition includes the continuous Hidden Markov Model (using Gaussian) and discrete Hidden Markov Model (using Multinomial) ( Suharjito et al., 2019 ). Geetha (2020a , 2020b ) gave procedural steps for continuous sign language recognition (CSLR) using HMM. The HMM classifier is structured to be modelled per each sign label, with a predefined number of hidden states s as shown in Fig. 10 . Given a video clip as a sequence of images X N 1 = X 1 , X 2 . . . . . . . . . . X n , a CSLR system finds a sequence of sign words Gloss m 1 , for which X N 1 best fit the learned models. For decoding purposes, a sliding window approach was considered for each video segment from the continuous utterance. At the same time, the Viterbi algorithm ( Forney, 1973 ) was carried out for each HMM, to find the most likely gesture based on their scores. The 23 I.A. Adeyanju, O.O. Bello and M.A. Adegboye Intelligent Systems with Applications 12 (2021)  model with the best scores was considered as the sign (gloss or label).
HMM equation was employed to maximize optimal sign and is given as: For m labels and observation x of length n , x j is an observation of input sequence x at position j and s i, j represents the state of HMM for sign label i ( Glos s i ) and x j . Zaki and Shaheen (2011) have proposed the HMM-based sign language model using an appearance-based feature extraction technique. Hybridization of three feature extraction techniques was used to obtain features that define the point of articulation arrangement, hand orientation and movement. The system used skin colour thresholding and connected component identification to extract only the dominant hand, face and track it. Their system achieved an overall error rate of 10.9% on the RWTH-BOSTON-50 database. Also, Li et al. (2016) used HMM as a classifier to recognise 11 home-service-related Taiwan sign language words. The system combined PCA with entropy-based K-means algorithm for data transformation, eliminating redundant dimensions and evaluating the number of clusters in the dataset without losing the main information. The ABC algorithm and the Baum-Welch algorithm are integrated to resolve optimization problems and achieved recognition accuracy of 91.30%. Some researchers hybridize HMM with other techniques to improve SLR accuracy, such as CNN-HMM ( Koller et al., 2018 ), HMM-SVM ( Lee et al., 2016 ) with Kinect to create a 3D model of the acquired image and achieved recognition accuracy of 85.14% and Coupled-HMM ( Kumar et al., 2017 ) used Kinect and Leap to acquire data with a recognition accuracy of 90.80%.
Kaluri and Reddy (2017) introduced a genetic algorithm to improve HMM for gesture recognition and compared the performance of the proposed model against SVM and Neural Network. The recognition accuracy for the proposed method, neural network and SVM are 83%, 72% and 79%, respectively. Suharjito et al. (2019) proposed models for predicting ten signs of Argentine sign language using sign Gaussian HMM and Multinomial HMM. The models are compared with edge detection and skin detection techniques. Their result showed that Gaussian HMM performed better when using edge detection with a recognition accuracy of 83%. HMM-based algorithms for sign language recognition have been developed by other researchers, including Coupled HMM (CHMM) ( Brand et al., 1997 ;Kumar et al., 2017 ), Tied-Mixture Density HMM ( Zhang et al., 2004 ), Parallel HMM (PaHMM) ( Vogler & Metaxas, 1999 ) and Parametric HMM (PHMM) ( Wilson & Bobick, 1999 ).

Convolutional neural network (CNN)
Convolution neural network offers a wide range of applications, including face recognition, scene labelling, image classification, voice recognition and natural language processing ( Gopika et al., 2020 ). It is a type of deep learning algorithm that takes in an input image, assigns values to various features in the image, and uses the value to differentiate various features ( Saha, 2018 ). It normally has input, convolution, pooling, and fully connected layers with output ( Nisha & Meeral, 2021 ). Fig. 11 shows the operation of a convolution neural network. The convolution layer extracts the input data by a convolution operation. The pooling layer realizes features dimensionality reduction and controls the calculation burden by selecting the most significant features to prevent overfitting. The extracted features are fed to the fully connected layer, consisting of an activation function ( Jiang et al., 2020 ). Instead of building complex handmade features, CNNs can automate the process of feature extraction and perform better compared to other traditional image processing techniques ( Ameen & Vadera, 2017 ).
A real-time ASL recognition system using CNN was proposed by Taskiran et al. (2018) . The input image is segmented using a convex hull algorithm to obtain the hand region using YCbCr skin colour segmentation. The proposed CNN model consists of the input layer, two 2D convolution layers, pooling, flattening and two dense layers. An accuracy of 98.05% was attained when tested with the realtime data. Tolentino et al. (2019) developed a real-time system to learn sign language for beginners requiring hand recognition. This system is based on a skin-colour modelling technique to extract the hand region from the background and integrate it into Convolutional Neural Network (CNN) to classify images. The system achieved an accuracy of 90.04% for ASL alphabets, 93.44% for number recognition and 97.52% for static word recognition. Dudhal et al. (2019) proposed CNN based approach for ISL recognition using SIFT to extract the features from an image with a different variant. The output of SIFT algorithm was fed into Convolution Neural Network (CNN) for classification with a recognition accuracy of 92.78%. The research also used adaptive thresholding on the sign image with CNN and achieved an accuracy of 91.84%. The study shows that better performance is achieved with the hybridization of SIFT, while hybridization of adaptive thresholding and gaussian blur is better than adaptive thresholding only. Similar research ( Barbhuiya et al., 2021 ) proposed a similar approach for robust modelling of static signs language recognition using deep learning-based convolutional neural networks (CNN). The proposed sign language recognition system includes four major phases: image acquisition, image preprocessing, training and testing of the CNN classifier. The developed system achieved the highest training accuracy of 99.72% and 99.90% on coloured and grayscale images. More recently, Barbhuiya et al. (2021) designed convolutional neural networks (CNN) framework based on a modified pre-trained AlexNet and modified pre-trained VGG16 model for feature extraction with multiclass support vector machine (SVM) for the classifier. The system achieved recognition accuracy of 99.82%.
Another deep learning approach using a fine-tuned VGG16 based on CNN model Nelson et al. (2019) achieved recognition accuracy of 97% compared to a fine-tuned VGG19 ( Khari et al., 2019 ) with a recognition accuracy of 94.8%. Adithya and Rajesh (2020) in- 24 I.A. Adeyanju, O.O. Bello and M.A. Adegboye Intelligent Systems with Applications 12 (2021)  troduced a deep learning model based on CNN to recognise static ASL alphabets. The model was tested on the National University of Singapore (NUS) hand posture dataset and American fingerspelling A dataset which contains 24 letters of the ASL alphabet. The developed model performed well on the two datasets with 94.7% and 99.96% recognition accuracy on the NUS and American fingerspelling A datasets.

Fuzzy logic systems
Fuzzy logic is a form of multi-valued logic that uses the mathematical theory of fuzzy sets for reasoning rather than accuracy ( Ngai et al., 2014 ). Zadeh presented a fuzzy logic algorithm in 1965 to solve the models that deal with logical reasoning, such as imprecise and vague ( Song & Chissom, 1993 ). The approach mimics how humans make decisions involving all intermediate possibilities between digital values 0 and 1 ( Scott et al., 2003 ;Nuhu et al., 2021 ). This classification technique is based on fuzzy set theory and is used in a different area to solve problems. It operates based on three stages which are Fuzzification, Inference and Defuzzification. In the fuzzification stage, the input feature vectors are converted into fuzzy values with the aid of a fuzzy membership function, which correlates with the score of each fuzzy value. The second stage is the fuzzy inference engine, where the mapping between input and output membership function is done based on the fuzzy rules. At the same time, defuzzification is the final phase in fuzzy logic which combines the output into a single numerical value for prediction ( Kaluri andReddy, 2016a , 2016b ). The fuzzy logic system includes Fuzzy Inference System (FIS), and Adaptive Neuro-Fuzzy Inference System (ANFIS) ( Nedeljkovic et al., 2004 ) have been applied in sign language recognition system.

Fuzzy inference system (FIS)
Fuzzy inference uses fuzzy logic to create a mapping from a given input to an output. It entails assigning each rule a weight between 0 and 1, then multiplying the membership value assigned to the outcome of the rules ( Lenard et al., 1999 ). The process of fuzzy inference involves membership functions, fuzzy logic operators and if-then rules. Fuzzy Inference System (FIS) implementation can be further classified into Mamdani and Sugeno ( Wang, 2001 ). Mamdani fuzzy inference system : Mamdani fuzzy inference system was first introduced by Mamdani and Assilian (1975) . The Mamdani technique is one of the most extensively utilized fuzzy inference systems. It is employed because the design combines expert knowledge with more intuition in the form of IF-THEN rules expressed in natural language ( Izquierdo & Izquierdo, 2018 ).
In Mamdani, the rules of fuzzy systems are given as: where Q i denotes the i th rule, i = 1 , . . . . . . . . . , N r , N r is the number of rules, x n is n -th input to the fuzzy system, A j n and B j n are fuzzy The propositions in the IF part of the rule are combined by applying minimum operators. Sometimes the product is calculated, but it mostly depends on the situation. The number of prepositions in the consequence part of the rule depends on the number of outputs of the fuzzy system ( Garcia-Diaz et al., 2013 ).

Sugeno fuzzy inference system
Sugeno technique of fuzzy inference is similar to the Mamdani technique. The few steps of the fuzzy inference process, fuzzifying the inputs and applying the fuzzy operator, are exactly the same. In comparison, Sugeno fuzzy inference employs singleton output membership functions, which are either constant or linear functions of the input values. When compared to a Mamdani system, the Sugeno defuzzification procedure is more computationally efficient.
The rules in the functional Sugeno fuzzy inference system is given by: where f j (x ) is a crisp function of the input variables, rather than a fuzzy proposition. The major difference between Mamdani-type FIS and Sugenotype FIS is that Mamdani-type FIS uses the defuzzification technique of a fuzzy output. In contrast, Sugeno-type FIS uses a weighted average to compute the crisp output hereby bypassed the defuzzification stage. Mamdani FIS is well suited to human input with a better interpretable rule base ( Kaur & Kaur, 2012 ;Shleeg & Ellabib, 2013 ).

Adaptive neuro-fuzzy inference system (ANFIS)
Adaptive Neuro-Fuzzy Inference System (ANFIS) proposed by ( Jang, 1993 ) is a neural network-based integration system that optimizes the fuzzy inference system. It creates a set of fuzzy if-then rules with appropriate membership functions. Human expertise regarding the outputs to be modelled is used to set the initial fuzzy rules and membership functions. ANFIS can modify these fuzzy ifthen rules and membership functions based on an interconnected neural network to map the numerical inputs into a reduced output error ( Ahmed & Shah, 2017 ;Al-Hmouz et al., 2012 ) as shown in ANFIS architecture of two inputs in Fig. 12 . This technique refines fuzzy IF-THEN rules to describe the behavior of a complex system and enables a fast and accurate learning mechanism. It has successfully combined the advantages of fuzzy logic and neural network techniques into a single approach ( Jang, 1993 ;Kassem et al., 2017 ).
Given Sugeno if-then rules as: Rule 1: I f x is A 1 and y is B 1 , then f 1 = p1 x + q 1 y + r1 Rule 2: I f x is A 2 and y is B 2 , then f 2 = p2 x + q 2 y + r2 The overall output of the model is computed by: where w i f i is the output of node i in Layer 4 as illustrated in Fig. 12 , x and y are the inputs, A i and B i are the fuzzy sets, f i are the outputs within the fuzzy region specified by the fuzzy rule, and pi , q i and ri are the design parameters that are determined during the training process. Al-Jarrah and Halawani (2001) developed an adaptive neurofuzzy inference system to recognise Arabic sign language. The system was applied on sign images with a bare hand and achieved a recognition accuracy of 93.55%. Al-Jarrah and Al-Omari (2007) also presented a similar model using an improved algorithm. The system achieved recognition accuracy of 97.5%, 100% for the 10 and 19 rules, respectively. Kausar et al. (2008) used colour marked gloves to extract fingertip and finger-joint features from the Pakistani sign language alphabets. The angle between fingertip and joint was distinguished using the Fuzzy Inference System (FIS). Lech and Kostek (2012) presented a fuzzy rule-based inference system for dynamic gesture recognition. Their system has shown better performance than the fixed speed thresholds method using Sugeno fuzzy inference as a classifier for the video dataset.
In the study of Kishore et al. (2016) , the Fuzzy Inference Engine was proposed for Indian sign language recognition. Continuous gestures of 50 words were segmented and extracted using Horn Schunck Optical Flow (HSOF) algorithm and active contours, respectively. The system combined the most dominant features of data to form the features used for recognition. Mufarroha and Utaminingrum (2017) used an adaptive network-based fuzzy inference system to group features and a KNN classifier to speed up the recognition process. The system achieved recognition accuracy of 80.77% with the ten epochs. Elatawy et al. (2020) proposed a predictive model to recognize Arabic sign language alphabets. The dataset used were first converted into the neutrosophic domain to add more information. The system used the fuzzy c-means technique and attained a recognition accuracy of 91%. Table 13 summarises the reviewed papers on the fuzzy logic systems used for sign language recognition.

Ensemble learning
Ensemble learning (EL) is a hybrid learning model that combines multiple classifiers algorithms to improve the model's prediction performance ( Simske, 2019 ). It can make a strong classifier with low training errors from the combination of multiple weak classifiers. An ensemble learning method is a meta-algorithm that combines several machine learning techniques to solve many reallife problems and create a predictive model with improved accuracy ( Polikar, 2012 ;Huang & Chen, 2020 ;Zhao et al., 2021 ). The algorithm has been widely used in various machine learning applications, including face and object recognition, error correction, object tracking, and feature selection. Examples of ensemble learning techniques are Adaptive Boosting (AdaBoost) and XGBoost (Extreme Gradient Boosting).
The commonly adopted ensemble learning approaches are bagging and boosting. Bagging combines the predicted classifications (prediction) from multiple models or the same type of model for different learning data. Bagging also addresses the inherent instability of results when applying complex models to relatively small data sets. In Boosting, different models' decisions, like bagging, by amalgamating the various outputs into a single prediction but derives the individual models in different ways. In the bagging approach, however, the models receive equal weight, whereas weighting is used to influence the more successful in boosting.
Aloysius and Geetha (2020a , 2020b ) presented a weighted average ensemble of CNNs, which includes a low-resolution Network (LRN), an intermediate resolution network (IRN) and a highresolution Network (HRN). Their system demonstrated improved performance with high accuracy of 91.76% for static hand gestures than the CNN trained model. Zhang et al. (2005a) proposed an ensemble classifier based on continuous HMM (CHMM) and AdaBoost to boost the performance of Chinese sign language (CSL) subwords. Zhang et al. (2005b) developed hierarchical voting classification (HVC) based on a combination of continuous hidden Markov models (CHMM) with an accuracy of 95.4%. Simon et al. (2007) proposed an ensemble system to recognize sign language and human behavior recognition. Also, Huang et al. (2012) developed an ensemble system based on the hybridization of HMM and DWT for hand gesture recognition. Their method outperforms state of the art approach using HMM or DWT. In research ( Savur & Sahin, 2017 ;Mustafa & Lahsasna, 2016 ), the American sign language recognition system was developed using ensemble learning and achieve good recognition accuracy. Yang et al. (2016) presented a system based on a combination of dynamic time warping based Level Building (LB-DTW) and Fast HMM to improve the recognition accuracy of the continuous sign. Fast HMM was used to handle computation complexity. The system achieved recognition accuracy of 87.80% on 100 Chinese sign language (CSL) sentences with only 21 signs vocabularies. Ji et al. (2017) proposed sign language interactive robot commandos using a hybrid classifier of CNN-SVM. Their system attained recognition accuracy of 97.72 % with NUS hand posture dataset II. Kim et al. (2018) presented an ensemble artificial neural network. The system used an electromyography armband sensor with 8-channel and eight ANN classifiers. The optimum recognition accuracy of 97.40% was achieved with their proposed method. Nguen et al. (2019) ensemble a system using ResNet-based CNN 100% with 19 rules and 97.5% with ten rules The system performed better than ( Al-Jarrah and Al-Omari, 2007 ) with gestures of similar boundaries.  Active contour Sugeno fuzzy inference system 92.5% The video dataset of Indian signs contains 80 words and sentence testing. Kishore et al. (2016) Active contours Fuzzy Inference Engine

96%
The system achieved better results compared to other models in the same categories. Lech and Kostek (2012) fuzzy rule-based inference system with Kalman filters Little effect defect was observed using Kalman filters.It was implemented on few simple gestures. Elatawy et al. (2020) Gray level co-occurrence matrix (GLCM) fuzzy c means 91% It recognized 28 Arabic sign language sign alphabets. and ResNet quaternion CNN to recognise Japanese sign language. Gupta and Jha (2020) proposed a real-time recognition system for continuously signed sentences of Indian sign language. The proposed system used an ensemble technique based on multiple SVM classifiers on features extracted from fixed windows durations and achieved an accuracy of 98.7% for 11 sentences. Raghuveera et al. (2020) used an ensemble of three features (histogram of oriented gradient (HOG), speeded up robust features (SURF) and local binary pattern (LBP)) to be trained SVM. Their system achieved a recognition accuracy of 71.85% with a response time of 35 s. Sharma et al. (2020aSharma et al. ( , 2020b ) introduced a novel ensemble-based transfer learning called the Trbaggboost algorithm. The performance of the proposed ensemble method is compared with transfer learning algorithms which include TrAdaboost ( Yao & Doretto, 2010 ), TrResampling ( Liu et al., 2017 ), TrBagg ( Kamishima et al., 2009 ), and Random forest ( Yuan et al., 2020 ). An improved Trbaggboost algorithm outperforms the state-of-the-art Trbaggboost with a recognition accuracy of 97.04 %. Yuan et al. (2020) proposed an ensemble technique based on random forest (RF) for offline Chinese sign language alphabets recognition. Four features were extracted from the subjects' forearm and achieved recognition accuracy of 95.48%, which is better than SVM and ANN. The summary of the reviewed ensemble learning methods in sign language recognition is presented in Table 14 . It is useful where the fast evaluation of the learned target function is required. It is quite robust to noise in the training dataset. It has fault tolerance.
It is computationally expensive and has difficulties in finding a proper network structure.

Support Vector Machine (SVM)
It performs better when dealing with multidimensions and continuous features. It is applicable in numerous domains. Tolerance to irrelevant attributes.
It requires a large sample of the dataset to achieve its maximum prediction accuracy. Hyperparameters are often challenging while interpreting their impact.

Hidden Markov Model (HMM)
It performs relatively well in recognition. It is easier to implement and analyse. It eliminates the label bias problem.
HMMs often have a large number of unstructured parameters. It requires a huge amount of training to obtain better results. It requires a large dataset for training. Convolution Neural Network (CNN) It automatically detects important features without any human supervision. It handling image classification successfully with high accuracy.
High computational cost. It requires a lot of training data to achieve good accuracy. Lack of ability to be spatially invariant to the input data. It does not encode the position and orientation of the object.

Fuzzy logic
It is a robust system where no precise inputs are required. It is flexible and can also allow modifications. It is an expert base technique that provides solutions to complex solutions. It deals with complicated problems in a simple way.
It is completely dependent on human intelligence and expertise. It has low accuracy, and its predictors are not always correct.
Ensemble Learning It improves the average prediction performance. It provides high accuracy and a more stable model. It reduces the variance of predictive errors.
It can be more difficult to interpret. Sometimes the model can be overfitted or underfit using the ensemble learning method. 98.08% The system achieved better performance and reduce motion blurring, sign variation and finger occlusion.
Various sign language classifiers have been reviewed in this paper. The techniques have merits and demerits over others. Table  15 shows various advantages and disadvantages of the classifiers reviewed in this study.

Other classification methods
This section briefly presents the review of the less popular machine learning algorithms employed in sign language recognition. These algorithms include modified Long Short-Term Memory, Bayesian classifier, Finite state machine and fuzzy logic-based, Mul-tiLayered Perceptron, and Self Organizing Maps. Wong and Cipolla (2005) used a relevance vector machine (RVM) with the probabilistic nature of the Bayesian classifier to improve the recognition accuracy of ten gestures in complex motion analysis. The system achieved an accuracy of 91/8%. Verma and Dev (2009) proposed a Finite state machine and fuzzy logic-based method for hand gesture recognition. Their system extracted features from the images comprising 2D hand positions using the harris corner detector and clustered them together with fuzzy c-means (FCM) algorithm. The I.A. Adeyanju, O.O. Bello and M.A. Adegboye Intelligent Systems with Applications 12 (2021) 20 0 056 clusters of hand posture determine the states of FSM and give the appropriate gesture recognition.
Karami et al. (2011) developed a system based on MultiLayered Perceptron (MLPNN) for the recognition of 32 classes of Persian sign language (PSL). The features for classification were extracted using discrete wavelet transform (DWT) and achieved recognition accuracy of 94.06%. Maraqa and Abu-Zaiter (2008) introduced recurrent neural networks (RNN) to recognise static Arabic signs. The proposed model attained an accuracy of 95.11%. Bhat et al. (2013) proposed Self Organizing Maps (SOM) for hand gesture recognition. Their system transformed the image into radial enclosed edges used to train SOM and achieved recognition accuracy of 92%. Lim et al. (2016) proposed a feature covariance matrix with a serial particle filter for isolated sign language recognition. The proposed American sign language recognition system achieved recognition accuracy of 87.33%. Camgoz et al. (2017) introduced a novel deep learning architecture called SubUNets. This algorithm is based on convolutional neural networks (CNNs) and Bidirectional Long Short-Term Memory (BLSTM) with connectionist temporal classification (CTC) to recognition continuous sign language. The model was trained on a one million hand gestures dataset with improved accuracy of about 30% compared with the previous state of the art system. Abraham et al. (2019) proposed a real-time Indian sign language recognition system using LSTM based neural networks. The proposed model attained a recognition accuracy of 98% for 26 gestures. Mittal et al. (2019) designed a modified Long Short-Term Memory (LSTM) Network to recognize isolated and continuous gestures of Indian sign language. The proposed system achieved average recognition accuracy of 72.3% and 89.5% for both continuous signs and isolated sign words. Vincent et al. (2019) developed an American sign language recognition system for human activities. The system incorporates a Convolutional Neural Network (ConvNet) and Recurrent Neural Network with Long-Short Term Memory (LSTM) ( Bantupalli & Xie, 2019 ) called DeepConvLSTM. It achieved a recognition accuracy of 91.1% with data augmentation on testing datasets. Tornay et al. (2020) introduced Kullback Leibler divergence HMM (KL-HMM) for a multilingual sign language recognition system. The performance of the system was validated using three different sign languages. Meng & Li (2021) proposed a new multiscale and dual sign language recognition Network (SLR-Net) based on a graph convolutional network (GCN) to overcome the challenge of redundant information, finger occlusion, motion blurring and variation in the ways people sign. Their model consists of three sub-modules: multi-scale attention network (MSA), multiscale spatiotemporal attention network (MSSTA), and attention enhanced temporal convolution network (ATCN). Table 16 presents a summary of the reviewed less popular machine learning algorithms employed in sign language recognition.

Conclusion and future Work
With the recent advancement in machine learning and computational intelligence methods, intelligent systems in sign language recognition continue to attract academic researchers and industrial practitioners' attention. This study presents a systematic analysis of intelligent systems employed in sign language recognition related studies between 2001 and 2021. An overview of intelligent-based sign language recognition research trends is provided based on 649 full-length research articles retrieved from the Scopus database. Using the publication trends of the article retrieved from the Scopus database, this study shows that machine learning and intelligent technologies in sign language recognition are proliferating for the last 12 years. The countries and academic institutions with a large number of published articles and solid international collaborations have been identified and presented in this paper. It is ex-pected that this study will be an opportunity for the researcher in the countries with fewer collaborations to broaden their research collaborations.
As part of this work, this review gives an insightful analysis of previous techniques used by various researchers over two decades on different stages involved in vision-based SLR, including image acquisition, image segmentation, feature extraction, and classification algorithms employed by various researchers to achieved recognition accuracy. The study also established numerous shortcomings and challenges facing a vision-based approach to sign language recognition, namely, cost of implementation, techniques, the system's accuracy, nature of the sign, complex background of the image, variation in the illumination of the image and computational time. Several devices have been used to acquire sign data, such as Dataglove, Kinect, Leap motion controller and Camera in acquiring data. Despite the fact that these devices have contributed to the performance and accuracy of the ASLR system. These devices have some shortcomings, such as high cost and inconvenience to use associated with dataglove. The image acquired from a low-resolution camera also affects the recognition accuracy of the system. Therefore, there is a need for more research that fuses images from multiple devices such as a camera, dataglove, and Kinect to acquired images to produce better results without feature extraction. Skin colour segmentation and edge detection techniques have demonstrated robust improved segmentation performance. Hybridization of two or more feature extraction techniques has also been shown to produce more robust recognition features.
Numerous approaches have been proposed on the manual acquired sign images, which involves handshapes, and remarkable success has been attained. To attained benchmark performance in this context, the following points are worthy of more attention for future research: 1. Further study on a non-manual sign involves the face region, including the movement of the head, eye blinking, eyebrow movement, and mouth shape. 2. Need to address the recognition of signs with facial expression, hand gestures and body movement simultaneously with the better recognition accuracy in real-time with improved performance. The researchers envisage that these challenges can be achieved using a deep learning approach with a high configuration system to process the input data with low computational time. 3. Different researches have been done on words, alphabets and numbers. However, there is a need for more research in future for sentences recognition in sign language. 4. Most of the work on intelligent-based sign recognition systems are at the research and prototype stage. Implementation of the proposed model will find practical application to automatic sign language.