Real Time Classification of American Sign Language for Finger Spelling Purpose

Kumar, Amit; Assaf, Mansour; Mehta, Utkal

doi:10.1007/978-3-319-51969-2_11

Amit Kumar¹⁷,
Mansour Assaf¹⁷ &
Utkal Mehta¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10036))

Included in the following conference series:

International Conference on Internet of Vehicles

911 Accesses
3 Citations
1 Altmetric

Abstract

Real time communication with use of sign languages is addressed. Sign language used in this study is performed in uniform lighting conditions. The system looks at image processing of the hand gestures followed by some feature extraction techniques to verify the gesture. Different classification techniques and logics are applied to classify the images and results are compared experimentally. Conditional classification is also used in the research to test for accuracy and is compared with previous results.

Download conference paper PDF

Indian Sign Language Spelling Finger Recognition System

Fingerspelling Recognition in Mexican Sign Language (LSM) Using Machine Learning

Signer independent real-time hand gestures recognition using multi-features extraction and various classifiers

Article 07 May 2020

Keywords

1 Introduction

Computer vision is enabling a computer or a machine to perceive objects based on its processing speed. Human vision differs from computer vision since computer vision works with frames. At 60 frames per second (fps) the perception is much smoother [1]. Computer vision can be used to detect and recognize objects of interests. Hand gestures which are a mode of communication for hearing impaired can be recognized using computer vision by applying image processing and pattern recognition techniques.

In this paper we look at the image processing techniques and feature extraction of gestures. The ASL is fourth frequently used language in The United States of America. Human eyes can only see objects in presence of adequate lighting. Presence of uniform light will keep the objects needing recognition unique. Palmistry [2] shows that there are six types of palms regardless of the skin color. ASL characters in finger spelling differ from one another, and to contrast between these characters, a number of features need to be identified that makes each gesture unique. Image processing with uniform light distribution and scattered lighting on the scene has different output.

Different approaches are available for classification. Euclidean distance is the simplest approach and Paansare et al. [4], achieved a minimum of 85% recognition accuracy for 26 different classes. A database of image needs to be established first before classification process.

Gestures with altering background require background segmentation prior to feature extraction process. This is done so that only the features of the gestures are extracted. Kulkarni and Lokhande [5] proposed the method for static gestures where the background was kept constant. The region of interest is the gesture, and since the background remains constant there is not much need to track hands and do foreground subtraction. Skin color detection is very sensitive to lighting conditions [6] thus it is better to work with a color space which is intensity normalized [7].

The paper is structured as follows. Section 2 discusses the basics of image processing and image acquisition for the implemented system. In Sect. 3, the feature extraction of the gestures is introduced. Classification method is shown in Sect. 4. Results from real-time experimentation are discussed in Sect. 5. The conclusion and future recommendation and the proposed work is given in Sect. 6.

2 Image Acquisition and Processing

In computer vision approach it is necessary to interface a visual device for the machine and acquire videos or frames of videos from the 3D surroundings. A web cam is used in acquiring image. Cameras sensitivity to light is determined by its ISO speed and a camera has ability to function at different ISO speeds. High ISO can give a noisy image by amplifying the image together with the noise present thus involving more image processing and filtering. For a real time system it is preferable to work in an environment with adequate lighting so that the web cam can acquire images fast enough for processing. Extremely high luminance can cause occlusion problems and presence of multiple shadows and noises. Hand gesture is dynamic movement of static hand postures [3] and in order to acquire image from the video input to the machine, it is a good practice to track the hand before taking snapshots or acquiring the frames from the real time video. The tracking and detection of hand is excluded from the algorithm since the system is designed where the gestures are performed in a confined space with a static background. Video resolution of 640 × 480 was set for the webcam interfaced with the machine before grabbing images from the video. Since 640 × 480 resolution image have 4 times more data for processing compared to 320 × 240, the images acquired were scaled down to 320 × 240. The features of the gesture were envisioned to be extracted from the edge representation of the image. The scale used to convert any image with 4:3 aspect ratio to 320 × 240 resolutions is given in Eq. 1. The scale was multiplied element wise to the matrix representation of size of the image.

$$ \sqrt {\frac{{320\, \times \,240}}{sum \;of\;pixels\;per\;column \, \times \,sum \;of\;pixels\;per\; row}} $$

(1)

The step to feature extraction is image processing. The images are processed to give a binary image. The color space for input image will be different depending on the video input settings. Some color spaces used in acquiring images are YCbCr, RGB, YPbPr which is scaled version of YUV color space [8, 9]. HSV color space is used mostly for skin detection purposes. Figure 1 shows a gesture, representing the character 4 in the ASL gesture for finger spelling, is converted from RGB to HSV. The gesture was performed on a reflective white background. It can be clearly noticed that the skin color part is contrasted as green and blue from the background. However the edges of thumb is not noticeable thus losing its feature to some extent.

The acquired images are converted to binary images for extracting features. Figure 2 shows the major processes involved in image conversion. The result of the conversion is shown in Fig. 3 where the gesture representing the number 1 in ASL finger spelling is used.

There are some image processing techniques used to achieve the desired binary image. Light and illumination are factors that affect the results. One of the approach is to adjust the brightness and contrast or and correct the illumination.

An image can be represented as a two dimensional function [10], (x,y) → f(x,y). The intensity of an image is proportional to the light energy input to the device capturing image and its resolution. The image can be regarded as function of light with two parameters which are the luminance of the light on the scene and the reflective index of the objects in the scene.

$$ f\left( {x,y} \right) = i\left( {x,y} \right)r\left( {x,Y} \right) $$

(2)

Where i is the illumination from the light source and ranging from 0 to infinity and r is the reflective index of the object limited from 0 to 1. A value of zero indicates total absorption whereas values of 1 indicates total reflection. The brightness and contrast of snapshots is adjusted in the processing stage so the output after edge detection has less noise and more relevant data. The contrast was manually adjusted in photo viewer and then the values were tested for image in Matlab. The brightness of an object determines the presence of light in the image whereas contrast is the difference between objects and regions [11]. In constant lighting environment, brightness does not need to be altered much. To minimize noises and to smoothen the objects in the scene, illumination correction methods are applied. Figure 4 shows the results of illumination correction on the static background which is used. The image becomes smoother with uniform illumination.

Figure 5 illustrates the gesture performed on a white reflective background with its surface plot before and after distribution of light on the scene. Illumination correction helps in removing the noise caused by highly reflective regions.

To distinguish between the background and the object in background, background segmentation technique is utilized in which the background is removed from the entire image. Since the background is the same at all times, tracking and foreground segmentation is not given much priority in the system. The system is designed with a fixed video input which is projected on a static white background where the gestures are performed and tested. The light acting on the background plane has uniform intensity and varies very little. Any scattering of light on scene is normalized and distributed in the image processing step.

3 Feature Extraction

Feature extraction relies mostly on the success of the image processing. In feature extraction, unique features which distinguish one gesture from the other is mined. These features are represented as feature vectors. Two common geometric features used are the area and perimeter of the segmented image. Integral image is used by [12] in which sum of rectangular area in the image is used to get the area. For feature extraction, three processed images were used for simplicity. First image is the boundary or edge representation of the gesture, second image is the filled area of the edge representation of first image. Two edge detection techniques, the ‘canny’ edge detection and ‘Laplacian of Gauss’ edge techniques were multiplied together to get the third image which had much finer boundary but was more discontinuous. Sum of pixels in filled edge image gives the area whereas the total pixel of the edge image gives the perimeter. The relationship between the perimeter of a circle and area of a circle was used to find the value of r, which was termed as apothem of the gesture.

$$ \frac{radius}{2} = \frac{{Area_{circle} }}{{Perimetre_{circle} }} $$

(3)

Apothem is 2 times the area divided by the perimeter of the object and represents the radius of the inscribing circle.

Most of the features were extracted using the regionprop command in Matlab. The length of major axis and minor axis was used as features, the eccentricity and the orientation was also used as features. Finger counting was used by [13] as features of the gesture. It was observed that the spacing’s between consecutive fingers is approximately same as the width of the fingers at a point slightly above the base of the fingers. Thus a range of pixel value is used to find the number of fingers and thumbs present. The image ratio of 4:3, was divided into 40 equivalent intervals and the width and length between the extreme edges were found. The ratio of the length to width was calculated and termed as the gradient. This 40 gradients represented the nature of the gesture and represented in form of a graph. The area of the gradient plots were calculated and represented as a feature of the gesture. Figure 6 describes how the length and width of images were extracted at different intervals. The plot of gradients for gesture representing ‘A’ is shown in Fig. 7.

To prevent the ratio from being infinity, the absolute value of the length and widths were taken and a very small value of 0.001 was added before calculating the ratio.

4 Classification Approach for Gestures

Classification was done based on ten features and upon observing the results, the classification was changed and modified slightly. First approach was finding the distance of the performed vector from the feature vectors obtained. The feature vectors were saved as mean of each features of a class. There are 10 features and 36 classes.

$$ \left[ {\begin{array}{*{20}c} {\mu_{1,1} } & \cdots & {\mu_{36,1} } \\ \vdots & \ddots & \vdots \\ {\mu_{1,10} } & \cdots & {\mu_{36,10} } \\ \end{array} } \right] $$

(4)

The test features were represented as column vector and subtracted element wise from each of the columns in the mean feature vector obtained. For every feature, the closest matching classes were found. This was done by finding the minimum value in the rows and noting its class. The Euclidean distance for the classes with closest match was calculated and compared.

$$ \left[ {\begin{array}{*{20}c} {f_{1} - \mu_{1,1} } & \cdots & {f_{1} - \mu_{36,1} } \\ \vdots & \ddots & \vdots \\ {f_{10} - \mu_{1,10} } & \cdots & {f_{10} - \mu_{36,10} } \\ \end{array} } \right] $$

(5)

The next classification approach was using KNN (K^th Nearest Neighbor), where the value of K was tested for 3, 5 and 7. A value of 7 was used to implement the system. In this classification the mean was not calculated but all the feature vector for each classes was used. The classes of 7 closest nearest neighbors are stored in an array and the unique class is found. If there existed two or more unique classes then the nearest neighbor is used where K is set to 1 (Fig. 8).

The third and fourth classification are same as the first and second approach but with reduced features. The feature representing the connectivity of the gesture or Euler’s number [14] was eliminated. This was done since the variance of Euler’s number between classes is very high. Thus if in a gesture, 9 features are extremely close enough while the feature showing connectivity is very far, then the Euclidean distance calculated is very high showing that the gesture is not closely matched. The final design was based on conditions of features. In this approach one of the features was given priority and used as condition to minimize the classes into subclasses. Feature which had a relatively high variance was chosen to act as a decision tree, thus determining the accuracy of the system based on decision of a less important feature. The finger count plus the spacing’s between the fingers was the feature given priority. The values of this feature ranges from 0 to 5, thus the classes were divided into 6 sections. Table 1 shows the division.

Table 1. Spacing index for gestures

Full size table

The spacing index represented new class and the gestures were sub classes. Classification was again done based on Euclidean distance for this approach.

5 Results and Accuracy

The classification results were checked for ten tests for each class. The characters were performed in orderly sequence ‘a’ to ‘z’ and ‘0’ to ‘9’. Accuracy for single characters was tested first. ASL characters were then used to form words containing two, three or four characters. The results of the tests are given in Table 2.

Table 2. Results of classification (%)

Full size table

Gestures were performed in orderly sequence and also randomly. The recognition rate was calculated by finding the ratio of correct classification and total tests. The total tests was 360, 10 tests for each 36 class of gestures.

$$ recognition = \frac{total\;correct}{total\;tested} \,{ \times }\, 100{\% } $$

(6)

Since the system is in real time, the algorithm was designed to show all the classification results in Matlab workspace thus the patterns noticed. Five classification methods which were expansion of Euclidean distance, was used to test for results.

The results obtained by the system compares relatively same with results obtained by (Kulkarni and Lokhande, 2010) [5] in which neural network was used as classification technique. Nearest neighbor classification is a simple classification technique compared to other state of art techniques.

6 Conclusions

Illumination correction of hand gesture resulted in a better and smoother binary image which was used to extract features to compare and classify gestures. Lighting conditions affect the results. Very dark environment with less illumination resulted in binary image with more noise thus required lots of image processing. High amount of features yielded good results but inclusion of a redundant feature affected the results largely. Eliminating the redundant features improved accuracy of classification. On the other hand, giving priority to a single feature of less importance also reduced the accuracy of the classification. Euclidean distance approach, nearest neighbor and K^th nearest neighbor are simple to implement and can be expanded with conditional algorithm to get good results. Implementing the system is very cheap and can be easily expanded to recognition of other objects or sign languages.

Classification using two camera inputs is possible future study. A hybrid classification and feature weighting is also area of exploration to obtain further improved result. The current research benefits by enabling communication with hearing impaired people easier by use of simple state of art. The algorithm can also be used in human machine interface for controlling devices and in any other non-verbal communication.

References

Bakaus, P., (n.d.).: https://paulbakaus.com/tutorials/performance/the-illusion-of-motion/
28 October 2014. www.dixie.edu: http://www.dixie.edu/com/icl/File/disiplines/PALMISTRY%20101,%20SHAPES%20OF%20HANDS%20AND%20FINGERS.pdf. Accessed 2015
Garg, P., Aggarwal, N., Sofat, S.: Vision Based Hand Gesture Recognition. World Academy of Science, Engineering and Technology, 25 (2009)
Google Scholar
Paansare, J.R., Gawande, S.H., Ingle, M.: Real time static hand gesture recognition for american sign language (ASL) in complex background. J. Sig. Process. 3(3), 364–367 (2012)
Google Scholar
Kulkarni, V.S., Lokhande, S.D.: Appearance based recognition of american sign language using gesture segmentation. Int. J. Comput. Sci. Eng. 2(3), 560–565 (2010)
Google Scholar
Rumyantsev, O., Merati, M., Ramachandran, V.: Hand Sign Recognition through Palm Gesture and Movement. Image processing, EE 368, Spring 2012
Google Scholar
Stenger, B., Mendonca, P.R., Cipolla, R.: Model-Based 3D Tracking of an Articulated Hand 2, 310–315 (2001). doi:10.1109/CVPR.2001.990976
Nanda, A., Mishra, A.: Master Hand Technology For The HMI Using Hand Gesture And Colour Detection. Department of Electronics and communication Engineering National Institute of Technology, Rourkela (2012)
Google Scholar
Fang, Y., Wang, K., Cheng, J., Lu, H.: A real-time hand gesture recognition method. In: 2007 IEEE International Conference on Multimedia and Expo, pp. 995–998 (2007). doi:10.1109/ICME.2007.4284820
Vese, L.: An Introduction to Mathematical Image Processing IAS, Park City Mathematics Institute, Utah (2010)
Google Scholar
Smith, S.W.: Image formation and display. In: Smith, S.W. (ed.) The Scientists and Engineers Guide to Digital Signal Processing, pp. 373–396. Carlifornia Technical Publishing (1997)
Google Scholar
Pedersen, J.T.: Study group SURF: Feature detection & description (2011)
Google Scholar
Dey, S.K., Anand, S.: Algorithm for multi hand fingercounting: an easy approach. Adv. Vis. Comput. Int. J. (AVC) 1(1) (2014)
Google Scholar
Gay, S.B.: Local properties of binary images in two dimensions. IEEE Trans. Comput. 20(5), 551–561 (1971)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Engineering and Physics, The University of the South Pacific, Laucala Campus, Suva, Fiji Islands
Amit Kumar, Mansour Assaf & Utkal Mehta

Authors

Amit Kumar
View author publications
You can also search for this author in PubMed Google Scholar
Mansour Assaf
View author publications
You can also search for this author in PubMed Google Scholar
Utkal Mehta
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Amit Kumar .

Editor information

Editors and Affiliations

Department of Computer Science, Chung Hua University, Hsinchu, Taiwan, Taiwan
Ching-Hsien Hsu
Beijing University of Posts and Telecommunications, Beijing, China
Shangguang Wang
Beijing University of Posts and Telecommunications, Beijing, China
Ao Zhou
The University of Fiji, Suva, Fiji
Ali Shawkat

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kumar, A., Assaf, M., Mehta, U. (2016). Real Time Classification of American Sign Language for Finger Spelling Purpose. In: Hsu, CH., Wang, S., Zhou, A., Shawkat, A. (eds) Internet of Vehicles – Technologies and Services. IOV 2016. Lecture Notes in Computer Science(), vol 10036. Springer, Cham. https://doi.org/10.1007/978-3-319-51969-2_11

Download citation

DOI: https://doi.org/10.1007/978-3-319-51969-2_11
Published: 20 January 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-51968-5
Online ISBN: 978-3-319-51969-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Real Time Classification of American Sign Language for Finger Spelling Purpose

Abstract

Similar content being viewed by others

Indian Sign Language Spelling Finger Recognition System

Fingerspelling Recognition in Mexican Sign Language (LSM) Using Machine Learning

Signer independent real-time hand gestures recognition using multi-features extraction and various classifiers

Keywords

1 Introduction

2 Image Acquisition and Processing

3 Feature Extraction

4 Classification Approach for Gestures

5 Results and Accuracy

6 Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Real Time Classification of American Sign Language for Finger Spelling Purpose

Abstract

Similar content being viewed by others

Indian Sign Language Spelling Finger Recognition System

Fingerspelling Recognition in Mexican Sign Language (LSM) Using Machine Learning

Signer independent real-time hand gestures recognition using multi-features extraction and various classifiers

Keywords

1 Introduction

2 Image Acquisition and Processing

3 Feature Extraction

4 Classification Approach for Gestures

5 Results and Accuracy

6 Conclusions

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation