Keywords

1 Introduction

Computer vision is enabling a computer or a machine to perceive objects based on its processing speed. Human vision differs from computer vision since computer vision works with frames. At 60 frames per second (fps) the perception is much smoother [1]. Computer vision can be used to detect and recognize objects of interests. Hand gestures which are a mode of communication for hearing impaired can be recognized using computer vision by applying image processing and pattern recognition techniques.

In this paper we look at the image processing techniques and feature extraction of gestures. The ASL is fourth frequently used language in The United States of America. Human eyes can only see objects in presence of adequate lighting. Presence of uniform light will keep the objects needing recognition unique. Palmistry [2] shows that there are six types of palms regardless of the skin color. ASL characters in finger spelling differ from one another, and to contrast between these characters, a number of features need to be identified that makes each gesture unique. Image processing with uniform light distribution and scattered lighting on the scene has different output.

Different approaches are available for classification. Euclidean distance is the simplest approach and Paansare et al. [4], achieved a minimum of 85% recognition accuracy for 26 different classes. A database of image needs to be established first before classification process.

Gestures with altering background require background segmentation prior to feature extraction process. This is done so that only the features of the gestures are extracted. Kulkarni and Lokhande [5] proposed the method for static gestures where the background was kept constant. The region of interest is the gesture, and since the background remains constant there is not much need to track hands and do foreground subtraction. Skin color detection is very sensitive to lighting conditions [6] thus it is better to work with a color space which is intensity normalized [7].

The paper is structured as follows. Section 2 discusses the basics of image processing and image acquisition for the implemented system. In Sect. 3, the feature extraction of the gestures is introduced. Classification method is shown in Sect. 4. Results from real-time experimentation are discussed in Sect. 5. The conclusion and future recommendation and the proposed work is given in Sect. 6.

2 Image Acquisition and Processing

In computer vision approach it is necessary to interface a visual device for the machine and acquire videos or frames of videos from the 3D surroundings. A web cam is used in acquiring image. Cameras sensitivity to light is determined by its ISO speed and a camera has ability to function at different ISO speeds. High ISO can give a noisy image by amplifying the image together with the noise present thus involving more image processing and filtering. For a real time system it is preferable to work in an environment with adequate lighting so that the web cam can acquire images fast enough for processing. Extremely high luminance can cause occlusion problems and presence of multiple shadows and noises. Hand gesture is dynamic movement of static hand postures [3] and in order to acquire image from the video input to the machine, it is a good practice to track the hand before taking snapshots or acquiring the frames from the real time video. The tracking and detection of hand is excluded from the algorithm since the system is designed where the gestures are performed in a confined space with a static background. Video resolution of 640 × 480 was set for the webcam interfaced with the machine before grabbing images from the video. Since 640 × 480 resolution image have 4 times more data for processing compared to 320 × 240, the images acquired were scaled down to 320 × 240. The features of the gesture were envisioned to be extracted from the edge representation of the image. The scale used to convert any image with 4:3 aspect ratio to 320 × 240 resolutions is given in Eq. 1. The scale was multiplied element wise to the matrix representation of size of the image.

$$ \sqrt {\frac{{320\, \times \,240}}{sum \;of\;pixels\;per\;column \, \times \,sum \;of\;pixels\;per\; row}} $$
(1)

The step to feature extraction is image processing. The images are processed to give a binary image. The color space for input image will be different depending on the video input settings. Some color spaces used in acquiring images are YCbCr, RGB, YPbPr which is scaled version of YUV color space [8, 9]. HSV color space is used mostly for skin detection purposes. Figure 1 shows a gesture, representing the character 4 in the ASL gesture for finger spelling, is converted from RGB to HSV. The gesture was performed on a reflective white background. It can be clearly noticed that the skin color part is contrasted as green and blue from the background. However the edges of thumb is not noticeable thus losing its feature to some extent.

Fig. 1.
figure 1

Converted from RGB to HSV (Color figure online)

The acquired images are converted to binary images for extracting features. Figure 2 shows the major processes involved in image conversion. The result of the conversion is shown in Fig. 3 where the gesture representing the number 1 in ASL finger spelling is used.

Fig. 2.
figure 2

Image conversion process

Fig. 3.
figure 3

Image processing of a gesture

There are some image processing techniques used to achieve the desired binary image. Light and illumination are factors that affect the results. One of the approach is to adjust the brightness and contrast or and correct the illumination.

An image can be represented as a two dimensional function [10], (x,y)  f(x,y). The intensity of an image is proportional to the light energy input to the device capturing image and its resolution. The image can be regarded as function of light with two parameters which are the luminance of the light on the scene and the reflective index of the objects in the scene.

$$ f\left( {x,y} \right) = i\left( {x,y} \right)r\left( {x,Y} \right) $$
(2)

Where i is the illumination from the light source and ranging from 0 to infinity and r is the reflective index of the object limited from 0 to 1. A value of zero indicates total absorption whereas values of 1 indicates total reflection. The brightness and contrast of snapshots is adjusted in the processing stage so the output after edge detection has less noise and more relevant data. The contrast was manually adjusted in photo viewer and then the values were tested for image in Matlab. The brightness of an object determines the presence of light in the image whereas contrast is the difference between objects and regions [11]. In constant lighting environment, brightness does not need to be altered much. To minimize noises and to smoothen the objects in the scene, illumination correction methods are applied. Figure 4 shows the results of illumination correction on the static background which is used. The image becomes smoother with uniform illumination.

Fig. 4.
figure 4

Uniform illumination of background

Figure 5 illustrates the gesture performed on a white reflective background with its surface plot before and after distribution of light on the scene. Illumination correction helps in removing the noise caused by highly reflective regions.

Fig. 5.
figure 5

Illumination correction of gesture in background

To distinguish between the background and the object in background, background segmentation technique is utilized in which the background is removed from the entire image. Since the background is the same at all times, tracking and foreground segmentation is not given much priority in the system. The system is designed with a fixed video input which is projected on a static white background where the gestures are performed and tested. The light acting on the background plane has uniform intensity and varies very little. Any scattering of light on scene is normalized and distributed in the image processing step.

3 Feature Extraction

Feature extraction relies mostly on the success of the image processing. In feature extraction, unique features which distinguish one gesture from the other is mined. These features are represented as feature vectors. Two common geometric features used are the area and perimeter of the segmented image. Integral image is used by [12] in which sum of rectangular area in the image is used to get the area. For feature extraction, three processed images were used for simplicity. First image is the boundary or edge representation of the gesture, second image is the filled area of the edge representation of first image. Two edge detection techniques, the ‘canny’ edge detection and ‘Laplacian of Gauss’ edge techniques were multiplied together to get the third image which had much finer boundary but was more discontinuous. Sum of pixels in filled edge image gives the area whereas the total pixel of the edge image gives the perimeter. The relationship between the perimeter of a circle and area of a circle was used to find the value of r, which was termed as apothem of the gesture.

$$ \frac{radius}{2} = \frac{{Area_{circle} }}{{Perimetre_{circle} }} $$
(3)

Apothem is 2 times the area divided by the perimeter of the object and represents the radius of the inscribing circle.

Most of the features were extracted using the regionprop command in Matlab. The length of major axis and minor axis was used as features, the eccentricity and the orientation was also used as features. Finger counting was used by [13] as features of the gesture. It was observed that the spacing’s between consecutive fingers is approximately same as the width of the fingers at a point slightly above the base of the fingers. Thus a range of pixel value is used to find the number of fingers and thumbs present. The image ratio of 4:3, was divided into 40 equivalent intervals and the width and length between the extreme edges were found. The ratio of the length to width was calculated and termed as the gradient. This 40 gradients represented the nature of the gesture and represented in form of a graph. The area of the gradient plots were calculated and represented as a feature of the gesture. Figure 6 describes how the length and width of images were extracted at different intervals. The plot of gradients for gesture representing ‘A’ is shown in Fig. 7.

Fig. 6.
figure 6

Length and width of image at intervals

Fig. 7.
figure 7

Plot of ratio of length to width

To prevent the ratio from being infinity, the absolute value of the length and widths were taken and a very small value of 0.001 was added before calculating the ratio.

4 Classification Approach for Gestures

Classification was done based on ten features and upon observing the results, the classification was changed and modified slightly. First approach was finding the distance of the performed vector from the feature vectors obtained. The feature vectors were saved as mean of each features of a class. There are 10 features and 36 classes.

$$ \left[ {\begin{array}{*{20}c} {\mu_{1,1} } & \cdots & {\mu_{36,1} } \\ \vdots & \ddots & \vdots \\ {\mu_{1,10} } & \cdots & {\mu_{36,10} } \\ \end{array} } \right] $$
(4)

The test features were represented as column vector and subtracted element wise from each of the columns in the mean feature vector obtained. For every feature, the closest matching classes were found. This was done by finding the minimum value in the rows and noting its class. The Euclidean distance for the classes with closest match was calculated and compared.

$$ \left[ {\begin{array}{*{20}c} {f_{1} - \mu_{1,1} } & \cdots & {f_{1} - \mu_{36,1} } \\ \vdots & \ddots & \vdots \\ {f_{10} - \mu_{1,10} } & \cdots & {f_{10} - \mu_{36,10} } \\ \end{array} } \right] $$
(5)

The next classification approach was using KNN (Kth Nearest Neighbor), where the value of K was tested for 3, 5 and 7. A value of 7 was used to implement the system. In this classification the mean was not calculated but all the feature vector for each classes was used. The classes of 7 closest nearest neighbors are stored in an array and the unique class is found. If there existed two or more unique classes then the nearest neighbor is used where K is set to 1 (Fig. 8).

Fig. 8.
figure 8

Flowchart of KNN approach used

The third and fourth classification are same as the first and second approach but with reduced features. The feature representing the connectivity of the gesture or Euler’s number [14] was eliminated. This was done since the variance of Euler’s number between classes is very high. Thus if in a gesture, 9 features are extremely close enough while the feature showing connectivity is very far, then the Euclidean distance calculated is very high showing that the gesture is not closely matched. The final design was based on conditions of features. In this approach one of the features was given priority and used as condition to minimize the classes into subclasses. Feature which had a relatively high variance was chosen to act as a decision tree, thus determining the accuracy of the system based on decision of a less important feature. The finger count plus the spacing’s between the fingers was the feature given priority. The values of this feature ranges from 0 to 5, thus the classes were divided into 6 sections. Table 1 shows the division.

Table 1. Spacing index for gestures

The spacing index represented new class and the gestures were sub classes. Classification was again done based on Euclidean distance for this approach.

5 Results and Accuracy

The classification results were checked for ten tests for each class. The characters were performed in orderly sequence ‘a’ to ‘z’ and ‘0’ to ‘9’. Accuracy for single characters was tested first. ASL characters were then used to form words containing two, three or four characters. The results of the tests are given in Table 2.

Table 2. Results of classification (%)

Gestures were performed in orderly sequence and also randomly. The recognition rate was calculated by finding the ratio of correct classification and total tests. The total tests was 360, 10 tests for each 36 class of gestures.

$$ recognition = \frac{total\;correct}{total\;tested} \,{ \times }\, 100{\% } $$
(6)

Since the system is in real time, the algorithm was designed to show all the classification results in Matlab workspace thus the patterns noticed. Five classification methods which were expansion of Euclidean distance, was used to test for results.

The results obtained by the system compares relatively same with results obtained by (Kulkarni and Lokhande, 2010) [5] in which neural network was used as classification technique. Nearest neighbor classification is a simple classification technique compared to other state of art techniques.

6 Conclusions

Illumination correction of hand gesture resulted in a better and smoother binary image which was used to extract features to compare and classify gestures. Lighting conditions affect the results. Very dark environment with less illumination resulted in binary image with more noise thus required lots of image processing. High amount of features yielded good results but inclusion of a redundant feature affected the results largely. Eliminating the redundant features improved accuracy of classification. On the other hand, giving priority to a single feature of less importance also reduced the accuracy of the classification. Euclidean distance approach, nearest neighbor and Kth nearest neighbor are simple to implement and can be expanded with conditional algorithm to get good results. Implementing the system is very cheap and can be easily expanded to recognition of other objects or sign languages.

Classification using two camera inputs is possible future study. A hybrid classification and feature weighting is also area of exploration to obtain further improved result. The current research benefits by enabling communication with hearing impaired people easier by use of simple state of art. The algorithm can also be used in human machine interface for controlling devices and in any other non-verbal communication.