Hand Gesture Recognition System Based on a Geometric Model and Rule Based Classifier

This work was carried out in collaboration between all authors. Author SMS designed the study, performed the statistical analysis, wrote the protocol


INTRODUCTION
Gestures are a powerful means of communication among humans. However, gesturing is deeply steadfast in human communication that people often continue gesturing when speaking together in daily life [1,2].
The strong rule of grammar and context makes sign language (SL) robust enough to fulfill the needs of the deaf people in their day to day life. SL is the main gestural communication used in the deaf community, in which postures and gestures have congruent meanings with an adequate grammar. Like any other verbal language, its discourse consists of wellstructured reception and rendering of non-verbal signals according to the context rules of complex grammar. Postures are the basic units of a SL, and when collected together over a time axis and arranged according to the grammar rules, they convey a specific meaning [3].
This paper contribution is recognizing the Arabic SL static gestures based on a vision based geometric model integrated with a rule based classifier. This paper is organized as follows. Section two will introduce some of the previous research in the hand gesture and SL recognition. Section three describes the steps of the hand gesture algorithm including feature extraction and classification phases. The results and the experimental results are discussed in section four. Finally, the conclusion and the future work are presented.

Device-Based Approaches
A set of data incoming from multiple sensor streams to a processor unit are the basis for all performed detection and recognition. Sensors or trackers capture all information related to signing articulators. Signers wear sensors that may include displacement sensors, positional trackers or sensors. When a signer performs gesturing, articulator's data is taken on a specific average and fed to the recognition stage.
A data-Glove [15] with multiple electronic sensors (installed on the finger joints, palm and wrist) is shown in Fig.1, which deliver these measurements in real time to a processing unit [16] and [17]. The processing unit makes comparisons between the set of static sign samples with saved templates and generates output. There are many researchers [12][13][14][15][16] who used this technique in hand gesture recognition tasks. Unlike vision based methods, these techniques are robust and efficient due to a minimal vocabulary set [20][21][22][23][24][25]. On the other hand, they roughly affect the user independence due to an intense mesh of installed sensors [26] and [27].

Vision-Based Approaches
Vision-based SL recognition techniques need hand detection and tracking algorithms to extract hand shapes and location. Color, edge information, or motion is generally used to detect hands from input data. Vision-based approaches have limitations because of imaging constraints and conditions such as illumination, background, clothing, and so on [6].
Yang and Sarkar proposed an ASL spotting method based on CRFs [7]. The system used motion information as features and the Kanade-Lucas-Tomasi method to track the motion of salient corner points. On the other hand Yang et al. [8] proposed an ASL recognition method based on an enhanced Level Building algorithm. Nayak et al. proposed an ASL recognition method based on a continuous state space model [9] based on an unsupervised approach. Also, Wang et al. [10] achieved an accuracy of about 94% using a recognition algorithm based on DTW/ISO DATA.
Also researchers in [11][12][13][14] used the Hidden Markov Models (HMMs) for recognizing different signs in different sign languages and achieved accuracy ranging from 80% to 95%.

Hand localization
The description and implementation of image processing algorithms requires a suitable mathematical representation of various types of images. In gesture recognition, color is the most frequently used feature for hand localization since shape and size of the hand's projection in the two dimensional image plane vary greatly. A three-dimensional discrete histogram h object (r, g, b) can be used to represent the dimensions corresponding to the red, green, and blue components. The total sum of h object over all colors is therefore equal to the number of considered object pixels n object , Given a pixel from the object, the probability of it having a certain color (r, g, b) can be computed from h object as P(r,g,b|object)=h object (r,g,b)/n object (2) By creating a complementary histogram h bg of the background colors we will have the probability for the background in the same way: Applying Bayes' rule, the probability of any pixel representing a part of the object can be computed from its color (r, g, b) using equations (2) and (3) P(object) and P(bg) denote the a priori object and background probabilities, respectively, with P(object) + P(bg) = 1.the object probability image Iobj, prob is created from I as A data structure suitable for representing this classification is a binary mask As presented in [28] the automatic computation of the object probability threshold Ɵ without the use of high level knowledge is as follows 1. Arbitrarily define a set of background pixels (some usually have a high a-priori background probability, e.g. the four corners of the image). All other pixels are defined as foreground. This constitutes an initial classification. 2. Compute the mean values for background and foreground, μ object and μ bg , based on the most recent classification. If the mean values are identical to those computed in the previous iteration, halt. 3. Compute a new threshold Ɵ=1/2 (μ object +μ bg ) and perform another classification ofall pixels, then go to step 2.

Hand Region Description
The algorithm to find the border points [29,30] of all regions in an image is as follows and shown in Fig. 2.
1-Create a helper matrix m with the same dimensions as I obj,mask and initialize all entries to 0. Define (x,y) as the current coordinates and initialize them to (0,0). Define (x`,y`) and (x``,y``) as temporary coordinates. 2-Iterate from left to right through all image rows successively, starting at y=0. 3-Create a list B of border points and store (x,y) as the first element. 4-Set (x`,y`) =(x,y) 5-Scan the 8-neighborhood of (x`,y`) using (x``,y``), starting at the pixel that follows the last pixel stored in B in a counterclockwise orientation, or at (x`-1,y`-1) if B contains only one pixel. Proceed counterclockwise, skipping coordinates that lie outside of the image, until I obj,mask (x``,y``)=1. If (x``,y``) is identical with the first element of B, goto step 6. Else store (x``,y``) in B. Set (x`,y`) = (x``,y``) and goto step 5. 6-Iterate through B, considering, for every element is the successor of the last, which is the predecessor of the first. If y i-1 = y i+1 ≠y i , set m(x,y)=1 to indicate that the border touches the line y=y i at x i . Otherwise, if y i-1 ≠y i ˅ y i ≠ y i+1, set m(x i ,y i ) = 2 to indicate that the border intersects with the line y=y i at x i . 7-Add B to the list of computed borders and proceed with step2.

The Geometric Features
The main geometric features [30] that need to be extracted are as following  Border Length is 1 if either their x or y coordinates are equal, and √2 otherwise. l depends on scale/resolution, and is translation and rotation invariant.

 Orientation
The region's main axis is defined as the axis of the least moment of inertia. Its orientation α is given by α= arctan (2µ 1,1 / µ 2,0 -µ 0,2 )  Compactness A shape's compactness c [0, 1] is defined as Compact shapes (c → 1) have short borders l that contain a large area a. The most compact shape is a circle (c = 1), while for elongated or frayed shapes, c → 0. Compactness is rotation, translation, and scale/resolution invariant.

 Border Features
In addition to the border length l, the minimum and maximum pixel coordinates of the object, x min , x max , y min and y max , as well as the minimum and maximum distance from the center of gravity to the border, r min and r max , can be calculated rom B.Minimum and maximum coordinates are not invariant to any transformation.r min and r max are invariant to translation and rotation, and variant to scale/resolution.

 Normalization
The above features must be normalized to be used in a real-life application to eliminate translation and scale variance. A resolution and translation invariant feature x p can be computed as follows  Derivatives In features computed for dynamic gestures, invariance of a constant offset may also be achieved by derivation. Computing the derivative f'(t) of a feature f(t) and using it as an additional element in the feature vector to emphasize changes in f(t) can sometimes be a simple yet effective method to improve classification performance.

Rule-based Classification
A simple heuristic approach to classification is a set of explicit IF-THEN rules that refer to the target's features and require them to lie within a certain range that is typical of a specific gesture. The researches for automatic learning of rules are presented in [29]. The rule based classifications are also usually used as a primary step for the dynamic gesture classification algorithms.

IF a <ƟaN2 THEN the observed gesture is the stop IF a >=ƟaN2
THEN the object is the hand.
The threshold that is specified for c was determined experimentally from a data set of multiple productions of the gesture and all other gestures, performed by different users. In Fig. 2. The used hand gesture algorithm steps are represented.

EXPERIMENTS AND RESULTS
The implementation of this hand gesture recognition algorithm uses the AForge.Net [33] framework for capturing frames from a webcam in real time. Using any webcam, frames are captured in real time, and then they are processed for detection of the hand. Then the features are extracted. Using these features the gestures are identified as described in the previous section. The program was designed in C# with .Net Framework 4. Fig. 4, Fig.5, and Fig.6 represent the user interface for the "No gesture","Right", and "Left" states respectively. The program also includes other gestures from the Arabic alphabets. The gestures used like (Alef, Baa, Taa, Thaa).

Advantages of the Proposed Model
The used algorithm works in real time. Real world capturing conditions differ greatly from the laboratory recording conditions. They differ in the frame content, lighting, backgrounds, and user independency.
1. Neutrality: The signer will speak without wearing data gloves, colored gloves or other types of sensors or markers. 2. Self-adaptation to changing the external conditions such as Lighting, backgrounds, capturing setup, image content 3. Camera: Camera hardware and/or parameters may change from take to take, or even during a single take in case of automatic dynamic adaptation. 4. Signer-independency: The system can be similar to robust automatic speech recognition systems. 5. Performance as the algorithm works in real time for the real world it takes about 1.38 second to recognize each gesture.  The most successful tries were got from the stop gesture as the hand is in its main shape. Table 2 describes a comparative study between the proposed system and the ArSLAT [34]. Our recognition rate is higher than the last one. The results were taken from their publication and compared ours with it.

Drawbacks of the Proposed Model
The proposed model may show some drawbacks or inconvenience when the vocabulary increases as the geometric features for a specific gesture will overlap with another one. Fortunately, the ArSL static gestures are about 40 (different in shape) gestures.
Consequently, it is possible for the proposed model to produce false positives or incorrect hand gesture identifications if two or more hand gestures are similar in appearance.

CONCLUSION AND FUTURE WORK
A vision based static ArSL recognition system has been developed. Color is used for the hand localization process since the size and shape of the hand vary dramatically. Then a hand region description algorithm has been used to find the border points of all regions in an image. Thereafter, a set of geometric features are then extracted these features are eccentricity, orientation, compactness, border length, area, center of gravity (cog), and second order moments. Finally we used the rule based classifier to classify the extracted features to the correct sign. A recognition rate of about 95.3% on testing data was achieved over a dataset of 7 words. This method will be applied on a large dataset and will be compared to other methods like artificial neural networks for better evaluation of the performance.