The Effect of Classification Methods on Facial Emotion Recognition Accuracy

The interests toward developing accurate automatic face emotion recognition methodologies are growing vastly, and it is still one of an ever growing research field in the


INTRODUCTION
Facial expression is a visible manifestation of the affective state, cognitive activity, intention, personality, and psychopathology of a person [1]. Since each person has different cultural and surrounding background, it is difficult to precisely determine the number of facial expressions that are used daily in our life. However, researches on analyzing facial impressions have focused on seven basic expressive gestures which are: anger, disgust, fear, happiness, sadness, neutral, and surprise [2]. Till now, the recognition of facial expression still a problem with a big number of challenges. This is mainly due to the following reasons: (i) face detection from the captured image and segmentation process for the purpose of Region of Interest (ROI) extraction is a difficult task [3], (ii) Generating effective descriptors from ROI region (i.e., face region) that reduce the percentage difference between samples per class and increase between classes differences is an important and difficult step for building facial impression recognition system with high accuracy and (iii) difficulty of selection the classifier that classifies the face expression into correct emotional state [4].
In general, there are two main methods for facial emotion feature generation: geometric-based features and appearance-based features. Geometric features always present in the face but may be deformed due to any kind of facial expression. The basic components of the face or the points that represent facial features are extracted from the image spatial domain for purpose of classification can be used to get discriminating geometrical based features [5]. The appearance-based classification methods use features that appear temporarily in the face during any kind of facial expression (for example: the presence of specific facial wrinkles, bulges, forefront and the texture of the facial skin in regions surrounding the mouth and eyes). Transforms filters, such as Haar wavelets and integral image filters, are applied to ROI region, which either forms the overall face area or specific parts of the face region, to extract feature vector [6].
Classification is the procedure for discriminating class data from other data sources in the feature space. There are many methodologies in pattern classification; the most widely used are those based on statistical approach, syntactic approach, template matching, support vector machines, and neural network [7].
Anmar & Ahmed [8] have used Euclidian metric to measure the distance between feature vector that is generated using wavelets transform and Discrete Cosine Transform (DCT) and the feature vectors for the six basic facial expression images. Recognition rate of about 92.2% is achieved using 123 facial image samples taken from CK database for the six basic facial emotions (expect normal class).
Zhang et al. [9] have used ANN to recognize emotion class of the input image using two typical appearance facial features, Local Binary Patterns (LBP) and Gabor wavelets representations. Experiment results on 470 images samples taken from CK database gives accuracy of about 97.14% and 98.09%, for LBP and Gabor wavelets, respectively.
Hen et al. [10], have proposed the use of Gabor filter with SVM for face verification purpose. Clearly, when the proposed SVM is compared to the state-of-the-art methods such as PCA-SVM and Eigen face, it has outperformed these two methods with an error rate which equals to 6.19%. Wang et al. [11] have used SVMs to perform classification task on feature vector that is generated using Discriminant Laplacian Embedding (DLE) method. Six emotions from CK database have been used for performance evaluation, the attained test results were 87.4%.
In this paper, the effectiveness of three different classification methods (classical statistical  method,  neural  network  method  and  support vector machine method) on recognition abilities of facial emotion is discussed. The used features are of geometric-type representing various facial expressions.

MATERIALS AND METHODS
To build a facial emotion recognition system main stages are needed: (i) face detection, Regions of Interest extraction (ROIs) stag feature vector extraction stage, and (iv) classification stage.

Face Detection and ROIs Extraction Stage
To extract a set of good discriminating from the allocated ROIs; first the face allocated using knowledge-based face detection where the human about face structure and symmetry exploited. Face pattern is built and this pattern is done. After detection of face region, the basic facial components (i.e. eyebrows, eyes, nose, and mouth) are extracted.
Since the illumination condition and the skin and hair colors are highly different among tested dataset, adaptive contrast enhancement methods are used to extract ROIs rega these difficulties (A more detailed description about this stage could be found in the published paper [12]).

Feature Vector Extraction Stage
In this paper, a set of geometric based features is extracted using two types of geometrical operations; which are: the distance conveys between any two points and the angle that is convey among any three points.
A set of 26 facial points of interest is from the basic facial components (i.e., five points for each segment except the nose region where one point is taken to represent it). Fig. 1 shows the locations of these 26 points; the point locations are labeled using capital English alphabetic letters (from A to Z).
The possible distances and angles between these facial points are determined; in total they are 5525 possible features for both types (i.e., distance and angles).

For
distance-based features, normalization process is applied to reduce the effects of face size and camera zooming variability.
Normalization was performed according to the formula (1):

Face Detection and ROIs Extraction
To extract a set of good discriminating features from the allocated ROIs; first the face region is based methods for face detection where the human knowledge about face structure and symmetry property is exploited. Face pattern is built and the search for detection of face components (i.e. mouth) are extracted. condition and the skin and highly different among tested contrast enhancement extract ROIs regardless of more detailed description could be found in the published

Feature Vector Extraction Stage
In this paper, a set of geometric based is extracted using two types of operations; which are: the distance between any two points and the angle convey among any three points.
A set of 26 facial points of interest is generated (i.e., five points nose region where represent it).
Where, j is the index of the person subject, i is the index of the distance to be normalized, is the mean of all distances determined between any possible two pairs of facial reference points extracted from subject sample ,and Dj(i), Djn(i) is the distance feature of index i for subject j before and after normalization, respectively.

Feature Vector Clustering
The next step to perform as part of feature extraction stage is feature clustering; this step is needed for the following reasons: a. The single person can express a given emotion in different ways. Each time with a way that may differ from another time (as shown in Fig. 2a). b. The way that person expresses a given emotion may differ from the way that another person expresses it (a in Fig. 2b). c. For a given emotion there are different intensities to express it (as shown in Fig. 2c).
To obtain more than one template for each data clustering is needed. Data here N features that give highest abilities.
Different algorithms can be used for purpose; one of them is K-Means algorithm which clusters each class subclasses number. In this paper Means clustering algorithm is propose modification is based on the idea of minimizing clustering error in cascade way (i.e., at each step minimization is done by tuning each feature independently, such that the accumulative error over whole features is minimized mathematics form, for the feature vector of , 2016; Article no. BJAST.23090 Where, j is the index of the person subject, i is distance to be normalized, µ( j) is the mean of all distances determined between facial reference points ,and Dj(i), Djn(i) is index i for subject j before and after normalization, respectively.

Feature Vector Clustering
The next step to perform as part of extraction stage is feature clustering; this is needed for the following reasons: single person can express a emotion in different ways. Each time way that may differ from another time The way that person expresses a emotion may differ from the way another person expresses it (as shown For a given emotion there are intensities to express it (as shown To obtain more than one template for each class, data clustering is needed. Data here means best N features that give highest discrimination Different algorithms can be used for clustering Means clustering algorithm which clusters each class into fixed (K) subclasses number. In this paper modified Kproposed. The idea of minimizing way (i.e., at each step tuning each feature the accumulative error minimized). In feature vector of sample (S i ) containing N features (f 1i , f 2i , f 3i ,……., f Ni ), the minimized accumulative mean absolute error for the features of a given class (j) from its mean vector (T j ) is calculated as following: In equation (3), the error is calculated for all features to form one value that represents total error (accumulative error), this mean if we need to minimize the error of clustering we need to minimize the total error that is resulted from all features together. The error can be calculated and minimized for each feature individually as following: By minimizing the error of each feature in individual fashion, we can effectively reduce the total clustering error for each class. So, we have exploited this propriety to enhance K-Means clustering abilities by reducing final clustering error for each class by swapping templates values for each feature within each class and the total error for that class is recalculated. If the new error is less than the old error then new templates values within that class is adopted. Algorithm 1 illustrates the steps of modified K-Means clustering algorithm. This clustering algorithm is applied for each feature of 5525 possible features which are previously extracted and each feature is clustered into two templates within each class. More than two templates can be generated for each feature but in our work we have confined the list of templates to two templates per class to handle the clustering task.

Feature Dimensional Reduction
Dimensionality reduction means the transformation of high dimensional data into a meaningful representation of reduced dimensionality. Ideally, the reduced representation should have a dimensionality that corresponds to the intrinsic dimensionality of the data with minimum number of parameters needed to account for the observed properties of the data [13]. Fisher discriminant analysis (LDA) is an approach used for supervised dimensionality reduction and classification; it is based on classical statistical methods.
LDA works on reduction feature dimensionality by simultaneously minimizing the within-class distance among each (a) Fig. 2. Different cases in facial emotions expression: (a) two different ways to express "happy" emotion for the same subject, (b) two different ways to express "sad" emotion for two different subjects, (c) two different intensities to express "fear" emotion for class samples and maximizing the between classes distances among all classes achieving maximum class discrimination by combining features with each assumption there is two templates feature) and choosing feature combination that give best recognition ability [14].
LDA algorithm indicated that the features Table 1 are the best ones from 5525 features (of distances and angles types).  Different cases in facial emotions expression: (a) two different ways to express "happy" emotion for the same subject, (b) two different ways to express "sad" emotion for two different subjects, (c) two different intensities to express "fear" emotion for the same subject class samples and maximizing the betweendistances among all classes , thus discrimination by combining features with each other (with assumption there is two templates for each combination that LDA algorithm indicated that the features listed in Table 1 are the best ones from 5525 possible types).

Classification Stage
In this stage three different classification methods have been analyzed to evaluate the recognition ability of the proposed feature vector.

Statistical classification method
In statistical classification approaches, the underlying probability model is exploited, which provides a probability of being in each class rather than simply a classification [8]. To compare patterns, this approach uses measures (such as Euclidian distance measure) by observing distances between points in this statistical space [7]. In this approach, patterns to be classified are represented by a set of features defining a specific multidimensional vector. By doing so, each pattern is represented by in the multidimensional feature space [15].
To perform classification task using this method, the distance of the feature vector for the input image from the templates of each emotions class is calculated using Euclidian distance measure and the class that shows lowest distance will be considered as final classification result. Taking in account, there are two templates for final feature vector within each class (total number of templates for all classes will be 14), so the Euclidian distance is applied for all possible templates of feature vectors as shown in Fig.

ANN classification method
An ANN is an information processing paradigm inspired from the way that the biological nervous systems (e.g., the brain) process information. An ANN is configured for a specific application (for example, pattern recognition, function approximation, prediction, clustering and many others) [16]. In this stage three different classification methods have been analyzed to evaluate the recognition ability of the proposed feature vector.

classification method
In statistical classification approaches, the model is exploited, which probability of being in each class simply a classification [8]. To this approach uses measures istance measure) by between points in this statistical space [7]. In this approach, patterns to be classified are represented by a set of features defining a specific multidimensional vector. By doing so, each pattern is represented by a point in the multidimensional feature space [15].
To perform classification task using this method, the distance of the feature vector for the input image from the templates of each emotions class is calculated using Euclidian distance measure class that shows lowest distance will be considered as final classification result. Taking in account, there are two templates for final feature vector within each class (total number of templates for all classes will be 14), so the applied for all possible templates of feature vectors as shown in Fig. 3.
An ANN is an information processing paradigm way that the biological nervous information. An is configured for a specific application example, pattern recognition, approximation, prediction, clustering and Feed forward neural network with following structure is adopted (see Fig. 4):

No. of input layer nodes = 39 No. of hidden layer nodes = 20 No. of output Layer nodes =3
The network is learned using supervised learning to change weights values of the feature vector for achieving the best representation of emotions classes. A set of feature vectors (for training set) is given to ANN during training phase and the suitable number of clusters (No. of hidden nodes in hidden layer) is chosen based on data structure, in our system the suitable number of clusters that give best accuracy was 20 nodes in hidden layer.

Fig. 3. Statistical classification method
Feed forward neural network with following The network is learned using supervised learning feature vector for achieving the best representation of emotions classes. A set of feature vectors (for training set) is given to ANN during training phase and the suitable number of clusters (No. of hidden nodes in hidden layer) is chosen based on data ructure, in our system the suitable number of clusters that give best accuracy was 20 nodes in Recently, many machine learning algorithms have been used in many applications of recognition. In fact, Support Vector (SVM) is the main choice to be used as classification method in the last decade. SVM classifier was presented by Cortes Vapnik, [10], as a supervised machine learning algorithm. This algorithm shows that it has outperformed the state art methods that were used for the same purpose [17]. This algorithm was proposed for binary classification which has two classes of classification, but this algorithm shows tha it can be easily extended to multi classes classification. The main point in this algorithm is to find the optimal hyper separates the classes that have positive +1 from the classes that have negative when determining the boundary of the data, it can construct the hyper-plane (as shown in Fig. 5).
The line between the two dash lines is called hyper-plane. The distance between these lines is called margin and the point lying boundaries is called support vector. that, the boundary implementation the complexity of the production function and a small subset of training vector is margin. Several researches have classification purposes such as fingerprint, face detection, speech recognition, image retrieval system and hand writing. This is due to the fact that SVM has a lot of desirable characteristics which makes it very popular and used commonl Basically, different kernels have been for SVM, namely Linear, Polynomials, Radial Basis Function (RBF) and Sigmoid. In fact, SVM is mainly based on two parameters namely Cost (C) and Gamma (γ), the best values will be achieved by using Grid search algorithm with search space {2 } for C and γ respectively on the training data. In addition, in this research one versus all classification schema have been used. Lastly, this research paper has implemented the SVM classifier based on the LIBSVM with its four kernels and grid search algorithm to find optimal SVM parameters [18]. The algorithm below illustrates the LIBSVM implementation , 2016; Article no. BJAST.23090 algorithm. This algorithm has outperformed the state-of-theused for the same purpose algorithm was proposed for classification which has two classes classification, but this algorithm shows that can be easily extended to multiclassification. The main point in this is to find the optimal hyper-plane that the classes that have positive +1 from classes that have negative -1. Basically, boundary of the data, it plane (as shown in The line between the two dash lines is called the plane. The distance between these two lines is called margin and the point lying on the called support vector. Apart from that, the boundary implementation is based on function and a small subset of training vector is trained on the margin. Several researches have used SVM for fingerprint, face image retrieval is due to the fact desirable characteristics popular and used commonly.
Basically, different kernels have been proposed for SVM, namely Linear, Polynomials, Radial Sigmoid. In fact, SVM is mainly based on two parameters namely Cost ), the best values will be d search algorithm with and {2 -15 , 2 -13 ,…, the training data. In addition, in this research one versus all been used. Lastly, this research paper has implemented the SVM LIBSVM with its four kernels and grid search algorithm to find optimal [18]. The algorithm below illustrates the LIBSVM implementation .

RESULTS AND DISCUSSION
To analysis the performance of the proposed system using different classification methods, a subset consists of 870 facial images samples is taken from Cohen-Kanade (CK) database (from the last three frames). This set of samples holds the seven basic facial expressions which are neutral (NE), happy (HA), angry (AN), disgust (DI), fear (FE), sad (SA) and surprise (SU). The images are gray images (in png format), the size of each image used is either 640x490 or 640x480 pixels.
When Euclidian distance used for classification purpose with two templates for each emotion class, the recognition rate was reached to 83.22. Table 2 shows confusion matrix resulted when statistical classifier is used.
To evaluate ANN and SVMs classifier methods performance, the used dataset was divided into two subsets: training and testing. The training subset consists of about 75% of each class samples, taken, randomly. The remaining samples have been assembled as test samples.
After training ANN with training set and determining the weights of each feature in the final feature vector, the recognition ability of the proposed system is checked. The system classification ability is checked with different random training and testing sets, Table (3) shows accuracy results of best 10 runs.

Input:
Train and test data Output: Classification decision Begin • Make scaling to the training and testing data using the following equation • Apply Grid search method on the Scaled Training Data with K-fold cross validation =10 to find the optimal SVM parameters C and γ. • Select one of the SVM Kernels to train the SVM based on the optimal C and γ and build the training model. • Test the SVM model using testing data. End  As shown in Table 3 the best system accuracy with NN method from 10 random runs was reached to 91.72%. Table 5 shows confusion matrix for this run.
As it mentioned earlier, the SVM kernels are mainly based on two parameters namely Cost and Gamma, to find the optimal parameters, the grid search has been used for this purpose. In the experiments, the best parameters that achieved are ∁= 32 and ߛ = 0.5 , with k-fold cross validation= 10. The table below illustrates the classification result of the SVM classifier with its four kernels.  As it shown in Table 6, the Radial Basis Function kernel based SVM classifier gives higher level of accuracy when it compare with the other kernels.
In addition, to verify the stability of the proposed system, 10 different random training and testing sets are tested with Radial Basis Function kernel based SVM classifier. Table (7), illustrates the achievement result of 10 different random runs.  Table 7, the average of 10 different random runs gives higher level of classification rate when it compare with the other state of the art algorithms. Besides, the standard deviation (STD) gives an indication that the proposed system is much stable and perfume very well under different random runs. Fig. 6 gives a classification comparison of three proposed classification methods namely statistical, NN and SVM. Basically, the artificial methods denoted by SVM and NN outperform the Statistical method denoted by Euclidian Distance. On the other hand, the NN classifier gives higher level of classification rate due to its clustering ability but the SVM classifier outperform the benchmark methods with a stable results this is because of, SVM works well with high dimensionality data and its classification decision based on the located support vectors.

CONCLUSIONS
A new facial emotion recognition system is presented using geometric based features. LDA algorithm was used for dimensional reduction of feature space. To cluster each feature into K templates, modified K-Means clustering algorithm is presented to minimize the final clustering error. Three different classification methods are tested in the classification stage. A subset consists of 870 facial images samples from CK database is used to evaluate these methods. The experimental results show that the SVM classifier with Radial Basis Function outperform the benchmark classifier methods, this is due to the fact that the SVM has the ability to deal with the high dimensionality feature vectors and constructed a hyper-plan based on some points of the training data (Support Vector) not all the training data. As a future work more than two templates can be used in modified K-Means clustering and the results of classification methods are tested based on these feature clusters where more accurate results could be achieved especially for statistical analysis and SVM.

DISCLAIMER
The images used in this research belong to the public data base Cohn-Kanade (Ck) which can be downloaded from web site: The Affect Analysis Group At Pittsburgh (http://www.pitt.edu/~emotion/ck-spread.htm)