An Effective Combination of Textures and Wavelet Features for Facial Expression Recognition

In order to explore the accompanying examination goals for facial expression recognition, a proper combination of classification and adequate feature extraction is necessary. If inadequate features are used, even the best classifier could fail to achieve accurate recognition. In this paper, a new fusion technique for human facial expression recognition is used to accurately recognize human facial expressions. A combination of Discrete Wavelet Features (DWT), Local Binary Pattern (LBP), and Histogram of Gradients (HoG) feature extraction techniques was used to investigate six human emotions. K-Nearest Neighbors (KNN), Decision Tree (DT), Multi-Layer Perceptron (MLP), and Random Forest (RF) were chosen for classification. These algorithms were implemented and tested on the Static Facial Expression in Wild (SWEW) dataset which consists of facial expressions of high accuracy. The proposed algorithm exhibited 87% accuracy which is higher than the accuracy of the individual algorithms. Keywords-ANN; FER; DWT; LBP; HOG; K-Nearest Neighbors

INTRODUCTION Facial expressions are a way of sentiment expression and non-verbal correspondence. There are various systems that deal with human attitude and point of view recognition. Facial Expression Recognition (FER) transforms is one of the most discussed scientific areas nowadays. This issue is furthermore incredibly noteworthy in Human-Computer Collaboration (HCI) [1,2]. FER is being utilized to provide a description for the mental state of human beings [3]. Meanwhile, modifications in the look of photos can occur by disturbances in the pixels. Illumination troubles might also occur in indoor or outdoor photos. The exploration indicates those issues and proposes a combination strategy for different accessible highlights that surpasses these issues [4].

II. LITERATURE SURVEY AND THEORETICAL FRAMEWORK
It is very difficult to identify human facial regions. In order to handle this efficiently, a technique should be implemented to recognize facial indicators. One clause that is vital to know is the dynamic angle on transferring video [5]. Lower and top face method extends the spatial pyramid histogram of edges which give 3-dimensional facial acknowledgment. Fundamentally in this method, elements are researched for cheerful and pity indicators [6]. LBP and Improved Local Binary Pattern have been applied alongside Coordinate Bunching Representation [7]. Face recognition using an optimized algorithm chain for both 2D and 3D images gives an accuracy about 96% with SVM classifier using LBP and PCA. Further testing on 2D and 3D images using LBP and PCA with FFBPNN (Feed Forward Back Propagation Neural Network) is less effective and efficient as compared to the SVM classifier [8]. Locality Preserving Projections (LPPs) have been used for manifold systems originated from Local Binary Pattern (LBP) subjects [9]. At first, a pyramid change is utilized to divide the test photographs into different areas. So, the goal pictures are isolated. After this, the ELBP is applied upon the little pictures to compute the ELBP pyramid and the community photo decided qualities are utilized to the little pictures from AWM which can ascertain the importance of the facts they got.
Finally, the AWELBPP highlight is assembled from the blend of the little ELBP pyramid and the AWM [10]. Support Vector Machine (SVM) has been applied in dispensing boisterous pictures for highlight extraction [11,12].
Background Subtraction [13] showed good results by applying background subtraction on real-time feeds. In this work, a model based on Gaussian Mixture was used for unfolding the pixels of images and the variables of the pattern were calculated with the Expectation-Maximization (EM) algorithm. The shades were also spotted effectively. Background subtraction was also very effective and met the requirements of drowning detection. Authors in [14] reviewed earlier approaches and tried to cover up the issue of recognizing actions and behaviors and the problem of dealing with a moderate crowded situation with a good modeling technique. The conventional techniques where mixed and a Gaussian distribution was used to design the temporal changing of the background pixels in [15]. This has been proven to be insufficient for extremely non-stationary environment. However, the thresholding method with hysteresis dealt with the issue of choosing thresholds in the background subtraction context. Stationary cameras have also been used to find drowning persons in swimming pools [6,17,18]. In contrast to previous works based on geometrical and 3D Mahalanobis distance features, the presented method in [18] captured the temporal and spatial correlation of the swimmers along with color information using the Markov Random Field (MRF) context to give better performance. Promising outcomes for drowning detection were achieved using an exclusive functional link net which fused the descriptors of extracted swimmers optimally. An improved descriptor fusion technique associated with the hierarchical technique was proposed in [19]. The current drowning detection techniques can be broadly classified into the vision-based schemes and the systems based on wearable sensors [20][21][22]. On the other hand, the combination of aerial and underwater cameras to monitor the postures of FER was utilized in [23], whereas the CNN model achieved 99.78% accuracy [24]. An even more successful accuracy level was achieved in learning similarities and dissimilarities among the faces of dataset using FDREnet in [25].
III. SYSTEM METHODOLOGY Usually, cameras are present in most areas for security purposes. The already installed cameras can be utilized for the purpose of monitoring and expression detection. So, few critical frames are extracted from the video or can be utilized. A facial expression video is divided into frames to be processed. The image frames extracted from the video are utilized for feature extraction. Then, classification is carried out.

A. Feature Extraction
The input dataset is very large to be handled and processed. It is supposed to be redundant (enough data, but not abundant information), so, the input dataset will be converted into a reduced depiction set containing features. This set is named as Features vector (Fv). This process is known as feature extraction. Therefore, taking out the prejudiced features from the images enhances the decline of the dimension of the Fv by removing the redundancy in images and squeezing the relevant data into the Fv to a much smaller size. • Mouldering the image using DWT in N-levels using decimation and filtering to get the detailed coefficients and approximation.
• Feature extraction using the DWT coefficients output.
• The features that were taken out from the DWT coefficients of the images are considered as the input to classifiers because of their operative representation.
The algorithmic steps for feature extraction from the dataset are: • Step 1: The image data are decomposed into 4 detailed subbands by DWT.
• Step 2: The coefficients of approximation are further been decomposed by DWT to obtain localized data from the subband of the detailed coefficients of approximation (horizontal, vertical, and diagonal).
• Step 3: Aimed at processing and analyzing, all of the 4 levels detailed coefficients are calculated.
• Step 4: Finally, the features are analyzed and tabulated to be used as the input of the classifier.

C. Feature Extraction via Histogram of Gradients
The Histogram of Oriented Gradients (HOG) is the shape of the "function descriptor". The motive behind a feature descriptor is to generalize the item on a way that the same item (in this case a person) produces the same feature descriptor at the same time as considered under specific situations. This makes the class assignment simpler. Static Facial Expressions in the Wild (SFEW) has been utilized for selecting frames from AFEW. Regarding the block normalization for HOG, we consider v as the non-normalized vector containing all histograms in a given block, k v be its k-norm for k =1, 2, and e is some small constant. The normalized factor is defined as: The dataset covers unconstrained facial expressions, numerous head poses, massive age variety, occlusions, numerous poses, and near actual global illumination. Frames had been extracted from AFEW sequences and were labeled based on the label of the series. Typically, SFEW includes seven-hundred snapshots which have been classified for six fundamental expressions: anger, disgust, fear, happiness, sadness, and surprise, and were categorized by unbiased labelers.

D. Feature Extraction via Local Binary Pattern (LBP)
The LBP method is applied on facial images in order to extract features that may be used to get a degree of similarity. Firstly, the pictures have been divided into several blocks. After that, the LBP histogram was calculated for each block.
The value of the LBP code of a pixel ( ) , c c x y is considered as: The number of different labels produced by the LBP operator, and I{A} is 1, if A is true and 0 if it is false. Further, the image patches whose histograms are to be compared must be normalized in order to get a coherent description: Then, the block LBP histograms were concatenated into an unmarried vector. The histograms have then been evaluated by using space similarity [16]. Moreover, each bin in histograms consists of the variety of its look within the region. Lastly, the feature vector is constructed with the useful data by concatenating the community histograms to one massive histogram.

IV. RESULTS AND DISCUSSION
In this study, the SFEW dataset was used for testing, which is close to real world environment, having 300 color images with 6 emotion categories, consisting of 50 pictures each with dimensions of 143×181 pixels. The classes are Surprise, Fear, Anger, Sadness, Disgust, and Happiness represented by SU, F, A, S, D, and H respectively. The results were evaluated with assessment metrics, including confusion matrix, precision, recall, and F1 score. To compute the overall precision, we used micro-averages to combine the consequences across the 6 categories. We divided our dataset into 80% training and 20% testing subsets. The sets were fed to the distinctive learning system which utilized algorithms such as K-Nearest Neighbor (KNN), Decision Tree (DT), Multilayer Perceptron (MLP), and Random Forest (RF). Our experimental model was divided into four parts. The mentioned machine learning algorithms were applied directly to the first part of the dataset. Table I shows the original dataset accuracies. Maximum accuracy was achieve by the RF and was only 32%. Then, all algorithm accuracies were computed using DWT, LPB, and HOG for the other parts of the dataset (Tables  II-IV), and finally combined them and achieved 87% maximum accuracy with MLP and 29% minimum accuracy with KNN (Table V), which are respectively shown in the confusion matrices of Figures 2 and 3.  Further, we also calculated some edges of the face generated by DWT, LBP, and HOG. The original image is reconstructed using Harr DWT techniques. Retained energy is 99.40%.

V. CONCLUSION AND FUTURE WORK
The proposed model combines DWT, HOG, and LBP capabilities in a feature extraction technique with system learning algorithms in an excellent way of enhancing the accuracy of facial feature recognition. Six facial expressions from the SFEW database had been used for training and validation. The results indicated that the accuracy of the use of blended methods is 87%, which is higher from the individual accuracies of the combined algorithms. However, the proposed combination has the issue of generalization which may be addressed in our future work. FER is one of the most well-known regions in image processing. Generally, FER has been given more attention nowadays. The proposed technique gives an exquisite overview of facial recognition methods. The extraction of functions is vital as it decreases the very massive amount of data to only a required set. Thus, it reduces the processing time of the machine and the results are more correct. In future work, the accuracy may be augmented by using more learning algorithms. A similar approach to the usage of the Convolution Natural Community can be combined with the prevailing support vector classifier.