A Machine Learning Approach for Expression Detection in Healthcare Monitoring Systems

: Expression detection plays a vital role to determine the patient’s con-dition in healthcare systems. It helps the monitoring teams to respond swiftly in case of emergency. Due to the lack of suitable methods, results are often compromised in an unconstrained environment because of pose, scale, occlusion and illuminationvariations in the image of the face of the patient. A novel patch-based multiple local binary patterns (LBP) feature extraction technique is proposed for analyzing human behavior using facial expression recognition. It consists of three-patch [TPLBP] and four-patch LBPs [FPLBP] based feature engineering respectively. Image representation is encoded from local patch statistics using these descriptors. TPLBP and FPLBP capture information that is encoded to find likenesses between adjacent patches of pixels by using short bit strings contrary to pixel-based methods. Coded images are transformed into the frequency domain using a discrete cosine transform (DCT). Most discriminant features extracted from coded DCT images are combined to generate a feature vector. Support vector machine (SVM), k-nearest neighbor (KNN), and Naïve Bayes (NB) are used for the classification of facial expressions using selected features. Extensive experimentation is performed to analyze human behavior by considering standard extended Cohn Kanade (CK + ) and Oulu–CASIA datasets. Results demonstrate that the proposed methodology outperforms the other techniques used for comparison.


Introduction
In the last two decades, excessive research has been done in the field of facial expression recognition (FER). Facial expression is the most expressive way of communication among humans and is generally categorized into seven basic expression types named anger, disgust, fear, happiness, sad, surprise, and neutral [1]. Human behavior detection through recognition of expressions plays a very important role in human-computer interaction and has attracted much attention in the areas of surveillance, healthcare, forensics, missing individual identification, crime investigation, interactive games, intelligent transportation and many other applications [2]. Facial expression recognition is a problem related to pattern recognition and computer vision, where a twodimensional image of the face is acquired to extract the features for classification. Thorough research has been carried out in recent years and several techniques have been proposed to achieve better performance. These techniques usually produce good results in a constrained environment. Though the task to identify expressions from images captured in an unconstrained environment is still challenging due to the presence of variations in resolution and illumination in certain datasets. The proposed method contains the ability to deal with such images that resemble real-world images in terms of these factors to estimate the expression adequately.
In facial expression recognition, the main step is the feature extraction from still images or video frames to obtain the appearance-based and geometric-based variations that map to a target facial expression [3]. This paper investigates the use of multiple LBPs [TPLBP, FPLBP] in facial expression recognition. Coded images are obtained after the extraction of features using TPLBP and FPLBP. In the next step, these features are converted into the frequency domain using DCT. The most discriminative features are computed to analyze emotions from both forms of coded DCT images to obtain a fused feature vector. SVM, K-NN and NB classifiers are used to classify the emotions from publicly available databases namely, CK+ [4][5][6][7][8][9][10] and Oulu-CASIA [11][12][13]. The proposed technique gives better performance in terms of accuracy and robustness than the techniques presented in the literature survey.
This research paper is categorized into five sections. Section 2 briefly describes the relevant literature review. The proposed technique is presented in Section 3. Section 4 describes the experimental results and discussion. Finally, Section 5 presents the conclusion and future work.

Related Work
Texture-based and appearance-based descriptors are popular and these are extensively used for multi-scale facial expressions such as principal component analysis (PCA) to reduce dimensionality, linear discriminate analysis (LDA) for feature selection, and local binary patterns (LBP) [14,15]. LBP [16][17][18][19] is a successful feature extraction technique for emotion recognition and other image processing applications. LBP is calculated for each pixel in an image by involving the neighbors around that pixel. It provides binary numbers using 8 neighbors and a threshold is applied on these eight values corresponding to the central pixel. The binary values are used to generate histograms representing the appearance-based regions. Many LBP variations have been investigated in the previous related work to resolve the issues of illumination, multi-scale, and high dimension variations in FER. Automatic emotion recognition is investigated in [20,21] using the weber local descriptor (WLD) technique for frontal and spontaneous images and implemented in the e-healthcare environment. The WLD histogram features are computed using the Fisher Discriminate Ratio (FDR). These features are classified through a support vector machine (SVM). The WLD-based system gives better performance using the JAFFE and Cohn Kanade Databases. Guo et al. [22] proposed an enhanced deep learning hybrid CNN-BiLSTM (EJH-CNN-BiLSTM) algorithm to detect pain intensity using facial expression. The fine-tuned VGG-Face pre-trainer is used as a feature extraction tool by considering the balanced UNBC-McMaster Shoulder Pain Archive Database principle. Principal Component Analysis was applied to reduce dimensionality and enhance efficiency. The algorithm is used to estimate four various stages of pain. The results explored that the algorithm is the potential tool in medical diagnostics for automatic pain detection.
A novel algorithm called online sequential extreme learning machine and spherical clustering (OSELM-SC) is proposed by Muhammad et al. [23]. In this approach, different techniques are applied to original face images. This includes the Voila-Jones detector for face detection and cropping and histogram equalization for illumination variations. Features are extracted by applying the curvelet transform to every region of the face image. Then the statistical features are extracted through mean, standard deviation, and entropy. The features are classified through the proposed algorithm OSELM. The best performance is achieved in the case of frontal and spontaneous images. A new facial decomposition technique named IntraFace (IF) algorithm is presented that uses landmarks to compute regions of interest (ROI). Texture, shape-based LTP, HOG, LBP, and CLBP features are extracted and classified through SVM. The better performance based on the recognition rate is achieved with this technique than other decomposition-based techniques. The appearance-based and geometric-based features are extracted by computing local face regions and then combined [24]. The LBP and normalized central moments (NCM) are used as features. The proposed technique compares the local region-based and grid-based holistic representation. The local region-based method performs better than grid-based representation after applying the SVM classifier on the features. The histogram of oriented gradients (HOG) and the most discriminant discrete cosine transform (DCT) features are extracted in [25]. The proposed system is accurate and reliable in handling illumination variation and multi-scale (resolution variance) problems. The system achieved better performance in the case of MMI and CK+ datasets images feature classified with KNN, Sequential Minimal Optimization (SMO), and Random Forest (RF). Donia et al. [26] proposed a new framework DSAE , which is based on Deep Sparse Network that automatically recognizes the expressions. This technique is only feature-based in which the Active Appearance Model (AAM), Principle Component Analysis (PCA), and Histogram of Oriented Gradients (HOG) features are extracted. The output features are used as input to Deep Sparse Coding Network, Deep Sparse Auto Encoder (DSAE) which gives the better performance considering the accuracy. Weber's local binary image cosine transform (WLBI-CT) has been proposed in [27]. This technique investigates the multi-orientation and multi-scale face images. The frequency components of images are computed through local binary descriptors (LBP) and Weber local descriptors (WLD). The WLD and LBP generate face image features that are obtained through DCT with orientation and without orientation, respectively. Then an evaluation is done using these feature vectors with different classifiers including Naïve Bayes classifier (NB), Sequential Minimal Optimization (SMO), Multilayer Perceptron (MLP), K-nearest neighbors (KNN), and Classification Tree. The main aim of the research is to recognize the basic emotions [28]. Microsoft Kinect is used for 3D face modeling, which models the face using 121 specific points and arranges them based on face points. And plot it in coordinates by the Kinect Device. The six Action Units are used to describe the emotions presented by the FACS System and are used as features that are classified through KNN and MLP. Min Guo et al. proposed an algorithm K-ELBP by integrating Extended Binary Local Pattern (ELBP) and Karhunen-Loeve Transform (KLT) [29]. The ELBP is used for uniform patterns and removed the others. KLT is used to reduce the dimensionality. The K-ELBP histograms are obtained from the segmented blocks. The multi SVM classifier is applied to a combined histogram to find accurate expressions. A salient geometric feature-based framework is presented by Ekman et al. [29] for the automatic FER system. The elastic bunch graph matching (EBGM) algorithm and Kanade Lucas Tomaci (KLT) tracker are used to create and track facial points and feature points initialization, respectively. Three different geometricbased points, line, and triangle features are extracted from generated tracked facial point results. The line and triangle discriminant features are extracted through the Extreme Learning Machine (ELM) and AdaBoost. SVM is used to classify the selected features. The line and triangle-based features are computed when the features are selected while the point-based features are computed directly. The best performance is obtained using multiple datasets. To improve the power of learning deep features, a novel island loss technique is implemented for convolutional network CNN [30]. The island loss technique with CNN (IL-CNN) outperforms the baseline CNN. It is used to reduce the intraclass variations that happen due to head position changes, occlusions and illumination variations. The IL-CNN outperforms the other techniques while using the CK+ dataset for a class of seven expressions and the Oulu-CASIA dataset. For the enhancement of the services of healthcare in smart cities, the FER technique [30] is proposed to extract the subbands by applying the "band let" transform to the face image. The weighted center-symmetric local binary pattern (CS-LBP) is implemented for every sub-band in the image in a block-wise manner. The feature vector is formed by combining CS-LBP histograms.
The most dominant features are extracted and classified by Support Vector Machine (SVM) and Gaussian Mixture Model. The performance of the technique is better in the case of JAFFE and CK datasets [30]. A novel FER system is proposed with a support vector machine (FERS) by Bargshady et al. [31]. The faces are detected from the image by combining self-quotient image (SQI) filter and Haar-like features. The SQI filter is used to overcome the light variations. The features are computed using the angular radial transforms (ART), discrete cosine transform (DCT), and the Gabor filter (GF) from the faces. The support vector machine (SVM) classifies the features and gives the best performance in terms of recognition rate for training and testing the patterns. "Simultaneous feature and dictionary learning" (SFDL) technique is proposed for sets of face images. In each training and testing set, the images were captured with different illumination and pose variations. SFDL method is implemented for the raw face pixels that learned the features and dictionaries. In stage one of the learning procedure, the facial image sets are manipulated together. The deep SFDL (D-SFDL) method is proposed for non-linear face samples of image-sets, by learning both class-specific dictionaries and hierarchical non-linear transformations. A shallow module is executed to extract the most discriminant information from the global and local regions to learn the low-level features. Then a part-based module is constructed to extract and learn dynamic local region information related to facial expressions. The long-short term memory (LSTM) and gated recurrent unit (GRU) layers are used to learn long-term dependencies. The extensive experiments show that the proposed technique gives better performance in the case of CK+ and Oulu-CASIA datasets.

Proposed Patch Based Multiple Descriptors Technique
The proposed technique describes novel patch-based multiple LBP descriptors using TPLBP and FPLBP. Image representation is computed from local patch values through these descriptors and encodes the properties of the local micro-texture around every pixel using short binary strings. Three patch LBP and four patches LBP [31] are implemented that encode the most discriminate types of local texture-based information. This technique consists of four steps. In the first step, faces are detected and cropped to overcome the multi-scale variations in the preprocessing step using the Viola-Jones algorithm described in Section 3.1. The feature engineering and fusion step introduces texture-based features. It helps in face detection through multiple LBP-based techniques. The discriminative features are extracted through DCT and fusion is performed. The feature vector is formed by fusing the selected features in Section 3.2. Finally, the classification step identifies the expression type. The experimental results for each data set are discussed in Section 3.3. The architecture of the proposed patch-based multiple descriptors technique is shown in Fig. 1

Preprocessing
The input images are preprocessed using the Viola Jones algorithm to detect and crop the face from the entire image as shown in Fig. 2. The faces are converted to grayscale if the images are already in RGB form. The pre-processing is performed due to the variance in the resolution of the images that uniformly maps all the data into 384 × 288 for CK+ and 65 × 65 for Oulu-CASIA datasets. The features are extracted and encoded in the feature engineering and fusion step.

Feature Engineering and Fusion
This section includes a description of three patch LBP and four patch LBP (TPLBP, FPLBP) feature extraction mechanism, and encoding process that are performed on pre-processed images.

Three Patch LBP (TPLBP) Coding
TPLBP and FPLBP work like the simple local binary pattern (LBP) [19] technique but are extended and developed to introduce a patch-based version. Three patch pixel values are compared with each other to produce a single value and assign it to every pixel in the image to form a TPLBP coded image. A w × w patch positioned at the central pixel is computed for every pixel in the input face image. The S extra patches are allocated consistently around a central patch in a circle denoted by radius r and the central patch is compared with each z pair of patches values. The single bit value is defined, based on the two patches values that is closer to the middle patch thus resulting in S bits per pixel code. The following formula is executed for every pixel in the image to produce three-patch LBP coding.
In Eq. (1), Y p denotes the central patch and the two patches are denoted by Y i and Y i+Zmod S along the ring. To calculate the distance between any two patches, the d(p1, p2) is used (e.g., the gray level differences between p1 and p2 is L2 norm). The function f is formulated as: For uniform regions some stability is provided using t value in Eq. (2) (e.g., t = 0.01). The processing speed is increased by obtaining the patches through nearest-neighbor sampling instead of interpolating their values.
A coded image is produced by encoding the input image similar to the CSLBP descriptor [24]. The coded image is split into a grid of non-overlapping regions and for every region, the histogram is computed that measures the frequency of every binary value. Unit length is produced by normalizing each histogram region; their values are truncated at 0.2 and normalized to unit length again. A single vector is formed by concatenating these histograms generated for an image.

Four Patch LBP Coding
In four patched LBP, the two circles of the radii r 1 and r 2 are positioned on the central pixel for every pixel in the input image. S extra patches of size w × w are split out around each circle consistently. In the internal circle, two central patches are compared with the two center patches in the external circle that is located z patches apart from each other in the circle. By comparing the two pairs, the one with higher similarity is used to define one bit in every pixel's value. S/2 center symmetric pairs for S extra patches along every circle are used for the computing the binary coded length [14]. By executing the two-step process, the coded image is computed. The following equation computes FPLBP coded image.

Discrete Cosine Transform (DCT)
DCT is a technique which describes an image as a sum of sinusoids or just like sinusoidal waves of varying magnitudes and frequencies. For feature extraction, DCT-2 technique is implemented in the proposed approach. Two-dimensional (2-D) DCT is applied on an input image that converts it into DCT coefficients of the same matrix as the input image. The most significant information is stored in just a few coefficients on the top left corner of the transform output called low frequencies. These are extracted in a zigzag manner and high frequencies are discarded as shown in Figs. 3c, 3d. Due to this reason, the DCT is often used in image compression applications. For example, in JPEG, for an X × Y input image, the DCT is computed by the following equation: x = 0, 1, . . . , A and y = 0, 1, . . ., B where f(u, v) function denotes the image intensity while F(x, y) function denotes the computed DCT coefficients in a 2D matrix form.

Feature Fusion
The TPLBP and FPLBP codes are generated and features are extracted through DCT, separately from each of these methods in the feature engineering phase. The features are also fused to construct a feature vector in a zigzag manner. The zigzag function takes a matrix and a certain number of features such as 64 (8 × 8) as an input and returns a one-dimensional array consisting of the results of zigzag scans. For example, it stores, the value of the first pixel and flows in the right and down direction until the 8 x 8 matrix is complete as shown in Figs. 4d, 4h. The same process is repeated for all the images that are coded through TPLBP and FPLBP respectively to form a concatenated feature vector of 128 values (64 and 64 for each image) as shown in Tab. 1.

Classification
The features are encoded and extracted in the feature engineering section followed by the feature fusion section, separately to form the feature vector of each dataset. SVM, KNN and SMO classifiers [31] are used to classify facial expressions using selected features.

Experimental Results and Discussion
Experiments have been performed on two different facial expression datasets the extended Cohn-Kanade (CK+) and partial Oulu-CASIA dataset. The detail of datasets is given in Tab. 2.
Execution of the proposed technique has been assessed using performance measures such as precision, recall, accuracy, specificity, sensitivity, and F-score. To determine the efficiency of the proposed technique, the comparison has been performed with the existing methods based on the included datasets.

Experiments on Extended Cohn Kanade (CK+) Dataset
Initially, experiments are performed on the CK+ dataset according to the descriptions mentioned in Tab. 3 and the distribution of the images from dataset is shown in Tab. 4. Fig. 5 shows the performance of multiple classifiers such as SVM, KNN, and SMO using CK+ dataset. Tab. 5 shows the recognition rate of linear kernel SVM classifier. Values in bold indicate the best recognition cases for each class of the CK+ dataset. The recognition rate is high but anger is misclassified as disgust and happy.
Tab. 6 shows the recognition rate when KNN classifier is used. The recognition rate is good but anger and disgust are confused with fear and disgust is misclassified as happy.    Tab. 7 shows the recognition rate in the case of SMO classifier. The recognition rate is high, but anger is confused with sad while disgust and sad are misclassified as anger.
A comparison of the results of different techniques presented in literature work along with the proposed technique using the CK+ dataset is presented in Tab. 8. A histogram of oriented features is generated along the discrete cosine transform. The hybrid technique has performed better when images with multi-scale and illumination variations are used. The face images are used to obtain local binary pattern (LBP) and weber local descriptor (WLD) features along with feature extraction mechanism based on DCT to perform classification. It is observed that the technique is not robust to noise. Tab. 8 shows the performance of the proposed system is comparable with other relevant techniques and the results describe the effectiveness of the proposed technique.

Experiments on Oulu-CASIA Dataset
The performance of the proposed technique is also evaluated on the subset of Oulu-CASIA dataset. The dataset consists of two camera types named NIR (near to infrared) and VL (visible light) to produce image sequences. The images are also captured in different illumination situations including dark, strong, and weak light with these cameras. The main dataset contains the image sequences of 80 different persons. Data of 20 persons is obtained from VL camera with occlusion (with glasses) and without occlusion (without glasses), separately. Image sequences having dark, strong and weak light for each of 20 persons are combined in both cases, with occlusion and without occlusion for each expression, separately.

Experiments on Oulu-CASIA with Occlusion
The experiments are evaluated on a combination of dark, strong, and weak illumination having frontal and spontaneous images with glasses. Fig. 6 shows the average accuracy, sensitivity, specificity, precision, recall and F-score, using multiple classifiers such as SVM, KNN and SMO. Tab. 9 shows the recognition rate of linear kernel SVM classifier using Oulu-CASIA dataset with occlusion. The recognition rate is good but disgust is misclassified as anger while happiness and surprise are confused with fear.  Tab. 10 shows the recognition rate in the case of the KNN classifier. Bold values indicate the best recognition rate of Oulu-CASIA with occlusion dataset. The recognition rate is accurate, but disgust misclassifies with fear and happiness while fear is confused with the surprise factor.  Tab. 11 shows the recognition results of SMO classifier using Oulu-CASIA with occlusion dataset. The confusion matrix indicated that disgust is confused with anger and sad while fear is classified as surprise. Sad is also confused with anger, disgust and fear.

Experiments on Oulu-CASIA Without Occlusion
The experiments are performed on the dark, strong, and weak illumination frontal and spontaneous images having faces without glasses and combined as one dataset. Fig. 7 shows the average accuracy, sensitivity, specificity, precision, recall, and F-score using multiple classifiers. Tab. 12 shows the results of linear kernel SVM classifier using Oulu-CASIA without occlusion dataset. The recognition rate shows that anger is confused with disgust.    The results obtained by employing SMO classifier using Oulu-CASIA without occlusion dataset are listed in Tab. 14. The values reveal that disgust is confused with fear and happiness.  The average recognition rate of the proposed technique is compared with existing methods using the same Oulu-CASIA dataset. IL-CNN technique is often used to reduce the intra-class variations. The performance and accuracy are low with the IL-CNN technique as compared to the proposed technique. The long-short term memory (LSTM) and gated recurrent unit (GRU) layers are used to learn long-term dependencies. Tab. 15 shows that the performance of the proposed system is better than other techniques.

Conclusion and Future Work
A novel patch-based multiple LBP descriptors techniques namely three patch local binary patterns (TPLBP) and four patch local binary patterns (FPLBP) have been proposed. The proposed system exploits the feature extraction ability of TPLBP and FPLBP along with DCT to overcome the issues of illumination, scale variations, high dimensions, noisy images, and higher computational complexity of texture-based features. Multiple classifiers are used to classify standard CK+ and Oulu-CASIA datasets with posed, spontaneous emotions, illumination variant and multi-scale face images. The proposed technique can obtain a high-performance rate, which is relatively tough in situations with variations in angles and noise. The performance can be further improved to manage these factors using some pre-processing techniques along with TPLBP and FPLBP.

Conflicts of Interest:
The authors declare that they have no conflicts of interest to report regarding the present study.