A Novel Maximum Entropy Markov Model for Human Facial Expression Recognition

Muhammad Hameed Siddiqi; Md. Golam Rabiul Alam; Choong Seon Hong; Adil Mehmood Khan; Hyunseung Choo

doi:10.1371/journal.pone.0162702

Abstract

Research in video based FER systems has exploded in the past decade. However, most of the previous methods work well when they are trained and tested on the same dataset. Illumination settings, image resolution, camera angle, and physical characteristics of the people differ from one dataset to another. Considering a single dataset keeps the variance, which results from differences, to a minimum. Having a robust FER system, which can work across several datasets, is thus highly desirable. The aim of this work is to design, implement, and validate such a system using different datasets. In this regard, the major contribution is made at the recognition module which uses the maximum entropy Markov model (MEMM) for expression recognition. In this model, the states of the human expressions are modeled as the states of an MEMM, by considering the video-sensor observations as the observations of MEMM. A modified Viterbi is utilized to generate the most probable expression state sequence based on such observations. Lastly, an algorithm is designed which predicts the expression state from the generated state sequence. Performance is compared against several existing state-of-the-art FER systems on six publicly available datasets. A weighted average accuracy of 97% is achieved across all datasets.

Citation: Siddiqi MH, Alam MGR, Hong CS, Khan AM, Choo H (2016) A Novel Maximum Entropy Markov Model for Human Facial Expression Recognition. PLoS ONE 11(9): e0162702. https://doi.org/10.1371/journal.pone.0162702

Editor: Zhaohong Deng, Jiangnan University, CHINA

Received: March 15, 2016; Accepted: August 27, 2016; Published: September 16, 2016

Copyright: © 2016 Siddiqi et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.

Data Availability: We used publicly available standard databases of facial expressions, which are available for free, such as 1- Extended Cohn-Kanade Dataset (CK+) of Carnegie Mellon University, USA. 2- Japanese Female Facial Expression (JAFFE) Dataset of Ritsumeikan 195 University, Kyoto, Japan. 3- Multimedia Understanding Group (MUG) Dataset of Aristotle University of 204 Thessaloniki, Thessaloniki, Greece. 4- USTC-NVIE spontaneous-based Dataset of University of Science and Technology, Hefei, Anhui, P.R. China. 5- Indian Movie Face Database (IMFDB) of Indian Institute of Information Technology, Hyderabad, India. 6- Acted Facial Expressions in the Wild Database (AFEW) of University of Miami, Florida, USA. Every researcher can use those databases.

Funding: This research was supported by the MSIP, Korea, under the G-ITRC support program (IITP-2015-R6812- 15-0001) supervised by the IITP, and by the Priority Research Centers Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (NRF-2010- 0020210).

Competing interests: The authors have declared that no competing interests exist.

Introduction

Knowledge about each other’s emotional states is important for effective communication among humans. They are responsive to each other’s emotions, and computers should gain this ability, too. Several scientific studies have been carried out to automatically detect human emotions in various fields. These include human-computer interaction [1, 2], psychology and cognitive sciences [3], access control and surveillance systems [4], and driver state surveillance. Physiological state of human body, such as blood pressure, heart rate, speech etc., is one way of monitoring someone’s emotions. Emotion recognition by recognizing facial expression offers a simple yet effective alternative [5–8].

A typical facial expression recognition (FER) system performs four tasks. These include: preprocessing of video data, feature extraction, feature selection, and recognition, as shown in Fig 1. The preprocessing module processes the video frames to remove noise, detects facial boundaries, and performs face segmentation. The segmented facial region is processed by the feature extraction module to extract distinguishing features for each type of expression, which are then quantified as discrete symbols [9]. The feature selection module selects a subset of extracted features using techniques such as linear discriminant analysis. Finally, the recognizer module uses a trained classifier on the selected features to recognize the expression in the incoming video stream.

Download:

Fig 1. General flow diagram for a typical facial expression recognition (FER) system.

https://doi.org/10.1371/journal.pone.0162702.g001

Previous studies in FER have mostly focused on the use of traditional learning methods in the recognizer module [10]. These include artificial neural networks (ANN), Gaussian mixture model (GMM), support vector machine (SVM), hidden Morkov model (HMM), deep learning methods, and hidden conditional random fields. Among these, HMM is the most commonly used learner for FER problems. However, as stated by [7, 11–13], the main weakness with HMM is its assumption that the current state depends on only the previous state.

Having these limitations and lack of improvement in HMM learning model, this paper investigates the use maximum entropy Markov model (MEMM) for FER. More specifically, in the proposed method the video observations are considered to be the observations of MEMM, and the facial expressions are modeled as the states of MEMM. A modified Viterbi is then used to generate the most probable expression state sequence based on modeled observations. Finally, the expression state is predicted from the most likely state sequence. It is also investigated and shown that the existing models are limited due to their independent assumptions which may result in decreasing the classification accuracy. For feature extraction and selection wavelet transform coupled with optical flow and stepwise linear discriminant analysis (SWLDA) are used, respectively. The proposed approach is tested and validated on six publicly available datasets. The average recognition accuracy is 97% across all the datasets. To the best of our knowledge, it is the first time that MEMM model is being utilized as a classifier for FER systems.

Related Works

This section summarizes different classification methods that have been used in existing studies. For instance, artificial neural networks (ANNs) were used by [14, 15] in their work on FER. The major problem with ANNs is their high computational complexity. They may suffer from the problem of local minima as well [7].

Other systems, including [16–19] achieved good recognition performance by utilizing support vector machines (SVMs). However, SVM does not exploit temporal dependencies between adjacent video frames and each frame is processed statistically independent of others [7]. Similarly, Gaussian mixture model (GMM) was employed by [20–22] in their respective systems. However, GMM lacks ability to model abrupt changes, which limits its applicability for recognizing spontaneous expressions [23].

Different kinds of facial expressions were recognized by [24, 25] using decision trees. The memory requirements of a decision tree-based classifier are usually high. In addition to this, the patterns in a decision tree are defined on expectations and these expectations could be illogical, which could result in error-prone decision trees. Although, a decision tree follows a pattern matching for events and relationships between them, it may not be possible to cover all the combinations. Such oversights can lead to bad decisions, which shows the limitation of decision trees. [26].

Some works, such as [27, 28] have employed bayesian networks-based classifiers. However, a bayesian network-based classifier requires prior knowledge. Having limited or incorrect prior knowledge degrades the recognition performance. Moreover, it is very difficult for bayesian networks to handle continuous data [29].

As stated in [7, 30], the most commonly used learning method for FER is the HMM. It offers advantage of handling sequential data when frame-level features are used. In such a case, vector-based classifiers, e.g., GMM, ANN, SVM, decision tree, and bayes classifier, do not perform well. However, HMM has a well-known problem: it assumes that the current state depends only on the previous state, due to which these two states must occur consecutively in the observation sequence. This assumption does not hold in reality. To solve this, non-generative models such as conditional random fields (CRF) [31] and hidden conditional random fields (HCRF) [7, 11, 13] were proposed. HCRF is an extension of CRF to learn hidden structure of sequential data through hidden states. Both of them use global normalization instead of per-state normalization. This allows for weighted scores and makes the parameter space larger than that of HMM. However, HCRF requires explicitly involving the full covariance Gaussian distribution in the observation level which may cause the complexity issue [7].

Materials and Methods

The details of each component of the proposed FER system is as follows.

Preprocessing

Global histogram equalization (GHE) [5] is used to improve the image quality. GHE does that by increasing the dynamic range of the intensity using the histogram of the whole image. It obtains the scale factor from the normalized cumulative distribution of the brightness distribution of the original image and multiplies this scale factor by the original image to redistribute the intensity [32]. GHE finds the running sum of the histogram values and then normalizes it by dividing it by the total number of pixels. This value is then multiplied by the maximum gray-level value and then mapped onto the previous values in a one-to-one correspondence [32].

For the face detection and extraction, active contour (AC) based model is used [30]. This method automatically detects and extracts human faces from the expression frames, which is based on level sets integrated with two energy functions: Chan-Vese (CV) energy function to remove the dissimilarities within a face, and Bhattacharyya distance function to maximize the distance between the face and background.

Feature Extraction and Selection

In order to represent movable parts of the face, features are extracted by applying the wavelet transform on the extracted facial regions. More specifically, the symlet wavelet transform coupled with optical flow is used. The former helps in diminishing the noise, whereas the latter extracts the facial movement features.

In order to remove any redundancy in the feature space,a non-linear feature selection method called stepwise linear discriminant analysis (SWLDA) is applied to the selected feature space. SWLDA selects the most informative features a forward selection model and removes the irrelevant features through a backward regression model. Further details are available in [30].

Proposed Model

Details of the Maximum Entropy Markov Model (MEMM).

As mentioned earlier, in this work the expression states are modeled as MEMM, as it is one of the best candidates for modeling the sequential states and observations similar to HMM. In generative HMM, the joint probability is used to determine the maximum likelihood of observation sequence. On the other hand, in discriminative MEMM, conditional probability is used to predict the state sequence from the observation sequence [33]. The dependency among the states and observations in HMM and MEMM are presented by the dependency graph shown in Fig 2.

Download:

Fig 2. (a) shows the dependency graph of HMM, while (b) presents the dependency graph of MEMM.

https://doi.org/10.1371/journal.pone.0162702.g002

Fig 3 presents the M state MEMM model. The set of states is defined as the facial expressions Ψ = {χ₁, χ₂, …, χ_M} = {Happy, Anger, Sad, Surprise, Fear, Disgust}. The corresponding frame observations are represented by the set Φ = {φ₁, φ₂, …, φ_ℑ}, where ℑ observation ranking in time. Each φ_i is the vector of observed discriminative features {δ₁, δ₂, …, δ_n}, which are extracted from the expression frames at time slot t_i. Finally, ℵ is the total number discriminative features. Now the primal objective is to determine the most likely state sequence L = {l₁, l₂, …, l_p} ∈ Ψ based on the current sequential observations Φ for the duration ℑ.

Download:

Fig 3. MEMM based on expression state model for FER system.

https://doi.org/10.1371/journal.pone.0162702.g003

To generate the most likely state sequence, HMM requires transition probability P (Ψ_i|Ψ_i−1), emission probability P (Φ_i|Ψ_i), and initial probability P (Ψ_i). On the other hand, MEMM requires a single function P (Ψ_i|Ψ_i−1, Φ_i), which is easily obtainable from the maximum entropy model, as discussed in next section. These properties of MEMM is the reason that this work uses it to model expression states for determining the hidden expression state sequences.

Learning and Parameter Estimation in MEMM.

Various methods exist in literature for estimating the parameters of MEMM, which are thoroughly described in [33]. This work utilizes the maximum entropy (MaxEnt: Ω) model (1) to estimate the transition probability from state Ψ_i−1 to Ψ_i based on the observation Φ. (1) where δ_k is the feature value of observations of the training dataset considering χ features in total, ζ_k is the trainable weights of the multinomial logistic regression.

Now to fulfill the probability axiom of summation of probabilities of whole state space should be equal to 1. Therefore, the right hand side of Eq(1) is is normalized through a normalization factor ℜ to make the left hand side as a probability distribution of Ψ. (2) (3) (4) According to Eq(4), to find out P (Ψ_i|Ψ_i−1, Φ_i) the (MaxEnt: Ω) parameter ζ_k is now the major concern as the feature parameter δ_k is already known from the training dataset. Based on the MEMM modeling the facial expression classes are considered as the states of MEMM. To define the facial expression class level, the probability of the defined class should be greater than other facial expression classes. Therefore, maximization of P (Ψ_i|Ψ_i−1, Φ_i) through parameter ζ is formulated as the following optimization problem Eq (5). (5) By assuming total D instances in training dataset and considering log likelihood probability, Eq (5) can be written as in Eq (6). (6) Afterwards, the regularization is used to penalize the large values of parameter ζ. (7) Here, the Gaussian distribution N(μ, σ²) of parameter ζ is used for regularization as shown in Eq (8). (8) As Eq (8) is a log − sum exponential equation, the popular Broyden Fletcher Goldfarb Shanno (BFGS) unconstrained optimization method is used to learn optimal weight parameter ζ of MEMM. The training process is explained in Algorithm 1.

Algorithm 1: MEMM learning (Ψ, Φ).

begin

Initialize S ← Ψ = {χ₁, χ₂, …, χ_M}

Randomly select a state χ_i

while S do

Find all pairs of state-observation (χ_i, φ_i)

Consider the selected χ_i as the state Ψ_i−1 in the determining

P(Ψ_i|Ψ_i−1, Φ_i)

Determine optimal weight parameter ζ from Eq (8) through L-BFGS optimization method to maximize the log likelihood probability

P(Ψ_i|Ψ_i−1, Φ_i)

S ← S\χ_i

Select a state χ_i from S

end

Generation of Expression State Sequence through Viterbi Algorithm.

Commonly, the Viterbi algorithm is applied in dynamic programming approach (such as finite state Markov process) in order to determine the most likely state sequence by analyzing the corresponding observation sequence. In this work, an improved Viterbi algorithm (as shown in Algorithm 2) is implemented to determine the most likely hidden expression state sequence from a sequence of observations Φ. As described before, extracted features from video frame at time τ_i is considered as observation φ_i.

The legacy Viterbi determines most likely hidden expression state sequence through initial, emission and transition probabilities i.e., P(χ_i), P(φ_τ|χ_i), and P(χ_i|χ_k) respectively. On the other hand, the modified Viterbi employs only a single function P(χ_i|χ_k), φ_τ. Hence, Eq (9) is is used to determine the Viterbi value η.

(9)

Here, state i lies in 1 ≤ k < M. However, P (χ_i|χ_k, φ_τ) is determined through Eq (3) using optimal parameter ζ from the trained system. In respect to observation Φ, the modified Viterbi returns a sequence of most likely expression states L = {l₁, l₂, …, l_p} ∈ Ψ. Finally, the predicted expression is inferred from the generated of most likely expression state sequence L of the overall expression state of ℑ duration.

Algorithm 2: Modified Viterbi (Ω, Ψ, Φ).

begin

M = |Ψ|

i = 1

while (i ≤ M) do

η₁(i) = P(χ_i|φ₁)

λ₁(i) = 0

i = i + 1

end

τ = 2

while(τ ≤ ℑ) do

i = 1

while(i ≤ Z) do

i = i + 1

end

τ = τ + 1

end

τ = ℑ − 1

while τ ≥ 1 do

τ = τ − 1

end

return L

end

Prediction of the Expression State.

The expression may vary in several video frames of ℑ duration. However, to define expression state of ℑ duration, the cardinality of each state within ℑ is determined. Different states cardinality i,e., |χ₁, χ₁, …, χ_M| is measured from L and the expression state with highest cardinality is defined as the predicted expression. Algorithm 3 shows stepwise procedure to predict expressions from generated expression states sequence.

Algorithm 3: Expression state prediction (Ω, Ψ, Φ, γ).

begin

L = Viterbi (Ω, Ψ, Φ)

M = |Ψ|

i = 1

while (i ≤ M) do

F_{χ_i} = 0

P = |L|

j = 1

while (j ≤ P) do

if χ_i = = l_j then

F_{χ_i} = F_{χ_i} + 1

end

|χ_i| = F_{χ_i}

end

i = 1

while (i ≤ M) do

if |χ_i| > γ₁ && {’Happy’}then

return χ_i

end

else if |χ_i| > γ₂ && {’Anger’}then

return

end

else if |χ_i| > γ₃ && {’Sad’}then

return

end

else if |χ_i| > γ₄ && {’Surprise’}then

return

end

else if |χ_i| > γ₅ && {’Fear’}then

return

end

else if |χ_i| > γ₆ && {’Disgust’}then

return

end

else

return

end

System Validation

Datasets Used

For performance evaluation, six publicly available standard datasets of facial expressions are used, which are as follows.

Extended Cohn-Kanade Dataset (CK+):
This dataset contains 593 videos sequences comprising seven facial expressions recorded by 123 subjects (university students) [34]. The subjects include majority of female students with age range from 18 to 30 years. Out of total 593 sequences, 309 sequences are used in this work. Out of seve, six expressions are used for evaluation. The size of each frame is 640×480 pixels in some images, and 640×490 pixels in others with 8-bits precision for gray-scale values. This dataset is publicly available and can be found using http://www.consortium.ri.cmu.edu/ckagree/. This dataset belongs to Carnegie Mellon University, USA.
Japanese Female Facial Expression (JAFFE) Dataset:
The expressions in this dataset were collected from 10 different (Japanese female) subjects [35]. Each image has been rated on six expression adjectives by 60 Japanese subjects. Most of the expression frames were taken from the front view of the camera with tied hair in order to expose the entire face. This dataset consists of 213 facial frames and has seven expressions, including the neutral expression. Out of these, 193 facial frames for six facial expressions are used. The size of each facial frame is 256×256 pixels. This dataset can be downloaded by using http://www.kasrl.org/jaffe.html. This dataset belongs to Ritsumeikan University, Kyoto, Japan.
Multimedia Understanding Group (MUG) Dataset:
In this dataset, 86 subjects performed six expressions with constant blue background with the frontal view of the camera [36]. Two light sources of 300W each, mounted on stands at a height of 130cm approximately were used. A predefined setup with the help of umbrella was utilized in order to diffuse light and avoid shadow. The images were captured at a rate of 19 frames per second. The original size of each image is 896×896 pixels. The dataset is available in http://mug.ee.auth.gr/fed/. This dataset belongs to Aristotle University of Thessaloniki, Thessaloniki, Greece.
USTC-NVIE spontaneous-based Dataset:
In USTC-NVIE dataset, an infrared thermal and a visible camera was used in order to collect both spontaneous and posed expressions, but in this work, we only utilize the spontaneous-based expressions [37]. There were a total 105 subjects. They performed a series of expressions with illumination from three different directions: front illumination, left illumination, and right illumination. Subjects’ age range was from 17 to 31 years. Some of them worn glasses, whereas others were without glasses. The size of each facial frame is 640×480 or 704×490 pixels. In total, 910 expression frames are utilized from this dataset. This facial expression dataset is publicly available in http://nvie.ustc.edu.cn/index.html. This dataset belongs to University of Science and Technology, Hefei, Anhui, P.R. China.
Indian Movie Face Database (IMFDB):
The IMFDB dataset was collected from Indian movies of different languages [38]. Most of the videos were collected from the last two decades which contain large diversity in illumination, and image resolution. In IMFDB, the subjects wore partial or full-makeup. The images are from frontal, left, right, up, and down views of camera. The dataset has basic six expressions captured from 67 male and 33 female actors of different age groups, such as children (1–12 years), young adults (13–30 years), middle aged (31–50 years), and elderly (Above 50 years) with at least 200 images from each actor. Some subjects wore glasses and had beard, ornaments, hair, hand, or none. In order to maintain consistency among the images, a heuristic method for cropping is applied, and all the images are manually selected and cropped from the video frames. The size of each image which we used for our experiments is 140×180 pixels. The dataset can be downloaded by using http://cvit.iiit.ac.in/projects/IMFDB/, which belongs to Indian Institute of Information Technology, Hyderabad, India.
Acted Facial Expressions in the Wild Database (AFEW):
AFEW dataset [39] is publicly available standard dataset that has been collected from movies in indoor and outdoor (real world) environments. The age range of the subjects were from 1-70 years. All the expression related information such as name, age, pose, gender, expression type, etc were stored in XML schema. Static Facial Expressions in the Wild (SFEW) has been developed by selecting frames from AFEW. The database covers unconstrained facial expressions, varied head poses, large age range, occlusions, varied focus, different resolution of face and close to real world illumination. Frames were extracted from AFEW sequences and labelled based on the label of the sequence. In total, SFEW contains 700 images and which include seven basic expressions happy, anger, sad, surprise, fear, disgust, and neutral. But, we have selected the six basic expressions excluding neutral for fair comparison. The AFEW dataset of facial expression can be downloaded by using https://cs.anu.edu.au/few/AFEW.html, and the dataset belongs to University of Miami, Florida, USA.

It should be noted that since each dataset contains different expressions, six common expressions among them are selected for this work. These are happy, anger, sad, surprise, fear, and disgust. Furthermore, the datasets contain a high degree of variability in terms of scale, pose, illumination, resolution, occlusion, makeup, age and other physical characteristics of the participants. It is this high degree of variance which usually results in degrading the performance of and FER system when tested for different datasets.

Experimental Setup

For a thorough validation, the following set of four experiments is performed, and all the experiments are performed in Matlab using an Intel Core^™ i7-6700 (3.4 GHz) with a RAM capacity of 16 GB.

In the first experiment, performance of the proposed model is analyzed on each dataset using a 10–fold cross-validation scheme. In other words, each dataset is divided into ten equal parts. From these, one is used for testing; whereas, the remaining nine are used for training the system.
In the second experiment, the robustness of the proposed model is assessed. For this experiment, out of six datasets, one dataset is used for training; whereas, the other five datasets are used for testing purpose. This process is repeated six times so that each dataset is used exactly once as the training dataset.
In the third experiment, the setup of the first experiment is repeated; however, the classification module, i.e., MEMM is replaced with HMM. The purpose is to evaluate the performance of the proposed classification model against the traditionally used model, i.e., HMM.
Finally, in the fourth experiment, the proposed FER system is compared against state-of-the-art systems for FER.

Results and Discussion

First Experiment

The overall results are shown in Table 1 and Fig 4 (using CK+ dataset), Table 2 and Fig 5 (using JAFFE dataset), Table 3 and Fig 6 (using MUG dataset), Table 4 and Fig 7 (using USTC-NVIE dataset), Table 5 and Fig 8 (using IMFDB dataset) and Table 6 and Fig 9 respectively.

Download:

Table 1. Recognition rate of the proposed FER system using CK+ dataset of facial expressions (Unit: %).

https://doi.org/10.1371/journal.pone.0162702.t001

Download:

Table 2. Recognition rate of the proposed FER system using JAFFE dataset of facial expressions (Unit: %).

https://doi.org/10.1371/journal.pone.0162702.t002

Download:

Table 3. Recognition rate of the proposed FER system using MUG dataset of facial expressions (Unit: %).

https://doi.org/10.1371/journal.pone.0162702.t003

Download:

Table 4. Recognition rate of the proposed FER system using USTC-NVIE dataset of facial expressions (Unit: %).

https://doi.org/10.1371/journal.pone.0162702.t004

Download:

Table 5. Recognition rate of the proposed FER system using IMFDB dataset of facial expressions (Unit: %).

https://doi.org/10.1371/journal.pone.0162702.t005

Download:

Table 6. Recognition rate of the proposed FER system using AFEW dataset of facial expressions (Unit: %).

https://doi.org/10.1371/journal.pone.0162702.t006