Design and Implementation of HMM for 3D Emotion Recognition

—Facial expression is one of the most useful information in human robot interaction. To improve the accuracy in 3-dimension based facial expression recognition, Hidden Markov Models (HMMs) are used to recognize the emotion from facial expressions in this study. In particular, facial expressions are measured by two parameters, which are given by previous work. The human emotions are defined as: anger, smile, normal, sadness, fear, and surprise. The referred parts in human face are selected based on the activeness during the facial expression. The activity and arousal values of each facial part are used as the observations for each hidden state in HMMs. Baum-Welch algorithm is used to train the hidden Markov model. As a result, six different emotions are very efficiently recognized through the trained HMMs.


I. INTRODUCTION
Facial expression is one of the most powerful, natural, and immediate means for human beings to communicate their emotions and purposes [1]. To understand human emotion is an essential issue for intelligent robots and systems, to obtain an effective communication with human in providing the desirable services.
In the past research of facial expression recognition, Ekman and Friesen [2] developed a Facial Action Coding System (FACS) to describe facial expressions. They separated the facial expression as an upper face and a lower face. Ying-li and her coworkers [1] used a neural network to make another feature-based automatic face analysis system due to Ekman's study. Furthermore, Hong-Wei Ng [5] used a deep Convolutional Neural Network on facial expression recognition. These studies have found that neural networks are effective in emotion expression recognition. However, most of above systems are used to predict a facial expression based on the single frame of image. Instead, emotions are more often shown as a time sequence of facial expressions. It is important to develop an emotion recognition model based on a time sequence database to increase the system accuracy.
This paper follows from the previous work [4], in which the Kinect sensor is used to capture human facial expression data.
The activity and arousal [3] values of each facial part are given by [4], as the observation in Hidden Markov Models (HMMs). As the learning method, Baum-Welch algorithm is used as the training approach in HMM.

A. Surface Common Feature
In machine learning processing, a logical, meaningful data is very important. A 3D point cloud data only has 3D spatial position information, which means the point cloud data do not have too much meaning for machine learning. In order to mine the valid data, a method named surface-common feature (SCF) map [3] is used to reveal the point cloud data. This method is based on the surface normal vectors. The SCF image is shown in Fig. 1. This method will be more practical than other approaches with inadequate illumination.

B. Autoencoder
An autoencoder is a wildly used machine learning approach. It can obtain the initial parameters of a neuron by inputting the feature itself. Suppose that there is an input sequence { } =1 , and map them with an encoder to a hidden representation { } =1 , e.g. where s is a nonlinear activation function (such as sigmoid function).
After that, y is mapped onto the reconstruction z of the same shape as x : Autoencoder is also trained to minimize the reconstruction errors (such as squared errors).
In this research, we used an autoencoder as a pre-training method for the deep neural network.

C. Convolutional Neural Network
As a famous model on deep learning, a convolutional neural network (CNN) has a complicated structure, which will have at least three different kinds of layers: a) Convolutional layer: The convolutional layer is the core building block of a CNN which consists of a grid of neurons.
b) Pooling layer: The pooling layer is the another important concept of the CNN, which is a form of nonlinear down-sampling. c) Fully-connected layer: The fully-connected layer always appears after several convolutional and pooling layers, and the high-level reasoning in the neural network is performed via fully connected layers.
For instance, there is a W × W pixels image, using an H × H pixels filter to convolute it: We extract the activity and arousal parameters by a convolution neural network from surface common feature maps. Then, it uses the two parameters as observations of Hidden Markov Models.

A. Data Obtaining
The processing is calculated as follows: In previous work, the Kinect sensor was used to capture the point cloud data of facial expressions. Common feature maps are extracted from each facial expression. Seven referred parts in human face are selected based on the activeness during the facial expression, such as left eye, right eye, nose, upper lip, under lip and cheek. Filters of each part are pre-trained through an auto-encoder.
Then, the CNN is used to obtain their activity and arousal parameters from feature maps. The activity and arousal parameters are defined as observations of HMMs.
Considering the resolution of the data, the fixed distance D between the camera and the human face is set to 90 cm. Facial parts are segmented, such as nose, left eye, right eye, left cheek, right cheek, left mouth corner, and right mouth corner, as in total 7 parts, so that the capture range is defined as 32 × 26 pixel.

B. Hidden Markov Model
In this paper, the face expression based emotion recognition task is formulated as a sequence recognition problem.
The HMMs are set as two Markov chains: one uses activity values as observations and the other uses arousal values. The whole architecture of the HMMs is showed in Fig.  2.
The necessary parameters in HMMs are shown as follows: where is the transition probability from the state to ,  B: Observation probability distribution, and  : Initial state distribution. We treat the whole dataset as a concatenated timing sequence of frames. The frame sequence is represented as: where corresponds to the feature of the ith frame. The length of the sequence is T = 10.
The emotional activity and arousal values of facial expression are treated as being generated sequentially from a Markov process that trainsits between states S = { 1 , 2 , 3 , … , 6 }.

C. Training Approach
The Baum-Welch algorithm is used to infer unknown parameters of HMMs. The training procedure is shown as follows.

1)
In the first place, the emotional activity values of each facial part are combined to one value by Bayesian network. The emotional arousal values are managed with the same way.
2) Then, 10 sequential emotional activity values and 10 sequential emotional arousal values are fed into two Markov chain as observations.
3) Calculate the temporary variables according to Bayes' theorem. Given the observed sequence Y and the parameters θ, the probability of being in state i at time t is obtained by where represents a message from state i-1 to state i and represents a message from i+1 to i. Given the observed sequence Y and the parameters θ, the probability of being in states i and j at times t and t+1 respectively is where is an observing symbol and The trained HMMs reached a convergence region, it confirmed that the training is successful. The trained HMMs parameters are shown in Eqs. 14 and 15, In this paper, according to recognizing six different emotions, an emotion recognition system is designed based on Hidden Markov Models. Using Baum-Welch algorithm as the training method. As a result, the system can recognize the six emotions successfully, as found as {'Normal,' 'Happy,' 'Happy,' 'Happy,' 'Happy,' 'Happy,' 'Happy,' 'Happy,' 'Happy,' 'Happy'} in the sample of smile.
In future work, we will optimize the probability of observation which is used to combine six emotional values to one observation. And we have to complete database by increasing the amount of data.