A model for inference of emotional state based on facial expressions

Non-verbal communication is of paramount importance in person-to-person interaction, as emotions are an integral part of human beings. A sociable robot should therefore display similar abilities as a way to interact seamlessly with the user. This work proposes a model for inference of conveyed emotion in real situations where a human is talking. It is based on the analysis of instantaneous emotion by Kalman filtering and the continuous movement of the emotional state over an Emotional Surface, resulting in evaluations similar to humans in conducted tests. A simulation-optimization heuristic for system tuning is described and allows easy adaptation to various facial expression analysis applications.

language, prosody and intonation convey at least 65 % of the context information in a typical conversation [1]. Applications that strive to understand these communication modes and integrate them in the human-machine interfaces are crucial to "user centric experience" paradigms [2,3]. Although voice, face and gesture recognition are now used in video games and affective computing frameworks, the inference of emotional states remains as an open problem.
It has been demonstrated that recognizing emotions is not easy, even for humans, who employ specialized brain subsystems for the task [4]. Multimodal studies have shown that humans correctly recognize the conveyed emotion expressed through speech in about 60 % of interactions. For facial recognition, the success rate rises to 70-98 % [2,5,6]. This paper focuses on emotion recognition based on facial expressions. State-of-the-art reviews of automatic facial expression detection techniques can be found in [7] and [8].
As an introductory case, consider, as an example, the frames from a video, shown in Fig. 1 and the outputs from the commercially available edition of eMotion [9], in Fig. 2.
From eMotion's output data in Fig. 2, it would be impossible for a human subject to make an educated guess regarding the expressed emotion. If one performs the classification based solely on higher mean value, the result would be Sadness. However, watching the video, even without sound, a human would easily choose Anger as the emotional state of the speaker.
This work discusses a general model for the detection of emotional states and presents a model to detect slow dynamic emotions that constitute the perceived emotional state of the speaker. It is organized as follows: reference material is presented in Sect. 2, while Sect. 3 presents the general model, Sect. 4 describes the specific proposed model, the Kalman filtering technique and the heuristics used for model tuning, Sect. 5 describes the proposed experiment and results. Fig. 1 From left to right, eMotion classified these frames as happiness (100 %), sadness (70 %), fear (83 %) and anger (76 %), respectively. Video s43_an_2 of the eNTERFACE'05 Audio-Visual Emotion Database [26]. Extracted from [28] Fig. 2 Graphical representation of eMotion's output for the video of Fig. 1. eMotion analyses each video frame individually and outputs the estimated probability for each emotion category at that frame

Background
After decades of Behaviourism dominance in Psychology, Appraisal Theories gained strength since the 60's, [10,11]. These theories postulate that emotions are elicited from appraisals. Emotions, according to appraisal theorists, may be defined as ". . . an episode of interrelated, synchronized changes in the states of all or most of the five organismic subsystems in response to the evaluation of an external or internal stimulus event as relevant to major concerns of the organism" [10]. Appraisals differ from person to person but the appraisal processes are the same for all persons. Therefore, they offer a model which justifies a common behavior but, at the same time, allows for individual differences. From all events, the conveyed emotion, as perceived in facial expressions, is the focus of this work.
In the 70's, Ekman and co-workers proposed the universality of facial expressions related to emotions [6]. Their thesis was based on a series of experiments with different cultures around the world. Most notable were the results obtained with pre-literate and culturally isolated tribes which were able to classify photos of facial expressions better than chance [6]. A sample of their work is shown in Table 1, giving support for the universality of recognition of emotions on faces.
The 30-year long debate around the universality, its acceptance and its implications are discussed in [5] and [12]. Extracted from [5] Ekman and Friesen also established the Facial Action Coding System (FACS), a seminal work for emotion recognition from faces by decomposing the face into AUs (Action Units) and assembling them together to characterize an emotional expression [13]. The universality thesis is strongly relevant to this work because it implies universality for the proposed model; the thesis, however, still receives criticism [14]. One could classify the recent approaches to computational facial expression analysis into two groups. In one group there are innovative techniques focusing on spatiotemporal features and usually employing classifiers based on HMM [15] and [16]. Their recent popularity due the arrival of cheap 3D cameras may lead to significant changes in this field. The second group consists of more traditional approaches: Haarlike and geometric features, polygonal and Bezier mesh fitting, Action Unit's tracking and energy displacement maps, [17][18][19] The later methods are currently employed in both academic and commercial developments and the most recent proposals employ multimodal analysis of emotional states [20].
Among the second group's most mature solutions, we cite eMotion, developed at Universiteit van Amsterdam [9], and FaceDetect, by the Fraunhofer Institute [21], both of which are commercially available. Both software packages focus on detecting emotion in facial expressions from each video frame, and they show excellent results in posed, semi-static situations. However, during a conversation, the face is distorted to speak in many ways, leading the algorithms to incorrectly detect the conveyed emotion. Even more, lip movement during a conversation, similar to a smile for instance, does not mean the speaker is happy. Instead, it may be an instantaneous emotion: the speaker saw something not related to the conversation, and that made him smile. There is a difference between the emotion expressed in the face and the general emotional state of the speaker.

Overview of proposed model
The proposed model to determine perceived emotion from instantaneous facial expressions is based on the displacement of a particle over a surface, subject to velocity changes proportional to the current probability of each emotion, at every moment. We propose calling this surface the "Dynamic Emotional Surface" (DES). Over the surface, attractors corresponding to each detectable emotion are placed. The particle moves freely over the DES; its velocity is at each instant proportional to the instantaneous emotions detected. The particle may also slide towards the neutral state, placed at the origin of the coordinate system, the point of minimum energy, or any other local minimum.
As input, the model takes emotion detection from video frames as worked by many authors [7,8,22,23]. Any of these software packages for facial expression analysis can be taken as a "raw sensor" from which data to be processed in the proposed model is obtained. Data are processed by Kalman filtering to remove noisy outputs and by an integration phase over a Dynamic Emotional Surface (DES), as depicted in Fig. 3.
Raw signals related to each emotion are fed into low-pass filters so both instantaneous marker expressions and erroneous high frequency variations are eliminated.
To illustrate this, consider a conversation with a friend: the overall conveyed emotion could be Happiness (the slow dynamic). But suddenly the speaker remembers someone he hates: Anger may be displayed as a marker expression. The event could be external: the speaker may see someone doing something wrong and may display Anger. In both cases, Anger is displayed as the fast dynamics, lasting no more than a couple of frames. For the listener, the appraisal process An emotional curve. In this example the system detected an expression related to sadness, thus the particle has a − → V sad component. The sliding velocity is represented as − → V slide and it is proportional to the curve steepness, that is, tending to a stable point, normally neutral emotion might lead to ignore Anger and continue the conversation, or to change the subject to investigate what caused this change in the speaker's face. The proposed model has been developed to detect the slow dynamic.

Proposed model
As stated before, the perceived emotion from instantaneous facial expressions is based on the displacement of a particle over a surface, subject to velocity changes proportional to the current probability of each emotion, at every moment, detected by raw sensors.
The instantaneous particle's velocity is determined by Eq. (1).
where − → V p particle velocity, − → V s sliding velocity, parallel to DES' gradient at the current position, − → V a velocity towards each attractor, always tangent to the DES. Consider, as an example, the two-dimensional case where the detectable emotions are Happiness and Sadness, shown in Fig. 4.
The example demonstrates some key aspects of DES. The attractors for Happiness and Sadness are placed at (∞, 0) and (−∞, 0), respectively. When the raw sensor detects some probability or intensity of an emotion, this signal is interpreted as a velocity along the trajectory towards the correspondent attractor and the particle moves along the emotional curve. In the absence of emotional facial expressions, the particle slides to the local minimum. In this example, one may infer the emotional state of the speaker observing the position of the particle along the X axis.  [24] The DES concept extends this example by defining a surface or even a hypersurface over which attractors representing the modeled emotions are placed. The relationship between a particle's position and emotional classification is also defined. The idea of an emotional surface, as shown in Fig. 5 [11,24], has been proposed by psychologists to discuss someone's internal (appraised) emotion state trajectories; in this paper, it is used to detect the overall perceived emotion during a man-machine interaction.
The DES concept also differs from Zeeman's model on presenting the emotions as attractors positioned on the XY plane instead of attributing them to the axes themselves.
A DES in a 3D space is defined as Eq. (2).
The velocity in the direction of each attractor, − → V a , is proportional to the probability of each emotion as detected by existing software such as eMotion and it is tangent to the surface. It is defined as Eq. (3).
where F a is the filtered signal associated with the attractor's emotion.
It should be noted that the frame-by-frame approach used by the raw sensors does not take into account the continuous natural facial movements and the transitions between expressions. As shown in Fig. 3, a filtering process is applied to raw sensor outputs prior to DES calculations.
The analysis of multimodal realistic videos must account for different noise sources in the process and its observation. Unexpected camera and head motions, face deformation due to speech, CCD performance and minor light source variations result in intrinsically noisy data. Besides, low-pass filtering is necessary because the slow conveyed emotions are to be detected. Both Kalman filtering and moving-average filtering were tested, as presented in Sect. 5.3.
Due to these requirements, a Kalman filter is a natural candidate. Kalman filtering is a well-established technique for linear systems subject to zero mean Gaussian noise both in the process and the sensorial acquisition. There is no empirical evidence to support these hypotheses for the problem of emotional expression analysis. However, it was assumed, due to the complexity and apparent randomness of movements, that muscular facial deformations due to speech and light variations are in the scene. The rationale presented is, thus, the central limit theorem. Filtering convergence during the experiments gave further support for this assumption.
The use of Kalman filters requires the selection of underlying linear models for the update phase. It is proposed that a well-tuned first order system, as in Eqs. (4) and (5), doubles as the filter's internal update mechanism and low-pass filter. Filtering output for each emotion is described as F a and used in Eq. (3).
where K System's gain, τ System's time constant, x s State variable, y Filter output. The Kalman filtering equations are thus written as follows: Predict: where x s,t Current x value, x s, t−1 x value in last instant estimation, w Covariance of the process noise, N (0, w), p Covariance of x t , N (0, p). Update: where m Residual covariance, v Covariance of observation noise, N (0, v), r t Current reading from facial expression analysis software, y t Current filter output. The estimation process has two steps. First, the filter runs prediction using a proper time step. If there is raw sensor information for that timestamp, it runs the update phase. One may notice that the state variable x s represents only an internal calculated value. The proposed filtering relies only on readings from facial expression analysis software to calculate the internal state of the system.
Lastly we propose a simulation-optimization heuristic to tune system filters' w and v parameters. It employs Simulated Annealing (SA) to determine a set of parameters to minimize an energy function related to the error on classification. The simulation phase is comprised of a round of video analysis based on the current proposed parameters and is used to calculate a global energy value, the optimization phase is further discussed.
Defining vectors for process noise (Q n ) and observation noise (R n ) as follows: Then defining a starting temperature (T 0 ) and a cooling constant K t < 1: The process iterates until the system's temperature matches room temperature (T room ). One may calculate the number of steps using Eq. (14): For each video, the emotional particle's trajectory is divided in two halves. The energy (E i ) is calculated as the number of later half's points that are outside the sector of its nominal classification. A global energy measure is defined by Eq. (15).
The system then randomly generates neighbor parameter vectors Q n+1 and R n+1 . It reanalyzes the tuning videos and obtains E global,n+1 . The probability of accepting the new parameters as a solution is given by the Metropolis criteria: These steps are summarized in Algorithm 1.

Experiments
Experiments were conducted to test the proposed model for the detection of the slow emotional dynamic.

Corpus selection
Selecting videos for emotion inference experiments presents some challenges: the videos must respect the conditions Algorithm 1 Simulation-optimization algorithm for tuning filter's parameters imposed by the raw sensors such as lighting, head positioning, duration and resolution, and they must also contain images with expressions in a natural way. Additionally, they must be generally available, so further research may reproduce and compare results. The eNTERFACE'05 Audio-Visual Emotion Database [25] was selected as baseline corpus for both the research on emotion inference from facial expressions and multimodal inference [26]. This database consists of volunteers acting in a series of short scenes, expressing emotions through facial expressions, speech and vocalization. The volunteers are not professional actors and, as it will be demonstrated, there are some cases where is not possible to classify the conveyed emotion based solely on the facial expressions. Therefore, an initial experiment was conducted to select viable videos.
A set of 50 videos from the eNTERFACE'05 Audio-Visual Emotion Database has been selected. These videos were presented twice, one at a time, without sound, to 17 undergraduate subjects from the Mechatronics course. The students were given a multiple choice formulary where they were asked to classify each video as Happiness, Sadness, Anger or Fear, leaving no blanks. This methodology differs from [27] and [28] where the videos were chosen by the Table 2 Human classification for videos classified as happiness

Data acquisition
This section describes the data acquisition specifically related to the eMotion software. The process starts by splitting  the selected videos, according to the criteria in Sect. 5.1, into two groups: one for system tuning and one for testing. Each video has been submitted sequentially to the eMotion software and control points for mesh adjustment were selected. After mesh fitting, each video has been played back, observing if the mesh remains attached to face's control points during the whole video. In case of abnormal mesh deformation, the current analysis was discarded and the operator had to return to the mesh fitting step. The output data for each video has been collected in a separated CSV dump file containing frame-by-frame values.

Filter selection
The results of Kalman filtering and moving-average (window size of 20 frames) for the example video (sample frames in Fig. 1 and raw sensor output on Fig. 2) are shown in Fig. 7.
As it can be seen from Table 6, the overall emotion conveyed by the video, Anger, has been correctly detected with Kalman filtering, although with a large standard deviation. Kalman filtering was therefore selected to conduct automatic classification.

DES selection
A paraboloid with parameters shown in Eq. (17) and attractors placed as in Table 7 has been chosen for DES.
One may note that the Fear attractor was placed in the fourth quadrant, which is not the usual position on the Arousal-Valence field. In fact, the placement of the attractors is arbitrary and depends on the DES, the phenomena to be modeled and how one defines the classifying function. The paraboloid DES was used to model "reasonable" social displays of emotion and the particle's position is said to be related to one of the attractors if in the same quadrant. It also yields to simplifications as follows.
Considering − → P as the particle's current position and − → A the position of the attractor (emotion), their distance can be calculated as Eq. (18). If we define a ratio r as in Eq. (19), DES S(x) may be written as a function of the variable x as r = a py a px , a px = 0, The particle's velocity is calculated as V a = F a 1, r, 2(a 1 + a 2 r 2 ) * P x 1 + r 2 + 2(a 1 + a 2 r 2 )P x 2 (21) Figure 8 shows the XY projection of the emotional particle's trajectory for the example video (all frames).
The XY projection of the emotional particle's trajectory for the example video reveals that the emotional state of the speaker may be described as Anger, as the particle moves on the second quadrant. This inference corresponds to the human observation; see Table 10, "s43_an_2".

Tuning Kalman filters
The 31 valid videos were split in two groups: 16 videos for Kalman filter tuning and 15 for testing the proposed model. Based on previous experience in system tuning [27,28], system gain and time constant for all underlying linear models were fixed for all four filters. Algorithm 1 was used to calibrate w and v parameters. The initial w and v were chosen randomly from a uniform distribution in the interval [0.001, 1000]. Additional starting conditions were: These conditions lead to 11,041 iterations. Tuning was repeated for 18 runs, looking for convergence to a minimum. The results are presented in Table 8.
The graph in Fig. 9 represents all accepted solutions during the simulation-optimization process that resulted in 447 as minimum energy.   The resulting parameters are presented in Table 9 along with the defined gains and time constants.

Automatic classification
The 15 remaining videos, i.e., those not used for adjusting the Kalman filter, were then submitted to the system, yielding the results shown in Table 10.
The XY projection for (misclassified) file s43_sa_5 is shown in Fig. 10. Emotional trajectory for file "s43_sa_5". Note that the particle oscillates inside the second quadrant yielding the classification as Anger. The correct classification is Sadness

Conclusions
A reference model for recognition of emotions on faces has been introduced, as well as a computational model to detect slow conveyed emotions and to infer the speaker's overall emotional state. The model was tested and presented excellent results.
The proposed architecture allows these techniques to be integrated with almost any facial analysis expression software available with minimal changes. The proposed simulation-optimization heuristic leads to automatic configuration and system tuning. One should note that although there are recent techniques that employ spatiotemporal features, they could still benefit from the proposed model to infer general perceived emotions in natural interactions.
In future work we plan to test the model for fast emotions. The main obstacle we foresee is the lack of a corpus for this kind of test. Finally, we plan to apply the proposed model in a multimodal inference engine, as proposed in [28].