Towards Efficient Multi-Modal Emotion Recognition

The paper presents a multi-modal emotion recognition system exploiting audio and video (i.e., facial expression) information. The system first processes both sources of information individually to produce corresponding matching scores and then combines the computed matching scores to obtain a classification decision. For the video part of the system, a novel approach to emotion recognition, relying on image-set matching, is developed. The proposed approach avoids the need for detecting and tracking specific facial landmarks throughout the given video sequence, which represents a common source of error in video-based emotion recognition systems, and, therefore, adds robustness to the video processing chain. The audio part of the system, on the other hand, relies on utterance-specific Gaussian Mixture Models (GMMs) adapted from a Universal Background Model (UBM) via the maximum a posteriori probability (MAP) estimation. It improves upon the standard UBM-MAP procedure by exploiting gender information when building the utterance-specific GMMs, thus ensuring enhanced emotion recognition performance. Both the uni-modal parts as well as the combined system are assessed on the challenging multi-modal eNTERFACE'05 corpus with highly encouraging results. The developed system represents a feasible solution to emotion recognition that can easily be integrated into various systems, such as humanoid robots, smart surveillance systems and alike.


Introduction
Augmenting humanoid robotic systems with emotion recognition capabilities has recently attracted a lot of attention from both, the speech and computer vision communities.This increased attention resulted in a plethora of methods that can be found in the literature and pertain to the field of emotion recognition.
In this paper we build upon our work presented in [1,2] and present a novel multi-modal emotion recognition system exploiting video (i.e., facial expression) and audio information.The proposed system processes each source of information separately and then combines the results and the matching score level.Both the video-and audioprocessing parts of the system are implemented using novel approaches that improve upon existing methods from the literature.
Existing video-based emotion recognition techniques, for example, are typically grouped into [3]: -feature-based techniques that detect and track specific facial features, such as the corners of the mouth or eyebrows, and use the obtained information to conduct emotion recognition, and -region-based approaches, where facial motion is first measured on certain regions of the face, such as the eye or mouth region, and then exploited for emotion recognition.
Both types of methods require the detection and tracking of specific facial landmarks throughout the entire length of the image-or video-sequence and are, due to the difficulty of this task, also prone to error [1].In this paper, we take a fundamentally different approach and develop a novel method for emotion recognition from video data that adopts matching of image sets [4,5].With the proposed approach, no tracking of individual facial landmarks is needed.Instead, the procedure relies solely on the facial region as a whole, which can be robustly and efficiently extracted from video data using existing face detection techniques, as for example, the Viola-Jones face detector [6].
Similarly as for the video modality, numerous techniques for audio-based emotion recognition can also be found in the literature.Here, the techniques differ mainly in terms of the modeling approach used to represent the given audio features.Schuller et al. [7] classify the existing techniques into two classes:  frame-level modeling techniques, which build statistical models of feature vectors extracted from overlapping frames of a given utterance, and  supra-segmental modeling techniques, where a number of statistical functionals are applied to the framelevel features of one utterance, yielding a single highdimensional feature vector per utterance.
The low level acoustic features for both types of modeling techniques typically consists of spectral, prosodic and voice quality features [7,8].Although, the final recognition performance for both types of modeling techniques depends heavily on the classification method adopted, it was shown by various group evaluations that both types of modeling techniques are capable of yielding state of-the-art recognition results for the task of emotion recognition from audio data [9,10].
The contribution of this paper with respect to audio-based emotion recognition stems from an improved approach to frame-level modeling, which relies on Gaussian Mixture Models (GMMs).While in the commonly used approach a single Universal Background Model (UBM) is built first from the extracted acoustic feature vectors and then adapted by Maximum A Posteriori (MAP) adaptation, the possibility of decoupling emotion specific information from other paralinguistic cues, which exist in the UBM-MAP derived GMMs, is examined in this paper.We show that by exploiting gender information, we consistently improve upon the recognition performance of the standard UBM-MAP technique.
We assess both proposed uni-modal approaches using the eNTERFACEʹ05 [11] database, which is one of the few freely available databases containing multi-modal recordings of various types of emotions.Additionally, we study the applicability of several fusion schemes to further improve upon the results obtained with the individual modalities.Our results show that both unimodal approaches as well as the proposed combined system compare favorably with state-of-the-art techniques from the literature [9,12,13,14].
The rest of the paper is structured as follows.In Section 2 we elaborate on the proposed multi-modal emotion recognition system and describe in detail the video and audio processing parts of the system as well as the fusion schemes used to combine both parts into a coherent system.In Section 3 we present the experimental database, on which our system was evaluated, and the main findings of the paper.We conclude the paper with some final comments in Section 4.

Overview
The multi-modal emotion recognition system introduced in this paper consists of an audio and a video sub-system.Figure 1 presents the basic structure of the system.Each subsystem first processes its corresponding input and then produces a matching score.The two scores are then fused at the matching-score level to allow for a reliable classification decision.
The video sub-system comprises:  a face detection module that detects the facial region throughout the given video sequence,  a subspace creation module, which constructs a subspace from the extracted facial images to encode the emotional state, and  a matching module that compares the subspace constructed from the video sequence to the prototypical subspaces of the emotional classes using canonical correlations.
Similarly, the audio sub-system comprises:  a feature extraction and modeling module that calculates the feature vectors from each sample recording and creates a probabilistic model from the computed feature vectors and  a matching module, which produces the scores, based on the support vector models of each class.A detailed description of all system parts is presented in remainder of the paper.

The video sub-system
This section introduces our approach to emotion recognition from video data.It presents all the procedural steps that need to be taken to achieve reliable emotion recognition using holistic (appearance-based) techniques applied to image sets.

Face detection
The first procedural step required for building an emotion recognition system based on video data is the extraction of the region of interest from each frame of a given video sequence.
As our video sub-system relies on facial expression analysis, we adopt the established Viola-Jones face detector [6] for this purpose and employ it to detect the boundaries of facial regions in each frame of the currently processed video.Once the entire video sequence is processed, we resize the detected regions to a fixed size of 64�64 pixels and finally photometrically normalize them using histogram equalization.The result of the described procedure is a set of facial images as shown in Figure 2.
Note here that no geometric alignment of the facial regions based on specific landmarks is performed, which significantly increases the robustness of our approach when compared to existing methods from the literature, as no (error-prone) facial-landmark-localization procedure is needed [1,2].

Subspace creation
The extracted and normalized facial regions constructed with the procedure presented in the previous section form the foundation for the second step of our video sub-system, namely, the creation of a subspace that relates to the emotional state expressed in the given video sequence.
To facilitate the theoretical derivation of our subspace creation procedure let us consider a set of facial images � � � �� � � � � , for � � �, �, � , � � � extracted from a given video sequence �.Here, � � represents the i-th ddimensional facial image (in vector form) from the sequence � and � � denotes the total number of frames constituting �.When building a subspace from the facial images in � � , we assume that each image � � can be decomposed into a constant, identity-specific part � � � and a variable part � � (often referred to as the channel part) caused by non-identity related factors, such as illumination, pose and/or facial expression.Thus, we can write: Since we presume that illumination changes are satisfactorily compensated for with our histogram equalization procedure (and the exclusion of the first three basis vectors of the created subspace), the remaining variability must inevitably be linked to pose and facial expression changes, which are reasonable indicators of the emotional state of the subject shown in the given video sequence.Clearly, if we were able to estimate the variable part of each image in � � , we could estimate an emotion-specific subspace that could serve as the basis for recognition.
Let us assume that the variable part of the images � � , for � � �, �, � � � , � � , represents a random variable drawn from the standardized normal distribution N(0,1).The video- sequence-conditional mean � � then serves as the (variation-free) estimate of the constant identity-specific part of the images � � of �, as shown by the following expression: Considering this observation, we can conclude that removing the sequence-specific mean � � from all images in � � results in a new set � � that encodes only the variable part of the video sequence, i.e.: An example of the estimated identity-specific part as well as some channel images (computed based on the sequence shown in Figure 2) are presented in Figure 3.Note how fairly well the sequence-specific mean captures the identity of the subject shown in the video sequence, while the channel images capture the variability caused by pose and facial expression changes.
To capture the information contained in this set into a subspace useful for emotion recognition, we first compute a scatter matrix � from all images in � � .If we arrange the image in � � into the matrix � � � ��� � , where � � �� � , � � , � , � � � �, then the scatter matrix � � � ��� , can be computed as where T denotes the transpose operator.
Finally, the subspace encoding the variable part of the facial images (i.e., the maximum variance directions [15]) is characterized by the leading eigenvectors (corresponding to non-zero eigen-values) of the following eigen-problem: It should be noted that for classification purposes we discard the first three computed eigenvectors, as these usually correlate heavily with illumination changes.Thus, for a given video sequence � we construct a subspace � � of the following form: Some examples of the subspace basis (in image form) corresponding to the video sequence in Figure 2 are shown in Figure 4.

Constructing the templates
To be able to compare the subspaces computed from individual video sequences, we require some prototypical subspaces that serve as templates for our emotional classes.To construct these templates, we follow a similar approach as the one presented in the previous section and compute a subspace for each emotional-class featured in the training data.
As we have emphasized above, the canonical correlations measure the similarity between two subspaces.The first canonical correlation accounts for the similarity of the closest two basis vectors of the two subspaces, while the remaining ones carry information about the proximity of the basis vectors in other dimensions [4,5].For classification purposes we use only the first (the maximum) canonical correlation and define the similarity between two subspaces as ��� � , � � � � ��� � � .Thus, we formulate the classification problem as follows: The above expression postulates that in case the similarity between the subspaces � � and � � � is the highest among the similarities to all N subspaces, then the subspace � � � is assigned to the k-th class.

The audio sub-system
The audio part of our emotion recognition system builds on the traditional UBM-MAP technique of representing acoustic feature vectors.In this section we elaborate on our approach and describe the entire procedure of emotion recognition based on audio data.

Acoustic features
The acoustic feature vectors used in our experiments comprise of the standard set of 1-12 Mel-frequency Cepstral Coefficients (MFCC) plus energy.The MFCC features are first smoothed with a moving average filter of length 3 and then normalized using Cepstral Mean Normalization (CMN).In order to include temporal information as well, the first order delta coefficients are also generated and added to the feature vector.Thus, the final length ofthe feature vector equals 26.The described procedure is implemented using the open SMILE feature extractor [16].

GMM-UBM modeling
The frame-level features presented in the previous section are used to construct Gaussian mixture models (GMMs), which represent generative statistical models capable of characterizing arbitrary data distributions [17,18].
Formally, a GMM is defined as a linear combination of several multivariate Gaussian probability density functions (PDFs), i.e., where � � denotes the weight associated with the i-th Gaussian PDF � � ���: In the above equations � � denotes the mean vector of the i-th Gaussian PDF, � � denotes the covariance matrix of the i-th Gaussian PDF, d stands for the dimensionality of the PDF and � � �� � , � � , � � � (for i = 1, 2, ..., M), represents the set of GMM parameters.Note that a M-component GMM is fully characterized by the values of its parameters �.
The concept of universal background models (UBM) was first introduced for the problem of speaker verification [18].In general, a UBM represents a Gaussian mixture model, which is trained on some generic training data (usually all available training samples).The parameters of the UBM, i.e., � ��� , are estimated based on the maximum likelihood (ML) criterion via the expectationmaximization (EM) algorithm [19].The model is typically initialized using either k-means clustering or the Linde-Buzo-Gray algorithm.
Once the UBM is computed, the maximum a posteriori (MAP) estimation criterion (as described in [19]) is used to adapt the UBM to an utterance-specific GMM.The means of the utterance-specific GMM represent a new feature vector.While the GMM for a given test utterance could also be calculated directly from the set of feature vectors extracted from the utterance, adapting the UBM to the data in the given utterance has three important advantages:  it ensures that the ordering of the GMM parameters in � is the same as in the UBM for each computed GMM;  it compensates for the insufficient amount of data in the given utterance; and  it incorporates domain specific knowledge into the computed GMM.
When computing a GMM from the UBM, the first step is to determine the probabilistic alignment of a particular sample ����|� � � against all M UBM components as follows: where � � �� � � again denotes the Gaussian probability density function of the feature vector � � for the i-th component of the GMM, j denotes the feature vector index with j = 1, 2, ..., N, N stands for the total number of feature vectors extracted from the given image, and � � represents the weight associated with the i-th GMM component.
In the second step, the sufficient statistics for updating the mean feature vectors are computed.In general, the MAP estimation procedure updates the means, variances and weights of the GMM, but commonly the focus is only on updating the GMMʹs means.The statistics required for the MAP adaptation are: and where � � and � � stand for the null and first order sufficient statistics.
So far the presented adaptation procedure is identical to the Expectation step when using the ML criterion in the EM algorithm.The difference to the ML-based procedure is shown in the Maximization step, where the update rule becomes: as postulated by the MAP criterion.
The adaptation parameter � � � , which controls the balance between the old values of the means and the new estimate is computed as: where � is the relevance factor, which is the same for all components of the GMM.The value of the relevance factor is chosen experimentally and usually falls in the interval between 8 and 16.
After sufficient iterations of the described procedure, the algorithm stops, if the change in the component means is sufficiently small or a predefined number of iterations is reached.The size of this final GMM vector equals the dimension of the original feature vector multiplied by the number of components of the GMM and, thus, increases with the increase of GMM components, i.e., M.
The result of the described procedure is a UBM model and a separate super-vector (comprised of the mean vectors of the M GMM components) for each available utterance.The super-vector of means is taken as a feature representing the utterance and typically Support Vector Machines (SVM) are used for classification.

UBM-MAP derived super-vectors for emotion recognition
As already emphasized in the previous section, the use of Universal Background Models (UBM) in combination with the maximum a posteriori (MAP) adaptation criterion was initially introduced to the field of speaker recognition by Reynolds et.al [19], but has since been successfully applied for the recognition of other paralinguistic information in speech as well [10,20].
Clearly, if the information not related to the task at hand (in our case emotion recognition) could be excluded from the recognition procedure, it would improve the performance of the recognition task.Due to the limited amount of data (per speaker) available in most corpora commonly used in the field of emotion recognition, there is not enough statistical information for decoupling the speaker-specific information.We, therefore, take a different approach and exploit the possibility of excluding gender-specific information.As we show in Section 3, MAP derived models reliably distinguish between genders, empowering the system to take this information into account when making predictions about the emotional class of test utterances.
The illustrated procedure can be more thoroughly described as follows.During training, a single UBM is build using all of the available training data.This UBM is then adapted via the MAP criterion to produce two gender-specific UBMs.Note that in practice only the mean vectors of all Gaussian mixtures are adapted, while the covariance matrices and weights of the initial UBM are left unchanged.Once the two gender-specific UBMs are computed, the training utterances are partitioned into two disjoint sets in accordance with their gender labels (which for the training data are known in advance).
Next, a super-vector comprised of the mean vector of the estimated Gaussian mixtures is constructed for each training utterance by transforming the appropriate gender-specific UBM via the MAP rule.Using this procedure, we arrive at two sets of super-vectors, one for males and one for females, with each super-vector corresponding to a given emotional class.
In the final stage of the training procedure, a pairwise SVM classification scheme is trained based on the constructed super-vectors to discriminate between the different emotional-classes.
While the gender labels of the training utterances are known in advance, this is not the case for the test utterances.Hence, to be able to exploit gender-specific information for the emotion recognition task, the gender of the speaker a given test utterance belongs to has to be determined first.This can be done efficiently by a likelihood comparison against the male and female UBMs.
Once the gender is known, a super-vector is constructed for the given test utterance by MAP adaptation of the predicted genderʹs UBM and concatenation of the means of the Gaussian mixtures.The resulting super-vector is ultimately classified into an emotional-class using the trained SVM classification scheme.

Information fusion
To combine the information from the two sub-systems presented in Sections 2.2 and 2.3 we assess two fusion schemes in this paper.The first is a weighted sum-rule, while the second is a weighted product-rule [21].Assume that for a given test recording and a given emotional class our video sub-system has produced a matching score � � and, similarly, that our audio subsystem has produced a matching score of � � for the same recording and the same emotional class.Then the weighted sum-rule generates a new matching score � ��� based on the following expression: where � � ��,�� is a weighting factor that needs to be set based on some training/development data.
Similarly, the weighted product-rule produces a matching score of: where � � ��,�� is again a weighting factor that needs to be set in advance.
Since two different classification techniques are used for the video and audio modality, the matching scores � � and � � need to be normalized to balance their impact.Towards this end, we use rank normalization on the matching scores prior to the fusion process [22].

Database and experimental protocol
For the experiments presented in the remainder of this section, we adopt the publicly available eNTERFACEʹ05 [11] corpus.The corpus contains recordings of 44 subjects of 14 different nationalities, uttering 5 sentences per each of the 6 emotional classes.These 6 classes correspond to the ʹbig sixʹ archetypal emotions, as proposed by Ekman in [23].They are also adopted for the MPEG-4 standard and represent anger (AN), disgust (DI), fear (FE), happiness (HA), sadness (SA) and surprise (SU).Some frames (after the face detection step) extracted from video sequences of all six emotional classes of a random subject from the eNTERFACE'05 database are shown in Figure 5.
In our experiments 43 subjects are used, subject 6 is omitted as only one recording exists for each emotion.Furthermore, only 2 sentences, portraying happiness, can be found in the database for subject 23.Hence, the total number of utterances available for our experiments sum up to 1287.Since we experimented with the exclusion of gender information from our emotion recognition task, we annotated all data with gender labels prior to the experiments.
For a robust estimate of the recognition performance of the proposed system a 5 fold cross validation protocol is employed (1030 samples were used for training and 257 for testing).The folds are randomly selected without the attention to the distribution of speakers.The evaluation measure for all tests is unweighted (UW) class-wise recall (averaged over 5 folds), which is the predominant way of measuring emotion recognition accuracy [9].We also report weighted average (WA) recalls of our experiments, even though these are not as reliable, since emotional databases tend to have miss-balanced emotional classes.Confusion matrices for all 5 folds of our cross validation procedure generated using the presented video sub-system.

Results
Assessing the video sub-system Our first series of experiments aims at assessing the recognition performance of the proposed video subsystem.Specifically, we are interested in the recognition results obtained with respect to the dimensionality of the linear subspace, which is denoted with d' in Section 2.2.Hence, we vary the dimensionality of the subspaces from �� � � to �� � ���� and observe the average unweighted recall of the experiments (see Figure 6).
Note that with the increase in the dimensionality the recognition performance steadily improves, of course, at the expense of computational complexity.Note that the recognition performance is improved only by a little when the dimensionality of the subspace is increased from �� � ��� to � � � ����� Thus, we select a dimensionality of �� � ��� for our subspace and use this value for our following experiments.Figure 7 shows more detailed results for this subspace dimensionality, as confusion matrices for all 5 folds of our cross validation procedure are presented there.

Assessing the audio sub-system
The second series of experiments evaluates the performance of the audio sub-system.Throughout all presented experiments different numbers of GMM components were assessed.Generally, the recognition performance increases with the increase in the number of Gaussian mixtures, but with a limited amount of data, one can quickly either over-train the models, or singularities can occur during covariance calculations.In Section 2.3.we stated that gender-specific UBMs can be used via likelihood calculations to recognize gender.Table 1 presents the gender recognition results based on likelihood calculations against male and female UBMs with respect to GMM complexity.As expected, the recognition rate increases with the number of GMM components.Even with 8 components the results are above 94%.
While for the gender recognition task, a simple likelihood comparison is sufficient to obtain "good" recognition results, the emotion recognition experiments require the use of more advanced approaches.Thus, following the gender detection step, an utterance specific vector of means is produced based on the MAP criterion and the UBM of the predicted gender.This super-vector is finally subjected to our SVM classification scheme.
As shown in Figure 9, the proposed gender-specific UBM-MAP method outperforms the standard UBM-MAP approach, except in the case of 8 GMM components where higher gender detection errors cause a slight decrease in emotion recognition performance.With the increase of the number of GMM components the emotion recognition performance increases for both systems, but since the gender predictions become more accurate, the efficiency of our procedure becomes more evident.
Similar as for the video sub-system, we also present detailed results of our experiments with the audio subsystem in the form of confusion matrices for all 5 folds of our cross validation procedure.The matrices are presented in Figure 9.

Audio-Video Fusion
Our third series of experiments assessed the performance of the combined multi-modal system with different fusion techniques.In order to train the fusion parameters (i.e., the weighting factors �), the test samples are randomly split into two parts.The first half is used for the estimation of the fusion parameters, and the second half for evaluation.The scores, produced during classification from both modalities, are combined in order to give the final prediction for each test utterance.The results of the weighted sum and weighted product fusion are presented in the lower part of Table 2. Video SAMMI framework [12,24] 28.0 / Video sub-system [13] 37.0 / LBPs+HMMs [14] 37.7 / Canonical correlations (ours) 52.8 52.2

Fusion
Audio Video HMM [14] 56.3 / SAMMI framework [12,24] 67.0 / Async.feature fusion [13] 71 The differences between the weighted sum-rule and weighted product-rule are minor, with the highest UW recall of 77.5% achieved by the weighted product-rule fusion procedure.

Comparison with the state-of-the-art
Last but not least, we compared the performance of both developed sub-systems as well as the multi-modal emotion recognition system as a whole to results published in the literature.The results of this comparison are shown in Table 2.
Since there is no strictly defined protocol for the eNTERFACEʹ05 corpus, different experimental setups were used with the cited results, thus, a strict comparison is not possible.Note, however that our experimental protocol was as least as challenging as any from the cited sources.It is evident that our results at least match the highest reported results from the literature.Furthermore, our results are obtained without incorporating any prosodic or voice quality features, which could further improve the results.

Conclusion
In the paper we presented a multi-modal emotion recognition system.Both, audio and video sub-systems were implemented using novel approaches.For the audio sub-system we have shown that the standard UBM-MAP procedure can be further improved by incorporating gender-specific information.For the video sub-system we presented an approach to emotion recognition based on image-set matching.Both sub-systems were evaluated individually, resulting in competitive performance, when compared to the state-of-the-art results from the literature.The fusion of both sub-systems resulted in an additional increase in the emotion recognition performance when compared to the results obtained with the uni-modal systems.Moreover, the achieved average unweighted recall of 77.5% on the eNTERFACEʹ05 corpus also compares favorably with other techniques from the literature.
For our future work with respect to multi-modal emotion recognition we plan to evaluate other possibilities to exclude non-emotion related information from the audio signals.For the video sub-system we plan to assess different, possibly non-linear, options for image-set matching, such as kernel canonical correlation analysis [25,26] or related techniques.

Figure 1 .Figure 2 .
Figure 1.Block diagram of the multi-modal emotion recognition system

Figure 3 .
Figure 3.The estimated identity specific part of the images in the video sequence (left), channel images (right)

Figure 4 .
Figure 4. Some examples of the computed subspace basis for the video sequence shown in Figure2

Figure 5 .
Figure5.Sample frames extracted from video sequences of a random subject from the eNTERFACE'05 database depicting (from top to bottom): anger, disgust, fear, happiness, sadness and surprise.Note that the presented frames are not sampled in equal intervals from the video sequences and that they are processed by the face detection module of our system.

Figure 6 .
Figure 6.Video recognition results in the form of average unweighted recalls for different subspace dimensionalities

Figure 7 .
Figure 7. Confusion matrices for all 5 folds of our cross validation procedure generated using the presented video sub-system.

Figure 9 .
Figure 9. Confusion matrices for all 5 folds of our cross validation procedure generated using the presented audio sub-system System description Emotion recognition Description # feat UW WA

Table 1 .
Gender recognition results for the audio sub-system Figure 8. Emotion recognition based on audio data results

Table 2 .
Comparison of emotion recognition results (in %)