Collaborative expression representation using peak expression and intra class variation face images for practical subject-independent emotion recognition in videos
Introduction
Thanks to the increasing availability of computers and powerful electronic devices, the human computing needs to develop more human-centered user interfaces which respond quickly to naturally occurring human communication [1]. An important functionality of such interfaces is to understand emotions represented by facial expressions [1]. Facial expressions are the most natural and effective tools which allow humans to communicate their emotions, to express their intentions, and to interact with each other [2]. The aforementioned reasons have been emphasizing the importance of automatic facial expression recognition (FER) and justifying the great interest this research topic has attracted in the past few years [2].
Several research efforts have been made regarding automatic FER. In general, existing FER methods can be classified into geometric-based methods and appearance-based methods [3], [4]. In the beginning, most of the FER methods were based on facial landmark localization and face geometry (which are called geometric-based methods). These methods use the characteristics of a face such as the shape and locations of facial components (including mouth, eyes, eyebrows and nose), the distances between pairs of facial landmark points or the velocities of particular facial landmark points [3], [5]. It has been reported that geometric features can provide sufficient information to achieve accurate FER [6]. However, one crucial limitation these methods is that they suffer from mis-alignment of face due to inaccurate detection or tracking of facial landmark points under challenging image conditions (e.g., occlusion, low-resolution image, illumination change etc.) [7], [8], [11]. Mis-alignment of face unavoidably leads to degradation of the feature extraction for FER. Another limitation is that many of them require a neutral face of the corresponding subject for the normalization of a query face (i.e., eliminating the effect of facial identity [9], [10], [11], [12], [13]), or for the initialization of a facial landmark tracker (e.g., [35]). However, a neutral face of a subject is not always available in real applications [4], [32]. The authors in [14] proposed neutral-independent geometric features for FER without using neutral face. In this method, the locations of eight facial landmark points, and the six distances between the facial landmark points were used as features for FER [14]. However, such features (e.g., distances between facial feature points) can also be used for face recognition (i.e., identifying subject) [15], which means that geometric features may be different across subjects and not fully subject-independent. Indeed, as is seen in Table 7, this method shows a relatively poor FER performance under subject-independent recognition.
Appearance-based methods aim to capture the changes in face textures such as those created by wrinkles and bulges [3], [5]. These methods apply image filters (such as Gabor wavelets [16]) to the whole face or, specific face regions, or local patches around some facial components [5]. Most appearance-based FER methods that rely on static appearance in still images were investigated either using still image dataset (e.g., [17]) or manually selected peak expression faces from video sequences (e.g., [7], [18], [19], [32], [43], [44], [45]). To exploit the temporal dynamic information present in a video sequence, there have been some methods which have used a spatio-temporal appearance descriptor such as Local Binary Patterns from Three Orthogonal Planes (LBP-TOP) [20], [21], Local Phase Quantization from Three Orthogonal Planes (LPQ-TOP) [5], [21], Spatio-Temporal Local Monogenic Binary Pattern (STLMBP) [34] (that exploits the local phase and the local orientation of an image in addition to the local magnitude of LBP) and so on. To exploit both spatial and temporal discriminative information, a texture operation is independently applied to each of the three orthogonal planes (XY plane: spatial appearance, XT and YT planes: appearances of facial expression changes over time [20]) in a video volume. Note that the video volume can be of the whole face or of a local region of the face. However, there exist two main limitations of the spatio-temporal appearance descriptors. First, temporal motion appearance in XT or YT plane could negatively affect the FER when faces within a sequence are temporally not consecutive. Specifically, if a number of faces in the sequence are not detected by a face detector or tracker, some parts of the appearance associated with the facial expression change may be lost. In this case, the discriminative capability of the spatio-temporal descriptor will be degraded due to the loss of the information. Second, the appearance of the XY plane could be dependent on the facial identity which is not desirable for subject-independent FER.
It has been known that combining geometric and appearance features is better than using only geometric or appearance feature for facial expression recognition (FER) [4]. This hybrid representation is able to incorporate local pixel variation pattern (related to face texture information) while exploiting face geometry information at a global level [56]. There have been an increasing number of FER methods which make use of hybrid feature extraction. To generate a hybrid feature, the FER methods in [12] and [56] simply concatenated the appearance features and geometric features. However, one could not ensure that every appearance or geometric feature is helpful for classification. To address this issue, feature selection techniques have been adopted for more discriminative hybrid features [57], [58]. In [57] Adaboost [59] was used to select a set of discriminative geometric and appearance features for recognizing facial action units. In [58], the backward elimination method was used to select the discriminative features. The main idea was that, out of all features extracted, a feature to make the quadratic mutual information [60] between the remaining feature set and an emotion category maximized was discarded [58]. In [56] and [58], the geometric and appearance features were extracted directly from a face which contained both identity and expression information. Thus, this might not be optimal for subject-independent FER scenarios due to the confusion between the identity and expression. In [12] and [57], a neutral face was used to eliminate the effect of the facial identity during the appearance and geometric feature extractions. However, the main limitation of these methods was that a neutral face of a subject was not always available in real applications.
Instead of exploiting hand-crafted descriptors, some research efforts have been dedicated to learn semantic expression features using deep learning [45], [65], [66]. In [65], a deep architecture based FER method was proposed, which was inspired by facial action coding system (FACS). In this method, a convolutional neural network (CNN) was used to generate an over-complete representation of expression specific appearance variation [65]. For simulating specific AUs, groups of local patches (called AU-aware receptive fields (AURFs)) were selected [65]. Restricted Boltzmann machines [70] were used to extract high-level features of the AURFs, which were concatenated to construct the final hierarchical feature for FER [65]. Different from the methods [45], [65] that relied on a still image (i.e., peak expression face), 3D CNN was applied to a face image sequence in [66] to exploit facial dynamics for FER. A 3D CNN was achieved by convolving the 3D kernels on the cube constructed by face images. However, these methods were evaluated using manual preprocessing (peak expression face selection in [45], [65] or manual face alignment [65], [66]). Thus, their feasibilities in fully automatic FER applications were not verified.
In this paper, we propose a new practical FER in video sequences aiming to overcome three following difficulties. First, we may have incomplete temporal dynamic (such as temporal discontinuity) in a video sequence due to face detection or tracking error. Second, normalization using a neutral face from the same subject is required for performing a subject-independent FER, which is not always satisfied. Third, subtle mis-alignment of a face image may occur due to the incorrect landmark detection [55]. Many facial expression recognition (FER) methods were evaluated using manually aligned faces [7], [8], [32], [34], [45]. However, the manual alignment of face is far from the realistic. In the context of a fully automatic FER method, faces are not always perfectly aligned [61]. Experimental results in [61] showed that FER performance could be severely degraded even in small degrees of perturbation on landmarks (e.g., 3% perturbation of the eye distance). The main contributions of the paper are threefold:
1) We propose a new feature extraction method (called collaborative expression representation (CER)) where the peak expression face of a video sequence and an artificially generated face collaboratively represent an expression related facial appearance. This artificial face image is called Intra class variation (ICV) face and aims to eliminate the intra class variation due to the facial identity appearance on a peak expression face image. An ICV face is generated by combining the training face images of an expression class and is characterized to be similar to the facial identity appearance of the peak expression face [32]. To reduce the appearance related to facial identity, the proposed CER computes the distances of the locally pooled texture features between the peak expression face and the ICV face. The CER does not need prior knowledge of the query subject׳s neutral state. In addition, due to the differences of locally pooled texture features, the representation is robust to subtle rotation and mis-alignment of the peak expression face.
2) We propose a method to select the peak expression face from a video sequence while discarding the degraded faces. Herein, degraded face refers to face image with low quality in terms of face alignment and/or frontal pose degree. In terms of face alignment, incorrectly scaled or rotated face images [55] due to incorrect landmark detection are regarded as degraded faces. In terms of frontal pose degree, face images with out-of-plane face rotations (rotations in yaw or pitch [52]) are regarded as degraded faces. Among the properly aligned and frontal faces, we select the most expressive (peak expression) face image.
3) We propose a sparsity based weighting scheme that aims to fuse the complementary CERs derived by using a given peak expression face and its multiple ICV face images. We make use of an assumption that a sparse solution with higher sparsity provides more discriminative information for classification [23], [30]. Using this assumption, the effects of more discriminative CERs can be emphasized while the effects of less discriminative ones can be suppressed. The weighting scheme can be practically used because it is adaptive to the appearance of a given peak expression face image rather than pre-training using fixed training face images.
The proposed method is similar to the method in [32] in the sense that the both methods make use of ICV face images for subject-independent FER. In terms of practical usage and robustness, the main differences between the two methods are explained as follows. First, the method in [32] uses a manually selected peak expression face image for feature extractions, whereas the proposed method automatically selects it. By discarding degraded face image in the selection, the proposed method enables converting an unconstrained automatic FER task into a more controlled one with feasible performance (please refer to Section 3.2). Second, the method in [32] uses simple image difference between peak expression and ICV face image for feature extraction. Thus, it could be sensitive to subtle face rotation or alignment error. On the other hand, the proposed method could be much more robust to moderate face rotation or alignment error as it uses locally pooled feature difference instead of image difference. Thanks to the capability of discarding degraded face images and the robust feature extraction from peak expression face image, the proposed method could be more feasible for practical use in videos.
Extensive and comparative experiments have been performed on Cohn-Kanade plus (CK+) [24] MMI [25], and natural visible and infrared facial expression (NVIE) [39] databases under a fully subject-independent recognition protocol. Experimental results show that the proposed FER method could be feasible even in very challenging conditions such as video sequences characterized by spontaneously induced expressions and unconstrained head movements. In addition, the proposed FER is comparable to or better than competing FER methods in recent literature.
The remainder of this paper is organized as follows: Section 2 presents the proposed FER method based on CER using peak and ICV face images. In Section 3, experimental results are presented followed by conclusions in Section 4.
Section snippets
Proposed facial expression recognition method using collaborative expression representation
As shown in Fig. 1, the proposed FER method consists of five sequential steps. Given a video frame sequence, we use an automatic facial landmark detection to localize and align the face region of each video frame. The aligned face region is cropped, resulting in a face image. Peak expression face selection is performed to select the most useful expression face (called peak expression face) within the face image sequence. To extract a facial appearance related to the expression, an ICV face is
Experiment
To verify the proposed method, experiments on three public databases, i.e., CK+[24], MMI [25] and NVIE database [39] have been performed. The face images used in the experiments were cropped and aligned based on the two eye locations [49] using the facial landmark detection method detailed in [22]. Fig. 9 shows an example of facial landmark detection. In Fig. 9, the coordinates of the left eye and right eye are obtained by averaging the coordinates of the facial landmarks No. 20–25 and the
Conclusion
In this paper, we proposed a new facial expression recognition (FER) which was designed for a subject-independent FER with general scenarios including non-frontal head poses and various illuminations in faces sequences. A robust method that automatically selected the most useful face image (called peak expression face) from a video sequence was proposed. This method selected the most expressive face within a face sequence, while discarding the severely non-frontal faces (in terms of face
Conflict of interest
None declared.
Acknowledgements
This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (No. 2015R1A2A2A01005724).
Seung Ho Lee is currently working towards the Ph.D. degree in the Image Video Systems Laboratory at KAIST. His research interests include facial expression recognition, face recognition, pattern recognition, and machine learning. In 2011, he was a visiting researcher at the University of Toronto in Toronto, Ontario, Canada.
References (73)
- et al.
Static and dynamic 3D facial expression recognition: a comprehensive survey
Image Vis. Comput.
(2012) - et al.
Gabor wavelets and general discriminant analysis for face identification and verification
Image Vis. Comput.
(2007) - et al.
Beyond sparsity: the role of l1-optimizer in pattern classification
Pattern Recognit.
(2012) - et al.
Multi-PIE
Image Vis. Comput.
(2010) - et al.
Multimodal learning for facial expression recognition
Pattern Recognit.
(2015) - et al.
Spontaneous facial expression recognition: a robust metric learning approach
Pattern Recognit.
(2014) - et al.
Generalization of on-line learning and an application to boosting
J. Comput. Syst. Sci.
(1997) - et al.
Facial expression recognition based on local binary patterns: a comprehensive study
Image Vis. Comput.
(2009) - et al.
Score normalization in multimodal biometric systems
Pattern Recognit.
(2005) - M.F. Valstar, B. Jiang, M. Mehu, M. Pantic, K. Scherer, The First Expression Recognition and Analysis Challenge, IEEE...
A survey of affect recognition methods: audio, visual, and spontaneous expressions
IEEE Trans. Pattern Anal. Mach. Intell.
Facial expression analysis
A dynamic appearance descriptor approach to facial actions temporal modeling
IEEE Trans. Syst. Man Cybern. B
Local directional number pattern for face analysis
IEEE Trans. Image Process.
Classifying facial actions
IEEE Trans. Pattern Anal. Mach. Intell.
Facial expression recognition in image sequences using geometric deformation features and support vector machine
IEEE Trans. Image Process.
Facial expression recognition in perceptual color space
IEEE Trans. Image Process.
Component-based recognition of faces and facial expressions
IEEE Trans. Affect. Comput.
Dynamic texture recognition using local binary patterns with an application to facial expressions
IEEE Trans. Pattern Anal. Mach. Intell.
A review on gabor wavelets for face recognition
Pattern Anal. Appl.
Robust face recognition via sparse representation
IEEE Trans. Pattern Anal. Mach. Intell.
Cited by (57)
Multi-geometry embedded transformer for facial expression recognition in videos
2024, Expert Systems with ApplicationsEnhanced spatial-temporal learning network for dynamic facial expression recognition
2024, Biomedical Signal Processing and ControlExpression snippet transformer for robust video-based facial expression recognition
2023, Pattern RecognitionIntelligent facial emotion recognition based on Hybrid whale optimization algorithm and sine cosine algorithm
2022, Microprocessors and MicrosystemsCitation Excerpt :Learning to "read" facial emotions makes it simple to determine a person's mood, such as whether they are happy or furious or sad so that you can decide what to do or say next [1–10]. Early exploration of FER primarily emphasised identifying expressions from static frames [11–13]. These approaches efficiently excerpt spatial statistics but cannot model the inconsistency in contextual and morphological influences.
Clip-aware expressive feature learning for video-based facial expression recognition
2022, Information SciencesCitation Excerpt :Video-based FER methods include static frame-based methods and dynamic sequence-based methods [3]. Most of the static frame-based methods process the manually defined peak (apex) frames, e.g., local binary patterns (LBPs) [4], local phase quantization (LPQ) [5,6], Gabor wavelets [7], convolutional features [8–10], etc. These methods usually neglect the importance of intrinsic relationships between visual information of adjacent frames.
Seung Ho Lee is currently working towards the Ph.D. degree in the Image Video Systems Laboratory at KAIST. His research interests include facial expression recognition, face recognition, pattern recognition, and machine learning. In 2011, he was a visiting researcher at the University of Toronto in Toronto, Ontario, Canada.
Wissam J. Baddar is currently working towards the Ph.D. degree in the Image and Video Systems Laboratory at KAIST. His research interests include face recognition/detection, face expression recognition/analysis, biometrics, medical imaging, pattern recognition, and machine learning.
Yong Man Ro is a professor of Dept. of EE at KAIST. His research interests include image/video processing, biometrics, medical imaging, and pattern recognition. He served as an AE for IEEE SPL and TPC member in many international conferences including the program chair of IWDW 2004.