Collaborative expression representation using peak expression and intra class variation face images for practical subject-independent emotion recognition in videos

doi:10.1016/j.patcog.2015.12.016

Pattern Recognition

Volume 54, June 2016, Pages 52-67

https://doi.org/10.1016/j.patcog.2015.12.016 Get rights and content

Highlights

•
We propose a facial expression recognition method in practical videos.
•
We propose a method for selecting the peak expression face from a video.
•
We propose a robust subject-independent feature extraction.
•
The feature extraction normalizes facial identity of peak expression face.
•
The proposed method is feasible for spontaneous facial expression.

Abstract

This paper proposes a facial expression recognition (FER) method in videos. The proposed method automatically selects the peak expression face from a video sequence using closeness of the face to the neutral expression. The severely non-frontal faces and poorly aligned faces are discarded in advance to eliminate their negative effects on the peak expression face selection and FER. To reduce the effect of the facial identity in the feature extraction, we compute difference information between the peak expression face and its intra class variation (ICV) face. An ICV face is generated by combining the training faces of an expression class and looks similar to the peak expression face in identity. Because the difference information is defined as the distances of locally pooled texture features between the two faces, the feature extraction is robust to face rotation and mis-alignment. Results show that the proposed method is practical with videos containing spontaneous facial expressions and pose variations.

Introduction

Thanks to the increasing availability of computers and powerful electronic devices, the human computing needs to develop more human-centered user interfaces which respond quickly to naturally occurring human communication [1]. An important functionality of such interfaces is to understand emotions represented by facial expressions [1]. Facial expressions are the most natural and effective tools which allow humans to communicate their emotions, to express their intentions, and to interact with each other [2]. The aforementioned reasons have been emphasizing the importance of automatic facial expression recognition (FER) and justifying the great interest this research topic has attracted in the past few years [2].

Several research efforts have been made regarding automatic FER. In general, existing FER methods can be classified into geometric-based methods and appearance-based methods [3], [4]. In the beginning, most of the FER methods were based on facial landmark localization and face geometry (which are called geometric-based methods). These methods use the characteristics of a face such as the shape and locations of facial components (including mouth, eyes, eyebrows and nose), the distances between pairs of facial landmark points or the velocities of particular facial landmark points [3], [5]. It has been reported that geometric features can provide sufficient information to achieve accurate FER [6]. However, one crucial limitation these methods is that they suffer from mis-alignment of face due to inaccurate detection or tracking of facial landmark points under challenging image conditions (e.g., occlusion, low-resolution image, illumination change etc.) [7], [8], [11]. Mis-alignment of face unavoidably leads to degradation of the feature extraction for FER. Another limitation is that many of them require a neutral face of the corresponding subject for the normalization of a query face (i.e., eliminating the effect of facial identity [9], [10], [11], [12], [13]), or for the initialization of a facial landmark tracker (e.g., [35]). However, a neutral face of a subject is not always available in real applications [4], [32]. The authors in [14] proposed neutral-independent geometric features for FER without using neutral face. In this method, the locations of eight facial landmark points, and the six distances between the facial landmark points were used as features for FER [14]. However, such features (e.g., distances between facial feature points) can also be used for face recognition (i.e., identifying subject) [15], which means that geometric features may be different across subjects and not fully subject-independent. Indeed, as is seen in Table 7, this method shows a relatively poor FER performance under subject-independent recognition.

Appearance-based methods aim to capture the changes in face textures such as those created by wrinkles and bulges [3], [5]. These methods apply image filters (such as Gabor wavelets [16]) to the whole face or, specific face regions, or local patches around some facial components [5]. Most appearance-based FER methods that rely on static appearance in still images were investigated either using still image dataset (e.g., [17]) or manually selected peak expression faces from video sequences (e.g., [7], [18], [19], [32], [43], [44], [45]). To exploit the temporal dynamic information present in a video sequence, there have been some methods which have used a spatio-temporal appearance descriptor such as Local Binary Patterns from Three Orthogonal Planes (LBP-TOP) [20], [21], Local Phase Quantization from Three Orthogonal Planes (LPQ-TOP) [5], [21], Spatio-Temporal Local Monogenic Binary Pattern (STLMBP) [34] (that exploits the local phase and the local orientation of an image in addition to the local magnitude of LBP) and so on. To exploit both spatial and temporal discriminative information, a texture operation is independently applied to each of the three orthogonal planes (XY plane: spatial appearance, XT and YT planes: appearances of facial expression changes over time [20]) in a video volume. Note that the video volume can be of the whole face or of a local region of the face. However, there exist two main limitations of the spatio-temporal appearance descriptors. First, temporal motion appearance in XT or YT plane could negatively affect the FER when faces within a sequence are temporally not consecutive. Specifically, if a number of faces in the sequence are not detected by a face detector or tracker, some parts of the appearance associated with the facial expression change may be lost. In this case, the discriminative capability of the spatio-temporal descriptor will be degraded due to the loss of the information. Second, the appearance of the XY plane could be dependent on the facial identity which is not desirable for subject-independent FER.

It has been known that combining geometric and appearance features is better than using only geometric or appearance feature for facial expression recognition (FER) [4]. This hybrid representation is able to incorporate local pixel variation pattern (related to face texture information) while exploiting face geometry information at a global level [56]. There have been an increasing number of FER methods which make use of hybrid feature extraction. To generate a hybrid feature, the FER methods in [12] and [56] simply concatenated the appearance features and geometric features. However, one could not ensure that every appearance or geometric feature is helpful for classification. To address this issue, feature selection techniques have been adopted for more discriminative hybrid features [57], [58]. In [57] Adaboost [59] was used to select a set of discriminative geometric and appearance features for recognizing facial action units. In [58], the backward elimination method was used to select the discriminative features. The main idea was that, out of all features extracted, a feature to make the quadratic mutual information [60] between the remaining feature set and an emotion category maximized was discarded [58]. In [56] and [58], the geometric and appearance features were extracted directly from a face which contained both identity and expression information. Thus, this might not be optimal for subject-independent FER scenarios due to the confusion between the identity and expression. In [12] and [57], a neutral face was used to eliminate the effect of the facial identity during the appearance and geometric feature extractions. However, the main limitation of these methods was that a neutral face of a subject was not always available in real applications.

Instead of exploiting hand-crafted descriptors, some research efforts have been dedicated to learn semantic expression features using deep learning [45], [65], [66]. In [65], a deep architecture based FER method was proposed, which was inspired by facial action coding system (FACS). In this method, a convolutional neural network (CNN) was used to generate an over-complete representation of expression specific appearance variation [65]. For simulating specific AUs, groups of local patches (called AU-aware receptive fields (AURFs)) were selected [65]. Restricted Boltzmann machines [70] were used to extract high-level features of the AURFs, which were concatenated to construct the final hierarchical feature for FER [65]. Different from the methods [45], [65] that relied on a still image (i.e., peak expression face), 3D CNN was applied to a face image sequence in [66] to exploit facial dynamics for FER. A 3D CNN was achieved by convolving the 3D kernels on the cube constructed by face images. However, these methods were evaluated using manual preprocessing (peak expression face selection in [45], [65] or manual face alignment [65], [66]). Thus, their feasibilities in fully automatic FER applications were not verified.

In this paper, we propose a new practical FER in video sequences aiming to overcome three following difficulties. First, we may have incomplete temporal dynamic (such as temporal discontinuity) in a video sequence due to face detection or tracking error. Second, normalization using a neutral face from the same subject is required for performing a subject-independent FER, which is not always satisfied. Third, subtle mis-alignment of a face image may occur due to the incorrect landmark detection [55]. Many facial expression recognition (FER) methods were evaluated using manually aligned faces [7], [8], [32], [34], [45]. However, the manual alignment of face is far from the realistic. In the context of a fully automatic FER method, faces are not always perfectly aligned [61]. Experimental results in [61] showed that FER performance could be severely degraded even in small degrees of perturbation on landmarks (e.g., 3% perturbation of the eye distance). The main contributions of the paper are threefold:

1) We propose a new feature extraction method (called collaborative expression representation (CER)) where the peak expression face of a video sequence and an artificially generated face collaboratively represent an expression related facial appearance. This artificial face image is called Intra class variation (ICV) face and aims to eliminate the intra class variation due to the facial identity appearance on a peak expression face image. An ICV face is generated by combining the training face images of an expression class and is characterized to be similar to the facial identity appearance of the peak expression face [32]. To reduce the appearance related to facial identity, the proposed CER computes the distances of the locally pooled texture features between the peak expression face and the ICV face. The CER does not need prior knowledge of the query subject׳s neutral state. In addition, due to the differences of locally pooled texture features, the representation is robust to subtle rotation and mis-alignment of the peak expression face.

2) We propose a method to select the peak expression face from a video sequence while discarding the degraded faces. Herein, degraded face refers to face image with low quality in terms of face alignment and/or frontal pose degree. In terms of face alignment, incorrectly scaled or rotated face images [55] due to incorrect landmark detection are regarded as degraded faces. In terms of frontal pose degree, face images with out-of-plane face rotations (rotations in yaw or pitch [52]) are regarded as degraded faces. Among the properly aligned and frontal faces, we select the most expressive (peak expression) face image.

3) We propose a sparsity based weighting scheme that aims to fuse the complementary CERs derived by using a given peak expression face and its multiple ICV face images. We make use of an assumption that a sparse solution with higher sparsity provides more discriminative information for classification [23], [30]. Using this assumption, the effects of more discriminative CERs can be emphasized while the effects of less discriminative ones can be suppressed. The weighting scheme can be practically used because it is adaptive to the appearance of a given peak expression face image rather than pre-training using fixed training face images.

The proposed method is similar to the method in [32] in the sense that the both methods make use of ICV face images for subject-independent FER. In terms of practical usage and robustness, the main differences between the two methods are explained as follows. First, the method in [32] uses a manually selected peak expression face image for feature extractions, whereas the proposed method automatically selects it. By discarding degraded face image in the selection, the proposed method enables converting an unconstrained automatic FER task into a more controlled one with feasible performance (please refer to Section 3.2). Second, the method in [32] uses simple image difference between peak expression and ICV face image for feature extraction. Thus, it could be sensitive to subtle face rotation or alignment error. On the other hand, the proposed method could be much more robust to moderate face rotation or alignment error as it uses locally pooled feature difference instead of image difference. Thanks to the capability of discarding degraded face images and the robust feature extraction from peak expression face image, the proposed method could be more feasible for practical use in videos.

Extensive and comparative experiments have been performed on Cohn-Kanade plus (CK+) [24] MMI [25], and natural visible and infrared facial expression (NVIE) [39] databases under a fully subject-independent recognition protocol. Experimental results show that the proposed FER method could be feasible even in very challenging conditions such as video sequences characterized by spontaneously induced expressions and unconstrained head movements. In addition, the proposed FER is comparable to or better than competing FER methods in recent literature.

The remainder of this paper is organized as follows: Section 2 presents the proposed FER method based on CER using peak and ICV face images. In Section 3, experimental results are presented followed by conclusions in Section 4.

Section snippets

Proposed facial expression recognition method using collaborative expression representation

As shown in Fig. 1, the proposed FER method consists of five sequential steps. Given a video frame sequence, we use an automatic facial landmark detection to localize and align the face region of each video frame. The aligned face region is cropped, resulting in a face image. Peak expression face selection is performed to select the most useful expression face (called peak expression face) within the face image sequence. To extract a facial appearance related to the expression, an ICV face is

Experiment

To verify the proposed method, experiments on three public databases, i.e., CK+[24], MMI [25] and NVIE database [39] have been performed. The face images used in the experiments were cropped and aligned based on the two eye locations [49] using the facial landmark detection method detailed in [22]. Fig. 9 shows an example of facial landmark detection. In Fig. 9, the coordinates of the left eye and right eye are obtained by averaging the coordinates of the facial landmarks No. 20–25 and the

Conclusion

In this paper, we proposed a new facial expression recognition (FER) which was designed for a subject-independent FER with general scenarios including non-frontal head poses and various illuminations in faces sequences. A robust method that automatically selected the most useful face image (called peak expression face) from a video sequence was proposed. This method selected the most expressive face within a face sequence, while discarding the severely non-frontal faces (in terms of face

Conflict of interest

None declared.

Acknowledgements

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (No. 2015R1A2A2A01005724).

Seung Ho Lee is currently working towards the Ph.D. degree in the Image Video Systems Laboratory at KAIST. His research interests include facial expression recognition, face recognition, pattern recognition, and machine learning. In 2011, he was a visiting researcher at the University of Toronto in Toronto, Ontario, Canada.

References (73)

G. Sandbach et al.
Static and dynamic 3D facial expression recognition: a comprehensive survey
Image Vis. Comput.
(2012)
L. Shen et al.
Gabor wavelets and general discriminant analysis for face identification and verification
Image Vis. Comput.
(2007)
J. Yang et al.
Beyond sparsity: the role of l1-optimizer in pattern classification
Pattern Recognit.
(2012)
R. Gross et al.
Multi-PIE
Image Vis. Comput.
(2010)
W. Zhang et al.
Multimodal learning for facial expression recognition
Pattern Recognit.
(2015)
S. Wan et al.
Spontaneous facial expression recognition: a robust metric learning approach
Pattern Recognit.
(2014)
Y. Freund et al.
Generalization of on-line learning and an application to boosting
J. Comput. Syst. Sci.
(1997)
C. Shan et al.
Facial expression recognition based on local binary patterns: a comprehensive study
Image Vis. Comput.
(2009)
A. Jain et al.
Score normalization in multimodal biometric systems
Pattern Recognit.
(2005)
M.F. Valstar, B. Jiang, M. Mehu, M. Pantic, K. Scherer, The First Expression Recognition and Analysis Challenge, IEEE...

Z. Zhang et al.

A survey of affect recognition methods: audio, visual, and spontaneous expressions

IEEE Trans. Pattern Anal. Mach. Intell.

(2009)

Y.L. Tian et al.

Facial expression analysis

B. Jiang et al.

A dynamic appearance descriptor approach to facial actions temporal modeling

IEEE Trans. Syst. Man Cybern. B

(2014)

M. Valstar, I. Patras, M. Pantic, Facial Action Unit Detection Using Probabilistic Actively Learned Support Vector...

A.R. Rivera et al.

Local directional number pattern for face analysis

IEEE Trans. Image Process.

(2013)

S.M. Lajevardi, Z.M. Hussain, Facial Expression Recognition Using Log-Gabor Filters and Local Binary Pattern Operators,...

S. Zafeiriou, M. Petrou, Sparse Representation for Facial Expression Recognition via L1 Optimization, in: Proc. IEEE...

G. Donato et al.

Classifying facial actions

IEEE Trans. Pattern Anal. Mach. Intell.

(1999)

R. Ptucha, A. Savakis, Facial Expression Recognition Using Facial Features and Manifold Learning, International...

J. Chen, Y. Gong, K. Zhang, Facial Expression Recognition Using Geometric and Appearance Features, International...

I. Kotsia et al.

Facial expression recognition in image sequences using geometric deformation features and support vector machine

IEEE Trans. Image Process.

(2007)

A. Saeed, A. Al-Hamadi, R. Niese, Neutral-independent Geometric Features for Facial Expression Recognition, IEEE Int’l...

Y. Song, Y. Kim, N. Kim, J. Ahn, Face Recognition Using Both Geometric Features and PCA/LDA, International Conference...

S.M. Lajevardi et al.

Facial expression recognition in perceptual color space

IEEE Trans. Image Process.

(2012)

S. Taheri et al.

Component-based recognition of faces and facial expressions

IEEE Trans. Affect. Comput.

(2013)

G. Littlewort, J. Whitehill, T. Wu, I. Fasel, M. Frank, J. Movellan, M. Bartlett, The Computer Expression Recognition...

G. Zhao et al.

Dynamic texture recognition using local binary patterns with an application to facial expressions

IEEE Trans. Pattern Anal. Mach. Intell.

(2007)

B. Jiang, F. Valstar, M. Pantic, Action Unit Detection Using Sparse Appearance Descriptors in Space-Time Video Volumes,...

A. Asthana, S. Zafeiriou, S. Cheng, M. Pantic, Incremental Face Alignment in the Wild, IEEE Conf. on Comput. Vis....

P. Lucey, J.F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, The Extended Cohn-Kanade Dataset (CK+): A Complete Dataset for...

M. Pantic, M. Valstar, R. Rademaker, L. Maat, Web-Based Database for Facial Expression Analysis, IEEE Int’l Conf. on...

S. Zhalehpour, Z. Akhtar, C.E. Erdem Multimodal Emotion Recognition with Automatic Peak Frame Selection, IEEE...

L. Zhang, M. Yang, X.C. Feng, Y. Ma, D. Zhang, Collaborative Representation Based Classification for Face Recognition,...

L. Shen et al.

A review on gabor wavelets for face recognition

Pattern Anal. Appl.

(2006)

C.H. Chan, J. Kittler, Sparse Representation of (Multiscale) Histograms for Face Recognition Robust to Registration and...

J. Wright et al.

Robust face recognition via sparse representation

IEEE Trans. Pattern Anal. Mach. Intell.

(2009)

Cited by (57)

Multi-geometry embedded transformer for facial expression recognition in videos
2024, Expert Systems with Applications
Dynamic facial expressions in videos express more realistic emotional states, and recognizing emotions from in-the-wild facial expression videos is a challenging task due to the changeable posture, partial occlusion and various light conditions. Although current methods have designed transformer-based models to learn spatial–temporal features, they cannot explore useful local geometry structures from both spatial and temporal views to capture subtle emotional features for the videos with varied poses and facial occlusion. To this end, we propose a novel multi-geometry embedded transformer (MGET), which adapts multi-geometry knowledge into transformers and excavates spatial–temporal geometry information as complementary to learn effective emotional features. Specifically, from a new perspective, we first design a multi-geometry distance learning (MGDL) to capture emotion-related geometry structure knowledge under Euclidean and Hyperbolic spaces. Especially based on the advantages of hyperbolic geometry, it finds the more subtle emotional changes among local spatial and temporal features. Secondly, we combine MGDL with transformer to design spatial–temporal MGETs, which capture important spatial and temporal multi-geometry features to embed them into their corresponding original features, and then perform cross-regions and cross-frame interaction on these multi-level features. Finally, MGET gains superior performance on DFEW, FERV39k and AFEW datasets, where the unweighted average recall (UAR) and weighted average recall (WAR) are 58.65%/69.91%, 41.91%/50.76% and 53.23%/55.40%, respectively, and the gained improvements are 2.55%/0.66%, 3.69%/2.63% and 3.66%/1.14% compared to M3DFEL, Logo-Forme and EST methods.
Enhanced spatial-temporal learning network for dynamic facial expression recognition
2024, Biomedical Signal Processing and Control
The recognition of dynamic facial expressions has received increasing attention since they can better reflect the real expression process of emotion than a static image. However, due to various factors such as subtle variation differences, pose, occlusion, and illumination, it has been a challenging vision task to obtain discriminative expression features in dynamic facial expression recognition. Traditional CNN-based deep learning networks lack global and temporal contextual expression understanding, which tends to affect the final recognition of dynamic expressions. Therefore, we propose an enhanced spatial–temporal learning network (ESTLNet) for more robust dynamic facial expression recognition, which consists of a spatial fusion learning module (SFLM) and a temporal transformer enhancement module (TTEM). First, the SFLM obtains a more expressive spatial feature representation through a dual-channel feature fusion learning module. Then, the TTEM extracts more valid temporal contextual expression features based on the above spatial features through an encoder constructed by cascading a self-attention learning network and an effective gated feed-forward network. Finally, the co-enhanced spatial–temporal model approach is assessed on the four broadly used dynamic expression datasets (DFEW, AFEW, CK+, and Oulu-CASIA). Extensive experimental outcomes demonstrate that our approach surpasses several existing state-of-the-art methods, leading to notable enhancements in performance.
Spatio-temporal modelling with multi-gradient features and elongated quinary pattern descriptor for dynamic facial expression recognition
2023, Pattern Recognition
We propose a new spatio-temporal modelling approach for Dynamic Facial Expression Recognition (DFER). We first convert the domain of the spatial images in the sequence to the gradient of magnitude and angle images at different orientations. Robust gradient components are developed to deal with difficult types of illuminations, such as darkness, by forming the eight edge responses of the Gaussian mask. To describe the dynamic Facial Expression (FE) changes we extend the Elongated Quinary Pattern (EQP) descriptor to encode separately the anisotropic structure of the uniform patterns from Three Orthogonal Planes (TOP) of each gradient sequence. Then each encoded sequence is divided into a stack of block volumes in the XY, XT and YT planes. For each plane, the co-occurrence of histogram features are calculated from each block volume and concatenated together. Simple three-dimensional histogram features are generated by concatenating the histogram features of all planes. A Multi Classifier System (MCS) based on a multi-class Support Vector Machine (SVM) is adopted to combine all scores for the encoded sequences. The proposed approach is evaluated with the challenging MMI and Oulu-CASIA databases with different set-ups and advantage has been shown in terms of generalisation to different databases, together with robustness against difficult pose variations and illumination changes. In terms of Recognition Accuracy (RA), a comparison is established with DFER methods in the literature. A high recognition rate of 79.23% is attained in the case of six classes when applied to the MMI database which surpasses all the state-of-the-art results.
Expression snippet transformer for robust video-based facial expression recognition
2023, Pattern Recognition
Although Transformer can be powerful for modeling visual relations and describing complicated patterns, it could still perform unsatisfactorily for video-based facial expression recognition, since the expression movements in a video can be too small to reflect meaningful spatial-temporal relations. To this end, we propose to decompose the modeling of expression movements of a video into the modeling of a series of expression snippets, each of which contains a few frames, and then boost the Transformer’s ability for intra-snippet and inter-snippet visual modeling, respectively, obtaining the Expression snippet Transformer (EST). For intra-snippet modeling, we devise an attention-augmented snippet feature extractor to enhance the encoding of subtle facial movements of each snippet. For inter-snippet modeling, we introduce a shuffled snippet order prediction head and a corresponding loss to improve the modeling of subtle motion changes across subsequent snippets. The EST obtains state-of-the-art performance, demonstrating its superiority to other CNN-based methods. Our code and the trained model are available at https://github.com/DreamMr/EST
Intelligent facial emotion recognition based on Hybrid whale optimization algorithm and sine cosine algorithm
2022, Microprocessors and Microsystems
Citation Excerpt :
Learning to "read" facial emotions makes it simple to determine a person's mood, such as whether they are happy or furious or sad so that you can decide what to do or say next [1–10]. Early exploration of FER primarily emphasised identifying expressions from static frames [11–13]. These approaches efficiently excerpt spatial statistics but cannot model the inconsistency in contextual and morphological influences.
The Whale Optimization Algorithm (WOA) is a new advanced algorithm that is based on the humpback whale chasing process. The main problem addressed by Whale Optimization Algorithm, similar to other metaheuristic algorithms, is premature convergence. This paper introduces the Sine Cosine Algorithm in the Whale Optimization Algorithm optimization process to enhance global convergence and to improve enactment. The proposed algorithm, called WOA-SCA, has local optima evasion of Whale Optimization Algorithm and high capability to increase the exploration. The implementation of the proposed WOA-SCA algorithm is investigated through a series of tests on common benchmark test functions, with the results compared to those of six alternative algorithms. The results of the investigations show that the proposed WOA-SCA algorithm works better in concert with the outputs of the benchmark functions. Also, to validate the proficiencies of the WOA-SCA, it has been used to resolve the real-world problem of Facial Emotion Recognition (FER). The field of facial emotion recognition is a fascinating new area of research that gives us the ability to identify the human face's expression in natural settings. A large part of the standard methodologies certainly does not indicate the looks because the attitudes depend on the human face sections' behaviors. The paper proposes the efficient Facial Emotion Recognition method using MultiSVNNN based on WOA-SCA (the proposed WOA-SCA based Multi-Support Vector Neural Network). Using the ADFES dataset for the disgust expression, our results demonstrate that the proposed work yields average emotion recognition accuracy of 98% associated to 80% and 96% based on BBPSO and MFOPSO, respectively. Investigational consequences indicate the virtuous enactment of the proposed scheme resolutions regarding precision.
Clip-aware expressive feature learning for video-based facial expression recognition
2022, Information Sciences
Citation Excerpt :
Video-based FER methods include static frame-based methods and dynamic sequence-based methods [3]. Most of the static frame-based methods process the manually defined peak (apex) frames, e.g., local binary patterns (LBPs) [4], local phase quantization (LPQ) [5,6], Gabor wavelets [7], convolutional features [8–10], etc. These methods usually neglect the importance of intrinsic relationships between visual information of adjacent frames.
Video-based facial expression recognition (FER) has received increased attention as a result of its widespread applications. However, a video often contains many redundant and irrelevant frames. How to reduce redundancy and complexity of the available information and extract the most relevant information to facial expression in video sequences is a challenging task. In this paper, we divide a video into several short clips for processing and propose a clip-aware emotion-rich feature learning network (CEFLNet) for robust video-based FER. Our proposed CEFLNet identifies the emotional intensity expressed in each short clip in a video and obtains clip-aware emotion-rich representations. Specifically, CEFLNet constructs a clip-based feature encoder (CFE) with two-cascaded self-attention and local–global relation learning, aiming to encode clip-based spatio-temporal features from the clips of a video. An emotional intensity activation network (EIAN) is devised to generate emotional activation maps for locating the salient emotion clips and obtaining clip-aware emotion-rich representations, which are used for expression classification. The effectiveness and robustness of the proposed CEFLNet are evaluated using four public facial expression video datasets, including BU-3DFE, MMI, AFEW, and DFEW. Extensive experiments demonstrate the improved performance of our proposed CEFLNet in comparison with the state-of-the-art methods.

View all citing articles on Scopus

Wissam J. Baddar is currently working towards the Ph.D. degree in the Image and Video Systems Laboratory at KAIST. His research interests include face recognition/detection, face expression recognition/analysis, biometrics, medical imaging, pattern recognition, and machine learning.

Yong Man Ro is a professor of Dept. of EE at KAIST. His research interests include image/video processing, biometrics, medical imaging, and pattern recognition. He served as an AE for IEEE SPL and TPC member in many international conferences including the program chair of IWDW 2004.

View full text

Collaborative expression representation using peak expression and intra class variation face images for practical subject-independent emotion recognition in videos

Highlights

Abstract

Introduction

Section snippets

Proposed facial expression recognition method using collaborative expression representation

Experiment

Conclusion

Conflict of interest

Acknowledgements

Image Vis. Comput.

Image Vis. Comput.

Pattern Recognit.

Image Vis. Comput.

Pattern Recognit.

Pattern Recognit.

J. Comput. Syst. Sci.

Image Vis. Comput.

Pattern Recognit.

A survey of affect recognition methods: audio, visual, and spontaneous expressions

IEEE Trans. Pattern Anal. Mach. Intell.

Facial expression analysis

A dynamic appearance descriptor approach to facial actions temporal modeling

IEEE Trans. Syst. Man Cybern. B

Local directional number pattern for face analysis

IEEE Trans. Image Process.

Classifying facial actions

IEEE Trans. Pattern Anal. Mach. Intell.

Facial expression recognition in image sequences using geometric deformation features and support vector machine

IEEE Trans. Image Process.

Facial expression recognition in perceptual color space

IEEE Trans. Image Process.

Component-based recognition of faces and facial expressions

IEEE Trans. Affect. Comput.

Dynamic texture recognition using local binary patterns with an application to facial expressions

IEEE Trans. Pattern Anal. Mach. Intell.

A review on gabor wavelets for face recognition

Pattern Anal. Appl.

Robust face recognition via sparse representation

IEEE Trans. Pattern Anal. Mach. Intell.