Infrared facial expression recognition via Gaussian-based label distribution learning in the dark illumination environment for human emotion detection

doi:10.1016/j.neucom.2020.05.081

Neurocomputing

Volume 409, 7 October 2020, Pages 341-350

https://doi.org/10.1016/j.neucom.2020.05.081 Get rights and content

Highlights

•
The natural correlation ambiguity is revealed, and a novel label distribution is constructed.
•
An end-to-end learning framework of FER is proposed in both feature learning and classifier learning.
•
Experimental results demonstrate that the proposed model achieves the best performance.

Abstract

Facial expression recognition task as a crucial step for emotion recognition remains an open challenge that due to individual expression correlation/ambiguity. In this paper, to tackle these challenges, a novel model with the correlation emotion label distribution learning is proposed for near-infrared (NIR) facial expression recognition which associates multiple emotions with each expression depend on the similarity of expressions. Firstly, the similarities of the seven basic expressions are calculated, and then guide the correlation emotion label distribution by predicting the latent label probability distribution of the expression. Furthermore, the proposed model can be learned in an end-to-end manner via a constructed convolutional neural network to classify the six basic facial expressions. Experimental results on Oulu_CASIA database demonstrate that the proposed method has achieved the superior performance on NIR expression recognition.

Introduction

Emotion recognition by facial expression in human–computer interactive system [1], [2] is one of the challenging research topics in the field of artificial intelligence which has attracted plenty of attention in recent years. However, it is difficult to achieve natural and harmonious emotional interaction with traditional interaction methods such as keyboard, mouse, screen, and pattern, which is far from meeting the requirements for artificial intelligence [2]. Human expression [3] is the most important carrier of emotions perception and the most direct and obvious way of expressing emotions. Thus, facial expression recognition (FER) has important theoretical significance for improving the emotional interaction ability of computer [4], [5]. Furthermore, facial expression is arguably the most natural, powerful and immediate signal to communicate emotional states and intentions [1]. However, automatic FER is still difficult in an unconstrained real-life situation with the widespread use of deep learning techniques [6], [7]. It encounters various challenges caused by occlusion, face pose variations, illumination changes, head motion, expressions ambiguity and so on. An ideal automatic FER system is supposed to be able to tackle these challenges.

It is well-known that the active near-infrared (NIR) (780–1100 nm) imaging [8] is an alternative method to overcome the problem of illumination variations and even robust in near darkness. In Fig. 1, it can be observed that the different facial features between the NIR images and visible (VIS) images, such as wrinkles and texture (shown by red arrows). The NIR images are clear and without shadows while some dark areas caused by self-occlusion can be found in the VIS images (Fig. 1(a)). Li et al. [9] firstly develop an active NIR imaging system to recognize human face in different illuminations.

Over the past two decades, many FER algorithms [10], [11], [12] have been proposed to classify the six basic emotions including the anger (An), disgust (Di), fear (Fe), happy (Ha), sadness (Sa) and surprise (Su) in each facial image with a predefined emotion category. However, it is difficult to obtain the ground-truth of facial expression in practice. Usually, approximate approaches are adopted to acquire facial expressions. For example, the Oulu_SACIA database is collected by asking the subjects to make a facial expression based on an expression sample displayed in picture sequences under a laboratory-controlled environment according to facial action coding system (FACS) [13]. However, there are factors that may cause inaccurate results. First of all, the expression is formed by the combination of multiple facial action modules. It is not guaranteed that the changes in facial action module of the subject are completely the same. Secondly, the movements of the same facial action module are existed in the different expressions. As a result, even when two facial images are labeled with the same emotion, they might belong to quite different real emotions. Moreover, most of the emotions appear in a combination, mixed or composite form of basic emotions according to the wheel of emotions theory of Plutchik [14]. Furthermore, human express their feelings through the facial appearance that is often a fusion or compound of different emotions rather than a single basic feeling. And each basic emotion plays a different role in the expression. In this sense, the facial appearance expression is ambiguous or correlative, i.e., multiple expression numbers might be utilized to describe the appearance of human face. Thus, the single-label learning methods [15], [16] for identifying a basic emotion of each expression may fail to describe the correlation/ambiguity in different emotions which may not be applicable to real-life expression recognition applications.

To address the problems, a multi-output Laplacian dynamic ordinal regression method is proposed by Rudovic et al. [17], which can estimate the probability of each emotion label as well as their intensities. However, it assumes that each expression with one correct emotion label and outputs the emotion with the highest probability as a result, which may fail in the mixture emotion situation. Moreover, multi-label learning (MLL) [18] is suitable for describing each expression image with several related emotions for FER tasks when each basic emotion is considered as a single label. Li et al. [19] develop the database of VIS multi-label facial expression, and preserve the manifold structures of emotion labels and the local affinity of deep features to learn the distinguishing features of multi-label expressions. However, MLL fails to learn the degree of each emotion describing the expression. Gan et al. [20] employ a CNN and softed label with a diverse ensemble that associate multiple emotions with each expression. It has achieved impressive results. However, this method is only suitable for the VIS facial images, and it often fails in the NIR facial images. Because the features of the NIR images (Fig. 1(g)–(l)) are essentially different from that of the VIS facial images (Fig. 1(a)–(f)).

Thus, a new emotion label distribution learning method is proposed for NIR FER which assigns the value to each basic emotion to describe facial expressions. While the ground-truth emotion of facial image is considered to be the most relevant label to the image, those emotions close to the ground-truth emotion could be utilized to describe the facial image with lower relevance. Our proposed method allows direct modeling of different importance of each label to the instance, and thus can better match the nature of many real practical applications.

In this paper, an attempt is made to reveal the correlation between different frontal facial expressions which is independent of the datasets and universal inspired by our observations. Specially, not only do we need to understand the emotions associated with facial expressions, but also need to learn the description degree that each emotion describes the expression. And then, a novel automatic FER framework is proposed based on deep constructed convolutional neural network (CNN) and label distribution. In the first stage, the expression feature similarity is calculated by using the cosine distance between the feature vectors which are learned from NIR FER datasets with frontal face images. Then, the expression label distribution construction is learned via an end-to-end CNN. The contributions of our study can be summarized as follows.

1)
Based on the expression feature similarities, the natural correlation/ambiguity among expressions is revealed, and a novel label distribution is constructed in this paper. To the best of our knowledge, the natural relationships among different expressions are revealed and modeled for the first time.
2)
A new end-to-end learning framework of FER is proposed which learns correlation emotion label distribution and regresses ground-truth expression in both feature learning and classifier learning.
3)
Experimental results on the active NIR public datasets demonstrate that the proposed model achieves better performance than the state-of-the-art methods.

The rest of this paper is organized as follows. The correlation or ambiguity of the different expressions is revealed in Section 2. The details of the method we proposed are presented in Section 3. Experimental results on the dataset and analysis are provided in Section 4. Finally, Section 5 concludes this paper.

Section snippets

NIR facial expression recognition

NIR FER procedures can generally be divided into face acquisition, feature extraction and expression classification. Fortunately, a new state-of-the-art in expression recognition is driven by deep learning technology. However, it is urgently necessary to learn from the face images to reflect simultaneously the characteristics of real-life to meet the requirements of reality applications. Facial expressions are generated by the contraction of peripheral muscles, causing temporary deformation of

Problem formulation

In the correlation emotion label distribution learning system, the emotion label distribution learning will be formally defined. Let X = R^q denote the input space of expressions and let Y = {y₁, y₂, …, y_k} represent k possible emotion labels consisting of the basic expressions. Given a training set $S = {(X_{1}, D_{1}), (X_{2}, D_{2}), \dots, (X_{n}, D_{n})}$ , where $D_{i} = {d_{X_{i}}^{y_{1}}, d_{X_{i}}^{y_{2}}, \dots, d_{X_{i}}^{y_{k}}}$ is the emotion label distribution related to $X_{i}$ . And the number of $d_{X_{i}}^{y_{j}}$ named emotion description degree stand for the degree that a

Experiment settings

The Oulu_CASIA database [27] consists of 2880 videos from 80 subjects. Each subject is labeled with one of the basic facial expressions, such as the anger, disgust, fear, happy, sadness and surprise. There are two types of cameras that are NIR and VIS. Both cameras are utilized to capture the video sequences. Among them, only 480 sequences are labeled by the NIR system as one of the basic facial expressions. Each video sequence starts from a neutral facial expression and the last frame reaches

Conclusion

In this work, an end-to-end learning framework is proposed on NIR facial expression recognition in the different lighting conditions. We reveal the ambiguity or correlation of the different expressions. And the proposed model learns ground-truth emotion label distributions based on facial expression similarity distributions. Firstly, the cosine distance is utilized to calculate the similarities of the different expressions, and then soften the label to a Gaussian distribution. Then, the emotion

CRediT authorship contribution statement

Zhaoli Zhang: Data curation. Chenghang Lai: Writing-original draft. Hai Liu: Writing - review & editing. You-Fu Li: Conceptualization, Methodology.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The authors sincerely thank anonymous reviewers for their constructive comments, and thank Dr. Xiaoxuan Shen and Dr. Taihe Cao, which helped improve this paper. This work was supported in part by the National Natural Science Foundation of China under Grant 61875068, Grant 61873220, and Grant 61505064, the National Key Research and Development Program of China under Grant 2017YFB1401300 and Grant 2017YFB1401303, the Research Grants Council of Hong Kong under Project CityU 11205015 and Project

Zhaoli Zhang (M’16) received the M.S. degree in Computer Science from Central China Normal University, Wuhan, China, in 2004, and the Ph.D. degree in Computer Science from Huazhong University of Science and Technology in 2008. He is currently a professor in the National Engineering Research Center for E-Learning, Central China Normal University. His research interests include signal processing, knowledge services and software engineering. He is a member of IEEE and CCF (China Computer

References (36)

Z. Sun et al.
An extended dictionary representation approach with deep subspace learning for facial expression recognition
Neurocomputing
(2018)
Y. Ji et al.
Cross-domain facial expression recognition via an intra-category common feature and inter-category distinction feature fusion network
Neurocomputing
(2019)
Z. Yu et al.
Spatio-temporal convolutional features with nested lstm for facial expression recognition
Neurocomputing
(2018)
N. Zeng et al.
Facial expression recognition via learning deep sparse autoencoders
Neurocomputing
(2018)
G. Zhao et al.
Facial expression recognition from near-infrared videos
Image and Vision Computing
(2011)
Q. Meng et al.
Robots learn to dance through interaction with humans
Neural Computing and Applications
(2014)
Y. Zhuang et al.
Admittance control based on emg-driven musculoskeletal model improves the human–robot synchronization
IEEE Transactions on Industrial Informatics
(2018)
Z. Gao et al.
A coincidence filtering-based approach for cnns in eeg-based recognition
IEEE Transactions on Industrial Informatics
(2019)
J. Wang et al.
The driving safety field based on driver–vehicle–road interactions
IEEE Transactions on Intelligent Transportation Systems
(2015)
N. Zeng et al.
Deep belief networks for quantitative analysis of a gold immunochromatographic strip
Cognitive Computation
(2016)

N. Zeng et al.

An improved particle filter with a novel hybrid proposal distribution for quantitative analysis of gold immunochromatographic strips

IEEE Transactions on Nanotechnology

(2019)

T. Liu et al.

Fast Blind Instrument Function Estimation Method for Industrial Infrared Spectrometers

IEEE Transactions on Industrial Informatics

(2018)

S.Z. Li et al.

Illumination invariant face recognition using near-infrared images

IEEE Transactions on Pattern Analysis and Machine Intelligence

(2007)

S. Li et al.

Reliable crowdsourcing and deep locality-preserving learning for unconstrained facial expression recognition

IEEE Transactions on Image Processing

(2018)

S. Taheri et al.

Structure-preserving sparse decomposition for facial expression analysis

IEEE Transactions on Image Processing

(2014)

P. Ekman, W. Friesen, J. Hager, Facial action coding system. salt lake city, ut,...

R. Plutchik, A general psychoevolutionary theory of emotion, in: Theories of Emotion, Elsevier, 1980, pp....

O. Rudovic, V. Pavlovic, M. Pantic, Multi-output Laplacian dynamic ordinal regression for facial expression recognition...

Cited by (41)

k-NN attention-based video vision transformer for action recognition
2024, Neurocomputing
Action Recognition aims to understand human behavior and predict a label for each action. Recently, Vision Transformer (ViT) has achieved remarkable performance on action recognition, which models the long sequences token over spatial and temporal index in a video. The fully-connected self-attention layer is the fundamental key in the vanilla Transformer. However, the redundant architecture of the vision Transformer model ignores the locality of video frame patches, which involves non-informative tokens and potentially leads to increased computational complexity. To solve this problem, we propose a $k$ -NN attention-based Video Vision Transformer ( $k$ -ViViT) network for action recognition. We adopt $k$ -NN attention to Video Vision Transformer (ViViT) instead of original self-attention, which can optimize the training process and neglect the irrelevant or noisy tokens in the input sequence. We conduct experiments on the UCF101 and HMDB51 datasets to verify the effectiveness of our model. The experimental results illustrate that the proposed $k$ -ViViT achieves superior accuracy compared to several state-of-the-art models on these action recognition datasets.
Predicting stress levels for smartphone users using transfer learning induced residual net
2024, Entertainment Computing
Smartphones have become an essential part of our daily lives, especially during the COVID-19 pandemic when most people are forced to stay at home. This resulted in increased reliance on smartphones for education. Restricted body movements and continuous looking at phone-screens started gradually degrading the mental health especially for children and teenagers. The real-time measurement of stress levels (SLs) generated due to chronic exposure to screens becomes interesting study for the psychologists. Still, there exist void in linking the outcomes found by psychologists with automatic facial-expression-classification (FEC) from technical background. In the current study, the phone users’ SL is recognized by analyzing their facial expressions using Convolutional Neural Network. But it is observed that the accuracy gets saturated after certain iterations. This problem is addressed by introducing skip-connections into the architecture and implementing Residual Network. After successful completion of the SL classification, the authors delve into analyzing the training and validation errors with respect to human level performance. It is found that the system is suffering from overfitting. One way to address this issue is to feed more data to the system by Transfer Learning. The proposed work has a potential to open up a new avenue of SL measurement.
Infrared thermal image denoising with symmetric multi-scale sampling network
2023, Infrared Physics and Technology
We propose an infrared thermal image denoising method based on the residual learning with a symmetric multi-scale (SM) encoder-decoder sampling structure (SMEDS). The U-shape-based SMEDS is designed to extract SM information from different layers and focus on decoding recovery with attention acting on the upper and lower levels. Specifically, SMEDS consists of SM encoder-decoder blocks, cascaded residual blocks, and attention recovery modules. An attention-guided reconstruction unit (AGRU) and SSIM loss are jointly used to supervise image reconstruction and enhance visual perception. A novel Gaussian-Poisson-Stripe noise (GPSN) model is developed to simulate real-world noise. To verify the effectiveness of SMEDS, ablation studies are conducted. Extensive experiments demonstrate that the proposed method performs well on synthetic and real-world noisy images and outperforms previously reported infrared image denoising methods. The proposed method has great potential applications in areas such as remote sensing, infrared medical imaging, and navigation.
GCANet: Geometry cues-aware facial expression recognition based on graph convolutional networks
2023, Journal of King Saud University - Computer and Information Sciences
Facial expression recognition (FER) task in the wild is challenging due to some uncertainties, such as the ambiguity of facial expressions, subjective annotations, and low-quality facial images. A novel model for FER in-the-wild datasets is proposed in this study to solve these uncertainties. The overview of the proposed method is as follows. First, the facial images are grouped into high and low uncertainties by the pre-trained network. The graph convolutional network (GCN) framework is then used for the facial images with low uncertainty to obtain geometry cues, including the relationship among action units (AUs) and the implicit connection between AUs and expressions, which help predict the probability of the underlying emotional label. The emotion label distribution is produced by combining the predicted latent label probability and the given label. For the facial images with high uncertainty, k-nearest neighbor graphs are built to determine the k facial images in the low uncertainty group with the highest similarity to the given facial image. The emotion label distribution of the given image is then replaced by fusing the emotion label distribution based on the distances between the given image and its adjacent images. Finally, the constructed emotion label distribution facilitates training in a straightforward manner using a convolutional neural network framework to identify facial expressions. Experimental results on RAF-DB, FERPlus, AffectNet, and SFEW2.0 datasets demonstrate that the proposed method achieved superior performance compared to state-of-the-art approaches.
Facial expression recognition network with slow convolution and zero-parameter attention mechanism
2023, Optik
Facial expression recognition (FER) has been successfully applied to the fields of automatic driving, classroom education, and many other industries. However, a lot of existing problems induced by different lighting, posture, and skin color make many difficulties in extracting effective features, resulting in low accuracy of FER. to address these issues, a novel slow evolution and zero parameter attention mechanism (SC-ZAM) is proposed to reliably extract discriminative feature from facial images for FER. In the proposed SC-ZAM, Slow convolution module is used to reduce the speed of image dimensionality reduction, extract considerably reliable facial features, and enhances the representation of these features to extract more effective global information of facial expressions. Meanwhile, zero-parameter attention modules are applied to enhance the focus on the local discriminative feature areas. The improved attention module focuses on highly distinguishable facial features and restrains the weight of non-facial areas to locate more detailed local facial areas in the global information extracted from the network. Experimental results demonstrate that our SC-ZAM method is robust against different FER tasks on our infant dataset and several public datasets, and is superior to other state-of-the-art methods.
High-resolution facial expression image restoration via adaptive total variation regularization for classroom learning environment
2023, Infrared Physics and Technology
Citation Excerpt :
Finally, we discuss several parameters of the proposed model. In Fig. 3, the infrared facial images come from the Oulu dataset [19,28,29]. The facial imaging conditions include three aspects, such as dark, strong, and weak light.
Infrared videos play an important role in recording the learning process. Facial expression images are an important part of infrared videos. High-resolution facial images can reflect the emotion of students or teachers in the classroom. However, recorded infrared videos inevitably have random noise and image blur, which influence facial expression recognition and head pose estimation. In this study, we introduce a blind image restoration method with wavelet transform and total variation regularization. The difference between the low-resolution facial expression image and high-resolution one is revealed by the wavelet transform and total variation regularization. The distribution of the wavelet transform coefficient of high-resolution images is sparser than the coefficient distribution of original low-resolution images. The major novelty of this work is that the sparsity of coefficient distribution is described by the wavelet transform and total variation regularization. Furthermore, the proposed method is conducted on real facial images to verify the effectiveness of priori knowledge. Numerical experiments demonstrate that the proposed method can recover high-resolution facial image and facilitate the application on facial expression recognition.

View all citing articles on Scopus

Chenghang Lai received the B.S. degrees from Quzhou University, Quzhou, China, in 2018. He is currently pursuing the M.S. degree with the National Engineering Research Center for E-Learning, Central China Normal University, Wuhan, under the supervision of Professor Hai Liu and Zhaoli Zhang. His research interests include facial expression recognition, image processing, computer vision, pattern recognition, and multimedia applications.

Hai Liu (S’12–M’14) received the M.S. degree in applied mathematics from Huazhong University of Science and Technology (HUST), Wuhan, China, in 2010, and the Ph.D. degree in pattern recognition and artificial intelligence from the same university, in 2014.

Since June 2017, he has been an Assistant Professor with the National Engineering Research Center for E-Learning, Central China Normal University, Wuhan. He was a “Hong Kong Scholar” postdoctoral fellow with the Department of Mechanical Engineering, City University of Hong Kong, Kowloon, Hong Kong, where he was hosted by the Professor You-Fu Li; he held the position two years till March 2019. He has authored more than 60 peer-reviewed articles in international journals from multiple domains such as pattern recognition, image processing. More than six articles are selected as the highly cited papers.

His current research interests include facial expression recognition, big data processing, artificial intelligence, spectral analysis, optical data processing and pattern recognition. Dr. Liu has been frequently serving as a reviewer for more than six international journals including the IEEE Translations on Industrial Informatics, IEEE Translations on Cybernetics, IEEE/ASME Transactions on Mechatronics, and IEEE Translations on Instrumentation and Measurement. He is also a Communication Evaluation Expert for the National Natural Science Foundation of China.

You-fu Li (M’91–SM’01) received the B.S. and M.S. degrees in electrical engineering from the Harbin Institute of Technology, Harbin, China, and the Ph.D. degree in robotics from the Department of Engineering Science, University of Oxford, Oxford, U.K., in 1993.

From 1993 to 1995, he was a Research Staff in the Department of Computer Science, University of Wales, Aberystwyth, U.K. He joined the City University of Hong Kong, Hong Kong, in 1995, and is currently a Professor in the Department of Mechanical and Biomedical Engineering. His current research interests include robot sensing, robot vision, three-dimensional vision, and visual tracking.

Professor Li has served as an Associate Editor of the IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING and is currently an Associate Editor of the IEEE ROBOTICS AND AUTOMATION MAGAZINE. He is an Editor of the IEEE Robotics and Automation Society Conference Editorial Board, and the IEEE Conference on Robotics and Automation.

View full text

Infrared facial expression recognition via Gaussian-based label distribution learning in the dark illumination environment for human emotion detection

Highlights

Abstract

Introduction

Section snippets

NIR facial expression recognition

Problem formulation

Experiment settings

Conclusion

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgments

Neurocomputing

Neurocomputing

Neurocomputing

Neurocomputing

Image and Vision Computing

Robots learn to dance through interaction with humans

Neural Computing and Applications

Admittance control based on emg-driven musculoskeletal model improves the human–robot synchronization

IEEE Transactions on Industrial Informatics

A coincidence filtering-based approach for cnns in eeg-based recognition

IEEE Transactions on Industrial Informatics

The driving safety field based on driver–vehicle–road interactions

IEEE Transactions on Intelligent Transportation Systems

Deep belief networks for quantitative analysis of a gold immunochromatographic strip

Cognitive Computation

An improved particle filter with a novel hybrid proposal distribution for quantitative analysis of gold immunochromatographic strips

IEEE Transactions on Nanotechnology

Fast Blind Instrument Function Estimation Method for Industrial Infrared Spectrometers

IEEE Transactions on Industrial Informatics

Illumination invariant face recognition using near-infrared images

IEEE Transactions on Pattern Analysis and Machine Intelligence

Reliable crowdsourcing and deep locality-preserving learning for unconstrained facial expression recognition

IEEE Transactions on Image Processing

Structure-preserving sparse decomposition for facial expression analysis

IEEE Transactions on Image Processing