Elsevier

Neurocomputing

Volume 409, 7 October 2020, Pages 341-350
Neurocomputing

Infrared facial expression recognition via Gaussian-based label distribution learning in the dark illumination environment for human emotion detection

https://doi.org/10.1016/j.neucom.2020.05.081Get rights and content

Highlights

  • The natural correlation ambiguity is revealed, and a novel label distribution is constructed.

  • An end-to-end learning framework of FER is proposed in both feature learning and classifier learning.

  • Experimental results demonstrate that the proposed model achieves the best performance.

Abstract

Facial expression recognition task as a crucial step for emotion recognition remains an open challenge that due to individual expression correlation/ambiguity. In this paper, to tackle these challenges, a novel model with the correlation emotion label distribution learning is proposed for near-infrared (NIR) facial expression recognition which associates multiple emotions with each expression depend on the similarity of expressions. Firstly, the similarities of the seven basic expressions are calculated, and then guide the correlation emotion label distribution by predicting the latent label probability distribution of the expression. Furthermore, the proposed model can be learned in an end-to-end manner via a constructed convolutional neural network to classify the six basic facial expressions. Experimental results on Oulu_CASIA database demonstrate that the proposed method has achieved the superior performance on NIR expression recognition.

Introduction

Emotion recognition by facial expression in human–computer interactive system [1], [2] is one of the challenging research topics in the field of artificial intelligence which has attracted plenty of attention in recent years. However, it is difficult to achieve natural and harmonious emotional interaction with traditional interaction methods such as keyboard, mouse, screen, and pattern, which is far from meeting the requirements for artificial intelligence [2]. Human expression [3] is the most important carrier of emotions perception and the most direct and obvious way of expressing emotions. Thus, facial expression recognition (FER) has important theoretical significance for improving the emotional interaction ability of computer [4], [5]. Furthermore, facial expression is arguably the most natural, powerful and immediate signal to communicate emotional states and intentions [1]. However, automatic FER is still difficult in an unconstrained real-life situation with the widespread use of deep learning techniques [6], [7]. It encounters various challenges caused by occlusion, face pose variations, illumination changes, head motion, expressions ambiguity and so on. An ideal automatic FER system is supposed to be able to tackle these challenges.

It is well-known that the active near-infrared (NIR) (780–1100 nm) imaging [8] is an alternative method to overcome the problem of illumination variations and even robust in near darkness. In Fig. 1, it can be observed that the different facial features between the NIR images and visible (VIS) images, such as wrinkles and texture (shown by red arrows). The NIR images are clear and without shadows while some dark areas caused by self-occlusion can be found in the VIS images (Fig. 1(a)). Li et al. [9] firstly develop an active NIR imaging system to recognize human face in different illuminations.

Over the past two decades, many FER algorithms [10], [11], [12] have been proposed to classify the six basic emotions including the anger (An), disgust (Di), fear (Fe), happy (Ha), sadness (Sa) and surprise (Su) in each facial image with a predefined emotion category. However, it is difficult to obtain the ground-truth of facial expression in practice. Usually, approximate approaches are adopted to acquire facial expressions. For example, the Oulu_SACIA database is collected by asking the subjects to make a facial expression based on an expression sample displayed in picture sequences under a laboratory-controlled environment according to facial action coding system (FACS) [13]. However, there are factors that may cause inaccurate results. First of all, the expression is formed by the combination of multiple facial action modules. It is not guaranteed that the changes in facial action module of the subject are completely the same. Secondly, the movements of the same facial action module are existed in the different expressions. As a result, even when two facial images are labeled with the same emotion, they might belong to quite different real emotions. Moreover, most of the emotions appear in a combination, mixed or composite form of basic emotions according to the wheel of emotions theory of Plutchik [14]. Furthermore, human express their feelings through the facial appearance that is often a fusion or compound of different emotions rather than a single basic feeling. And each basic emotion plays a different role in the expression. In this sense, the facial appearance expression is ambiguous or correlative, i.e., multiple expression numbers might be utilized to describe the appearance of human face. Thus, the single-label learning methods [15], [16] for identifying a basic emotion of each expression may fail to describe the correlation/ambiguity in different emotions which may not be applicable to real-life expression recognition applications.

To address the problems, a multi-output Laplacian dynamic ordinal regression method is proposed by Rudovic et al. [17], which can estimate the probability of each emotion label as well as their intensities. However, it assumes that each expression with one correct emotion label and outputs the emotion with the highest probability as a result, which may fail in the mixture emotion situation. Moreover, multi-label learning (MLL) [18] is suitable for describing each expression image with several related emotions for FER tasks when each basic emotion is considered as a single label. Li et al. [19] develop the database of VIS multi-label facial expression, and preserve the manifold structures of emotion labels and the local affinity of deep features to learn the distinguishing features of multi-label expressions. However, MLL fails to learn the degree of each emotion describing the expression. Gan et al. [20] employ a CNN and softed label with a diverse ensemble that associate multiple emotions with each expression. It has achieved impressive results. However, this method is only suitable for the VIS facial images, and it often fails in the NIR facial images. Because the features of the NIR images (Fig. 1(g)–(l)) are essentially different from that of the VIS facial images (Fig. 1(a)–(f)).

Thus, a new emotion label distribution learning method is proposed for NIR FER which assigns the value to each basic emotion to describe facial expressions. While the ground-truth emotion of facial image is considered to be the most relevant label to the image, those emotions close to the ground-truth emotion could be utilized to describe the facial image with lower relevance. Our proposed method allows direct modeling of different importance of each label to the instance, and thus can better match the nature of many real practical applications.

In this paper, an attempt is made to reveal the correlation between different frontal facial expressions which is independent of the datasets and universal inspired by our observations. Specially, not only do we need to understand the emotions associated with facial expressions, but also need to learn the description degree that each emotion describes the expression. And then, a novel automatic FER framework is proposed based on deep constructed convolutional neural network (CNN) and label distribution. In the first stage, the expression feature similarity is calculated by using the cosine distance between the feature vectors which are learned from NIR FER datasets with frontal face images. Then, the expression label distribution construction is learned via an end-to-end CNN. The contributions of our study can be summarized as follows.

  • 1)

    Based on the expression feature similarities, the natural correlation/ambiguity among expressions is revealed, and a novel label distribution is constructed in this paper. To the best of our knowledge, the natural relationships among different expressions are revealed and modeled for the first time.

  • 2)

    A new end-to-end learning framework of FER is proposed which learns correlation emotion label distribution and regresses ground-truth expression in both feature learning and classifier learning.

  • 3)

    Experimental results on the active NIR public datasets demonstrate that the proposed model achieves better performance than the state-of-the-art methods.

The rest of this paper is organized as follows. The correlation or ambiguity of the different expressions is revealed in Section 2. The details of the method we proposed are presented in Section 3. Experimental results on the dataset and analysis are provided in Section 4. Finally, Section 5 concludes this paper.

Section snippets

NIR facial expression recognition

NIR FER procedures can generally be divided into face acquisition, feature extraction and expression classification. Fortunately, a new state-of-the-art in expression recognition is driven by deep learning technology. However, it is urgently necessary to learn from the face images to reflect simultaneously the characteristics of real-life to meet the requirements of reality applications. Facial expressions are generated by the contraction of peripheral muscles, causing temporary deformation of

Problem formulation

In the correlation emotion label distribution learning system, the emotion label distribution learning will be formally defined. Let X = Rq denote the input space of expressions and let Y = {y1, y2, …, yk} represent k possible emotion labels consisting of the basic expressions. Given a training set S={(X1,D1),(X2,D2),,(Xn,Dn)}, where Di={dXiy1,dXiy2,,dXiyk} is the emotion label distribution related to Xi. And the number of dXiyj named emotion description degree stand for the degree that a

Experiment settings

The Oulu_CASIA database [27] consists of 2880 videos from 80 subjects. Each subject is labeled with one of the basic facial expressions, such as the anger, disgust, fear, happy, sadness and surprise. There are two types of cameras that are NIR and VIS. Both cameras are utilized to capture the video sequences. Among them, only 480 sequences are labeled by the NIR system as one of the basic facial expressions. Each video sequence starts from a neutral facial expression and the last frame reaches

Conclusion

In this work, an end-to-end learning framework is proposed on NIR facial expression recognition in the different lighting conditions. We reveal the ambiguity or correlation of the different expressions. And the proposed model learns ground-truth emotion label distributions based on facial expression similarity distributions. Firstly, the cosine distance is utilized to calculate the similarities of the different expressions, and then soften the label to a Gaussian distribution. Then, the emotion

CRediT authorship contribution statement

Zhaoli Zhang: Data curation. Chenghang Lai: Writing-original draft. Hai Liu: Writing - review & editing. You-Fu Li: Conceptualization, Methodology.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

The authors sincerely thank anonymous reviewers for their constructive comments, and thank Dr. Xiaoxuan Shen and Dr. Taihe Cao, which helped improve this paper. This work was supported in part by the National Natural Science Foundation of China under Grant 61875068, Grant 61873220, and Grant 61505064, the National Key Research and Development Program of China under Grant 2017YFB1401300 and Grant 2017YFB1401303, the Research Grants Council of Hong Kong under Project CityU 11205015 and Project

Zhaoli Zhang (M’16) received the M.S. degree in Computer Science from Central China Normal University, Wuhan, China, in 2004, and the Ph.D. degree in Computer Science from Huazhong University of Science and Technology in 2008. He is currently a professor in the National Engineering Research Center for E-Learning, Central China Normal University. His research interests include signal processing, knowledge services and software engineering. He is a member of IEEE and CCF (China Computer

References (36)

  • N. Zeng et al.

    An improved particle filter with a novel hybrid proposal distribution for quantitative analysis of gold immunochromatographic strips

    IEEE Transactions on Nanotechnology

    (2019)
  • T. Liu et al.

    Fast Blind Instrument Function Estimation Method for Industrial Infrared Spectrometers

    IEEE Transactions on Industrial Informatics

    (2018)
  • S.Z. Li et al.

    Illumination invariant face recognition using near-infrared images

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2007)
  • S. Li et al.

    Reliable crowdsourcing and deep locality-preserving learning for unconstrained facial expression recognition

    IEEE Transactions on Image Processing

    (2018)
  • S. Taheri et al.

    Structure-preserving sparse decomposition for facial expression analysis

    IEEE Transactions on Image Processing

    (2014)
  • P. Ekman, W. Friesen, J. Hager, Facial action coding system. salt lake city, ut,...
  • R. Plutchik, A general psychoevolutionary theory of emotion, in: Theories of Emotion, Elsevier, 1980, pp....
  • O. Rudovic, V. Pavlovic, M. Pantic, Multi-output Laplacian dynamic ordinal regression for facial expression recognition...
  • Cited by (41)

    • GCANet: Geometry cues-aware facial expression recognition based on graph convolutional networks

      2023, Journal of King Saud University - Computer and Information Sciences
    • High-resolution facial expression image restoration via adaptive total variation regularization for classroom learning environment

      2023, Infrared Physics and Technology
      Citation Excerpt :

      Finally, we discuss several parameters of the proposed model. In Fig. 3, the infrared facial images come from the Oulu dataset [19,28,29]. The facial imaging conditions include three aspects, such as dark, strong, and weak light.

    View all citing articles on Scopus

    Zhaoli Zhang (M’16) received the M.S. degree in Computer Science from Central China Normal University, Wuhan, China, in 2004, and the Ph.D. degree in Computer Science from Huazhong University of Science and Technology in 2008. He is currently a professor in the National Engineering Research Center for E-Learning, Central China Normal University. His research interests include signal processing, knowledge services and software engineering. He is a member of IEEE and CCF (China Computer Federation).

    Chenghang Lai received the B.S. degrees from Quzhou University, Quzhou, China, in 2018. He is currently pursuing the M.S. degree with the National Engineering Research Center for E-Learning, Central China Normal University, Wuhan, under the supervision of Professor Hai Liu and Zhaoli Zhang. His research interests include facial expression recognition, image processing, computer vision, pattern recognition, and multimedia applications.

    Hai Liu (S’12–M’14) received the M.S. degree in applied mathematics from Huazhong University of Science and Technology (HUST), Wuhan, China, in 2010, and the Ph.D. degree in pattern recognition and artificial intelligence from the same university, in 2014.

    Since June 2017, he has been an Assistant Professor with the National Engineering Research Center for E-Learning, Central China Normal University, Wuhan. He was a “Hong Kong Scholar” postdoctoral fellow with the Department of Mechanical Engineering, City University of Hong Kong, Kowloon, Hong Kong, where he was hosted by the Professor You-Fu Li; he held the position two years till March 2019. He has authored more than 60 peer-reviewed articles in international journals from multiple domains such as pattern recognition, image processing. More than six articles are selected as the highly cited papers.

    His current research interests include facial expression recognition, big data processing, artificial intelligence, spectral analysis, optical data processing and pattern recognition. Dr. Liu has been frequently serving as a reviewer for more than six international journals including the IEEE Translations on Industrial Informatics, IEEE Translations on Cybernetics, IEEE/ASME Transactions on Mechatronics, and IEEE Translations on Instrumentation and Measurement. He is also a Communication Evaluation Expert for the National Natural Science Foundation of China.

    You-fu Li (M’91–SM’01) received the B.S. and M.S. degrees in electrical engineering from the Harbin Institute of Technology, Harbin, China, and the Ph.D. degree in robotics from the Department of Engineering Science, University of Oxford, Oxford, U.K., in 1993.

    From 1993 to 1995, he was a Research Staff in the Department of Computer Science, University of Wales, Aberystwyth, U.K. He joined the City University of Hong Kong, Hong Kong, in 1995, and is currently a Professor in the Department of Mechanical and Biomedical Engineering. His current research interests include robot sensing, robot vision, three-dimensional vision, and visual tracking.

    Professor Li has served as an Associate Editor of the IEEE TRANSACTIONS ON AUTOMATION SCIENCE AND ENGINEERING and is currently an Associate Editor of the IEEE ROBOTICS AND AUTOMATION MAGAZINE. He is an Editor of the IEEE Robotics and Automation Society Conference Editorial Board, and the IEEE Conference on Robotics and Automation.

    View full text