Keywords

1 Introduction

The face is one of the most powerful channels of nonverbal communication [3, 5]. Facial expression provides cues about emotion, intention, alertness, pain, personality, regulates interpersonal behavior [4], and communicates psychiatric [8] and biomedical status [10] among other functions.

There has been increasing interest in automated facial expression analysis within the computer vision and machine learning communities. Several applications for related technologies exist: distracted driver detection [27], emotional response measurement for advertising [23, 25], and human-robot collaboration [2] are just some possibilities.

Given the time-consuming nature of manual facial expression coding and the alluring possibilities of the aforementioned applications, recent research has pursued computerized systems capable of automatically analyzing facial expressions. The dominant approach adopted by these researchers has been to identify a number of fiduciary points on the face, extract hand-crafted or learned features that can characterize the appearance of the skin, and train classifiers in a supervised manner to detect the absence or presence of expressions.

Recently, deep learning based solutions have been proposed for coding holistic facial expressions and facial actions units. Li et al. [21] used a convolutional neural network (CNN) based deep representation of facial 3D geometric and 2D photometric attributes for recognizing holistic facial expressions. Liu et al. [22] proposed an Action Unit aware deep architecture to learn local appearance variations on the face and constructed a group-wise sub-network to code facial expressions. Xu et al. [28] explored transfer learning of high-level features from face identification data to holistic facial expression recognition. Only recently did Jaiswal and Valstar [12] propose a deep learning approach for recognizing facial action units under uncontrolled conditions. Action Units were coded using a memory network that jointly learns shape, appearance and dynamics in a deep manner.

Even though significant progress has been made [7], the current state-of-the-art science is still limited in several key respects. Stimuli to elicit spontaneous facial actions have been highly controlled and camera orientation has been frontal with little or no variation in head pose. Head motion and orientation to the camera are important if AU detection is to be accomplished in social settings where facial expressions often co-occur with head motion [1, 17]. As the head pose moves away from frontal, parts of the face may become self-occluded and the classifier’s ability to measure expressions degrades. Here, we study the efficiency of a novel deep learning method for AU detection under large head poses.

This paper advances two main novelties:

  • AU Detection under Large Head Poses with 3D Augmentation. In our work we use the BP4D spontaneous dataset and its extension detailed in Sect. 2.2. An augmented dataset has been created using the 3D information and renderings of the faces with broad range of yaw and pitch rotations. We show that performance is high for the networks trained around different pose directions opening the door for a number of useful applications.

  • Selective Gradient Descent Optimization. Threshold performance metrics (such as the \(F_1\) score) are piecewise-constant functions and including them directly in the CNN cost function would degrade the convergence of the optimization method. In our algorithm, we combined gradient descent with selective methods to overcome this issue. This approach results in a small but highly effective network that outperforms the more complex state-of-the-art systems.

The paper is organized as follows. The method section (Sect. 2) contains the overview of the architecture (Sect. 2.1), the descriptions about database (Sect. 2.2), its extension (Sect. 2.3), the facial landmark tracking method (Sect. 2.4) and the deep learning components (Sect. 2.5). These descriptions are followed by our results (Sect. 3) and the related discussion (Sect. 4). We conclude in the last section (Sect. 5).

2 Methods

2.1 Architecture

The main steps of pre-processing, such as face detection, mesh fitting, pose estimation are depicted in Fig. 1. Details are to follow below.

Fig. 1.
figure 1

Overview of the system. The original image underwent face tracking and was pre-processed in three different ways; Histogram of Gradients (HoG), similarity normalized (scaled and cropped), and cut and put together from patches around landmark positions (Mosaic). Training methods included Support Vector Machine (SVM), Convolutional Neural Networks (CNN) in single and multi-label versions.

2.2 BP4D-Spontaneous Dataset

We used the BP4D-Spontaneous dataset [31] from the FERA 2015 Challenge [26]. This database includes digital video of 41 participants (56.1 % female, 49.1 % white, ages 18–29). These individuals were recruited from the departments of psychology and computer science and from the school of engineering at Binghamton University. All participants gave informed consent to the procedures and permissible uses of their data. Participants sat approximately 51 in. in front of a Di3D dynamic face capturing system during a series of eight emotion elicitation tasks. Target emotional expressions include anxiety, surprise, embarrassment, fear, pain, anger, and disgust. Example tasks include being surprised by a loud sound, submerging a hand in ice water, and smelling rotten meat. For each task, the 20-second segment with the highest AU density was identified; this segment then was coded for AU onset (start) and offset (end) by certified and reliable FACS coders.

The FERA 2015 Challenge [26] employed the 41 subjects from BP4D - Spontaneous dataset [31] as a training set. In this paper we refer this subset as “Train” set. Additional videos from 20 subjects were collected using the same setup and were used for testing in the challenge [26]. In this paper we refer this subset as “Test” set.

2.3 Database Extension

The subjects in the BP4D-Spontaneous dataset exhibit only a moderate level of head movements in the video sequences. The dataset [31] comes with frame-level high-resolution 3D models. To validate the proposed method on larger viewpoint angles, an augmented dataset has been created using the 3D information and renderings of the faces with different yaw and pitch rotations. We used all the FACS coded data to synthesize the rotated views.

2.4 Facial Landmark Tracking and Face Normalization

The first step in automatically detecting AUs was to locate the face and facial landmarks. Landmarks refer to points that define the shape of permanent facial features, such as the eyes and lips. This step was accomplished using the ZFace tracker [14, 15], which is a generic tracker that requires no individualized training to track facial landmarks of persons it has never seen before. It locates the two- and three-dimensional coordinates of main fiducial landmarks in each image. These landmarks correspond to important facial points such as the eye and mouth corners, the tip of the nose, and the eyebrows. The moderate level of rigid head motion exhibited by the subjects in the BP4D-Spontaneous dataset was minimized as follows: facial images were warped to the average pose and face using similarity transformation on the tracked facial landmarks. The average face has been normalized to have 100 pixels inter-ocular distance and normalized images were cropped to \(256\times 256\) pixels. This procedure created a common space, where variation in head size and orientation would not confound the measurement of facial actions.

2.5 Deep Learning

Deep learning aims to overcome the curse of dimensionality problem of MLPs via a number of architectural inventions. The increase of the number of layers lessens the transformational tasks of each layer. Rectified linear units (ReLUs) are favoured, since their sensitive range is large, the rectification can efficiently shatter the space, and supervised training does not require unsupervised pre-training (see [9] and the references therein).

Layers of the Network. Convolutional layers make another efficient innovation. They are particularly useful for images. One can view each layer as a set of trainable template matchings [6]. It has the following attractive properties: (a) The templates (also called filters) can be matched at each pixel of the image relatively quickly due to the convolution operation itself [20]. The result for each filter is called the feature map. (b) While the number of neurons can be large, still the number of variables, the weights, is kept low, saving in memory requirements and decreasing the curse of dimensionality problem. (c) Each convolutional layer may be followed by a subsampling layer. The role of this step is to decrease the number of units that scale as the product of the dimension of the input of that layer and the number of filters. Max-pooling that solicits the largest response in each pooling region is one of the preferred methods. The effective result of pooling is that the precision of the feature map degrades, which is nicely compensated by the number of feature maps and the option of further convolutional processing steps without explosion in the number of units. Subsampling also reduces overfitting. For more details, see [19] and the references therein.

Convolutional networks typically add densely connected layers after the convolutional layers, often made of ReLUs. Our architecture is sketched in Fig. 2.

Fig. 2.
figure 2

Deep neural network, main components: convolutional layers with ReLUs (CL), pooling layers (PL), fully connected layers (FC), output layer with logistic regression (OL). There are two versions. (a): CL-PL-CL-PL-FC-FC-OL, (b): additional CL between the second PL and the first FC layer.

We used typical regularization, stabilization, early stopping, and local minima avoiding procedures [24] with a reasonably small network and we found that larger networks would not improve performance considerably. The parameters and some procedures of the architecture are as follows:

  1. (a)

    The dimension of the input layer is \(256 \times 256\). The original three channel color images were converted to a single grayscale channel and the values were scaled between 0 and 1.

  2. (b)

    The first and second convolutional cascades have 16 filters each, with \(5 \times 5\) in the first and \(4 \times 4\) pixels in the second cascade. The stride was 1 in both cases. Max pooling was \(4 \times 4\) and \(2 \times 2\) applied with a stride of 4 and 2, respectively. Occasionally a third convolutional layer with 16 filters of \(3\times 3\) pixels each was added when the representation power of the architecture was questioned (Fig. 2).

  3. (c)

    There are two densely connected layers of 2,000 ReLU units in each.

  4. (d)

    The output is a sigmoid layer for the action units. Special procedures include dropout before the two dense and the sigmoid layers with 50 % rate. Gradient training is controlled by Adamax (see later). Minibatch size is 500.

  5. (e)

    The cost function to be minimized has two components, the sum of two terms, a regularizing \(\ell _2\) norm for the weights and the binary cross-entropy cost on the outputs of the network. This latter takes the average of all cross entropies in the sample: assume that we have \(1 \le n \le N\) samples with binary labels \(y_n \in \{-1, +1\}\) and network responses \(\hat{y}_n\) for all n. The loss function is

    $$\begin{aligned} J(\hat{y}_1, \ldots , \hat{y}_n) = \frac{1}{N} \sum _{n=1}^N y_n \log \hat{y}_n + (1-y_n) \log (1-\hat{y}_n). \end{aligned}$$
    (1)

    where the proper range of estimation is warranted by the logistic function: \(\hat{y}_n(z)=1/(1+e^{-\theta z})\) with z being the input to the \(n^{th}\) output unit and \(\theta \) being a trainable parameter.

Early Stopping. Training stops early if performance over the validation set is not improving over, say m epochs. This way overfitting becomes less probable. In our case, \(m=5\) was chosen. \(F_1\) score is the typical measure for face related estimations. However, \(F_1\) score has discontinuities and constant regions making it dubious for gradient based methods. Our approach that aims to overcome this problem is the following: we computed the gradient for the binary cross-entropy, but used the \(F_1\) score as performance measure in the validation step. This way, gradient descent was guided by the \(F_1\) score itself. The high quality results that we reached with a relatively simple network may be partially due to this procedure.

Cross-Validation. All the results and methods reported on the “Train” set have been validated with a 10-fold, subject-independent cross-validation. In the other experiments we trained on the “Train” set and reported performance measures on the “Test” set, following the challenge protocol [26].

Details of the Backpropagation Algorithm. Beyond the advances of GPU technology and deep learning architectures, error backpropagation also underwent fast and efficient changes. We used one of the most recent methods called Adamax [18]. It is a version of the Adam algorithm, a first-order gradient-based optimization, designed for stochastic objective functions exploiting adaptive estimates of lower-order moments. Adam estimates the \(\ell _2\) norm of the current and past gradients. If the gradients are small, the step size is made larger; inverse proportionality is applied. Adamax generalizes the \(\ell _2\) norm to \(\ell _p\) norm and suggests to take the \(p \rightarrow \infty \) limit. For more details, see [18].

Applied Software. There are many implementations of deep learning, mostly based on Python or C++. For a comprehensive list of software tools, today, the link http://deeplearning.net/software_links/ is a good starting point. We used Lasagne, a lightweight library built on top of Theano. Theano (http://deeplearning.net/software/theano) has been developed by the Montreal Institute for Learning Algorithms. It is a symbolic expression compiler that works both on CPU and on GPU and it is written in Python.

3 Results

First, we evaluated the performance on the FERA Train set, employing a 10-fold, subject independent CV. According to Table 1, HoG based SVM is the best for AU14 and AU15, and performance is superior for AU15. The representation at around the decision surface seems superior for these AUs. For the other AUs, SNI based CNNs with single AU classification are better. Multi-label classification is somewhat worse for almost all AUs, but let us note that these evaluations are faster, time scales linearly with the number of AUs for the single AU case.

Table 1. \(F_1\) measures on the FERA BP4D Train set with different classifiers (C), input features (IF) and output label (OL) structures. The input features are Histogram of Gradients (HOG), mosaic images (MI), and similarity normalized images (SNI). The output structures are either single- (S) or multi-label (M).
Table 2. Results on the FERA BP4D Test set with multi-label CNN and SNI. Performance measures include \(F_1\) score, its skew normalized version (\(F_{1}^{s.n.}\)) [13], and area under ROC curve (AUC). The table shows the degree of skew (ratio of negative and positive labels) for each AU.

In the next experiment we trained the system on the FERA 2015 Train set, and tested it on the Test set. The AU base-rates are significantly different on these subsets [26] and \(F_1\) score is attenuated by skewed distributions [13]. For this reason we report the degree of skew, \(F_1\) score, its skew normalized version (\(F_{1}^{s.n.}\)) [13], and area under the receiver operating characteristic (ROC) curves. The AUC values are shown in Table 2 for the FERA BP4D test set, where skew parameters range between 1 and 20.

Fig. 3.
figure 3

\(F_1\) measures as a function of yaw rotation on the augmented BP4D Train set, using the single-label classifier.

Fig. 4.
figure 4

\(F_1\) measures as a function of pitch rotation on the augmented BP4D Train set, using the single-label classifier.

Head pose has three main angles, roll, yaw and pitch. Roll can be compensated in the frontal view by the normalization step. The case is more complex for non-frontal views. We studied yaw and pitch angles around the frontal view. Yaw is symmetric in this case and we show data for \((-18^{\circ },+18^{\circ })\) ranges around head poses 0, 18, 36, 54 and 72 degrees that covers the full frontal–to–profile view range. Angle dependence is relatively large for AU4, AU15, and AU23, but the mean \(F_1\) score is a weak function of the head pose angle (Fig. 3).

We studied the asymmetric pitch around the frontal view for \((-18^{\circ },+18^{\circ })\) ranges around \(-36\), \(-18\), +18, and +36 degrees. The mean \(F_1\) score is also a weak function of the pitch angle. AU1, AU4, and AU23 are affected by this angle more strongly than the other AUs (AU2, AU6, AU7, AU10, AU12, AU14, AU15, and AU17), see, Fig. 4.

Occlusion sensitivity maps [30] were generated for the different action units. We used 200 images for each subject, giving 8,200 images for the map generations. At around certain pixels, the pixels of the \(21 \times 21\) sized patches were set to 0.5, the middle of the normalized range, [0, 1]. Central pixels were laid uniformly on each image at \(32\times 32=1,024\) positions. The modified images, more than 8 million, were then tried on the trained network for each AU and the binary cross-entropy measure was computed. Results are shown in condensed form, the value is color coded on a \(32\times 32\) occlusion sensitivity map in Fig. 5.

Fig. 5.
figure 5

Occlusion Sensitivity Maps [30]. (a): cropped \(256 \times 256\) pixel images are covered by uniform grey \(21\times 21\) pixel patches at around pixels of a \(32\times 32\) pixel grid uniformly placed over the image. (b)-(n): the modified images are evaluated for binary cross-entropy performance. Performance is color coded at the central pixel of the patch and the \(32\times 32\) image is depicted for the different AUs.

We end the result section by comparing our results with the most recent ones reported in the literature (Table 3), the Local Gabor Binary Pattern (LGBP) [26], the geometric feature based deep network (GDNN) [12], the Discriminant Laplacian Embedding (DLE) [29], Deep Learning with Global Contrast Normalization (DL) [11], and the Convolutional and Bi-directional Memory Neural Networks (CRML) [12] methods. DLE wins for AU15, CMLR is the best for AU10, and AU 14, and DL performs the best for AU1 and AU2. Our architecture comes first for the other AUs, with one exception, the single label case wins. Since the multi-label case is considerably harder, we suspect that better training can improve the results further, e.g., by adding noise to the input on top of the dropout and/or increasing the database.

Table 3. Comparison of the single-label (SL) and multi-label (ML) version with other methods in the literature.

The single label case produced the best mean value. A special difference between the CMRL method and ours is that we can work on single images, whereas CMRL requires frame series. Furthermore, the inclusion of temporal information should improve performance for our case, too.

4 Discussion

Recent progress in convolutional neural networks (e.g., [12, 22, 28, 30] and see also the general review [24] and the cited references therein) shows that Deep Neural Networks, including CNNs are flexible enough to compete with hand-crafted features, such as HoG, SIFT, Gabor filters, LBP, among many others. The adaptivity of the CNN structure tunes the convolutional layers of the CNN to the database according to the statistics of the data. The fully connected layers, on the other hand, serve to collect, combine and exclude certain portions of the image.

The big progress is due to the tricks of avoiding local minima during the training procedure and the collection of such methods keeps increasing. We used high dropout rates, early stopping, and rectified linear units to overcome the danger of falling into one of the local minima too early during training. We have no doubt that this quickly developing field will come up superior solutions and performance will increase further. The maturation of deep learning neural network technologies offer great promises in the field of facial expression estimations.

The success of our relatively small network is most probably due to another additional trick; we combined gradient descent with selective methods. Although the contribution of this trick that we detail below is hard to grab quantitatively, we should note that we used no binary mask [12], no temporal extensions [12, 16], known to have a considerable impact on performance.

The problem of optimization lies in the dubious \(F_1\) score, which is not a good cost function, due to its discontinuities and flat, constant regions. Instead, a closely related quantity, the binary cross-entropy is preferred for gradient computations. Selection does not require well behaving, smooth costs and it can be introduced into the procedure at the validation step that guides early stopping. If performance is not improving on the validation set for a number of steps, in spite of the fact that it still does on the training set, then the gradient procedure should be stopped, since a local minimum of the training set is approached. Upon early stopping a new minibatch can be used for improving the performance.

This validation step can serve the selective process if gradient descent is stopped according to a different measure instead of the cost function. In our case, this measure was the \(F_1\) score. It should be noted that the ideal values for the \(F_1\) score and the binary cross entropy are the same, although they are rarely reached for real problems.

Clearly, special procedures, such as binary masks and temporal information should improve our results further, alike to performance increases in the studies mentioned previously.

Our main finding is that performance is a weak function of the head pose for CNNs and it remains high for a broad variety of angles. This opens the possibility of many real-life applications from cyber-physical systems with human in the loop, including smart factories, medical cyber-physical systems, independent living situation among many others. Furthermore, insights, sometimes of diagnostic value can be gained for affective disorders, addiction, and social relations. The progress of GPU technology will provide further gains in evaluation time that will decrease training time and evaluation frequency, too. The single-label version of our system runs at 58 FPS, while the multi-label version reaches over 600 FPS on a Titan X GPU.

Real life applications may require “in the wild” databases. This point remains to be seen.

5 Conclusions

Recent progress in deep learning technology and the availability of high quality databases enabled powerful learning methods to enter the field of face processing. We used these deep learning methods and the BP4D database for training an architecture for action unit recognition. Our results surpassed the state-of-the-art for images and could be further improved if temporal information is available. The main result is that angle dependence is minor, a large yaw and pitch range can be covered without considerable deterioration in performance. In turn, relevant applications from human-computer interaction to psychiatric interviews may gain momentum by applying such tools.