Deep Learning for Facial Action Unit Detection Under Large Head Poses

Tősér, Zoltán; Jeni, László A.; Lőrincz, András; Cohn, Jeffrey F.

doi:10.1007/978-3-319-49409-8_29

Zoltán Tősér¹⁵,
László A. Jeni¹⁶,
András Lőrincz¹⁵ &
…
Jeffrey F. Cohn^16,17

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 9915))

Included in the following conference series:

European Conference on Computer Vision

7824 Accesses
6 Citations

Abstract

Facial expression communicates emotion, intention, and physical state, and regulates interpersonal behavior. Automated face analysis (AFA) for the detection, synthesis, and understanding of facial expression is a vital focus of basic research with applications in behavioral science, mental and physical health and treatment, marketing, and human-robot interaction among other domains. In previous work, facial action unit (AU) detection becomes seriously degraded when head orientation exceeds $15^{\circ }$ to $20^{\circ }$. To achieve reliable AU detection over a wider range of head pose, we used 3D information to augment video data and a deep learning approach to feature selection and AU detection. Source video were from the BP4D database (n = 41) and the FERA test set of BP4D-extended (n = 20). Both consist of naturally occurring facial expression in response to a variety of emotion inductions. In augmented video, pose ranged between $-18^{\circ }$ and $90^{\circ }$ for yaw and between $-54^{\circ }$ and $54^{\circ }$ for pitch angles. Obtained results for action unit detection exceeded state-of-the-art, with as much as a 10 % increase in $F_1$ measures.

You have full access to this open access chapter, Download conference paper PDF

WhyMyFace: A Novel Approach to Recognize Facial Expressions Using CNN and Data Augmentations

Declarative Residual Network for Robust Facial Expression Recognition

Deep Structure Inference Network for Facial Action Unit Recognition

Keywords

1 Introduction

The face is one of the most powerful channels of nonverbal communication [3, 5]. Facial expression provides cues about emotion, intention, alertness, pain, personality, regulates interpersonal behavior [4], and communicates psychiatric [8] and biomedical status [10] among other functions.

There has been increasing interest in automated facial expression analysis within the computer vision and machine learning communities. Several applications for related technologies exist: distracted driver detection [27], emotional response measurement for advertising [23, 25], and human-robot collaboration [2] are just some possibilities.

Given the time-consuming nature of manual facial expression coding and the alluring possibilities of the aforementioned applications, recent research has pursued computerized systems capable of automatically analyzing facial expressions. The dominant approach adopted by these researchers has been to identify a number of fiduciary points on the face, extract hand-crafted or learned features that can characterize the appearance of the skin, and train classifiers in a supervised manner to detect the absence or presence of expressions.

Recently, deep learning based solutions have been proposed for coding holistic facial expressions and facial actions units. Li et al. [21] used a convolutional neural network (CNN) based deep representation of facial 3D geometric and 2D photometric attributes for recognizing holistic facial expressions. Liu et al. [22] proposed an Action Unit aware deep architecture to learn local appearance variations on the face and constructed a group-wise sub-network to code facial expressions. Xu et al. [28] explored transfer learning of high-level features from face identification data to holistic facial expression recognition. Only recently did Jaiswal and Valstar [12] propose a deep learning approach for recognizing facial action units under uncontrolled conditions. Action Units were coded using a memory network that jointly learns shape, appearance and dynamics in a deep manner.

Even though significant progress has been made [7], the current state-of-the-art science is still limited in several key respects. Stimuli to elicit spontaneous facial actions have been highly controlled and camera orientation has been frontal with little or no variation in head pose. Head motion and orientation to the camera are important if AU detection is to be accomplished in social settings where facial expressions often co-occur with head motion [1, 17]. As the head pose moves away from frontal, parts of the face may become self-occluded and the classifier’s ability to measure expressions degrades. Here, we study the efficiency of a novel deep learning method for AU detection under large head poses.

This paper advances two main novelties:

AU Detection under Large Head Poses with 3D Augmentation. In our work we use the BP4D spontaneous dataset and its extension detailed in Sect. 2.2. An augmented dataset has been created using the 3D information and renderings of the faces with broad range of yaw and pitch rotations. We show that performance is high for the networks trained around different pose directions opening the door for a number of useful applications.
Selective Gradient Descent Optimization. Threshold performance metrics (such as the $F_1$ score) are piecewise-constant functions and including them directly in the CNN cost function would degrade the convergence of the optimization method. In our algorithm, we combined gradient descent with selective methods to overcome this issue. This approach results in a small but highly effective network that outperforms the more complex state-of-the-art systems.

The paper is organized as follows. The method section (Sect. 2) contains the overview of the architecture (Sect. 2.1), the descriptions about database (Sect. 2.2), its extension (Sect. 2.3), the facial landmark tracking method (Sect. 2.4) and the deep learning components (Sect. 2.5). These descriptions are followed by our results (Sect. 3) and the related discussion (Sect. 4). We conclude in the last section (Sect. 5).

2 Methods

2.1 Architecture

The main steps of pre-processing, such as face detection, mesh fitting, pose estimation are depicted in Fig. 1. Details are to follow below.

2.2 BP4D-Spontaneous Dataset

We used the BP4D-Spontaneous dataset [31] from the FERA 2015 Challenge [26]. This database includes digital video of 41 participants (56.1 % female, 49.1 % white, ages 18–29). These individuals were recruited from the departments of psychology and computer science and from the school of engineering at Binghamton University. All participants gave informed consent to the procedures and permissible uses of their data. Participants sat approximately 51 in. in front of a Di3D dynamic face capturing system during a series of eight emotion elicitation tasks. Target emotional expressions include anxiety, surprise, embarrassment, fear, pain, anger, and disgust. Example tasks include being surprised by a loud sound, submerging a hand in ice water, and smelling rotten meat. For each task, the 20-second segment with the highest AU density was identified; this segment then was coded for AU onset (start) and offset (end) by certified and reliable FACS coders.

The FERA 2015 Challenge [26] employed the 41 subjects from BP4D - Spontaneous dataset [31] as a training set. In this paper we refer this subset as “Train” set. Additional videos from 20 subjects were collected using the same setup and were used for testing in the challenge [26]. In this paper we refer this subset as “Test” set.

2.3 Database Extension

The subjects in the BP4D-Spontaneous dataset exhibit only a moderate level of head movements in the video sequences. The dataset [31] comes with frame-level high-resolution 3D models. To validate the proposed method on larger viewpoint angles, an augmented dataset has been created using the 3D information and renderings of the faces with different yaw and pitch rotations. We used all the FACS coded data to synthesize the rotated views.

2.4 Facial Landmark Tracking and Face Normalization

The first step in automatically detecting AUs was to locate the face and facial landmarks. Landmarks refer to points that define the shape of permanent facial features, such as the eyes and lips. This step was accomplished using the ZFace tracker [14, 15], which is a generic tracker that requires no individualized training to track facial landmarks of persons it has never seen before. It locates the two- and three-dimensional coordinates of main fiducial landmarks in each image. These landmarks correspond to important facial points such as the eye and mouth corners, the tip of the nose, and the eyebrows. The moderate level of rigid head motion exhibited by the subjects in the BP4D-Spontaneous dataset was minimized as follows: facial images were warped to the average pose and face using similarity transformation on the tracked facial landmarks. The average face has been normalized to have 100 pixels inter-ocular distance and normalized images were cropped to $256\times 256$ pixels. This procedure created a common space, where variation in head size and orientation would not confound the measurement of facial actions.

2.5 Deep Learning

Deep learning aims to overcome the curse of dimensionality problem of MLPs via a number of architectural inventions. The increase of the number of layers lessens the transformational tasks of each layer. Rectified linear units (ReLUs) are favoured, since their sensitive range is large, the rectification can efficiently shatter the space, and supervised training does not require unsupervised pre-training (see [9] and the references therein).

Layers of the Network. Convolutional layers make another efficient innovation. They are particularly useful for images. One can view each layer as a set of trainable template matchings [6]. It has the following attractive properties: (a) The templates (also called filters) can be matched at each pixel of the image relatively quickly due to the convolution operation itself [20]. The result for each filter is called the feature map. (b) While the number of neurons can be large, still the number of variables, the weights, is kept low, saving in memory requirements and decreasing the curse of dimensionality problem. (c) Each convolutional layer may be followed by a subsampling layer. The role of this step is to decrease the number of units that scale as the product of the dimension of the input of that layer and the number of filters. Max-pooling that solicits the largest response in each pooling region is one of the preferred methods. The effective result of pooling is that the precision of the feature map degrades, which is nicely compensated by the number of feature maps and the option of further convolutional processing steps without explosion in the number of units. Subsampling also reduces overfitting. For more details, see [19] and the references therein.

Convolutional networks typically add densely connected layers after the convolutional layers, often made of ReLUs. Our architecture is sketched in Fig. 2.

We used typical regularization, stabilization, early stopping, and local minima avoiding procedures [24] with a reasonably small network and we found that larger networks would not improve performance considerably. The parameters and some procedures of the architecture are as follows:

(a)
The dimension of the input layer is $256 \times 256$. The original three channel color images were converted to a single grayscale channel and the values were scaled between 0 and 1.
(b)
The first and second convolutional cascades have 16 filters each, with $5 \times 5$ in the first and $4 \times 4$ pixels in the second cascade. The stride was 1 in both cases. Max pooling was $4 \times 4$ and $2 \times 2$ applied with a stride of 4 and 2, respectively. Occasionally a third convolutional layer with 16 filters of $3\times 3$ pixels each was added when the representation power of the architecture was questioned (Fig. 2).
(c)
There are two densely connected layers of 2,000 ReLU units in each.
(d)
The output is a sigmoid layer for the action units. Special procedures include dropout before the two dense and the sigmoid layers with 50 % rate. Gradient training is controlled by Adamax (see later). Minibatch size is 500.
(e)
The cost function to be minimized has two components, the sum of two terms, a regularizing $\ell _2$ norm for the weights and the binary cross-entropy cost on the outputs of the network. This latter takes the average of all cross entropies in the sample: assume that we have $1 \le n \le N$ samples with binary labels $y_n \in \{-1, +1\}$ and network responses $\hat{y}_n$ for all n. The loss function is
$$\begin{aligned} J(\hat{y}_1, \ldots , \hat{y}_n) = \frac{1}{N} \sum _{n=1}^N y_n \log \hat{y}_n + (1-y_n) \log (1-\hat{y}_n). \end{aligned}$$
(1)
where the proper range of estimation is warranted by the logistic function: $\hat{y}_n(z)=1/(1+e^{-\theta z})$ with z being the input to the $n^{th}$ output unit and $\theta $ being a trainable parameter.

Early Stopping. Training stops early if performance over the validation set is not improving over, say m epochs. This way overfitting becomes less probable. In our case, $m=5$ was chosen. $F_1$ score is the typical measure for face related estimations. However, $F_1$ score has discontinuities and constant regions making it dubious for gradient based methods. Our approach that aims to overcome this problem is the following: we computed the gradient for the binary cross-entropy, but used the $F_1$ score as performance measure in the validation step. This way, gradient descent was guided by the $F_1$ score itself. The high quality results that we reached with a relatively simple network may be partially due to this procedure.

Cross-Validation. All the results and methods reported on the “Train” set have been validated with a 10-fold, subject-independent cross-validation. In the other experiments we trained on the “Train” set and reported performance measures on the “Test” set, following the challenge protocol [26].

Details of the Backpropagation Algorithm. Beyond the advances of GPU technology and deep learning architectures, error backpropagation also underwent fast and efficient changes. We used one of the most recent methods called Adamax [18]. It is a version of the Adam algorithm, a first-order gradient-based optimization, designed for stochastic objective functions exploiting adaptive estimates of lower-order moments. Adam estimates the $\ell _2$ norm of the current and past gradients. If the gradients are small, the step size is made larger; inverse proportionality is applied. Adamax generalizes the $\ell _2$ norm to $\ell _p$ norm and suggests to take the $p \rightarrow \infty $ limit. For more details, see [18].

Applied Software. There are many implementations of deep learning, mostly based on Python or C++. For a comprehensive list of software tools, today, the link http://deeplearning.net/software_links/ is a good starting point. We used Lasagne, a lightweight library built on top of Theano. Theano (http://deeplearning.net/software/theano) has been developed by the Montreal Institute for Learning Algorithms. It is a symbolic expression compiler that works both on CPU and on GPU and it is written in Python.

3 Results

First, we evaluated the performance on the FERA Train set, employing a 10-fold, subject independent CV. According to Table 1, HoG based SVM is the best for AU14 and AU15, and performance is superior for AU15. The representation at around the decision surface seems superior for these AUs. For the other AUs, SNI based CNNs with single AU classification are better. Multi-label classification is somewhat worse for almost all AUs, but let us note that these evaluations are faster, time scales linearly with the number of AUs for the single AU case.

Table 1. $F_1$ measures on the FERA BP4D Train set with different classifiers (C), input features (IF) and output label (OL) structures. The input features are Histogram of Gradients (HOG), mosaic images (MI), and similarity normalized images (SNI). The output structures are either single- (S) or multi-label (M).

Full size table

Table 2. Results on the FERA BP4D Test set with multi-label CNN and SNI. Performance measures include $F_1$ score, its skew normalized version ($F_{1}^{s.n.}$) [13], and area under ROC curve (AUC). The table shows the degree of skew (ratio of negative and positive labels) for each AU.

Full size table

In the next experiment we trained the system on the FERA 2015 Train set, and tested it on the Test set. The AU base-rates are significantly different on these subsets [26] and $F_1$ score is attenuated by skewed distributions [13]. For this reason we report the degree of skew, $F_1$ score, its skew normalized version ($F_{1}^{s.n.}$) [13], and area under the receiver operating characteristic (ROC) curves. The AUC values are shown in Table 2 for the FERA BP4D test set, where skew parameters range between 1 and 20.

Head pose has three main angles, roll, yaw and pitch. Roll can be compensated in the frontal view by the normalization step. The case is more complex for non-frontal views. We studied yaw and pitch angles around the frontal view. Yaw is symmetric in this case and we show data for $(-18^{\circ },+18^{\circ })$ ranges around head poses 0, 18, 36, 54 and 72 degrees that covers the full frontal–to–profile view range. Angle dependence is relatively large for AU4, AU15, and AU23, but the mean $F_1$ score is a weak function of the head pose angle (Fig. 3).

We studied the asymmetric pitch around the frontal view for $(-18^{\circ },+18^{\circ })$ ranges around $-36$, $-18$, +18, and +36 degrees. The mean $F_1$ score is also a weak function of the pitch angle. AU1, AU4, and AU23 are affected by this angle more strongly than the other AUs (AU2, AU6, AU7, AU10, AU12, AU14, AU15, and AU17), see, Fig. 4.

Occlusion sensitivity maps [30] were generated for the different action units. We used 200 images for each subject, giving 8,200 images for the map generations. At around certain pixels, the pixels of the $21 \times 21$ sized patches were set to 0.5, the middle of the normalized range, [0, 1]. Central pixels were laid uniformly on each image at $32\times 32=1,024$ positions. The modified images, more than 8 million, were then tried on the trained network for each AU and the binary cross-entropy measure was computed. Results are shown in condensed form, the value is color coded on a $32\times 32$ occlusion sensitivity map in Fig. 5.

We end the result section by comparing our results with the most recent ones reported in the literature (Table 3), the Local Gabor Binary Pattern (LGBP) [26], the geometric feature based deep network (GDNN) [12], the Discriminant Laplacian Embedding (DLE) [29], Deep Learning with Global Contrast Normalization (DL) [11], and the Convolutional and Bi-directional Memory Neural Networks (CRML) [12] methods. DLE wins for AU15, CMLR is the best for AU10, and AU 14, and DL performs the best for AU1 and AU2. Our architecture comes first for the other AUs, with one exception, the single label case wins. Since the multi-label case is considerably harder, we suspect that better training can improve the results further, e.g., by adding noise to the input on top of the dropout and/or increasing the database.

Table 3. Comparison of the single-label (SL) and multi-label (ML) version with other methods in the literature.

Full size table

The single label case produced the best mean value. A special difference between the CMRL method and ours is that we can work on single images, whereas CMRL requires frame series. Furthermore, the inclusion of temporal information should improve performance for our case, too.

4 Discussion

Recent progress in convolutional neural networks (e.g., [12, 22, 28, 30] and see also the general review [24] and the cited references therein) shows that Deep Neural Networks, including CNNs are flexible enough to compete with hand-crafted features, such as HoG, SIFT, Gabor filters, LBP, among many others. The adaptivity of the CNN structure tunes the convolutional layers of the CNN to the database according to the statistics of the data. The fully connected layers, on the other hand, serve to collect, combine and exclude certain portions of the image.

The big progress is due to the tricks of avoiding local minima during the training procedure and the collection of such methods keeps increasing. We used high dropout rates, early stopping, and rectified linear units to overcome the danger of falling into one of the local minima too early during training. We have no doubt that this quickly developing field will come up superior solutions and performance will increase further. The maturation of deep learning neural network technologies offer great promises in the field of facial expression estimations.

The success of our relatively small network is most probably due to another additional trick; we combined gradient descent with selective methods. Although the contribution of this trick that we detail below is hard to grab quantitatively, we should note that we used no binary mask [12], no temporal extensions [12, 16], known to have a considerable impact on performance.

The problem of optimization lies in the dubious $F_1$ score, which is not a good cost function, due to its discontinuities and flat, constant regions. Instead, a closely related quantity, the binary cross-entropy is preferred for gradient computations. Selection does not require well behaving, smooth costs and it can be introduced into the procedure at the validation step that guides early stopping. If performance is not improving on the validation set for a number of steps, in spite of the fact that it still does on the training set, then the gradient procedure should be stopped, since a local minimum of the training set is approached. Upon early stopping a new minibatch can be used for improving the performance.

This validation step can serve the selective process if gradient descent is stopped according to a different measure instead of the cost function. In our case, this measure was the $F_1$ score. It should be noted that the ideal values for the $F_1$ score and the binary cross entropy are the same, although they are rarely reached for real problems.

Clearly, special procedures, such as binary masks and temporal information should improve our results further, alike to performance increases in the studies mentioned previously.

Our main finding is that performance is a weak function of the head pose for CNNs and it remains high for a broad variety of angles. This opens the possibility of many real-life applications from cyber-physical systems with human in the loop, including smart factories, medical cyber-physical systems, independent living situation among many others. Furthermore, insights, sometimes of diagnostic value can be gained for affective disorders, addiction, and social relations. The progress of GPU technology will provide further gains in evaluation time that will decrease training time and evaluation frequency, too. The single-label version of our system runs at 58 FPS, while the multi-label version reaches over 600 FPS on a Titan X GPU.

Real life applications may require “in the wild” databases. This point remains to be seen.

5 Conclusions

Recent progress in deep learning technology and the availability of high quality databases enabled powerful learning methods to enter the field of face processing. We used these deep learning methods and the BP4D database for training an architecture for action unit recognition. Our results surpassed the state-of-the-art for images and could be further improved if temporal information is available. The main result is that angle dependence is minor, a large yaw and pitch range can be covered without considerable deterioration in performance. In turn, relevant applications from human-computer interaction to psychiatric interviews may gain momentum by applying such tools.

References

Ambadar, Z., Cohn, J.F., Reed, L.I.: All smiles are not created equal: morphology and timing of smiles perceived as amused, polite, and embarrassed/nervous. J. Nonverbal Behav. 33(1), 17–34 (2009)
Article Google Scholar
Bauer, A., Wollherr, D., Buss, M.: Human-robot collaboration: a survey. Int. J. Humanoid Rob. 5(01), 47–66 (2008)
Article Google Scholar
Ekman, P., Rosenberg, E.L.: What the Face Reveals: Basic and Applied Studies of Spontaneous Expression Using the Facial Action Coding System (FACS), 2nd edn. Oxford University Press, New York (2005)
Book Google Scholar
Fairbairn, C.E., Sayette, M.A., Levine, J.M., Cohn, J.F., Creswell, K.G.: The effects of alcohol on the emotional displays of whites in interracial groups. Emotion 13(3), 468–477 (2013)
Article Google Scholar
Fridlund, A.J.: Human Facial Expression: An Evolutionary View. Academic Press, Cambridge (1994)
Google Scholar
Fukushima, K.: Neocognitron: a self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position. Biol. Cybern. 36(4), 193–202 (1980)
Article MATH Google Scholar
Girard, J.M., Cohn, J.F., Jeni, L.A., Sayette, M.A., De La Torre, F.: Spontaneous facial expression in unscripted social interactions can be measured automatically. Beh. Res. Methods 47, 1–12 (2014). articles/Girard14BRM.pdf
Google Scholar
Girard, J.M., Cohn, J.F., Mahoor, M.H., Mavadati, S.M., Hammal, Z., Rosenwald, D.P.: Nonverbal social withdrawal in depression: evidence from manual and automatic analyses. Image Vis. Comput. 32(10), 641–647 (2014)
Article Google Scholar
Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: International Conference on Artificial Intelligence and Statistics, pp. 315–323 (2011)
Google Scholar
Griffin, K.M., Sayette, M.A.: Facial reactions to smoking cues relate to ambivalence about smoking. Psychol. Addict. Behav. 22(4), 551 (2008)
Article Google Scholar
Gudi, A., Tasli, H.E., den Uyl, T.M., Maroulis, A.: Deep learning based FACS action unit occurrence and intensity estimation. In: 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), vol. 6, pp. 1–5. IEEE (2015)
Google Scholar
Jaiswal, S., Valstar, M.F.: Deep learning the dynamic appearance and shape of facial action units. In: Winter Conference on Applications of Computer Vision, (WACV). IEEE, March 2015
Google Scholar
Jeni, L.A., Cohn, J.F., De La Torre, F.: Facing imbalanced data-recommendations for the use of performance metrics. In: 2013 Humaine Association Conference on Affective Computing and Intelligent Interaction (ACII) (2013)
Google Scholar
Jeni, L.A., Cohn, J.F., Kanade, T.: Dense 3D face alignment from 2D videos in real-time. In: 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG) (2015). http://zface.org
Jeni, L.A., Cohn, J.F., Kanade, T.: Dense 3D face alignment from 2D video for real-time use. Image and Vis. Comput. (2016). doi:10.1016/j.imavis.2016.05.009
Google Scholar
Jeni, L.A., Lőrincz, A., Szabó, Z., Cohn, J.F., Kanade, T.: Spatio-temporal event classification using time-series kernel based structured sparsity. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8692, pp. 135–150. Springer, Heidelberg (2014). doi:10.1007/978-3-319-10593-2_10
Google Scholar
Keltner, D., MOffitt, T.E., Stouthamer-Loeber, M.: Facial expressions of emotion and psychopathology in adolescent boys. J. Abnorm. Psychol. 104(4), 644 (1995)
Article Google Scholar
Kingma, D., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arxiv:1412.6980 (2014)
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
Article Google Scholar
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Article Google Scholar
Li, H., Sun, J., Wang, D., Xu, Z., Chen, L.: Deep representation of facial geometric and photometric attributes for automatic 3D facial expression recognition. arXiv preprint arxiv:1511.03015 (2015)
Liu, M., Li, S., Shan, S., Chen, X.: Au-inspired deep networks for facial expression feature learning. Neurocomputing 159, 126–136 (2015)
Article Google Scholar
McDuff, D., el Kaliouby, R., Demirdjian, D., Picard, R.: Predicting online media effectiveness based on smile responses gathered over the internet. In: International Conference on Automatic Face and Gesture Recognition (2013)
Google Scholar
Schmidhuber, J.: Deep learning in neural networks: an overview. Neural Netw. 61, 85–117 (2015)
Article Google Scholar
Szirtes, G., Szolgay, D., Utasi, A.: Facing reality: an industrial view on large scale use of facial expression analysis. In: Proceedings of the Emotion Recognition in the Wild Challenge and Workshop, pp. 1–8 (2013)
Google Scholar
Valstar, M.F., Almaev, T., Girard, J.M., McKeown, G., Mehu, M., Yin, L., Pantic, M., Cohn, J.F.: Fera 2015-second facial expression recognition and analysis challenge. In: 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), vol. 6, pp. 1–8. IEEE (2015)
Google Scholar
Vicente, F., Huang, Z., Xiong, X., De la Torre, F., Zhang, W., Levi, D.: Driver gaze tracking and eyes off the road detection system. IEEE Trans. Intell. Transp. Syst. 16(4), 2014–2027 (2015)
Article Google Scholar
Xu, M., Cheng, W., Zhao, Q., Ma, L., Xu, F.: Facial expression recognition based on transfer learning from deep convolutional networks. In: 2015 11th International Conference on Natural Computation (ICNC), pp. 702–708. IEEE (2015)
Google Scholar
Yuce, A., Gao, H., Thiran, J.P.: Discriminant multi-label manifold embedding for facial action unit detection. In: 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), vol. 6, pp. 1–6 (2015)
Google Scholar
Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 818–833. Springer, Heidelberg (2014). doi:10.1007/978-3-319-10590-1_53
Google Scholar
Zhang, X., Yin, L., Cohn, J.F., Canavan, S., Reale, M., Horowitz, A., Liu, P., Girard, J.M.: BP4D-spontaneous: a high-resolution spontaneous 3D dynamic facial expression database. Image Vis. Comput. 32(10), 692–706 (2014)
Article Google Scholar

Download references

Acknowledgements

This work was supported in part by US National Institutes of Health grant MH096951 to the University of Pittsburgh and by US National Science Foundation grants CNS-1205664 and CNS-1205195 to the University of Pittsburgh and the University of Binghamton. Neither agency was involved in the planning or writing of the work.

Author information

Authors and Affiliations

Faculty of Informatics, Eötvös Loránd University, Budapest, Hungary
Zoltán Tősér & András Lőrincz
Robotics Institute, Carnegie Mellon University, Pittsburgh, PA, USA
László A. Jeni & Jeffrey F. Cohn
Department of Psychology, The University of Pittsburgh, Pittsburgh, PA, USA
Jeffrey F. Cohn

Authors

Zoltán Tősér
View author publications
You can also search for this author in PubMed Google Scholar
László A. Jeni
View author publications
You can also search for this author in PubMed Google Scholar
András Lőrincz
View author publications
You can also search for this author in PubMed Google Scholar
Jeffrey F. Cohn
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to László A. Jeni .

Editor information

Editors and Affiliations

Microsoft Research Asia, Beijing, China
Gang Hua
Facebook AI Research (FAIR), Menlo Park, USA
Hervé Jégou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tősér, Z., Jeni, L.A., Lőrincz, A., Cohn, J.F. (2016). Deep Learning for Facial Action Unit Detection Under Large Head Poses. In: Hua, G., Jégou, H. (eds) Computer Vision – ECCV 2016 Workshops. ECCV 2016. Lecture Notes in Computer Science(), vol 9915. Springer, Cham. https://doi.org/10.1007/978-3-319-49409-8_29

Download citation

DOI: https://doi.org/10.1007/978-3-319-49409-8_29
Published: 24 November 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-49408-1
Online ISBN: 978-3-319-49409-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Deep Learning for Facial Action Unit Detection Under Large Head Poses

Abstract

Similar content being viewed by others

WhyMyFace: A Novel Approach to Recognize Facial Expressions Using CNN and Data Augmentations

Declarative Residual Network for Robust Facial Expression Recognition

Deep Structure Inference Network for Facial Action Unit Recognition

Keywords

1 Introduction