Comparing ensemble strategies for deep learning: An application to facial expression recognition
Introduction
As facial expressions play vital roles in human interactions and nonverbal communications, Facial Expression Recognition (FER) is of crucial importance to the development of interactive computer systems. A facial expression represents a signal that humans use, intentionally or otherwise, in order to convey a message, i.e. an emotion, an affective state, or a health condition. In 1971, Ekman and Friesen (1971) demonstrated that facial expressions of emotion are universal. The study was carried out on both literate and preliterate cultures: the universality of the human way of expressing an emotion is supposed to be an evolutionary, biological fact, not depending on the specific culture. This finding allows modern Computer Vision studies to focus on the signal (facial expression) in order to analyze the message (emotion), giving rise to plentiful applications in different fields, ranging from Human Computer Interaction to Data Analytics (Martinez & Valstar, 2016). Furthermore, FER has been automated and several machine learning algorithms have been specifically proposed for this task (Gross, Brajovic, 2003, Martinez, Valstar, 2016, Pramerdorfer, Kampel, 2016, Zhang, Mahoor, Mavadati, 2015a). Effective solutions for expert systems targeted at solving the FER problem have to be extensively investigated. Automated FER approaches attempt to classify faces in a given single image as one of the six basic emotions, namely anger, disgust, fear, happiness, sadness, and surprise.
A related and relatively recent topic in the area of research on affective computing consists in emotion recognition from videos. On the one hand, video data pose additional challenges to the task of emotion recognition compared to static images, e.g. the quick and variable dynamics between the beginning of the expression (onset), its peak, and its vanishing (offset). On the other hand, the amount of information provided by the sequence of correlated frames and, in some instances, by the associated speech, enables a variety of automated methods for robust features extraction and classification in a multi-modalities setting. In this work we focus on FER from static images, rather than from video sequences: the nature of video data and underlying assumptions (multimodalities of audiovisual data, correlation among sequential frames) make it a different, although closely related, and more challenging topic than static image analysis. Nevertheless, many recent works attest that FER on static images is still an active research area and advances in this field may have a positive impact in the field of emotion recognition from videos too.
In this context, researchers have collected several annotated face databases both in spontaneous uncontrolled setting (Dhall, Goecke, Joshi, Sikka, & Gedeon, 2014) and in more strictly controlled environments (Gross, Matthews, Cohn, Kanade, Baker, 2010, Lucey, Cohn, Kanade, Saragih, Ambadar, Matthews, 2010, Lyons, Akamatsu, Kamachi, Gyoba, 1998). Images acquired in controlled conditions (or lab-conditions) consist in posed expressions of frontal faces, with standard illumination and background conditions. Nowadays, emotion recognition in this scenario is considered a solved problem and is primarily used for the proof of concept of features extraction and classification methods (Pramerdorfer, Kampel, 2016, Sariyanidi, Gunes, Cavallaro, 2015). Indeed, several works have shown that a recognition rate above 90% can be achieved under these conditions (Dornaika, Moujahid, Raducanu, 2013, Mahersia, Hamrouni, 2015) and, in more recent proposals, accuracy values close to 100% on well-known benchmark datasets have been reached (Liang, Liang, Yu, Zhang, 2019, Xie, Hu, Wu, 2019). Within the wide assortment of classical machine learning algorithms, several of them - notably Support Vector Machines and Bayesian classifiers - have proved to be able to classify posed facial expressions generated in a controlled environment. Nevertheless, these approaches fell short of the generalization capability. FER under naturalistic conditions, often referred to as in-the-wild, is the scenario of interest for what concerns the above mentioned applications (Dhall et al., 2014). The factors of variation that make this a harder problem are the following: subtlety of spontaneous expressions, head pose, illumination, and occlusions.
A standard algorithmic pipeline to address the FER problem on static images relies on the crucial step of feature extraction. Traditional approaches consist in determining features by hand through mathematical descriptors (e.g. Gabor filter, Local Binary Pattern, Scale Invariant Feature Transform) (Hussain, Khan, Nazir, & Iqbal, 2012), or using facial landmarking (Tie & Guan, 2013). Exploiting hand-crafted features has proved to be inadequate for the in-the-wild task: an optimal feature extractor should provide information useful for the classification step, being robust to the above mentioned nuisance factors. Recent research tried to investigate on the possibility to learn features directly from data (Hertel, Barth, Kster, & Martinetz, 2015). In computer vision, the most popular models used for this purpose are the CNNs, and new practical methodologies have been studied for their employment in modern expert systems (Han, Liu, & Fan, 2018).
The origin of CNNs dates back to the ’80s; nevertheless, they have been largely ignored from the mainstream computer vision and machine learning communities, up to the ImageNet competition in 2012. Their resurgence, as well as the impressive results they achieve, can be ascribed to the efficient use of GPUs, ReLUs, dropout, and new data augmentation techniques (LeCun, Bengio, & Hinton, 2015). CNNs have thus revolutionized the field of computer vision, and nowadays they are the dominant approach for all recognition and detection tasks, even approaching the human performance on specific cases.
The theoretical advantages of using deep architectures have been highlighted in the literature (Bengio, 2009). On the one hand, cognitive processes in humans seem to have a deep structure, with different levels of representation and abstraction: CNNs are inspired to the mammalian vision system and, in particular, to the bidimensional structure of visual cortex and to the relative biological neurons. On the other hand, too shallow architectures fail in representing the desired function with a reasonable number of parameters: the required number of units might grow exponentially if the depth is reduced. Furthermore, it has been shown that the spatial regions of the input facial image, which maximally excite neurons in the hidden layers of the proposed convolutional networks, correspond to the Facial Action Units described by Ekman (Khorrami, Paine, & Huang, 2015), that is, the network is able to learn relevant high-level features.
It is worth underlining that research in the FER task is hindered by the lack of a large amount of labeled training data, typically necessary in current deep learning approaches. Indeed, unlike visual object databases such as imageNet, existing FER databases often have a limited number of subjects, few sample images or videos per expression, or small variations between sets, hampering the neural network training procedure. For instance FER2013, which is one of the largest databases built so far, consists of 35,887 images of different subjects and, yet, only 547 of them portray disgust. Gathering and annotating new data is often a difficult, expensive, and time consuming task. The challenge is indeed to find alternative methods in order to improve the performance of automatic FER systems.
The reported average human accuracy on FER-2013 dataset is 65%. With the work presented in (Tang, 2013), Tang won the machine learning competition in the ICML 2013 Challenges in Representation Learning, achieving a test accuracy of 71.2% using a CNN with L2-SVM loss. This figure has been further improved in the recent years: Hereafter, we recall some of the most significant works along with their performance, i.e. the accuracy obtained over the test set. In (Kim, Dong, Roh, Kim, & Lee, 2016a), Kim et al. achieved a 73.73% test accuracy by means of an ensemble of CNNs that uses both aligned and non-aligned images: the key factor for the performance improvement is represented by a pre-processing (alignment) operation carried out by yet another Deep Convolutional Network that learns the proper mapping. In (Connie, Al-Shabi, Cheah, & Goh, 2017), Connie et al. proposed a model that combines SIFT features and CNN features: the aggregation of three models let them achieve a 73.4% accuracy. Excellent results have been obtained also by exploiting diversified learning information: Zhang, Luo, Loy, and Tang (2015b) reached a 75.1% test accuracy by fusing training data from multiple sources.
The work presented in this paper aims to propose general, more efficient procedures to construct ensembles of CNNs for the FER task: looking for general results, the relative experimentation must focus on the basic FER ability of each single network, independently of any possible ancillary geometric pre-processing action. However, it has been shown that such a pre-processing is crucial in getting top accuracies (Kim et al., 2016a). As a consequence, the performance obtained through a basic ensembling procedure is not directly comparable with the most precise state-of-the-art systems, because of the different nature of their building blocks. Nevertheless, even if our studied configurations adopt neither face-alignment techniques (or tricks of this kind) nor specific particular features, their delivered accuracy closely trails the top performing works.
Ensemble solutions are widely exploited in neural networks, with the aim of boosting classification performance. Giacinto et al. highlight the keypoint on network ensembles for image classification purposes (Giacinto & Roli, 2001): Beyond general theoretical analysis (Brown, Wyatt, Harris, & Yao, 2005), experimental evidence exists showing that an ensemble can outperform the best single neural network, provided that the networks make different errors. From this perspective, the task of producing error-independent networks is not trivial, mainly because of the weight space symmetry. Two approaches can be exploited to design ensembles of neural networks: implicit and explicit methods. The former consists in directly creating error-independent neural networks by forcing diversity among them; the latter consists in producing a large set of base classifiers and selecting the optimal subset with respect to a given measure of the error diversity. In our work we focus on implicit ensemble design strategy. The most commonly adopted strategies to create heterogeneous ensembles are: i) varying the initial random weights, ii) varying the network architecture, iii) varying the network type, and iv) varying the training data (Giacinto & Roli, 2001). The way the networks outputs are aggregated to produce a final output represents another central issue in the ensembling approach: the most straightforward strategies for combining outputs computed by base classifiers are Average and Majority voting (Ponti Jr, 2011).
The network size represents a key factor for the learning dynamics: in fact, deeper models are less sensitive to randomness in the initialization and training procedure, leading to a less sparse distribution of loss in multiple repetitions (Choromanska, Henaff, Mathieu, Arous, & LeCun, 2015). As a consequence, ensemble learning has proven to be more effective when using shallow networks (Ju, Bibaut, & van der Laan, 2018) since the network high sensitivity to initialization and training most likely results in different local minima.
Recent proposals in the field of expert and intelligent systems have mainly focused on devising more sophisticated ensemble design and fusion strategies for the FER problem. Liu and Zhang proposed a two-step ensemble framework in the context of granular computing (Liu & Zhang, 2019). In their approach, ensembles can be viewed as information granules; a further level of ensemble and information fusion completes the granular architecture with a coarser level of granularity. The approach, however, is validated using a training dataset of only 344 instances, on which deep learning architectures fail in obtaining competitive performance. In (Gan, Chen, & Xu, 2019), authors combined eight neural networks obtained by training a CNN architecture on a training set with different perturbations of soft-labels, which are supposed to capture latent similarity and mixture among different facial expressions. Hierarchical committees of CNNs have been introduced as well (Kim, Roh, Dong, & Lee, 2016b), and in this case the single decisions are fused according to a multi-level structure. Finally, Wen et al. proposed a probability-based fusion rule to combine diverse base classifiers obtained by varying initialization of parameters and hyperparameters of a CNN architecture (Wen et al., 2017). None of these works, however, has improved state-of-the-art performance on the FER-2013 dataset and disentangled the factors that influence performance in ensemble of neural networks.
In this paper, in order to find efficient procedures for the ensemble construction in FER problems we evaluate the effectiveness of several strategies in exploiting the sensitivity of shallow networks used as basic classifiers. We fix the classifier type, namely CNN, and the network architecture using a model that can be considered shallow if compared to modern very deep architectures such as VGG (Simonyan & Zisserman, 2014), Inception (Szegedy, Vanhoucke, Ioffe, Shlens, & Wojna, 2016), and ResNet (He, Zhang, Ren, & Sun, 2016). As the ensemble performance is tightly related to the diversity of the classifiers making up the ensemble, particular care must be placed in the selection of the strategy used to generate the base classifiers. We present an analysis of various techniques for generating diversity within ensembles of base networks, varying training data and weight initialization. Furthermore, we compare different fusion schemes for merging the outputs of the base classifiers. We also investigate whether it is appropriate to use aggregation methods other than the common Average and Majority Voting. To this aim, we experiment with different Ordered Weighted Averaging (OWA) operators (Fodor, Marichal, Roubens, 1995, Yager, 1993). The final objective is to provide some indications for building up effective ensembles of CNNs.
Although ensemble learning has been widely employed in FER and in other similar contexts, an experimental analysis of the aspects that influence its performance with CNNs as base classifiers has never been carried out. The main contributions of this study can be summarized as follows:
- •
Achieving competitive performance for the in-the-wild emotion recognition, avoiding to use hand-crafted features and adopting simple design strategies for the ensemble of CNNs;
- •
Shedding light on the factors that influence the performance, in terms of recognition accuracy, of such ensembles. Notably, the extensive analysis aims to compare different simple strategies for generating diversity among base classifiers, and different aggregation schemes for combining the prediction at the decision fusion level, and further to investigate the dependence of the performance upon the ensemble size.
The paper is organized as follows: In Section 2 we provide the background about CNNs. In Section 3 we describe the approaches for designing different ensembles of CNNs and the adopted fusion strategies. Section 4 presents the experimental setup: we describe the datasets used in the present work and the details of our implementation. In Section 5 we show and discuss the experimental results. Section 6 draws conclusion remarks.
Section snippets
Convolutional neural networks background
CNN (LeCun, Kavukcuoglu, & Farabet, 2010) is a type of feed-forward artificial neural network. The building blocks of a convolutional layer are the following: (i) The linear convolutional stage performs the convolution operation between a kernel, or filter, and an input bi-dimensional array; (ii) the non-linear activation stage consists in the pointwise application of a non-linear function; (iii) the pooling stage computes a summary statistic of a group of input neurons. Thanks to its
Ensemble design
In this section, we first describe the different ensemble design strategies to investigate, pointing out how variability among the base networks is generated, and then we show the aggregation methods adopted in our experiments.
Experimental set-up
In this section we describe the datasets used in the experiments and the details of our implementation.
Experimental results
The performance metric used in the present work is the accuracy on the reference dataset, i.e. the percentage of correctly classified examples. We evaluated the four selected strategies relying on two measures: the base classifier accuracy and the ensemble accuracy. The difference between these two values represents the performance gain, obtained thanks to the combination of the base classifiers. For each strategy we performed three repetitions with different seeds in order to assess the
Conclusion
We have presented a comprehensive comparison of four different strategies for the design of an ensemble of CNNs in the context of facial expression recognition. The Seed Strategy simply combines CNNs generated by using different pseudorandom number generator initializations; since the generator affects the behaviour of several CNN components, variability is thus induced. The Preprocessing Strategy employs diverse image preprocessing methods and different seeds. The Pretraining Strategy aims to
Author contributions
Alessandro Renda and Marco Barsacchi shaped the presented approach. The initial ideas have been refined under the coordination of Alessio Bechini and Francesco Marcelloni. Alessandro Renda implemented the required software and performed the experimental tests, according to the indications decided with all the other authors. All the authors contributed to writing the final manuscript.
Declaration of competing interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was been partially supported by the University of Pisa under grant PRA_2017 “IoT e Big Data: Metodologie e tecnologie per la raccolta e l’elaborazione di grosse moli di dati.”
References (55)
- et al.
The loss surfaces of multilayer networks
Proc. of the 18th conf. on artificial intelligence and statistics
(2015) - et al.
Characterization of the ordered weighted averaging operators
IEEE Transactions on Fuzzy Systems
(1995) - et al.
Digital image processing (3rd edition)
(2006) - et al.
Deep learning
(2016) - et al.
A new image classification method using CNN transfer learning and web data augmentation
Expert Systems with Applications
(2018) - et al.
Survey of various feature extraction and classification techniques for facial expression recognition
Proc. of the 11th wseas int’l conf. on electronics, hardware, wireless and optical communications, and proc. of the 11th wseas int’l conf. on signal processing, robotics and automation, and proc. of the 4th wseas int’l conf. on nanotechnology
(2012) - et al.
Batch normalization: Accelerating deep network training by reducing internal covariate shift
(2015) - et al.
Advances, challenges, and opportunities in automatic facial expression recognition
Advances in face detection and facial image analysis
(2016) - et al.
Automatic analysis of facial affect: A survey of registration, representation, and recognition
IEEE Transactions on Pattern Analysis and Machine Intelligence
(2015) - et al.
BP4D-Spontaneous: A high-resolution spontaneous 3D dynamic facial expression database
Image and Vision Computing
(2014)
Multimodal spontaneous emotion corpus for human behavior analysis
Proc. of the IEEE conf. on computer vision and pattern recognition
Using OWA fusion operators for the classification of hyperspectral images
IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing
Learning deep architectures for AI
Foundations and Trends in Machine Learning
Bagging predictors
Machine Learning
Diversity creation methods: A survey and categorisation
Information Fusion
Facial expression recognition using a hybrid CNN–SIFT aggregator
International workshop on multi-disciplinary trends in artificial intelligence
Emotion recognition in the wild challenge 2014: Baseline, data and protocol
Proceedings of the 16th international conference on multimodal interaction
Collecting large, richly annotated facial-expression databases from movies
IEEE MultiMedia
Facial expression recognition using tracked facial actions: Classifier performance analysis
Engineering Applications of Artificial Intelligence
Constants across cultures in the face and emotion
Journal of Personality and Social Psychology
Facial expression recognition boosted by soft label with a diverse ensemble
Pattern Recognition Letters
Design of effective neural network ensembles for image classification purposes
Image and Vision Computing
Challenges in representation learning: A report on three machine learning contests
Neural Networks
An image preprocessing algorithm for illumination invariant face recognition
Proc. of the 4th int’l conf. on audio- and video-based biometric person authentication
Multi-PIE
Image and Vision Computing
Deep residual learning for image recognition
Proc. of the IEEE conf. on computer vision and pattern recognition
Deep convolutional neural networks as generic feature extractors
2015 international joint conference on neural networks (IJCNN)
Cited by (45)
Transformer-based automated segmentation of recycling materials for semantic understanding in construction
2023, Automation in ConstructionDesign of data feature-driven 1D/2D convolutional neural networks classifier for recycling black plastic wastes through laser spectroscopy
2022, Advanced Engineering InformaticsCitation Excerpt :However, designing a method of feature extraction that matches the data exactly is not easy. Too shallow architectures (Renda et al., 2019) fail in representing the desired function with a reasonable number of parameters [11]. Nowadays, deep learning-based methods have achieved good results in many different domains.
A novel weighted deep convolution model – African vultures optimization algorithm for an automated facial emotion recognition system
2024, Multimedia Tools and ApplicationsIdentifying emotions from facial expressions using a deep convolutional neural network-based approach
2024, Multimedia Tools and ApplicationsEMERSK -Explainable Multimodal Emotion Recognition With Situational Knowledge
2024, IEEE Transactions on MultimediaDeep Learning-Based Seasonal Forecast of Sea Ice Considering Atmospheric Conditions
2023, Journal of Geophysical Research: Atmospheres