Facial Expression Recognition Based on Multi-dataset Neural Network

. Facial activity is the most powerful and natural means for understanding emotional expression for humans. Recent years, extensive efforts have been devoted to facial expression recognition by using neural networks. However, automated emotion recognition in the wild from facial images remains a challenging problem. In this paper, an effective facial expression recognition scheme is proposed. A multi-dataset neural network is developed to learn facial expression features in several different but related datasets. The novel multi-dataset network fuses the intermediate layers of a deep convolutional neural network (CNN) by using separate CNNs and a multi-dataset loss function. Experimental results performed on emotion database demonstrate that our proposed method outperforms state-of-the-art.


Introduction
Facial Expression Recognition (FER) is a facet of human intelligence that has been argued to be indispensable and even the most important for a successful social life. A core of social and emotional intelligence is to notice and understand emotional states and other social signals of a person. An automatic facial expression recognition system is desired in emerging applications in human-computer interaction (HCI), such as online/remote education, interactive games, and intelligent transportation.
Recent years, many works have been published to accurately recognize user behavior [1][2][3]. Support vector machine (SVM) is a supervised learning models with associated learning algorithms that analyze data used for classification and regression analysis. It has been widely implemented for HCI. Although significant performance achieved by these SVM-based methods [1], major challenges remain open for HCI. A main concern is that approaches based on SVM may obtain unsatisfied results while performing on task of emotion recognition, particularly in facial expression recognition. Numerous researchers have been conducted on automatic FER due to its practical importance in human-computer interaction systems, such as user-interface design, medical treatment and driver fatigue surveillance.
These years, deep learning techniques have gained a good reputation in computer vision fields. Usually, the features learned via deep learning have better representations of the data than the handcrafted features [7]. Comparing to feature extraction based on expert's knowledge, deep learning is considered a more effected way to discover the discriminant representations inherent in human faces by incorporating the feature extraction into the task learning process. Deep learning can be used by nonexperts for their researches and/or applications, especially in FER field. FER systems can be roughly divided into two main categories: statistic image FER and dynamic sequence FER [3]. In static-based methods, the facial features [4] are generated with only spatial information from the current single image. On the other hand, dynamic-based methods consider the temporal relation among contiguous frames in the input facial expression sequence [5]. In this work, we focus on the former one (static-based method) due to its relatively small computation. Multi-dataset learning is developed by human learning activities where people often take advantage of foreknowledge of previous tasks to help learn a new task. Compared with the current state-of-the-art of FER techniques, the contribution of our work can be summarized as follows:  Many existing techniques are focusing on improving the performance in several public FER datasets, which may lead to a limitation in real-world case. The proposed neural network is trained by the images not only in FER datasets, but also in wild experiment datasets. Therefore, it can recognize expressions in real-life scenarios.
 A multi-dataset loss function is proposed to enhance the discriminative power of the deeply learned features.  A novel multi-dataset network architecture is developed to improve the robustness of our proposed FER scheme.
The rest of the paper is organized as follows. Section 2 presents the review of FER approaches. The proposed method is presented in Sec. 3. Section 4 presents the experimental results and discussions. Finally, the concluding remarks are given in Sec. 5.

Related Works
Although, various approaches represent important results for FER, especially considering that the problems they tackle were previously (almost) unexplored. A large set of tools is now available. Among them, methods based on artificial neural networks were proven to be promising for exposing some challenge exemplars. CNN has been extensively used in diverse computer vision applications, including [6], steganography [7], FER and other applications [8][9][10]. Fasel [11] found that CNN is robust to face location changes and scale variations. He also discovered that CNN is more efficient than multilayer perceptron (MLP) in human face detection if previously face pose variations is missed. Matsugu et al. [12] proposed a CNN-based scheme to deal with the problems as well as translation, rotation, and scale invariance when recognizing the facial expressions.
In the wild cases, fusion techniques can be used to combine multimodal features, the accuracy of FER is unpleased due to the powerless of single descriptor. Sikka et al. [13] combined multiple visual descriptors and paralinguistic audio features to classify video clips in multimodal way. The extracted features were combined by using Multiple Kernel Learning. The visual and acoustic features were merged in their method. Girshick et al. [14] explored the usage of Regions with CNN features (R-CNN). High-capacity CNN was applied to the bottom-up region to localize and segment objects. Inspired by R-CNN, Sun et al. [15] designed two temporal-spatial dense scale-invariant feature transform (SIFT) features and combined these multimodal features to recognize human expression from image sequences. Linear SVM and partial least squares were used as classifiers for those kinds of features on the static and acted facial expression in the wild. They also proposed a fusion network to combine all the extracted features at the decision level. They gained significant scores on public datasets.
Many FER schemes seldom consider the interaction between potential factors. In wild scenarios, FER may be interfered by head pose, illumination and so on. On the other hand, multi-dataset learning models aim to learn a cross-dataset parameter sharing strategy, which reflects the similarities and differences between datasets. The goal of such selective parameter sharing is to make the difference robust when exploiting data from different datasets. Figure 1 is an illustration of multi-dataset network.
Since, the hyper-features must be associated in a way that effectively encodes common features for multiple datasets. Multi-dataset neural networks are generated in the presence of task data in several different but related datasets. The differences between multi-datasets learning and multitask learning are subtle. Some multi-dataset learning problems can be solved by methods developed for multitask. In order to learn different tasks using images from different data-bases, Fourure et al. [16] proposed a multitask CNN for semantic segmentation of outdoor images. A selective soft-max cross entropy function was developed. By using such function, images from another task can be trained without parameter losing. They illustrated their capacity by the following example. Let database D Grass and D Vegetation is the labeled database of grass and vegetation, respectively. The multi-task CNN in [16] can accurately estimate the image from D Grass when the input image belonged to D Vegetation . But for the human face, the accuracy of their approach is not satisfactory [17]. Zhang et al. [18] adopted a cascaded structure with three stages of carefully designed deep convolutional networks that predict face and landmark location in a coarse-to-fine manner. They achieved superior accuracy on several challenging dataset.
To solve the problem of transferring representations learned from multiple source datasets, Xian et al. [19] utilized multiple convolutional neural network (CNN) models trained on different labelled source datasets by feeding soft labels obtained by clustering on target dataset to each other. The enhanced model can learn more discriminative person representations than the single model trained on multiple datasets. In [20], lightweight-but-powerful fully convolution network was proposed. A dense anchor strategy and a scale-aware anchor matching scheme were used to improve the recall rate of small faces. Recently, a four-task network was proposed to detect human face [21]. Facial landmark and pose can be located and estimated simultaneously. In addition, the method can estimate human gender. They developed a separate fusion-CNN to fuse the intermediate layer features. However, the overall dimension of these intermediate layer features is too large to be learned by network. The weakness of their approach is to use only a single data source. All images used for training must be labeled with all relevant tasks.
Inspired by the promoting works based on multidataset learning, we proposed a novel FER neural network. By using such network, knowledge could be transferred from other relevant tasks. In the meanwhile, the disturb factor would be disentangled.

Design of Emotion Database
In order to meet the needs of different scenarios, we use two kinds of datasets to form an emotional database. Actually, we can use more than two datasets to improve the robustness of the proposed FER network. In this work, the experimental database contains two types of datasets: the facial emotion data in experiment and the facial expressions data in wild scenarios.
The former one includes FER2013 dataset [22] and experimental collected images. FER2013 is a large-scale and unconstrained database collected automatically by the Google image search API. All images have been registered and resized to 48  48 pixels after rejecting wrongfully labeled frames and adjusting the cropped region. FER2013 contains 28709 training images, 3589 validation images and 3589 test images with seven expression labels. However, the number of Asian faces in many public datasets is quite small. To improve the robustness of our proposed method, especially for Asian faces, we added a large number of Asian faces to the experimental dataset. 1400 facial images with ages from twenty to fifty were obtained. Every people were requested to offer seven facial expressions (i.e. happiness, sadness, surprise, anger, disgust, fear, and neutral). A sample of our collection is shown in Fig. 2.
The second dataset includes the Static Facial Expressions in the Wild (SFEW) dataset [23] and experimental collected images. SFEW was usually used to support training data for emotion recognition in wild-scene. It consists of short video clips extracted from popular Hollywood movies. Each clip contains a film actor who has been labeled into one of the seven basic facial expression categories, namely Anger, Disgust, Fear, Happiness, Neutral, Sadness, and Surprise. Several sample images in SFEW are presented in Fig. 3. Similar to the previous dataset, we added many faces of Asian actors to the second dataset.
As a pre-processing step, face alignment is necessary in many facial recognition applications. In this work, we  use the face alignment strategy proposed in [24] to robustly align human face 'in-the-wild'. Finally, each aligned facial image was cropped into 48 × 48 pixels and was transformed to grayscale.

Deep Multi-dataset Network
Since the ability of computation is still limited especially in wearable equipment, multi-application would be more appropriate in some cases. The proposed work needs to adopt real-world accommodations in more convenient way. Many existing networks for FER focus on a single task and learn features that are sensitive to expressions without considering interactions among other latent factors. In the real world, different factors would distract the FER task, such as lighting, head pose, etc. Based on this, we used multi-dataset leaning technique to improve the robustness of FER networks.

Network Architecture
As shown in Fig. 4, the proposed architecture for facial expression recognition includes a pair of identical components CNN. The parameters C and m have different values for different layers in the network as is demonstrated in the figure.
Each component contains five convolutional layers. Following the final convolutional layer, two fully connected (FC) layers consisting of 2048 neurons are employed. In the convolutional layers, we use kernel size of m × m × C, where C is the depth of a filter and m is the size of convolutional kernel. After the convolutional layer there is the pooling layer. We used max-pooling technique to decrease the size of feature maps. Let i denote the index of a max-pooling layer. The layer's output is a set O i of square maps with size w l . We get the O i from O i -1 . The square maps size w l is obtained by w l = w l -1 / k, where k is the size of the square max-pooling kernel.
As an extension of a Rectified Linear Unit (ReLU) [23] activation function, Concatenated Rectified Linear Units (CReLU) was proposed to reduce the number of computations by half without losing accuracy [24]. It conserves both positive and negative linear responses after convolution so that each filter can efficiently represent its unique direction. Because CReLU retains the available information of the input while keeping the non-saturated non-linearity, it may  be beneficial to more complex machine learning tasks, such as structured output prediction and multitask learning.
To reduce the number of parameters and improve computational efficiency, an inception module [25] is used in our proposed network. Inception module produces output activations of different sizes of receptive fields. The layers closed to the input will bring the units to a local region. Therefore, we chose 1 × 1 kernel convolution layers after the second convolutional layer to capture substances that vary greatly in size. The architecture of the proposed inception module is presented in Fig. 5.

Network Integration
The output of each component network should be integrated in the final step. The common way is use weighted summation. Let S i is the final score of emotion i = 1,…,p where p is the total number of emotion class, S i can be calculated as: In general, parameter  j is often set to the same value.
The above method is simple to use, but may not make full use of the ability of the two networks with different dataset as input. Therefore, a fine-tuning method with a novel loss function for alternating integration of two networks was proposed. By using this method, the performance of FER in multi-dataset can be improved. Different loss functions for each task and train alternatively for the different domains were developed to handle such problem. Softmax loss technique is usually used in convolutional CNN to force the features of different classes staying apart. To improve the performance and further reduce the intra-class variations, Center loss method [26] is developed. It can estimate the center of each class of features, and drag the features belonged to the same center at the same time. We start with the definition of Center loss function L C : where y i and x i are the class label of the i th sample and the feature of the i th sample generated from the fully-connected layer, respectively. c yi denotes the center of the cluster in which all samples are labeled as y i , and k is the number of the samples. They compute the Joint supervision loss value to minimize the intra-class variations while keeping the features of different classes separable [26]. The centers will be updated in each iteration using Stochastic Gradient Descent (SGD) [27] as part of the CNN training.
During its forward propagation, the weighted sum of the softmax loss and the center loss is calculated: where L S is the softmax loss, and λ is a parameter to assign the weight of softmax loss and center loss.
In backward propagation process, the partial derivative of the center loss L C with respect to the input sample x i can be calculated as: The centers are updated in the iterative optimization as defined below: where (y i , j) is defined as: However, in some cases, the clusters of different classes would be overlapped by using Center loss technique. To enhance the discriminative power of the deeply learned features, we proposed a novel multi-dataset (MD) loss function for FER. It can further improve the distance between different clusters.
Different to center loss, we calculate all the distances between a sample and other class centers. The objective is initially defined as follows: where k is the number of clusters, and λ is a predefined margin parameter. λ can be calculated as: where c i and c j are the i th and j th center of the cluster, respectively. φ is the variance of features away from their respective class centers. φ is defined as: We used SGD technique to optimize the parameters. Parameter P was exploited to limit the objective number to the nearest class centers. The MD loss function in (6) is then improved as follows: where P is carefully selected to efficiently skip some class centers which are relative far away from the current center.
Given a set of images from N different datasets, where the label spaces are different, we define L MLk the multi-dataset loss of the k th dataset, the overall loss function is given: where λ is the weight of each loss.
The update is performed in a mini-batch, which can avoid a large amount of calculation and increase systematic stability. Considering the classical BP algorithm, the entire parameter updating process of MD loss is summarized in Algorithm 1.

Experiment Results and Analysis
For deep feature learning, we employ the TensorFlow implementation, which is commonly used in several recent works. The first network was pre-trained by using the first dataset (i.e. the experimental facial emotion images) in MTE database as disputed in Sec. 3.1. All samples were divided into training data (70%) and test data (30%). The database learning rate is set to 0.01, which will be divided by 10 after every 10,000 iterations. In each iteration, 256 samples are used for stochastic gradient optimization. After 200 epoch's training, the first sub-network obtained 70.14% accuracy on the first dataset in MTE. The second network was pre-trained on the second data in MTE. The training strategy was same as the first network training. The second sub-network gets 64.11% accuracy on the MTE database, after 311 epoch's training. In the fine-tune stage, we exchanged the tuning datasets, that is, the second dataset for the first network and the first dataset (i.e. the experimental facial emotion dataset) for the second network. The base learning rate is changed to 0.001. The validation accuracy is converged after 200 epoch's finetuning.
Note that, those two Softmax functions were only used in the training step. All the experiments in this work were developed on NVIDIA GeForce GTX Titan GPU. We used dropout technique to each fully-connect layer with where i species the class, i.e., the i-th emotion category, TP i (true positives) are correctly identified test instances of class i, TN i (true negatives) are test images correctly labeled as not belonging to class i, and N is the total number of test images.

Face detection:
We use the Selective Search algorithm in a similar manner as RCNN to generate region proposals for faces in an image. A region having an overlap of more than 0:5 with the ground truth bounding box is considered a positive sample (l = 1). The candidate regions with overlap less than 0:35 are treated as negative instance (l = 0). All the other regions are ignored.
Face alignment: Face alignment is conducted to reduce variation in face scale and in-plane rotation across different facial images. In this work, we use the face alignment strategy proposed in [24] to robustly align human face 'in-the-wild'.
The next experiment is for emotion recognition which was performed on the MTE database. The confusion matrix of the proposed multitask CNN model is reported in Tab. 1.
We also test our approach on the well-known datasets which are available on public websites. The confusion matrix on CK+ [28] and MMI [29] are listed in Tab. 2  dataset when recognizing the fear emotion. Anyway, few methods can perform perfectly in fear recognition since this emotion is too imperceptible to be learned in MMI.
To evaluate the proposed approach comprehensively, we make a comparison between the proposed approach and some state-of-the-art FER methods [30][31][32]. The database used in this experiment was consisted with CK+ [28], MMI [33] and MTE. We used 70% of the total samples for training and the rest for testing. As shown in Tab. 4, experimental result demonstrates that our scheme obtains a better reorganization performance than others which is benefited from contribution of multi-dataset architecture and the MD loss function. The overall time taken to perform all the two tasks was 2 s per image. The results indicate that our approach has considerably higher performance. We also replaced the loss function in [30][31][32] with the proposed MD loss function to test the validity. Results are listed in the third column of Tab. 4 (the value behind the slash). The proposed MD loss function would increase the accuracy of FER by enhancing the discriminative power of the deeply learned features.
To further test the usability of our proposed multi-dataset network, we evaluated the performance of facial landmark localization. The algorithm proposed in [24] was used to extract the facial landmarks localization data in MTE database. The training and fine-tuning steps were the same as the FER task described at the beginning of this section. Some of the methods for comparison include Li et al. [34], Ramanan et al. [  The comparison result shown in Fig. 6 demonstrates that our multi-dataset strategy can be perfectly applied to facial landmarks localization.

Conclusion
In this paper, we presented a novel facial expression recognition method based on machine learning technique. A neural network with multiple datasets was proposed to learn the expression features in different datasets, which can increase the robustness of FER task. To efficiently fuse the discriminative power of the deeply learned features, we proposed a novel multi-dataset loss function based on the center loss algorithm. Extensive experiments demonstrated that the proposed method is able to recognize both experimental facial emotion and facial expressions in wild scenarios. In future, we will evaluate the performance of our method on other applications such as simultaneously predict facial landmarks, human pose and human gender.