Facial Expression Recognition by Jointly Partial Image and Deep Metric Learning

The performance of facial expression recognition (FER) tends to deteriorate due to high intraclass variations and high interclass similarities. To address this problem, an expression recognition model based on a joint partial image and deep metric learning method (PI&DML) is proposed. First, we propose cropping the active units (AU) that are most closely related to the expression to generate a partial image for feature extraction, which is conducive to mitigating the negative impact of the abovementioned problems to some extent. Second, a novel expression metric loss function (EMLF) is suggested to enhance the intraclass similarities and interclass variations. Finally, superior performance is achieved by jointly optimizing the expression metric loss and classification loss. As demonstrated by the visualization results, the proposed EMLF is effective at increasing the distance between various expressions and reducing the distance between the same expressions. The evaluations on three public expression databases have demonstrated that our method is capable of achieving better results than the state-of-the-art methods.


I. INTRODUCTION
Afacial expression is considered a major manifestation of human emotion. Therefore, if a machine is capable of accurately recognizing the facial expressions of human beings, it can improve the outcomes of human-computer interaction (HCI). FER has attracted increasing attention due to its widespread applications in HCI systems such as sociable robots, medical treatments, driver fatigue surveillance and so on [1]. The generic FER framework applied in most works can be split into three major parts, which are face detection, facial feature extraction and classification. Among them, the extraction of the most discriminative facial features is viewed as a significant factor in determining model performance, and these features can be roughly classed into two categories, which are human designed and learned features [2].
The human-crafted features primarily refer to local features, such as the SIFT [3], HOG [4], LBP [5], [6], LPQ [7], etc. In addition to the abovementioned methods used to extract the 2D features of static images, focusing on the The associate editor coordinating the review of this manuscript and approving it for publication was Yudong Zhang . temporal and spatial information in an image sequence method is also proposed, such as using spatiotemporal covariance descriptors (Cov3D) [8], the temporal modeling of the shape (TMS) [9], expressionlets on a spatiotemporal manifold (STM-ExpLet) [10], etc. The FER method based on human-crafted features requires additional classifiers for classification, such as K-NN classifier [11], the SVM classifier [12], and the Hidden Markov model [13]. Although this method has been applied in more cases, its features tend to be relatively singular, and they are susceptible to disruptions caused by head pose and illumination changes [14].
The learned features mainly refer to the features extracted using deep learning methods [15]- [18]. In addition to the features being diversified and robust to illumination changes and different head poses, the methods have also achieved remarkable results in recent years. Khorrami et al. performed emotion recognition on video data using both CNNs and RNNs [19]. The CNN is employed to extract the image features, and the RNN is used to express temporal information changes. Yang et al. suggested a de-expression residual learning method based on the cGAN [20]. Reference [21] took the SIFT features of landmark points as input data and applied them to a well designed DNN model to extract the optimal discriminative features for expression classification.
Despite the excellent performances achieved by these works in some datasets, they rarely focus on issues such as high intraclass variations and high interclass similarities, which could be caused by diverse head poses, illuminations, occlusions, and personal attributes (skin tone, age, gender, ethnicity, etc.), and this remains a challenge to applying the FER in the real-world that needs addressing. As illustrated in Fig.1.(a), there are high intraclass variations and high interclass similarities due to different personal attributes or illuminations, and the learned features in the same class are scattered, which makes it difficult to perform classification in an embedded space. To solve this problem, we combined partial images and the expression metric loss function to reduce the intraclass variations and enhance the interclass variations. The length of the green arrow represents the distance between the same expressions, and the length of the red arrow represents the distance between different expressions. The greater the distance, the smaller the similarity.) In this paper, distinct from previous works, we have not only researched extracting discriminative features based on the proposed metric loss function, but also explored the importance of partial images in determining FER performance. First, we crop the action units that are most relevant to the expression changes to generate a partial image, which is effective at addressing the abovementioned problems resulting from different illuminations, occlusions and other diverse factors in the overall image. Through experiments, we found that partial images can not only greatly reduce the dimensions of the input data, but also achieve better performance than the original image. Second, we apply the hard sample mining strategy to identify the hardest positive and negative sample pairs in the embedded space, which involves a relatively small amount of computations when compared to the previous metric-based learning. Third, we propose a novel expression metric loss function (EMLF) that is capable of achieving fast convergence to increase not only the similarity between positive sample pairs but also the variations between negative sample pairs, as shown in Fig.1.(b). Finally, by jointly optimizing the expression metric loss and classification loss in a unified framework, an improved classification accuracy compared to the state-of-the-art methods is achieved. Furthermore, we found that the joint optimization of the proposed metric loss and classification loss can use fewer training epoches to achieve faster convergence than a single optimized classification loss.
Overall, the contributions of this work are four-fold. 1) The PI&DML model is proposed which aims to learn discriminative representations with lower intraclass variations and higher interclass distances. 2) A method for constructing partial images is proposed, which is conducive to mitigating the negative impacts of the abovementioned problems to some extent, and can greatly reduce the dimension of the input data and amount of calculations. 3) A novel expression metric loss function (EMLF) is suggested to enhance the intraclass similarities and interclass variations. 4) Superior performance is achieved by jointly optimizing the expression metric loss and classification loss. The evaluations on three public expression databases have demonstrated that our method is capable of achieving better results than the state-of-the-art methods.
The rest of the paper is organised as follows. Section 2 briefly reviews the related topics. Section 3 outlines the methods proposed in this paper, including the method of constructing partial images, the hard sample mining strategy and the proposed expression metric loss function. The experimental results compared with the state-of-the-art methods are given in Section 4. Finally, Section 5 presents a brief conclusion to this paper.

II. RELATED WORK
As mentioned in the introduction, expression recognition methods can be grouped into two categories: still image and sequence-based approaches. Since still image methods are more generic and can also be used to identify expressions from video sequences, we focus on methods for recognizing expressions using still images. Among these, deep learning methods based on convolutional neural network (CNN) architectures have recently shown excellent performance on FER tasks. Despite its popularity, first, the features learned using this method may generate similar representations for different expressions, especially for the same person or the same image brightness. Second, CNNs may generate high variations for the same expression, especially for different people and images with different brightnesses [2]. The emerging deep metric learning methods have demonstrated strong effectiveness in vision tasks with high intraclass variations and high interclass similarities, such as image retrieval [39], [40], person reidentification [41], [42], etc., which suggests that deep metric learning can also solve the problems in FER.
Conventional metric learning methods usually learn a linear embedding of the data using the Mahalanobis distance [43], [44], but this is not enough to characterize the nonlinear relationships between sample pairs, which are quite common in real-world applications. Although the kernel trick can be adopted to address this limitation, the expression power of kernel functions is often not flexible enough to capture the nonlinearity in the data [45]. Inspired by deep learning, which can effectively solve the nonlinearity problem of samples, deep metric learning has been proposed to learn nonlinear VOLUME 8, 2020 mappings [46]- [48]. For example, Hu et al. proposed a new discriminative deep metric leaning method using deep neural networks for face verification [46], and Wang et al. proposed an angular loss for learning better similarity metrics [49]. The multi-similarity loss under the general pair weighting framework was proposed in [28].
For expression recognition, most recently, the island loss function based on the center loss [22] was proposed to reduce the intraclass variations while enlarging the interclass difference [2], which has led to satisfactory performance. Nevertheless, this method is more sensitive to noise samples, and there more hyperparameters that need to be determined. In addition, there are some other research works based on metric learning that have produced positive results [14], [23], [24]. However, a majority of them necessitate the selection of sample pairs and the labeling of identity information in advance, which requires much of extra work.

III. APPROACH
In this section, we will start by presenting the overall framework for the proposed model, and then introduce the approach for constructing the partial image, the hard sample mining strategy and the proposed EMLF.

A. FRAMEWORK
The overall framework of the proposed model is illustrated in Fig.2. First, the mini-batch samples are cropped to generate partial images, which will be sent to the CNN for feature extraction. Then, the hardest positive and negative sample pairs are mined by applying hard sample mining technology in the embedded space for the calculation of the expression loss using the proposed EMLF. Third, the classification loss is calculated at the last fully connected layer. Finally, the overall network is optimized by minimizing the sum of the metric loss and classification loss expressions.
The specific architecture of our proposed model is inspired by the VGG block [50], which consists of a sequence of convolutional layers, followed by a max pooling layer for spatial down sampling. This allows the depth model to be constructed by reusing simple basic blocks. By carefully designing the network parameters, we also found that it is more efficient to use several deep and narrow convolution (i.e. 3 × 3) layers than a few wide convolution layers. In terms of specific structural parameters, our PI&DML model for both datasets is I(  ,32) is a convolutional layer with 32 3×3 filters. FC(512) refers to a fully connected layer, with 512 nodes. Additionally, FC(n_classes) is the softmax layer with n_classes outputs, where n_classes represents the number of expression classes for each dataset. P(2) means a 2×2 max pooling layer. The stride of each layer was 1 with the exception of the pooling layer. The value of the stride for each pooling layer was set to 2. Convolutional layers are used to extract the features of expressions, using the hard sample mining strategy and calculating the metric loss in the penultimate layer of the network, more features of the sample are retained, so that the similarity information between the samples can be fully utilized.

B. METHOD OF CONSTRUCTING PARTIAL IMAGE
Human expressions are expressed by the movements of facial components, such as eyes, the mouth and so on. Inspired by this, we select the action units (AU) [25] (eye, nose, mouse) that are considered to be most relevant to the expression to generate a partial image for extracting the discriminative  features. The steps of forming a partial image are shown in Fig.3.(a), and an example of a partial image is shown in Fig.3.(b). It is clearly indicated that partial images mitigate the influence of personal attributes, illumination and occlusion when compared with the corresponding original images from the CK+ dataset [32] in Fig.4. Formally, we apply the face detection method from [26] and the landmark detection method from [27] to obtain face matrix A and the landmark coordinates (x i , y i ), i = 0, 1, · · · 67. After this step, we can obtain the boundary point coordinates of each AU. According to the boundary point coordinates, each AU can be detected from the original image, and the composition of each AU is shown in (1). (1) where τ 1 and τ 2 represent expanded range at the boundary point of each AU. In order to generate a partial image, each AU needs to be resized to a fixed size S, and the composition of the partial image is indicated in (2), where C denotes the images that are spliced together.
It is known from (1) and (2) that the partial image data are composed of only three parts with respect to the original image data A, which greatly reduces the dimensions of the input data, thereby reducing the amount of calculations.

C. HARD SAMPLE MINING STRATEGY
In the embedded space, let x i ∈ R d be the i th feature of the sample; then, we can obtain a feature matrix X ∈ R m×d for the mini-batch samples, where m indicates the batch size. The similarity between two samples is defined as where < ·, · > denotes the dot product. Then we can obtain an m × m similarity matrix S, the element of which at (i, j) is S ij for the mini-batch samples. Our aim is to enhance the similarity between the samples of the same classes while reducing the similarity between different classes of samples. It is a simple and easy way to identify samples of the same kind with low similarity (hardest positive pairs) to the current sample (anchor) and, to increase their similarity to the anchor, or to identify different kinds of samples (hardest negative pairs) with higher similarity to the anchor, and to reduce its similarity to the anchor. We apply the hard mining strategy method from [28] to identify the hardest positive pairs and negative pairs as illustrated in Fig.5. Formally, if x i is an anchor, the hardest negative pair x i , x j is selected. If S ij satisfies the condition: where ε indicates a given margin, and y k denotes the label of the k th sample. Also, hardest positive pair x i , x j needs to be met:

D. EXPRESSION METRIC LOSS FUNCTION
Distinct from the previous works such as the contrastive loss [29], triplet loss [30], lifted structure loss [31], etc., a new pair-based expression metric loss function (EMLF) is proposed that removes the need for hyperparameters and is VOLUME 8, 2020 capable of achieving faster convergence rates. Our EMLF is presented as follows: where P i and N i represent the hardest positive pairs and hardest negative pairs, respectively. J P_Loss and J N _Loss denote the hardest positive pairs loss and the hardest negative pairs loss, respectively. It can be seen from (6) that reducing this loss value is equivalent to reducing the similarity between the hardest negative sample pairs, and the same reason can be analyzed in (5). The softmax loss is used to calculate the classification loss, and the L2-Norm is applied to prevent overfitting. The total loss is defined as follows:

IV. EXPERIMENTS
In this section, to demonstrate the effectiveness of the proposed method for facial expression recognition, experiments are conducted on the CK+ [32], Oulu-CASIA [33] and MMI [34] public facial expression databases to evaluate the proposed model. Furthermore, in order to demonstrate the effectiveness of the proposed partial image method and EMLF, the PI&DML model is compared with three baseline CNNs, which have same network structure as the PI&DML. They are the following: (1) Original images (the images of the detected face) + Softmax loss + EMLF (OSE), (2) Partial images + Softmax loss (PS), and (3) Partial images + Softmax loss + EMLF (PSE).
A. EXPERIMENTAL DATASETS CK+ dataset: it contains a total of 327 image sequences collected from 118 subjects, each of which is labeled as one of 7 expressions, i.e. anger, contempt, disgust, fear, happiness, sadness and surprise. Each sequence starts with a neutral face, and reaches the peak in the last frame. Similar to other works [2], [14], the last three frames of each sequence are selected to generate 981 images for the experimental dataset. Oulu-CASIA dataset: it contains totally 480 image sequences collected from 80 subjects, each of which contains one of 6 expressions, i.e. anger, disgust, fear, happiness, sadness and surprise. Similar to the CK+ database, each sequence starts with a neutral facial expression and ends with the facial expression of each emotion. Following the previous works [2], [14], [16], the last three frames are collected as the peak frames of the labeled expression for each sequence. Thus, the Oulu-CASIA dataset contains 1,440 images for our experiments.
MMI dataset: The MMI database consists of 236 image sequences collected from 31 subjects, each sequence is labeled as one of 6 basic expressions, i.e. anger, disgust, fear, happiness, sadness, and surprise, starting from a neutral expression, through a peak phase in the middle, and back to a neutral face at the end. Similar to other works [2], [14], [20], we selected 208 sequences captured in frontal view and three frames in the middle of each image sequence are collected as peak frames associated with the provided label. Hence, there are a total of 624 images used in our experiments.
Preprocessing: The image resolutions of the CK+ dataset, Oulu-CASIA dataset and MMI dataset are 640×490, 320×320, and 186×185, respectively. In the selection of the size of the partial image, we observed the size of all partial images and took an equilibrium value (60×30) from them as the size of the final partial image, this equilibrium value will not cause obvious distortion of all partial images, and it can reduce the dimensions of the input data compared to the initial size of the images of the dataset. Face alignment is performed in works [2], [14], [19] and [20]; and in [2] and [18], they adjusted the contrast of the image. However, the above two operations are not performed on the images here and excellent results are achieved in our work, suggesting that the partial image is effective.
Training/testing strategy: To demonstrate the effectiveness of the proposed method, similar to other works [2], [20], a 10-fold subject-independent cross-validation is adopted for the evaluations conducted on all datasets, where each dataset is further split into 10 subsets. For each run, the data from 8 subsets are used for training and those from the remaining 2 subsets are used for testing. The results are reported as the average of the 10 runs. The training set and the test set cannot have the same kind of expression of the same person at the same time during each run, because if the same kind of expression of the same person appears in both sets at the same time, the model is likely to learn to determine whether is the same person or not in those images, not to determine whether is the same expression.

B. PARAMETERS SETTINGS
In (1), we empirically set τ 1 = 25, τ 2 = 9, and S = (40,25) for both datasets. In (3) and (4), ε is set to 0.1. For the metric space learning and classification, the Adam [35] optimizer with a batch size of 8 and a learning rate of 0.0001 are used to train the proposed model, the weights of the convolutional layers and fully connected layers were both initialized randomly using the ''xaiver'' procedure [51], and the number of training epochs is set to 200.

C. EXPERIMENTS RESULTS
Results on the CK+ dataset: The mean accuracy of the 10-fold cross validation is indicated in Table 1. As revealed by the last three results, better recognition accuracy can be achieved by jointly optimizing the expression metric loss and soft loss compared to a single use soft loss, which shows that the proposed metric loss function plays a positive role. Moreover, the recognition accuracy when using partial images is higher than the accuracy of using the original images, which shows that partial images can not only greatly reduce the amount of calculations, but also help to reduce the adverse effects caused by original images. Upon their comparison, the proposed PI&DML model outperforms the human-crafted feature-based methods and deep learning methods. Table 2 shows the confusion matrix of the PI&DML model on the CK+ dataset, and it can be found that our proposed method performs reasonably well at recognizing all emotions. Results on the Oulu-CASIA dataset: Table 3 summarizes the comparison results of the Oulu-CASIA dataset, and our proposed method is indicated to improve the accuracy by 7% compared to the current state of the art methods. In addition, it can be clearly seen that the recognition accuracy is low when experiments are performed on the original image, and it is demonstrated that the partial image is capable of mitigating the influences of the illumination, personal attributes and other factors to some extent. The confusion matrix shown in Table 4 indicates the results and demonstrates that all emotions are accurately recognized. Results on the MMI dataset: Since the MMI dataset contains a small number of samples, it is not large enough to train a deep model. Table 5 reports the average accuracy of 10 runs on the MMI database for recognizing six expressions. It can be clearly seen that the recognition accuracy of the proposed PI&DML model is significantly better than those of all the state-of-the-art methods. As shown in the confusion matrix in Table 6, our algorithm was not successful enough for the fear emotion. In particular, most of the fear emotions were confused with surprise, which is the same as the results of other works [2], [16], [20], but our model has a higher recognition rate for other expressions.

D. VISUALIZATION RESULTS
To further illustrate the effectiveness of the proposed method, we visualized the features learned by the OSE, PS and PSE methods on the CK+ dataset, and these feature vectors are VOLUME 8, 2020 visualized using the t-SNE [37], which provides a useful tool for the visualization of the high dimension data. As shown in Fig.6.(a), the input data were spread on a random basis, and most overlap each other.
It can be seen from the comparison between Fig.6.(b) and Fig.6.(d) that the classification effect of the partial images is better than that of the original images when both the classification loss and the proposed metric loss are used, the distance between different classes is relatively large when using partial images, and the features extracted from the last fully connected layer of the proposed model were well separated according to their label.
Comparing Fig.6.(c) with Fig.6.(d), it can be obtained that the classification accuracy can be improved by using the proposed metric loss function, there is almost no overlap between different classes, and the same classes of data can be well clustered together. Therefore, our proposed method reduced the distance between the same classes while increasing the variations between different classes.

E. DISCUSSION ON THE COMPUTATIONAL COST
First, the method of constructing partial images does not require a large amount of calculation, because we only add the step of detecting human eyes, mouth, and nose organs based on face detection, although adding this step slightly reduces the speed of detection, this method simply concatenates the detected partial image together without much calculation. Second, because the partial image is much smaller than the original image, the calculation amount of the model will be reduced in the process of extracting image features. Finally, hard sample mining strategy and metric learning technology did increase the amount of calculation of the model, but by reducing the batch size, the calculation has not increased significantly. For example, in hard sample mining strategy, the similarity matrix S = X · X T , where X ∈ R batchsize×256 , S ∈ R batchsize×batchsize , so the total number of calculations required to obtain S are: 256 × 256 × batchsize × batchsize. To reduce the amount of calculations, we choose a smaller batch size of 8. Although the batch size is smaller, the experiments results verify that a better convergence effect can be achieved.

V. CONCLUSION
To address high intraclass variations and high interclass similarities problems in FER, an expression recognition model based on joint partial image and deep metric learning method is proposed in this paper. First, partial image is beneficial to reduce the above problem caused by personal attributes, illuminations, occlusion and other factors to some extent. Second, the proposed EMLF in combination with hard sample mining strategy is applied to learn the nonlinear metric space. Finally, superior performance is achieved by jointly optimizing expression metric loss and classification loss when compared to the state-of-the-art methods on the CK+, Oulu-CASIA and MMI databases.