Feature Guided CNN for Baby’s Facial Expression Recognition

State-of-the-art facial expression methods outperform human beings, especially, thanks to the success of convolutional neural networks (CNNs). However, most of the existing works focus mainly on analyzing an adult’s face and ignore the important problems: how can we recognize facial expression from a baby’s face image and how difficult is it? In this paper, we first introduce a new face image database, named BabyExp, which contains 12,000 images from babies younger than two years old, and each image is with one of three facial expressions (i.e., happy, sad, and normal). To the best of our knowledge, the proposed dataset is the first baby face dataset for analyzing a baby’s face image, which is complementary to the existing adult face datasets and can shed some light on exploring baby face analysis. We also propose a feature guided CNNmethod with a new loss function, called distance loss, to optimize interclass distance. In order to facilitate further research, we provide the benchmark of expression recognition on the BabyExp dataset. Experimental results show that the proposed network achieves the recognition accuracy of 87.90% on BabyExp.


Introduction
Facial expressions play an important role in human being's communication. e ability to differentiate genuine displays of emotional experience from the posed ones is very important for dealing with day-to-day social interactions. Humans and computer algorithms can greatly benefit from being able to distinguish the genuine expression from the posed one. Possible applications of automated facial expression recognition include better transcription of videos, movies, or advertisement recommendations and detection of pain in telemedicine. erefore, facial expression recognition has attracted a vast amount of attention in the past two decades [1][2][3][4][5][6]. e development of facial expression recognition relies heavily on an adequate database of facial expressions. However, due to the nature of facial expressions, there are a limited number of publicly available databases providing a sufficient number of facial images tagged with accurate expression information. Table 1 shows the major differences of the existing image databases with the number of images, number of subjects, expression distribution, data size, and the released years. However, most of the existing works and datasets [7][8][9][10][11] focus on analyzing adult faces, which ignore how to analyze facial expressions from baby facial images. Although some datasets include children, there are actually very few images of very young children. None of these datasets is specifically designed to explore the expression of babies. ere are two main reasons for the lack of research on baby face analysis. e first reason is that the community has not realized the application values of analyzing baby's facial expression. In fact, there are many applications of analyzing the facial expressions of babies, such as advertising marketing for parents, intelligent family child care, and scientific parenting. e second reason may be traced to the additional challenge of obtaining the baby face datasets with accurate expression labels.
As we all know, 0-2 years old is a golden period for the development of a baby and for laying a solid foundation for their lifelong physical and mental health. erefore, it is valuable to develop the algorithm to interpret a baby's facial expressive signals for scientific parenting. In addition, due to the support of national policies and people's growing attention to the growth and development of a baby, the parenting market has been expanding. Accurate recognition of facial expressions of a baby is of great significance to facilitate the development of scientific parenting. All these real needs have brought a strong motivation to the study of recognizing baby's face expressions.
Recently, researchers have realized the importance of children's facial expressions in order to study developmentally the interpretation of these expression datasets. For example, the new NIMH Children's Emotional Face Picture Collection (NIMH-ChEFS) contains photos of children aged 10-17 [12], the Radboud Faces Database includes photos of 8-to 12-year-olds [13], and the CAFE set features photographs of 2-to 8-year-old children [14]. Although these new datasets give researchers the option to use a sample of children aged 2-17 years, there have been no datasets that feature smaller children to date. On the contrary, all the datasets mentioned above for children's facial expressions have only a small number of images, which are not suitable for training convolutional neural network (CNN) models. In addition, these datasets contain the facial images with posed expressions in a lab-controlled environment.
In this paper, to address the aforementioned issues, we propose a new image dataset with expression labels of baby faces for automatic facial expression recognition. Our dataset, which is called the BabyExp dataset, contains more than 12,000 images from babies younger than two years old showing spontaneous expressions in an uncontrolled environment. Each face image is annotated with one of three facial expressions (i.e., happy, sad, and normal). It is complementary to existing adult face datasets and can shed some light on exploring baby face analysis. Our key contributions are summarized as follows: (1) We present a facial expression dataset, named BabyExp, which contains more than 12,000 images from babies showing spontaneous genuine expressions in an uncontrolled environment. Each image is annotated with one of three facial expressions (i.e., happy, sad, and normal). (2) We propose a new distance loss function to effectively enhance the discriminative ability of distance between classes in unconstrained facial expression recognition tasks.

Data Collection.
Our baby face images are generated from both static images and video sequences uploaded by parents using smartphones. We will introduce the preprocessing of the BabyExp dataset in the following. For the original images and the original video data, we first perform face detection, then perform face cropping, and finally perform picture similarity detection. A detailed description can be found in the following.

Image Preprocessing.
For image processing, we first use the Dlib visual library [15] and the OpenCV visual library to perform face detection and cropping on the original image. During the face detection, we adopt the following strategy. First of all, if a face appears, the face section will be extracted. Second, if no face is detected during the detection, we rotate the image 270 degrees clockwise at 90 degrees each time. If a face appears during the three rotation detection processes, then we crop and save the face image. Last, if there are two or more faces detected in the image, we will assume that this image will have an adult face or a face that is not a human face but is misidentified as a human face. en, we will discard such images. It is important to note that the area of the original picture of the baby's face is not very large. At this point, the picture is redundant. If it is used directly for training, the model converges slowly, resulting in poor test results. In order to reduce the large amount of nonface information in the image, therefore, after using the above Dlib face detection strategy, when cropping the face, we crop the face area according to a specific artificial strategy and save it. e main purpose is to obtain a noise-free and good-quality baby face image dataset in order to obtain a better model during the training process and a better accuracy during the test process. We then crop the original image according to the new picture size and finally normalize the cropped image (the normalized size is 256 × 256).

Video Preprocessing.
We segment the original video data, take an image every 30 frames, and then perform the same process as the static image data preprocessing on the images from the video frames, detecting, rotating, and finally cropping the baby's face picture. It should be noted that 2 Complexity because the pictures obtained by intercepting video frames may have great similarities, many images are redundant, so the only different operation different from the static image is that, after the picture is cropped and saved, we need to perform picture similarity matching operations to filter the image. We use SSIM [16] to perform similarity matching and specify to delete images with similarity greater than 90%.

Data Annotation.
After preprocessing, we get 7,600 images, and we will tag the images with facial expressions. Because babies are all at the stage of 0-2 years old, their expressions are not as diverse as those of the adults. For this reason, we specially selected three main baby expressions (i.e., normal, sad, and happy) for the BabyExp dataset. e marking process is divided into three steps: manual labeling, label statistical analysis, and label aggregation.
In the manual labeling step, 10 raters coming from Harbin Institute of Technology were selected to manually label the data. Without given any information, the subjects were asked to classify the photos according to their own experience. In order to save time and to boost classification efficiency, we used C++ language to design a manual labeling tool for manual classification and record the human evaluator choice of the expression label. For each input image, we asked 10 raters to label the image into one of 3 emotion types and 1 error fold: happy, sad, normal, and error. e raters are required to choose one single emotion for each image. After labeling, there will be four categories, i.e., happy, normal, sad, and error. e error category represents that an image is not a human face or the face is unclear. e second step is to label statistical analysis. After the manual labeling of 10 people is completed, it is necessary to analyze the expressions in all the categories. e statistical result is an expression category selected by 10 people per picture. With labels from 10 raters for each face image, we can generate a probability distribution of emotion captured by the facial expression. Let N denote the number of the training examples I i , i � 1, . . . , N. Given the i-th example I i , its label distribution from the raters can be expressed as e final step is to aggregate the labels of each image. After the second step, we need to aggregate the label of each expression generated by the 10 people. e combined labeling results are happy, normal, sad, and error. In most of the existing facial expression datasets, each facial image is only associated with one single label. If the image has more than one label, it is natural to assign the image to the label of the largest p i k . We experimented majority voting schemes. More formally, we create a new target distribution.
After processing, when encountering an image, a certain type of expression will be selected, which means that the image is the corresponding category. If an image has the same labeling number of people and both have the maximum number of votes, the image is not classified, and they are marked twice to determine the baby's expression label of the image. Finally, in the end, we obtained 2,502 happy images, 4,028 normal images, and 1,070 sad images, as shown in Figure 1. It can be clearly seen that the three expression distributions in the baby expression dataset are unbalanced. is is because babies are different from adults who have rich expressions leading to a uniform expression distribution. Since expressions of babies from 0 to 2 years old are still developing and the expression types are relatively monotonous, especially in the absence of outside interference, most of the time, the baby is in a calm state followed by the state of laughter and finally, the state of sadness, so we can see that the proportion of normal is relatively large, and the proportion of sad is relatively small, which is very consistent with the expression characteristics of the baby, but imbalanced data may have a strong impact on the accuracy of the research experiment results; one solution is to use data augmentation and synthesis to balance the distribution of classes during the preprocessing phase.

Data Augmentation.
According to the dataset information obtained above, there is an imbalance in the dataset, which will adversely affect the subsequent experimental work. Although deep learning has a strong characteristic learning ability, some technical hurdles prevent their successful applications to our dataset. First, deep neural networks require a lot of training data to avoid overfitting. Additionally, models trained using imbalance facial expression samples have a poor generalization ability and are prone to overfitting, which is illustrated in the experiments we introduced later in the experimental section. So, we need to perform data augmentation to promote data balance and facilitate the use of deep learning methods for experiments.
At present, generative adversarial networks (GANs) [17] are a popular research method in the field of machine learning. eir basic idea is derived from the game of two players in game theory. In the GAN framework, a "generator" network is tasked with fooling a "discriminator" network into believing that its own samples are real data. Inspired by the successful application of the GAN in the field of image style transfer, this project will use the GAN as a network model for image enhancement processing. We can use the resulting generative model to generate faces with specific expressions from nothing but random noise. Many different types of GANs require paired datasets for image style transfer. Baby expression images do not have paired data for sad and happy expressions corresponding to the same normal expressions of the baby, so the research contents in this part will draw on the important idea of CycleGAN [18] asymmetry training for unpaired image-to-image translation. e research contents in this part mainly include data augmentation of sad and happy facial expression images for imbalanced baby facial expression data based on CycleGAN.
e CycleGAN architecture contains two generators and two adversarial discriminators: Generator A, Generator B, Similarly, the data augmentation of sad expressions has the same process structure as that of happy expressions, which is not described in detail here. It must be pointed out that because the number of normal expressions is sufficient, we have only enhanced the sad and happy expression image data. Finally, after data augmentation of CycleGAN, 1,498 happy expression images and 2,955 sad expression images are finally selected and generated.
e total amount of facial expression data we obtained is shown in Table 2. It can be seen that, after data augmentation, we obtained 4,000 happy images, 4,028 normal images, and 4,025 sad images. We have a total of 12,053 baby facial expression images. We call it the BabyExp dataset, of which 4,453 are generated images. e amount of data for three facial expressions has reached an equilibrium state for the future academic research.

Proposed Methods.
e overall pipeline of the proposed deep learning approach is depicted in Figure 3. Our proposed framework, called VFESO-DLSE, is composed of four modules: feature extraction, feature refinement, covariance pooling, and CNN classification. We also propose a new loss function, called distance loss, denoted as L DL .

Distance Loss.
Min Xia et al. [19] found that the feature constraint helps enlarge the feature distance of different age range feature space in face images with similar feature distributions. Inspired by this, we propose a novel loss function, called distance loss, which takes strong feature constraint into baby facial expression learning. e distance loss aims to learn representations with lower intraclass variations and higher interclass distances. As we all know, by pushing the samples to the corresponding class center in the feature space during the training, the center loss [20] significantly reduces the intraclass difference. e center loss is defined as the sum of the square distance between the sample and its corresponding class center in the feature space. e center loss is denoted as L C : where y i is the class label of the i-th sample; x i denotes the feature vector of the i-th sample taken from the FC layer before the decision layer; c y i denotes the center of all the samples with the same class label as x i ; and m is the number of samples in the mini-batch. Our distance loss denoted as L DL is defined as where N j and N k denote the set of expression labels and C k and C j denote the k-th and j-th centers. Specifically, the first term was used to narrow the distance between the sample and the center of the corresponding class, and the second term was used to punish the similarity between different expressions. λ 1 is used to balance the weights of the two terms. By minimizing the distance loss function, the same expression will be brought closer, and different expressions will be pushed in the feature space.

Feature Guided CNN.
As we all know, the expression change of babies aged 0 to 2 years will be less distorted. Although CNNs have achieved great performance in image processing [21][22][23], traditional CNNs consist of fully connected layers, maximum or average poolings, and convolutional layers to capture only first-order information [24]. We believe that second-order statistics is more suitable to capture such baby's expression distortions than first-order statistics. So, we take network architecture model-4 presented in [25] as a baseline model. Related studies [26,27] have proved that the trained deep convolutional network can be used as a feature extraction tool for classification tasks, and it has a generalization ability. Following up this idea, we apply the famous VGG16 [28] model for feature extraction  in our method. VGG16 is a typical CNN model. It has 13 convolutional layers, 5 pooling layers, and 3 fully connected layers for face recognition. To extract expression features, we use a pretrained VGG16 network on the expression dataset to extract features (referred to as VFE). For each facial image, we use the 14 × 14 × 512 size feature maps of the fourth pooling layer to represent an image feature.
For the feature refinement stage, we use the squeezeand-excitation (SE) block [29] to refine the CNN functionality and highlight the regions of expression that need to be highlighted, thereby explicitly modeling the interdependencies between the channels by adaptively recaliberating the channel's feature response. e detailed structure can be seen in Figure 4, and c is a scaling parameter (16 in this paper). e purpose of this parameter is to reduce the number of channels and thus reduce the computation. C represents the number of channels, and H, W represent the height and width of the feature map  In essence, the SE module performs attention or gating operations on the channel dimension. is attention mechanism allows the model to pay more concern about the channel features with the most information and suppress those unimportant channel features. en, three convolutions with kernel size 3 × 3 are followed, and we use ReLU [20] as the activation function for each convolution layer and two max pooling layers. en, the same as baseline [25], we also use covariance pooling after the last convolutional layer and before the fully connected layers. In the last classification part, the total loss of our network architecture training is formulated as follows: where L s denotes the softmax loss and L DL denotes the distance loss. e hyper parameter λ is used to balance the two loss functions.  Table 3 shows the results of this experiment. e second experiment is to demonstrate the effectiveness of the proposed method VFESO-DLSE. We compare our method against four designed architectures: DLP [31], the baseline [25], baseline + distance loss (SO-DL), and baseline + distance loss + SE block (SO-DLSE) (the structure can be seen in Figure 5). It should be noted that since our baseline network is based on the model from [31], we trained and tested the experimental results from scratch with our own BabyExp dataset for better comparison. Same as in [25], here, we use the center loss [32] in any case to train the network, not the locality preserving loss [31], because we do not deal with compound emotions. Table 4 shows the results of this experiment. In order to objectively measure the performance, the BabyExp dataset is divided into training and test sets, where the test set contains 2,413 images, and the remaining 9,640 images are used as the training set. e dataset is then resized to a fixed size 100 × 100, which is subsequently sent to the CNN classifier for expression recognition. It should be noted that the image size is resized to 224 × 224 only when entering the VFESO-DLSE method.
e labeled facial expression dataset is quite small; thus, we use the conventional data augmentation method to generate more training data. In the data augmentation stage, we augment the set of training images in BabyExp by random flipping, rotating each with ±10°, and random crop. We then train our networks for 700 epochs with the following parameters: learning rate 0.0001-0.005, weight decay 0.05, momentum 0.9, batch size 128, and linear learning rate decay in the

Complexity
Adaptive Moment Estimation (Adam) optimizer. It is worth pointing out that, to better measure the availability of the BabyExp dataset and the accuracy of the results, we report total accuracy, per class precision, per class recall, and per class F1-measure as the evaluation metrics here. e last experiment is to verify the experimental results if the data are not equalized by CycleGAN. Table 5 shows the results of this experiment. e original dataset contains 7,600 pictures, including 2,502 happy images, 4,028 normal images, and 1,070 sad images. In order to objectively measure the performance, it is divided into training and test sets. e test set contains 1,522 images, and the remaining 6,078 images are used as the training set. We choose two methods with better experimental results in the second experiment: SO-DLSE and VFESO-DLSE. Experimental settings, parameter settings, and the number of iterations are the same as those in the second experiment above. Table 3 shows the experimental results of adult expression recognition models trained on the adult dataset and tested on the adult and BabyExp datasets. As we can see, the performance of these methods on the BabyExp is significantly lower than that on the adult dataset SFEW2.0, 54.45% on SFEW2.0 vs. 39.7% on BabyExp and 58.14% on SFEW2.0 vs. 40.78% on BabyExp, indicating that baby faces are greatly different from the adult faces, and it is important for developing facial expression recognition approaches for baby images. e overall expression recognition performance of the proposed different experiments trained from scratch on the BabyExp dataset is shown in Table 4. From the results, we have the following observations: firstly, we can clearly see that the accuracy of DLP and baseline methods when trained and tested from scratch on the BabyExp dataset has greatly improved, 39.7% to 65.02% and 40.78% to 79.57%, compared with that trained on adult dataset SFEW2.0, once again indicating that baby faces are greatly different from the adult faces. Secondly, our proposed method VFESO-DLSE achieves the best result, 87.90%, which is about 4.8% greater than SO-DLSE showing that VGG16 is better than other CNN methods to extract features. From the results of baseline, SO-DL, and SO-DLSE, we can see distance loss and SE can achieve an improvement about 1.8%. e purpose of the distance loss is to learn lower changes between the same classes and higher distances between different classes, and the SE block can automatically obtain the importance of each feature channel through learning. irdly, from the results, it    is obviously shown that the recall, precision, and F1-measure can further confirm the reliability of our results and the validity of our method. e expression recognition performance of original data which are not equalized by CycleGAN can be seen in Table 5. We have two observations of the facial expression recognition on BabyExp. Firstly, we can easily see that two methods, SO-DLSE and VFESO-DLSE, have achieved 58.61% and 74.24% on the original data, in which both are still lower than 83.13% and 87.90% on BabyExp equalized by CycleGAN from Table 4. Secondly, even though these two methods have achieved higher accuracy, the recall rate and F1-measure are not very high, especially for the sad expression; this is because the distribution of expressions is unbalanced, and models trained using imbalance original facial expression samples have poor generalization ability and are prone to overfitting. Even in the SO-DLSE method, the recall, precision, and F1-score values of sad expressions are all 0, while the VFESO-DLSE method obtained 38.79%, 76.14%, and 51.39% in recall, precision, and F1-score, respectively, which also shows on the one hand that VGG16 is better than other CNN methods to extract features. On the other hand, it shows that we need to perform data augmentation to promote data balance and facilitate the use of deep learning methods for experiments, which validates the importance of CycleGAN for data equalization. is conclusion can also be drawn from the experimental results in Table 4.

Discussion
Facial expression recognition (FER) has always been a challenging topic in computer vision. Researchers usually aim to build a system that can identify different expressions in the images automatically [33]. Research on facial expression recognition relies heavily on an adequate dataset of facial expressions. However, due to the inherent nature of facial expressions and the difficulty of obtaining them, there are currently only a limited number of publicly available databases, which provide a sufficient number of facial images and are tagged with accurate facial expression information. Table 1 shows the summary of the existing image databases with the number of images, number of subjects, expression distribution, data size, and released years.
However, there are several limitations for these datasets. Most of the existing works and datasets [7,8] focus on analyzing adult faces, which ignore how to analyze facial expressions from baby facial images. Recently, researchers have realized the importance of children facial expressions in order to study developmentally the interpretation of these expression datasets. For example, the new NIMH Children's Emotional Face Picture Collection (NIMH-ChEFS) contains photos of children aged 10-17 [12], the Radboud Faces Database includes photos of 8-to 12-year-olds [13], and the CAFE set features photographs of 2-to 8-year-old children [14]. Although these new datasets give researchers the option to use a sample of children aged 2-17 years, there have been no datasets that include younger children to date. On the contrary, all the datasets mentioned above for children facial expressions have only a small number of images, which are not suitable for training CNN models. In addition, these datasets contain posed expressions in the lab-controlled environment, not spontaneous or natural facial expressions.

Conclusions
In this paper, to address the aforementioned issues, we propose a new image dataset with expression labels of baby faces for automatic facial expression recognition. Our dataset, which we call the BabyExp dataset, contains more than 12,000 images from babies younger than two years old showing spontaneous expressions in an uncontrolled environment. Each face image is annotated with one of three facial expressions (i.e., happy, sad, and normal). It is complementary to the existing adult face dataset and can shed some light on exploring baby face analysis, and it will enable the academic research community to study baby faces in a manner comparable to the vast literature that relies heavily on adult faces.
As a result, our novel dataset will become an important milestone for human expression researchers. is dataset will be an important resource for the computer vision community to benchmark and compare results. We further evaluate state-of-the-art adult face analysis methods on BabyExp, which indicate that adult facial expression recognition methods are not suitable for baby facial expression recognition, and new methods are necessary to be developed to approach baby face recognition. Besides, we have also proposed a baseline for automatic expression recognition for babies based on deep learning. We conduct several experiments and report the baseline performances of the BabyExp dataset. e proposed baseline CNN architecture achieves an average classification accuracy of 87.90% on the BabyExp dataset. e performance of these methods on the BabyExp dataset is significantly lower than that on the other datasets, indicating that baby face facial images are greatly different from the adult faces, and it is important for the community to develop facial expression recognition approaches for babies.
We hope that the release of the BabyExp dataset will encourage more research works on the real-world children expression recognition, and it will be a useful benchmark 8 Complexity resource for researchers to validate their facial expression analysis algorithms in challenge conditions. We will collect more data and assign more specific facial expression labels (i.e., crying and laughing) to each image in order to extend the dataset. And we will continue to explore methods to achieve better performance for baby facial expression recognition in the future.

Data Availability
e data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest
e authors declare that they have no conflicts of interest.