Landmark-Based Facial Feature Construction and Action Unit Intensity Prediction

Human face recognition has been used in many ﬁelds, including biorobots, driver fatigue monitoring, and polygraph tests. However, the end-to-end models ﬁt by most of the existing algorithms perform poorly in interpretation because complex classiﬁers are constructed using facial images directly. In addition, in some of the models, dynamic characteristics of subjects as individuals are not fully considered, so dynamic information is not extracted. In order to solve these problems, this paper proposes an action unit intensity prediction model. The three-dimensional coordinates of 68 landmarks of human faces are obtained based on the convolutional experts constrained local model (CE-CLM), which enables the construction of dynamic facial features. Based on the error analysis of the CE-CLM algorithm, dimension reduction of the constructed features is performed by the principal components analysis (PCA). The radial basis function (RBF) neural network is also constructed to train the action unit prediction models. The proposed method is veriﬁed by the experiments, and the overall mean square error (MSE) of the proposed method is 0.01826. Lastly, the network construction process is optimized, so that for the same training samples, the models are ﬁtted using fewer iterations. The number of iterations is decreased by 27 on average. In summary, this paper provides a method to rapidly construct action unit (AU) intensity prediction models and constructs automatic AU intensity estimation models for facial images.


Introduction
e human face is a carrier of multiple information types. It can not only directly show personal information (e.g., age, gender, and race) but also indirectly convey various emotions (e.g., pleasure, anger, sorrow, and joy). Moreover, a facial expression of emotion is one of the important types of interpersonal communication. According to Mehrabian et al. [1], in human interactions, 55% of the information is conveyed by facial expressions. e facial action coding system (FACS) has been widely used in research on facial information. It was first developed by Ekman and Friesen in 1975 [2] and then improved in 2002 [3]. In the FACS, a unique set of basic facial muscle actions is defined and denoted as the action unit (AU). e FACS involves 44 visible AUs linked to facial muscular movements [4], including 33 independent AUs and 11 additional nonstrictly defined action descriptors (ADs) [5], which are given in Tables 1 and 2, respectively. e FACS is comprehensive and objective. Any facial expression of emotion is generated by activating one or more groups of facial muscles. Also, each possible facial expression can be represented as a combination of different AUs. Due to the complex definition of coding rules in FACS, AU annotations can be completed only by FACS-certified coders. In order to meet the FACS coding requirements, the FACScertified coders need to be professionally and strictly trained for at least 100 hours [4]. In addition, manual coding of facial images is a tedious and time-consuming process; for instance, even a well-trained coder needs about one hour to complete the facial coding of a one-minute video clip [6].
With the aim of solving the problem of time-consuming and inefficient facial coding, researchers have considered using computer technology for the purpose of coding automation. is study aims to realize automatic AU intensity prediction by constructing facial landmark-based facial features, thereby improving the accuracy and efficiency of facial coding. e rest of this article is organized as follows. In Section 2, the relative literature is reviewed, and the existing AU intensity estimation algorithms are analyzed. In Section 3, feature construction and dimension reduction of the AU sample library are conducted by facial feature analysis based on which a neural network model for AU intensity estimation is constructed and trained. In Section 4, the trained network model is optimized based on the experimental result to reduce training time. Lastly, in Section 5, the main conclusions are stated, and future research directions are presented.

Literature Review
Remarkable progress has been made in the field of automatic recognition of facial expressions over the past two decades [7,8]. According to the previous studies on facial expression recognition, the two main tasks in the facial expression recognition process are facial information description and machine learning model design [9,10]. e purpose of facial information description is to obtain a set of features from the original facial images, ensuring that similarity of features of different faces expressing the same information is more similar than that of the same faces expressing different information.
Facial landmark detection is the basic way to describe facial information.
ere are many well-developed algorithms for facial landmark detection. e active shape model (ASM) [11], where the range of landmarks is constrained based on Mahalanobis distance, is the earliest algorithm for landmark model construction. It not only contributes to facial landmark detection but can also apply to gesture detection and body movement detection. By adding texture information to the ASM during the shape feature statistics, the active appearance model (AAM) [12] that can locate the landmarks more accurately based on shape features has been constructed, but its real-time performance has been unsatisfactory. In contrast, the constrained local model (CLM) [13] makes a balance between accuracy and real-time performance using the advantages and disadvantages of the ASM and AAM. It achieves a higher computing efficiency by changing the global texture features in the AAM into the texture features near the landmarks of the average faces of the public. e enhancement of computing efficiency brings an increasingly complex structure of machine learning models. Mahoor et al. [14] classified combined AUs using the sparse representation (SR) classifier. Both Wang and Lien [15] and Valstar and Pantic [16] constructed the AU recognition models using the hidden Markov model (HMM). In a study by Kaltwang et al. [17], relevance vector regression was used to predict the AU intensity. e support vector machine (SVM) model was used by Mahoor et al. [18] and Valstar et al. [19] for emotion recognition. Savran et al. [4] predicted the AU intensity using the support vector regression (SVR). Recently, widely applied neural networks have become dominant in model construction. e deep neural networks were used by Liu et al. [20], Gudi et al. [21], and Zhao et al. [22] to build the AU recognition models.
Hitherto, most of the studies on automated AU analysis have focused on AU recognition, including simple AU recognition [23], combined AU recognition [24], AUemotion coupled recognition [25], and AU dependence Table 1: List of AUs.   AU number  FACS name  1  Inner brow raiser  2  Outer brow raiser  4  Brow lowerer  5 Upper lid raiser 6 Cheek raiser 7 Lid tightener 8 Lips toward each other 9 Nose wrinkler 10 Upper lip raiser 11 Nasolabial furrow deepener 12 Lip corner puller 13 Cheek  Cheek blow 34 Cheek puff 35 Cheek suck 36 Tongue bulge 37 Lip wipe AU, action unit; FACS, facial action coding system. recognition [26]. However, little attention has been paid to the AU intensity prediction. Compared with the AU recognition, AU intensity prediction represents a more challenging task [27]. Due to the complexity of AU detection, including large numbers of categories, more subtle models, and minor differences between AUs, an automatic facial AU analysis still remains an open challenge in both AU recognition and AU intensity prediction [28].

Facial Landmark Localization
Method. Facial muscles contract and stretch when humans convey emotions through facial expressions, which causes the corresponding facial regions to change in shape or location. erefore, facial images contain a large amount of emotional information. In this study, the emotional information contained in facial images is identified based on the changes in the locations of facial landmarks. e convolutional experts constrained local model (CE-CLM) [29] is used to recognize the three-dimensional information of facial landmarks. By combining the advantages of the convolutional neural network and the original CLM algorithm, the CE-CLM achieves high robustness against interference factors, such as sunlight, angle, and obstruction. e input is a facial image, the human face in which is locked down by means of the human face detection algorithm. Next, three-dimensional facial alignment is conducted using the trained average face model. e exact location of each of 68 landmarks is determined using the local constraint algorithm. Lastly, the three-dimensional coordinates of all 68 facial landmarks are determined. As shown in Figure 1, the coordinate origin is the center of the camera lens. e optical axis of the camera denotes the z-axis, where the direction of the camera pointing to the face represents a positive direction.
e direction perpendicular to the lens center represents the positive direction of the y-axis; the direction from the lens center horizontally to the right represents the positive direction of the x-axis.

Construction of Complete Facial Information Feature.
After obtaining the three-dimensional coordinates of 68 facial landmarks using the CE-CLM algorithm, the complete facial information features are constructed. Points in space can differ significantly due to their differences in the origin and direction of each axis when building the coordinate system. erefore, constructing the features based on the relationship between the points in space, the change in sample features caused by different coordinate system construction methods can be avoided.

Feature Construction Method.
e facial features constructed based on the relations between facial landmarks can be divided into three following groups: the Euclidean distance between two points, the angle formed by a point and two other points, and the perpendicular distance from a point to the line between two other points, as shown in Figure 2. e angle formed between a point and two other points varies in a small range. Because of the principle of triangle similarity, angles cannot fully express the three-dimensional facial information.
According to the calculation equation of a triangle area, the perpendicular distance from a point to the line between two other points that is shown in Figure 3 can be calculated by (1) According to equation (1), this distance is equivalent to the Euclidean distance between two points, but the dimension (C 3 68 � 50116) constructed based on the perpendicular distance from one point to the line between two other points is far larger than that of the Euclidean distance between two points (C 2 68 � 2278). From the engineering perspective, the distance from a point to a straight line increases the computation complexity without introducing an effective feature. For this reason, it is not used as one of the facial features in this work. As shown in Figure 4, the facial features with 2278 dimensions are constructed based on the Euclidean distances between 68 facial landmarks.

Feature Construction Optimization.
Considering the huge advantages of the convolutional neural networks in image processing, researchers have recently placed emphasis on AU recognition based on the prior model constructed using static images. However, the static image feature-based methods can only obtain representative geometric facial features from images, such as texture, color, and shape. ey can present individual differences well but cannot focus on a certain AU of the whole group or give a summary of the features of AUs. In order to overcome the individual difference between static images, the individual sample calibration-based facial feature with movement differences is proposed. It can not only eliminate the difference between individuals but also match the AU description in the facial coding system.
In common calibration-based facial action feature construction methods, the difference between eigenvalues of facial images with and without expressions can be calculated by where d i,j denotes the Euclidean distance between landmarks i and j and d i,j denotes the difference between the distances from i to j with and without facial expressions.

Mathematical Problems in Engineering
However, some different △d i,j values, such as △d 51, 8 and △d 51,57 as shown in Figure 5, are similar due to the characteristics of facial landmark distribution, resulting in a reduced amount of feature information. For this reason, the variation rate of the features of facial images with and without expressions is used for facial feature construction in this study, and it is expressed as

Feature Dimension
Reduction. e facial features with 2278 dimensions are constructed from 68 landmarks to obtain complete facial information. ese features are not valid for a particular AU from an intuitive perspective; in other words, some features are invalid for particular AUs. erefore, feature dimension reduction is necessary for each AU. Valid features are extracted, which substantially enhances the construction efficiency.

Dimension Reduction Methods.
Traditional feature dimension reduction methods mainly depend on the subjective understanding of the existing features. e most representative features are determined based on the judgment of several authoritative experts on each feature, which can be a practical method for a lower feature dimension. However, the workload of experts increases with the feature dimension, which makes this method impractical. Currently, there are several methods for feature dimension reduction, including the genetic algorithm, Fisher's linear discriminant (FLD), maximum relevance and minimum redundancy (mRMR), and principal components analysis (PCA).
As a common optimization algorithm, the genetic algorithm simulates the natural survival of the fittest. is algorithm can generate and determine the most appropriate feature based on particular rules. However, the parameters of this method and the method of new feature formation depend on manual setup, which together with a lack of theoretical basis for some certain features makes these methods unsuitable for extensive application. e FLD represents a widely used dimension reduction method. In this method, samples are projected to a straight line, i.e., data are projected to the one-dimensional space. It divides features into valid and invalid ones, but valid and linearly correlated features cannot be removed. e purpose of the mRMR is to minimize the correlation between selected features and maximize the correlation between the selected features and mathematical expectations.
is method is the most efficient feature selection  Mathematical Problems in Engineering 5 method, but it does not consider new feature formation by means of feature combination.

Principal Component Analysis.
e steps of the PCA are as follows: gather a dataset, select direction with the greatest change and set it as a new axis, check changes in the remaining data, and find an axis that is perpendicular to the first one and covers as many changes as possible. ese steps are repeated until all the possible coordinate axes are identified. In this way, all the variables denote the axes along the rectangular coordinate system, while the covariance matrices denote diagonal matrices. Namely, each new variable is related only to itself rather than any other variable.
Essentially, it rebuilds a new coordinate system without changing the relationships between data points, which ensures the original data can be converted without any information loss.
us, dimensions with small data variations in some directions can be removed directly without affecting data variability. In Figure 6, XY represents the original coordinate system and X'Y' represents the coordinate system after PCA conversion. However, there is no uniform standard for the thresholds on which the dimension removal procedure relies.
Considering that an eigenvalue threshold needs to be set in the PCA-based dimension reduction method, dimensions corresponding to eigenvalues that are smaller than the threshold are removed for the purpose of dimension reduction. Due to different data forms used in the PCA algorithm, there is no uniform way to set the eigenvalue threshold. For the sake of optimal dimension reduction, it is necessary to analyze the errors and set a reasonable threshold since there are errors in dynamic video acquisition and algorithm processing. e specific method is as follows. e videos of subjects at complete rest were acquired to control the statistical error. e subjects were required not to do any facial expression during data acquisition, including blinking, so it did not take too much time to conduct the acquisition. In order to ensure the quality of statistical data, a one-minute video of each subject in the neutral state was acquired. A total of 18 clips of video data where the subjects did not blink were acquired. In this study, the videos were acquired on 30 fps with a resolution of 1280 × 960 using a Microsoft Kinect sensor. Finally, 18 groups of image data that contained 1800 images each were obtained.
First, the data of each image were processed and converted into features with 2278 dimensions using the aforementioned algorithm. For each frame of features, the variance corresponding to each dimension of features from the start frame to the current frame was calculated. In the end, the variances of all the dimensions were averaged to obtain the average feature variance of each frame of data, as shown in Figure 7. By using 100 frames as a step size, the information entropy of the data within every 100 frames was calculated, and the results are shown in Figure 8. As presented in Figure 8, the information entropy started to stabilize after 1500 frames, suggesting that the number of 1500 frames was the minimum requirement for error statistics in order to ensure high stability of data error. In this study, the average feature variance of 1800 frames of data was 1.16.
According to the mathematical derivation of the PCA algorithm, the eigenvalue is equal to the variance of the dimension corresponding to the coordinate after data rotation. If the variance in a certain dimension is smaller than the average variance without facial expression changes, this dimension contains less information and will not impact the overall information content if removed. erefore, the PCA segmentation threshold is set to 1.16 in this work.

Sample-Tag Relationship.
When sample data are preprocessed, the sample features after dimension reduction are obtained. Additionally, the sample data also contain the AU intensity tags of all images. ese tags are calibrated by FACScertified experts after long-time careful image observation. e FACS is an objective and comprehensive system constructed based on experimental psychological research. It aims to provide observers with an objective method for measuring facial actions. In behavioristics, the FACS is the most widely accepted measure of facial emotions. us, these AU intensity tags have been the most recognized result of subtle facial expressions.  ere are two difficulties in constructing an automatic AU intensity prediction model for sample features and their AU tags as follows: (1) e FACS definitions of various intensities are shown in Figure 9. Levels A, B, C, D, and E represent the faint sign of action, slight but inconspicuous actions, actions that have been obviously taken, drastic actions, and actions that have reached the limit, respectively. Each intensity level involves a series of appearance changes. us, it can be inferred from the FACS definitions of AU intensities that there can be a nonlinear relationship between the intensity level and the range of facial actions. (2) Considering that the AU intensity tags, which represent the annotations of subject facial expressions made by experts, of the sample database are determined by the FACS, the results are still generally acceptable despite the subjectivity. us, the results denote the true values of the sample data output.

Radial Basis Function Neural Network.
In order to solve the aforementioned problems, the radial basis function (RBF) neural network regression is used for AU intensity prediction. e RBF is a real-value function whose value depends only on the distance from a sample to a point (w) in space, namely, g(x, w) � g(‖x − w‖). Any function (g) that satisfies g(x) � g(‖x‖) can be called an RBF. Since the Euclidean distance is the most commonly used distance measure in the RBFs, the RBF is also known as the Euclidean radial basis function.
In this study, the RBF is a Gaussian kernel function that can be expressed as g(x, w, σ) � exp((− ‖x − w‖ 2 )/2σ 2 ), where x denotes the sample input, w denotes the center of the kernel function, and σ denotes the width parameter that controls the range of the radial action of the function. e RBF neural network is a three-layer neural network consisting of the input layer, the hidden layer, and the output layer. e conversion from the input layer to the hidden layer is nonlinear, whereas that from the hidden layer to the output layer is linear. e specific structure of the RBF neural network is shown in Figure 10. e basic idea of the RBF network is to use the RBF function as the activation function of the hidden-layer neurons so that the input vector can be mapped directly without requiring connections via weight. Once the central point of RBF is determined, the mapping relationship is also determined. e mapping from the hidden layer to the output layer is linear, which means network output represents a linear weighted sum of the hidden-layer output. e weight is the adjustable network parameter. In addition, the mapping from the network input to the network output is nonlinear, whereas the mapping from the network output to adjustable parameters is linear. e network weight can be solved directly by the system of linear equations, thus speeding up the learning process and avoiding the problem of the local minimum. e RBF network's features are ideal for the application considered in this study. Namely, the relationship between sample features and sample tags can be properly handled, and the correspondence between the features of training samples and their tags can be effectively regressed. erefore, the RBF neural network regression is performed to fit the AU intensity prediction model. e mean square error (MSE) is used as the cost function during the model training and as the final evaluation indicator of model regression.
e regression error is quantized by calculating the MSE between the true and predicted values based on which model parameters are optimized to ensure model validity. e MSE is calculated by where y i denotes the AU intensity tag corresponding to the ith sample and y i denotes the predicted value of the ith sample.

Database Introduction.
e Bosphorus database was developed by Bosphorus University in Turkey; it included 105 subjects (60 males and 45 females), of whom 29 were professional actors/actresses, and consisted of 4652 facial images. e images were marked by FACS-certified experts as AU1, AU2, AU4, AU9, AU10, AU12, AU14, AU15, AU16, AU17, AU18, AU20, AU22, AU23, AU24, AU25, AU26, AU27, AU28, AU34, AU43, and AU44. is database is one of the AU-tag databases that contain the largest sample size. e CK+ database was developed by Jeffrey F. Cohn and Takeo Kanade; it included 210 subjects (145 females and 65 males) and consisted of 2105 facial images that were marked by Mathematical Problems in Engineering FACS-certified experts with AU tags. However, this database was intended to present six basic emotions of subjects. ere were a number of different AU tags in this database, and the number of samples that contained a single AU was small. According to the statistics, the number of tags that were valid for each AU was smaller than 30. Considering that it would be difficult to conduct an effective model fitting with so few samples each AU tag contained, the above two databases were combined for the purpose of sample size expansion.

Dataset Construction.
A sample library was constructed for each AU in this study. Take AU1 as an example. e images with AU1 tags and those of subjects without facial expressions were extracted from the two databases. Next, the aforementioned sample feature construction was performed to obtain the sample features and the corresponding tag data. Lastly, all the data with AU1 tags were used to form a complete dataset that contained the features and AU1tags.

Evaluation Criterion.
With the aim to evaluate the model regression effect more effectively, the five-fold cross-validation was introduced and used to measure the regression effect of the final model. By using this method, the dataset was randomly divided into five equal parts, of which one was used as the validation set, while the remaining parts were used as training sets. e training was repeated five times, and the MSE values of the five validation sets were averaged and taken as the model evaluation indicator.
In addition, the correlation coefficient (CORR) of sample tags and predicted values was introduced as another indicator for model regression evaluation, and it was calculated by

Fitting Results.
In this study, the AU intensities predicted by the radial basis neural network (RBNN) without feature dimension reduction, the backpropagation neural network (BPNN), and the support vector regression (SVR) algorithm [25] were compared, and results are presented in Table 3.
MSE shows the difference between the estimated values and the true values. A lower MSE value represents a more accurate prediction. In addition, a correlation coefficient closer to 1 represents a better correlation.
As shown in Table 3, the MSE of the proposed method was smaller than the MSEs of the three other methods for all AUs, and CORR values were all larger than 0.98. is indicated that there was a high correlation between predicted and true values.

Training Process Analysis.
When constructing the radial basis neural network (RBNN) using the existing algorithms, the neurons with uniform step sizes were added. In order to minimize the neural network structure, the neurons were added at a step size of one. In this way, no more neurons would be added when the model converged, i.e., when the MSE stopped decreasing during the model training.
e number of neurons and the MSE of each AU in the RBNN during the training was counted. In Figure 11, the abscissa denotes the number of iterations during the training process and the ordinate denotes the MSE at the   Mathematical Problems in Engineering 9 corresponding number of iterations in each small image. As shown in Figure 11, the MSE decreased until it converged during all AU model training processes. e MSE showed two different local downtrends: downtrend of the concave function and downtrend of the convex function. Finally, it converged after the downtrend of the convex function.
In the proposed RBNN construction method, neurons were added at dynamic step sizes. Specifically, the MSE values of the five models were calculated during the network construction, and the concavity and convexity of the MSE were identified. If the MSE was concave, the step size of neuron addition was increased; if the MSE was convex, the step size of neuron addition was reset to one. e pseudocode (Algorithm 1) is given as follows.
A comparison of the model training times before and after algorithm improvement is given in Table 4, where it can be seen that both the number of model trainings and the model fitting time decreased after the model was constructed by means of dynamic neuron addition.

Conclusions
is paper proposed an AU intensity prediction model. First, the three-dimensional coordinates of 68 landmarks of human faces in the image set are calculated using the CE-CLM algorithm. Next, the variation rate between the Euclidean distances of images with AUs and those without facial expressions is calculated to obtain variations of facial features. en, the variations of facial features without expressions in the videos are used to obtain the range of feature error using the CE-CLM algorithm. A reasonable threshold is determined using the PCA algorithm to eliminate the error and retain information as much as possible. Lastly, the RBNN is constructed to fit the developed model for each AU. e fitting algorithm is optimized by observing the fitting process, thereby reducing the number of fittings in the second model fitting and shortening the model training time.
In order to decrease the detection error, further optimization of the CE-CLM algorithm is necessary, which will be part of our future work. In addition, the arrangement of 68 landmarks should also be optimized to achieve a more efficient model detection.

Data Availability
All data, models, or code that support the findings of this study are available from the corresponding author upon reasonable request.

Conflicts of Interest
e authors declare that they have no conflicts of interest regarding the publication of this paper.