Improved Facial Expression Recognition Method Based on GAN

Recognizing facial expressions accurately and effectively is of great significance to medical and other fields. Aiming at problem of low accuracy of face recognition in traditional methods, an improved facial expression recognition method is proposed. )e proposed method conducts continuous confrontation training between the discriminator structure and the generator structure of the generative adversarial networks (GANs) to ensure enhanced extraction of image features of detected data set. )en, the highaccuracy recognition of facial expressions is realized. To reduce the amount of calculation, GAN generator is improved based on idea of residual network.)e image is first reduced in dimension and then processed to ensure the high accuracy of the recognition method and improve real-time performance. Experimental part of the thesis uses JAFEE dataset, CK+ dataset, and FER2013 dataset for simulation verification.)e proposed recognition method shows obvious advantages in data sets of different sizes. )e average recognition accuracy rates are 96.6%, 95.6%, and 72.8%, respectively. It proves that the method proposed has a generalization ability.


Introduction
Recognizing facial expressions can provide a more comprehensive understanding of people's inner world [1]. It has many applications in medicine, transportation, culture, and education [2][3][4][5]. erefore, the recognition and analysis of facial expressions have important research significance and value.
At present, certain researchers have carried out research on image expression recognition, the purpose of which is to accurately classify and recognize seven basic emotional expressions in facial images [6,7] including anger, disgust, fear, happy, sad, surprise, and neutrality.
Traditional facial expression feature extraction uses mathematical methods to calculate and process facial expression images. Mainly, it can be divided into two situations for processing static images and dynamic images. Statistical methods, Gabor wavelet, and local binary method belong to the feature extraction of static images [8,9]. Geometric method, optical flow method, and model method belong to the feature extraction of dynamic images [10][11][12]. However, there are diversity and complexity in image acquisition. Traditional facial expression recognition methods face the problem of nonlinear uncertainty of sample data. e features selected in facial expression feature extraction have no good representation ability [13] and need to be extracted manually according to people's experience. ese problems will have a great impact on the recognition accuracy of the model, resulting in the poor generalization ability.
It should be noted that for facial expression recognition research, its essence is to optimize and analyze massive data [14,15]. Benefiting from the development of artificial intelligence and big data technology, the deep network model can effectively extract effective image features from massive multidimensional image data through continuous iterative learning of multilayer network. And based on its strong learning ability, compared with the traditional facial expression recognition method, it can classify the facial expression more accurately and quickly [16][17][18][19]. Reference [20] analyzed the facial expression information of time series based on the part-based hierarchical bidirectional recurrent neural network and extracts facial temporal features in dataset. It can comprehensively analyze facial expressions. Reference [21] proposed a method based on fusion of deep belief network (DBN) and local features. is method extracts eyebrows, eyes, and mouth with rich expression information as local expression images. It also combines the Log-Gabor feature with texture information and the secondorder histogram with the gradient direction feature of shape information to realize facial expression recognition. Reference [22] combined spatio-temporal features and used deep residual networks to extract feature. Reference [23] used three channels to extract feature of expression images, respectively. en, extracted features are connected and sent to the next layer for processing.
Considering the previous work, the network designed in this paper mainly has the following innovations: (1) rough continuous confrontation training of generator and discriminator in GAN, the deep extraction of the features of processed expression dataset is realized thereby improving performance of corresponding facial expression classification and recognition. (2) e performance and speed of network are improved. e idea of adding residual network to the generator network improves the operation efficiency while ensuring accuracy. In addition, the second section introduces related theoretical methods. e third section introduces the improved network and explains the structure of the discriminator and generator in the network. e fourth section introduces the experimental results and analysis based on different datasets. e fifth section is the conclusion.

General
Steps. Face recognition technology includes four steps. ey are face detection, face alignment, face representation, and face matching, as shown in Figure 1. Face detection module is used to detect position of face in input image. e face alignment module automatically locates key points around the face according to the input such as eyebrows, eyes, corner points of mouth, nose tip, and contour points. Face characterization is to locate a face picture from the above two steps, extract from it, or convert it into a feature vector. In the face matching part, the extracted feature vector will be used to compare with these in database. Based on the similarity between the two, it is possible to determine whether they belong to the same person in the database. e image needs to be preprocessed to improve accuracy of judgment [24]. e advantages of traditional face key point detection algorithms are clear architecture and easy to understand. However, the operation efficiency is not high, and it is not suitable for processing a large number of images.
For face characterization processing and analysis, the data feature vector often contains information such as the position of eyebrows, nose, and eyes and even additional information such as contour and shape. e more classic methods include HOG method, Haar wavelet method, and eigenface method. However, traditional methods are used to extract the front of a human face, but the effect is not good enough for the side [25].
Face matching generally compares the extracted face feature vector with these in database. If the distance of feature vector is close, identity information is output. If there is no match for all faces in database, output cannot be recognized.

Convolutional Neural Network (CNN).
CNN is good at processing images [26]. Traditional face recognition methods show poor results when facing complex scenes. CNN-based deep learning methods can automatically extract features based on a large amount of image data and perform well in complex scenes. e CNN model is essentially a deep feedforward model, which updates parameters through backpropagation. To obtain better results, it generally needs to design the cores of convolutional layer and pooling layer. And they will be continuously combined to obtain better image characteristics.

Generative Adversarial
Network. Due to its multilayer network structure, the convolutional neural network also has too many problems in its network parameter settings, which makes the CNN face recognition training process very fragile [27]. For face recognition research and analysis, subtle changes in the CNN structure or a little adjustment of parameters will lead to deviations in the recognition results.
As a deep learning model that is widely used in current image analysis, GAN can solve the problem of instability in the training process through adversarial learning methods.
A typical GAN consists of two part, namely, generator G and discriminator D. During training, these two subnetworks play a game, as shown in Figure 2.
First, the generated image and real image are input into discriminator at the same time, and the discriminator is trained. As the training process progresses, the pictures generated by the generator become more and more realistic, and the classification ability of the discriminator is gradually improved. Finally, the training process reaches a state of convergence. e discriminator cannot identify the true and false of the input image, and the image generated is also the same as the real image; that is, the Nash equilibrium state is reached.
e training process of the entire game can be described in the following value function V(D, G): where E x ∼ P da ta(x) and E Z∼P z(z) are the expected functions; x is the real image; z is the image of input generator. G converts the variable z into the probability G(z) that the image generated by the converter is a real image. e variable z is basically a sample from the distribution p z . e ideal distribution p z should converge to the data distribution p da ta . Practice has proved that in the generator, maximizing the logarithm log(D(G(z))) is better than minimizing the logarithm log(1 − D(G(z))).
Since the GAN network has two models, the loss of discriminator is as follows: When training loss function of generator, default discriminator has the best ability.
part is a constant, so the loss of generator is as follows:

Discriminator Network.
For the discriminator network in the GAN, this paper uses the VGG-16 network as backbone network structure [28], and the network structure is shown in Figure 3. Using as input, x ∈ R 512×512×3 and y ∈ R 512×512×1 . When the input is x i , y i , correct output of discriminator is 1 and the correct output of input x i , G(x i ) discriminator is 0. Leaky-ReLU is used as a nonlinear activation function in each convolutional layer in the discriminator.
First, two convolution and pooling operations are performed on the image, and each operation includes two convolutions and one maximum pooling. en, three convolution and pooling operations are performed, and each operation includes three convolution operations and a maximum pooling operation. Finally, there are three fully connected layers and one Softmax layer. Similar to traditional generative confrontation network, the discriminator mainly judges the authenticity of input discriminator image. Input image has same size and dimension as the generated image, and both are 3 × 48 × 48. e adversarial loss E a dv is defined as follows: where x i is the real image, G f is the feature extractor, θ f is the parameter of feature extractor, G h is a feature synthesizer, θ h is the parameter of feature synthesizer, G d is the discriminator, θ d is the parameter of discriminator, and L d is the loss calculation function of discriminator. en, the total loss function E total is as follows:

Generator Network.
e generator network in GAN uses x ∈ R w×h×c as the input image, where w � n � 512, c � 3, e network structure is shown in Figure 4. Some previous segmentation methods use encoder-decoder [29]. is structure first down-samples and then gradually up-samples. Scientific Programming is paper uses a U-shaped structure for the generator. e feature extractor is used to extract the feature of input image. Image input resolution is 3 × 48 × 48, and the backbone network uses ResNet-18 [30]. Unlike the traditional generative confrontation network, the generator input is not random noise but a facial expression image. First, the feature extractor performs a 3 × 3 convolution operation on the input image x i (i � 1, 2, . . . , n) with a step size of 1. en, there are batch normalization and ReLU. Second, convolution operation of 4 modules is performed, respectively. en, the average pooling operation is performed after convolution, and window size is 2 × 2. Dropout is used after the average pooling operation.
Finally, the extracted features are input to two fully connected layers and one Softmax layer. e 512-dimensional feature vector is classified into 7 types of facial expressions, and the facial expression recognition results are obtained. e classification loss E c of classifier is defined as where x i is the original input image, θ f is the parameter of feature extractor, G f is the feature extractor, θ c is the parameter of classifier, G c is the classifier, y i is the real label, and L c is the classification loss.

Scientific Programming
At the same time, this paper adds a residual module to generator, and it is shown in Figure 5(a). e structure of the forward propagation convolution unit is shown in Figure 5(b).
rough the confrontation training between the generator and the discriminator, the feature extractor's ability to extract features and the discriminator's recognition ability are improved. e feature synthesizer is a symmetrical structure to the feature extractor and is mainly composed of a convolutional layer and an upsampling layer. After continuous convolution and upsampling operations, the generated output image is restored to the original size.

Experimental Results and Analysis
e experiment uses TensorFlow framework to implement network model training on simulation dataset. To ensure quality of the experiment, Python is used as the programming language. And NVIDIA CUDA 9.0 is used for GPU accelerated computing.
e specific system development environment of the face recognition simulation experiment is shown in Table 1.

Parameter Setting.
When the face recognition network is trained, the optimization method uses SGD, momentum parameter is set to 0.9, and weight decay rate is 10 −4 . Learning rate is a reduction strategy of multiplying the initial lr � 3 × 10 −3 by (1-current_iter/max_iter) power where power � 0.9, current_iter is the current number of iterations, and max_iter is the maximum number of iterations in training process. For the discriminant network, the Adam optimization method is used, betas∈(0.9, 0.99), and the initial lr � 1 × 10 −4 . e learning rate reduction strategy is the same as the method of training the segmentation network. Taking into account the GPU memory limitation, the image size in the experiment is set to 348 × 348 pixels.

Evaluation Index.
To measure performance of identification our method, an objective and fair evaluation index should be used. Accuracy (AC), precision (P), and recall (R) are commonly used indicators in big data image classification research, which can be used to analyze performance of face recognition results. e calculation formulas are shown in formulas (7)- (9). P represents how many of the samples that the model predicts to be positive are true categories. R is expressed as how many of the model's predicted categories are positive examples in the samples where the true category is positive.
Precision � TP TP + FP , For classification problems, the combination of the model prediction result and the true category of the sample can be divided into true positive (TP), true negative (TN), false positive (FP), and false negative (FN). e precision and recall rate can be represented by a confusion matrix, as shown in Table 2.
At the same time, the loss value function c is used to evaluate the model and to measure the quality of the training performance of the GAN model. e appropriate number of iterations is determined in the process of training discrimination. In this paper, cross-entropy loss is used to express probability of predicting which type of input sample belongs to, and its expression is as follows: where y is the true classification value, a is the predicted value, and c represents the loss value.

Training Process.
We analyze recognition and classification performance of the GAN model for face collection data and explore the convergence of collection data training    process. Figure 6 shows the convergence performance of the training process for the expression dataset.
In the 10th iteration, recognition accuracy of training set samples has reached 95%. At the end of 15 iterations, the accuracy is approximately close to 100%. At the same time, through the numerical analysis of the loss function of each iteration, it can be known that the training set has been quickly and effectively attenuated before the 10th iteration. At the 18th iteration, the loss function value is close to 0. In summary, the improved expression recognition method of GAN has good convergence performance.

Simulation Analysis of General Experimental Dataset.
e experimental simulation analysis is carried out using the methods proposed in references [21][22][23] and this paper. To verify generalization performance of the proposed recognition method on data sets of different sizes, the small, medium, and large experimental simulation data sets are selected as JAFEE dataset, CK + dataset, and FER2013 dataset in turn. e JAFFE dataset was created by the Michael Lyons team. e image data collected in this dataset contain the expressions of 10 Japanese female participants, with a total of 213 facial images. e JAFEE dataset contains 7 types of basic expressions: anger, happy, sad, surprise, fear, disgust, and neutral.

JAFFE Dataset Experiment.
is paper chooses JAFFE dataset as a small dataset to simulate and verify performance of the proposed GAN facial expression recognition method. Table 3 shows the stability results based on JAFFE dataset under different methods.
It can be seen from Table 3 that in terms of facial expression recognition for JAFFE dataset, accuracy of our method is 96.6%. It is 0.9%, 2.1%, and 2.3% higher than references [21][22][23]. e proposed method has no obvious advantage over the comparative method in terms of simulation runtime. erefore, we believe that when performing expression recognition on small data sets, the method proposed in this article can be selected for efficient discrimination.

CK + Dataset Experiment.
e CK + dataset is used as a medium-sized data set for facial expression recognition in this article, and different methods are also used for comparative analysis with our method. e face recognition performance of the CK + dataset under different methods is shown in Table 4.
It can be seen from Table 4 that our method has the accuracy of 95.6% for CK + dataset, which is 5.3% higher than that in reference [23]. e PCNN used in reference [23] has more network layers. ere is the problem of the disappearance of the network gradient during training, which causes a large gap in the recognition accuracy compared with our method. e simulation time of the identification method in this paper is 84.23 s, which is 5.3 s shorter than that in reference [21]. Compared with reference [23], the simulation time is relatively close, but reference [23] does not have an advantage in recognition accuracy. erefore, it is proved that GAN has good accuracy and real-time performance for facial expression recognition of medium-sized volume data sets. Table 5 shows the simulation analysis results of large datasets using different facial expression recognition and classification methods.

FER2013 Dataset Experiment.
From Table 5, the accuracy of expression classification and recognition of all methods for the FER2013 dataset is both below 75%. is is because there are a certain number of error labels in the FER2013 dataset. All of this results in a lower accuracy of the recognition method. However, our method has the highest accuracy of 72.8%. e running time of our method is 134.23 s, which is shortened by more than 10 s  [22] 93.5 55.32 Reference [23] 94.3 56.32  [21][22][23]. erefore, the improved expression recognition method of GAN proposed in this paper can also be used for large-scale data set analysis.
To further illustrate recognition performance, a confusion matrix is used to display and illustrate the recognition results obtained by our method, as shown in Figure 7. e accuracy of the method for the recognition of anger, disgust, fear, happiness, sadness, surprise, and neutral expressions is 65%, 62%, 57%, 88%, 58%, 85%, and 67%, respectively. Figure 7 shows that the method performs well in identifying "happy" and "surprised," with accuracy rates reaching 88% and 85%, respectively. In addition, it can be noticed that the ability to recognize "fear" facial expressions of generating confrontation network is low, with an accuracy rate of 57%. is is because the labeling in the FER 2013 dataset is not good.
In summary, compared with other methods, our method has higher accuracy and operating efficiency for different volume data sets. It shows that our method has excellent generalization ability.

Conclusion
is paper proposes a facial expression recognition method based on GAN. is method is based on continuous confrontation training between generator structure and discriminator structure in GAN, which realizes the accurate extraction of data set features and ensures the accurate recognition of facial expressions. By improving the generator structure in GAN network, the residual network is combined with image processing technology.
us, the amount of calculation for identifying the network model is reduced. Finally, based on general datasets of different sizes, our method is validated for the efficient performance of facial expression recognition. It is proved that our method has obvious advantages in recognition accuracy and processing speed.
In the future, we also plan to add an attention mechanism to the network to further improve accuracy and prune the network to improve efficiency and strive to achieve industrialization.
Data Availability e data included in this paper are available without any restriction.

Conflicts of Interest
e author declares that there are no conflicts of interest regarding the publication of this paper.