Large-pose facial makeup transfer based on generative adversarial network combined face alignment and face parsing

: Facial makeup transfer is a special form of image style transfer. For the reference makeup image with large-pose, improving the quality of the image generated after makeup transfer is still a challenging problem worthy of discussion. In this paper, a large-pose makeup transfer algorithm based on generative adversarial network (GAN) is proposed. First, a face alignment module (FAM) is introduced to locate the key points, such as the eyes, mouth and skin. Secondly, a face parsing module (FPM) and face parsing losses are designed to analyze the source image and extract the face features. Then, the makeup style code is extracted from the reference image and the makeup transfer is completed through integrating facial features and makeup style code. Finally, a large-pose makeup transfer (LPMT) dataset is collected and constructed. Experiments are carried out on the traditional makeup transfer (MT) dataset and the new LPMT dataset. The results show that the image quality generated by the proposed method is better than that of the latest method for large-pose makeup transfer.


Introduction
Humanity's pursuit of beauty has never changed, and facial appearance is a crucial part of the beauty process. Among facial beautification techniques, makeup is the most popular method, along with a range of commercial products including eye shadows, lipsticks, and foundations. The desire for beautification is rooted in social etiquette, driving the relationship between makeup and social activities in many cultures around the globe. With the development of the internet in recent years, the online sales channels of cosmetics in the world have developed rapidly. This has led to the propagation of beautiful makeup effect photos of models on the sales page of a cosmetics product. Additionally, more and more beauty makeup bloggers and live-streaming performers have begun to share their makeup appearances on various online platforms. As the online beauty and makeup market expands, consumers will be exposed to more options and will require technology to ensure beauty protocols were followed properly. Makeup transfer technology ensures these beauty protocols were followed, and functions by transferring the makeup of a makeup face photo to any plain face photo while keeping the latter's face identity information unchanged. In fact, makeup transfer technology has been widely used in many industries and fields, such as the art industry. Makeup represents people's attitude towards life and pursuit of life quality in that era. In the works left by artists of different times, the make-up of characters can help us understand the customs of that period. For the film and television industry, proper makeup can help shape the character of actors, and futuristic makeup in science fiction movies can stimulate people's imagination and exploration of future concepts. Therefore, makeup transfer is a very meaningful research direction.
Makeup transfer can be regarded as a special kind of image style transfer. Image style transfer refers to the technique of using an algorithm to learn the style of an image and then applying this style to another image. Image style transfer only involves the transfer at the domain-level, emphasizing the differences within the domain, while the makeup transfer is not only a global style transfer, but also several independent style transfers of different facial regions. Therefore, the makeup transfer requires additional makeup losses to achieve makeup details.
In recent years, as the most important and effective means and methods, machine learning and deep learning have been more and more closely combined with image processing technology. Many famous algorithms, and models of this kind, have been proposed and applied in the field of image processing, and many excellent results have been achieved. Makeup transfer has been studied for more than a decade. Early traditional makeup transfer algorithms [1−5] can be divided into two categories according to the different requirements for the data set. The first category requires a large number of paired images before and after makeup as a training set, that is, a supervised model; the second category does not require paired images, that is, an unsupervised model. The makeup transfer algorithm based on deep learning has a more straightforward framework and is the current mainstream research idea. Currently, it primarily includes the algorithms based on database matching and the algorithms based on the generative model of generate adversarial network (GAN).
The main inspiration of GAN comes from the idea of zero-sum game in game theory. When applied to deep learning neural network, it is through the continuous game between the generator network G (Generator) and the discriminant network D (Discriminator), so that G learns the distribution of data. Before GAN is used for makeup transfer, the makeup transfer method is divided into several parts and then integrated, so the output image will appear unnatural. Due to the maturity of technologies such as GAN [6−10], the facial makeup transfer algorithm has made significant progress. Most GAN-based makeup transfer algorithms use CycleGAN [11] as the primary network frame, and its structure includes two generators, G and F, and two discriminators, D and H. The image in the X domain generates the image in the Y domain through the generator G and then reconstructs back to the original image input in the X domain through the generator F; the image in the Y domain generates the X domain image through the generator F and then reconstructs it back through the generator G. The discriminators D and H play a discriminative role in ensuring the style transfer of the images. PairedCycleGAN [12] further introduces asymmetric functions based on CycleGAN to complete the makeup transfer task. BeautyGlow [13] uses the glowing framework to divide facial information features into makeup and non-makeup features. The GAN-based makeup transfer algorithm represented by BeautyGAN [14] inputs two face images into the network, one without makeup and one with makeup. The model outputs the result after a makeup exchange, a makeup image and a makeup removal image. However, there are still some difficulties in practical applications in uncontrolled environments, such as the problem of large poses. T. Nguyen et al. [15] introduce UV texture mapping technology into the GAN framework for extremely wild makeup to improve the makeup quality of the generated images, and collect and organize the CPM dataset (Color-&-Pattern Makeup Datasets). The method proposed by Z. Sun et al. [16] adds semantic segmentation loss to the traditional makeup transfer loss, which improves the makeup effect after transfer. In addition, Z. Huang et al. [17] propose a new real-world-based automatic face makeup network IPM-Net. Z. Wan et al. [18] propose a novel Facial Attribute Transformer (FAT) and its variant Spatial FAT for high-quality makeup transfer. J. Lee et al. [19] propose to utilize the identical image with geometric distortion as a virtual reference, which makes it possible to secure the ground truth for a colored output image. However, none of the above methods has yet involved the transfer of large-pose makeup.
Large-pose makeup was proposed by W. Jiang et al. [20] in the PSGAN paper, and the makeup images with non-front perspective pose and non-neutral different expressions are collectively referred to as large-pose makeup images. Before PSGAN, makeup transfer algorithms represented by BeautyGAN [14] demonstrated success in makeup transfer. However, these methods are limited by the input condition that the source and reference images must be well-aligned front perspective faces. Performing cycle consistency [11] does not guarantee correct spatial transformations. The proposal of PSGAN [20] solves these problems. PSGAN proposes an attentive makeup morphing (AMM) module [21], which introduces an attention mechanism into the network and solves the problem of large-pose in makeup transfer, but it is easy to produce shadows in the makeup of the final transfer image. SCGAN [22] designed a flexible and controllable makeup transfer model and designed a latent space for makeup extraction of large poses. However, this method is simple to process the source image, and results in several issues including missing facial features and blurred and inaccurate makeup in the image after the final makeup transfer.
In this paper, by adopting the makeup style extraction module and makeup integration module in SCGAN, a large-pose makeup transfer model is proposed. First, a face alignment module (FAM) is designed and introduced to locate the key points of face better. Secondly, a face parsing module (FPM) is designed for the original facial feature extraction encoder, which introduces a convolutional neural network to improve the accuracy of facial identity feature extraction, and proposes a face parsing loss. The FPM includes two branches, the face feature extraction branch and the face reconstruction branch. The former is used to extract face features, and the latter is used to reconstruct a face based on the extracted features and generate an image. Face parsing loss is used to constrain the reconstructed face to be similar to the input face. The two branches work together to improve the accuracy of face feature extraction and maintain the consistency of face identity. In addition, since there are few data sets based on large-pose makeup images, in this paper, a new makeup transfer data set mainly composed of large-pose makeup images (abbreviated as LPMT) is collected and established for experiments.  In this paper, the quality of makeup images generated will be analyzed. So, the face regions should be cropped out in a rectangular pattern from all images. Some images of large-pose makeup and no makeup are shown in Figure 1, and some images of normal-pose makeup and no makeup are shown in Figure 2. Let X represents the source image domain and Y represents the reference image domain. Given any source image xsrc ∈ X and any reference image yref ∈ Y, the goal of the algorithm is to transfer the makeup style from the reference image yref to a source image xsrc and get the target image xt, so that xt has the makeup style of yref and the face information features of xsrc. As shown in Figure 3, the overall network structure is divided into four parts: the FAM, FPM, makeup style module and makeup fusion module. The makeup style module [22] extracts the makeup information of the reference image. It maps it to a one-dimensional style code Z. The FAM locates the key parts of the face. The FPM extracts the face identity features of the source image. The makeup fusion module [22] fuses makeup style and facial identity features and generates final results.

Face alignment module
Face alignment is used to locate the facial key point features. Before face parsing, face preprocessing is very important. We need to detect the face in the image, and then obtain the aligned standard face through face similarity transformation. At present, the most commonly used alignment method is to detect five face key points through CNN, and then use similarity transformation to obtain the aligned face. The FAM introduced to our method is used to accurately locate the face features, improve the fusion accuracy of the face and the makeup code, and improve the final makeup effect.
The face alignment module consists of three convolutional layers, three max-pooling layers and one fully connected layer [23], and the structure is shown in Figure 4.

Face parsing module
With the development of neural networks, models based on convolutional neural networks have increased rapidly [24−26]. In this paper, the FPM is proposed to extract face identity feature information accurately from the source image. The accuracy of facial feature extraction will directly affect the quality of the final generated image. FPM extracts the facial features of input image x, which is recorded as: As shown in Figure 5, the FPM consists of two branches: the face parsing branch and the face reconstruction branch. The former consists of two down-sampling layers, an encoder and three residual blocks [27]. The encoder consists of three convolutional modules. The structure of the convolution module [28] is shown in Figure 6. Each convolution module contains a 1 × 1 convolution layer, a Batch Normalization layer, a ReLU activation layer and a max-pooling layer. The parsing branch improves the accuracy of facial feature extraction by introducing convolutional neural networks. The face reconstruction branch contains a decoder consisting of deconvolution layers and decodes the face image based on the extracted face features. In this paper, the reconstruction loss is used to constrain the generated face, which makes the face identity information of the reconstructed generated image consistent with that of the input source image, and further improves the accuracy of face feature extraction. The two branches need to be run in the training phase, and only the face parsing branch needs to be run in the calculation phase.

Makeup style module
The structure of makeup style module (MSM) is shown in Figure 7, which consists of two downsampling layers, a mapping module [29], and a latent space [22]. Style codes can be obtained by simply averaging facial features into a one-dimensional vector, but such style codes follow similar probability densities of training data, leading to inevitable entanglement between facial components; a nonlinear mapping module can solve the problem. First, the reference makeup image yref is decomposed into three parts: ylip, yeyes and yskin through the face parser. Second, the initial style codes are embedded in the latent space by the nonlinear mapping module, so the generated style codes are not restricted by the distribution of training data and can be decomposed. The specific structure of the mapping module is shown in Figure 8, which includes an average pooling layer and a 1 × 1 convolutional layer. Each component yi is mapped to the style code zi through the mapping module to obtain zlip, zeyes and zskin. The style codes of the three components are then concatenated to get the complete initial style code Z in the latent space. The makeup code Z in the potential space only retains the color information of the makeup, ignoring other irrelevant information (such as mouth opening or closing, eyes widening or narrowing, etc.) to realize the makeup transfer of large-pose.

Makeup fusion module
The structure of makeup fusion module (MFM) is shown in Figure 9(a), which includes two down-sampling layers and three fusion blocks. Its function is to combine the makeup style code Z and the facial features Fid to generate a target image with the makeup style of the reference image and the facial feature information of the source image. The structure of the fusion block [22] is shown in Figure 9(b), including two convolutional layers, two AdaIN layers, and a ReLU layer, where each AdaIN [30] layer is used to connect Z and Fid, and finally get the target transfer image.

Loss function
Since there is no pairing between makeup and non-makeup images, a circular approach is used to train the generative adversarial network. Given a non-makeup image x and a makeup image y, the generative adversarial network is used to realize the mutual mapping between X domain and Y domain. Hence, the method of CycleGAN [12] is adopted in the training process, and the loss function includes adversarial loss, cycle consistency loss, perceptual loss, makeup loss, and face parsing loss.

Adversarial loss
The most basic loss of GAN [31] is used to guide the generator to generate more realistic images. In this paper, two discriminators, and , are used, and the output value is [0,1], where = 1 indicates that the output image is from the X space, and = 1 suggests that the output image is from the Y space. The purpose of GAN is to make discriminator learn the distribution of the data. The two discriminators both adopt the structure of PatchGAN [11]. The adversarial loss of the generator ℒ and the adversarial loss of the discriminator ℒ are respectively defined as: where Ex~X means taking samples from X space, Ey~Y means taking samples from Y space, E(.) represents the expected value of the distribution function, and G(.) represents the image generated by generator G.

Cycle consistency loss
Since the source image is not paired with the reference image, in order to make the final generated image have both the facial identity features of the source image and the makeup style of the reference image, the cycle consistency loss proposed by CycleGAN [11] is adopted. Since two generators are used, the network structure should learn the mapping of these two generators at the same time, and hope that ( ( , ), ) is as similar to as possible, and ( ( , ), ) is as similar to as possible, and the L1 norm is used to constrain the reconstructed image. The cycle consistency loss ℒ is defined as: where ∥. ∥ represents the L1 norm, x is the source image, y is the reference image, and G is the generator.

Makeup loss
The makeup style varies from person to person, and makeup transfer is more about the transformation of independent styles in different areas of the face, so makeup loss is introduced. The makeup loss is proposed by BeautyGAN [14], which is calculated by Histogram Matching (HM) to improve the makeup details of the generated images. The histogram matching loss is more stable than the Gram loss used in the style layer of traditional style transfer. The makeup loss ℒ is defined as: where ∥. ∥ represents the L2 norm, (. ) denotes histogram matching loss, and ( , ) denotes the makeup style with y while preserving the identity features of x.

Perceptual loss
Perceptual loss is used to ensure that the facial detail features of the generated image are similar to the source image and improve the quality of the generated image. The perceptual loss proposed by J. Johnson, et al. [32] is adopted, which is computed from the high-dimensional feature distance between two images. The L2 loss is used to measure their differences. The perceptual loss Lper is defined as: Among them, (. ) is the output of the first layer of the VGG-16 [25] model, and the VGG-16 model adopts the pre-trained model on ImageNet.

Face parsing loss
To accurately extract the face identity feature information of the source image, the face parsing loss is introduced into the FPM module. The parsing loss is used to constrain the generated images to be similar to the input ones. Assuming that the input image is x, the image generated after reconstruction is , and n is the number of pixels, The parsing loss is defined as:

Total loss
In summary, the overall loss of the entire network structure is defined as: where . is the weight of the respective loss function.

Implementation details
In this paper, combined with the operation methods of the datasets in BeautyGAN [14] and PSGAN [20], 100 makeup images are randomly selected from the MT dataset, and 150 makeup images and 100 non-makeup images are randomly selected from the LPMT dataset for testing. The rest of the LPMT dataset images are used as the training set for the experiment. In the test phase, the large-pose makeup is tested, and the normal-pose face makeup is also tested.
All experiments in this paper are trained and tested on NVIDIA GTX 1650Ti GPU through Pytorch 1.6.0. Features are extracted from the Relu_4_1 layer of VGG16 [25], and the perceptual loss is calculated. The optimizer for generator and discriminator is Adam [33], where β_1 = 0.2, β_2 = 0.9, learning efficiency is set to 0.0002, and batch size is set to 1. The weights of each loss function are set as λadv = λmakeup = 1, λcyc = 10, λper = 0.005 and λfp = 0.01. In order to ensure the quality of the final generated images and to better compare with previous methods in the comparison experiment, the weights of the loss functions here are consistent with that of previous methods.

Makeup and removal
In the experimental test process, simultaneously inputting a makeup image and a non-makeup image can get the image after makeup transfer and an image after makeup removal. In this paper, normal-pose makeup and large-pose makeup are tested successively, and the final results are shown in Figure 10. In both (a) and (b) of Figure 10, the first column is the image without makeup; the second column is the image with makeup, the third column is the image with makeup transferred, and the fourth column is the image with makeup removed.

Comparison of normal-pose makeup transfer
For the reference images of normal-pose makeup, the method proposed in this paper is compared with the current advanced makeup transfer methods, and the results are shown in Figures 11. Since BeautyGAN [14], BeautyGlow [13], PairedCycleGAN [12], and SOGAN [34] did not disclose the code and the trained model, the result pictures given in the references are directly used for comparison. As can be seen from the two groups of comparison images, the images generated by DIA [35] have very light colors after the transfer of eye and lip makeup, and the makeup sense is not apparent enough. The images generated by the CycleGAN [11] method have a relatively weak sense of makeup, and there are no specific makeup details. Most obviously, the lip color of the generated images in the first comparison failed to transfer successfully. The images generated by PairedCycleGAN [12] method also had a weak sense of makeup, and the skin color of the images generated by this method did not change in the second group. BeautyGAN [14] and BeautyGlow [13] generate better images than the previous three methods, but there is still room for improvement. In the first set of comparison images, the lip shape of the image generated by BeautyGAN is not obvious; in the image generated by BeautyGlow, the red lip makeup has faint signs of black, and the skin color is still the paler skin color in the source image. In the second set of comparison images, the skin color of the image generated by BeautyGlow still failed to transfer successfully, and it is still the darker yellow skin color in the source image. The face identity information of the images generated by the LADN [36] method is slightly changed compared to the source images. Especially in the first set of comparison images, the image generated by the LADN method has rough makeup, and the skin color of the forehead is layered. The images generated by SOGAN [34] have a good makeup effect and good makeup sense, but there are also some changes in facial identity information. In the first set of comparison images, the nose tip of the image generated by the SOGAN method has signs of whitening, and the lip shape also changes; in the second set of comparison images, the facial nasolabial folds of the image generated by SOGAN become lighter. The image makeup effect generated by PSGAN [20] and SCGAN [22] methods is better, but the makeup details can be improved. In the first set of comparison images, the eye makeup of the image generated by the PSGAN method is very light, and the black eyeliner is not obvious enough; the lip makeup of the image generated by the SCGAN method is rough, and the lip peak also appears red. In the second set of comparison images, the black eye shadow of the image generated by the SCGAN method is purple. In contrast, the lip makeup of the image generated by our method in the first row fits the lip shape better, and the details of the eye makeup part are also better. In the image generated by our method in the second row, the eye makeup is cleaner. Both images also well preserve the identity feature information of the source image. Because the addition of FPM and FAM can more accurately locate the facial features of the face in the image, it is not only suitable for large-pose faces, but also suitable for normal-pose faces so that when the makeup is transferred, the color of the makeup can more fit the contours of the facial features, showing better makeup details. For reference images of large-pose makeup with different expressions and poses, the method proposed in this paper is compared with the current state-of-the-art methods SCGAN [22] and PSGAN [20]. The results are shown in Figure 12. It can be seen from the figure that the makeup of the images generated by the PSGAN [20] method in the first row, the second row, and the third row have serious shadow marks, and the lip color of the generated image in the fourth row is wrong. The images generated by the SCGAN [22] method are prone to problems such as failure to apply some makeup, or the makeup is not detailed enough. As demonstrated by the skin color among them, the skin color of the images generated in the first and second rows is wrong, the skin color of the third row does not change, the lip color is also lighter, and the eye makeup in the fourth row is not obvious enough. Detailed in the overall comparison, through adding FPM and FAM in our method, the generated image in the first row does not show black skin color on the right face, and the black eyeliner is also more obvious. Additionally, the generated images in the second and third rows do not have the phenomenon of "dark circles". In the fourth row, the image generated by the method in this paper shows the details of the orange blush, and the match between the color of the lipstick and the lip shape is also better.

Objective evaluation
Most makeup transfer papers do not use objective indicators to evaluate the experimental results. In this paper, based on the experience of CPM [15] and AesGAN [37], two objective evaluation indicators, namely structural similarity index measure (SSIM) and peak signal to noise ratio (PSNR), are used. The quality is objectively assessed and tested on the MT and LPMT datasets. PSNR is the most common and widely used objective method for evaluating the quality of generated images. SSIM is often used to evaluate image quality. The higher the scores of these two objective indicators demonstrates better quality of the image. The final average score of each indicator is shown in Table  1. Since our process handles makeup details better, the average score for both metrics is the highest.

Subjective evaluation
In this paper, 10 participants are invited to score the generated makeup images subjectively. The 10 participants included eight girls and two boys, all undergraduates or graduate students. The eight girls are all familiar with beauty makeup, and the two boys have limited knowledge of makeup applications. When scoring and evaluating, it is required to consider the makeup effect and the unchanged face identity feature information, of which 5 is the highest score, 1 is the lowest score, and the final average is taken. The results are shown in Table 2, and the improved method in this paper is better recognized.

Ablation experiment
The ablation experiments are carried out for the functions of the face parsing module and the face alignment module. When the face alignment module is not added, the final generated image will fail to apply makeup in some parts due to the inaccurate positioning of face features. As shown in Figure 13, in the first row, the image lacking FAM has the problem of missing eye makeup. In the second row, the image lacking FAM not only lacks eye makeup, but also has red lips turning black lips in the lip makeup. This is because the facial features of large-pose face images are more difficult to accurately locate than the normal-pose face images. The SCGAN method does not perform additional face alignment processing on the source image and the reference image before makeup transfer, so the extraction of the identity feature information of the source image may fail, and the makeup of the reference image may also fail to be extracted, resulting in partial facial makeup transfer failure when the final makeup is fused. When FAM is added, the accuracy of facial feature positioning is improved, and the problem of lack of facial makeup is also solved. Without adding the face parsing module, some parts of the face in the generated makeup images are prone to uneven coloring. As shown in Figure 14, the makeup effect after adding the face parsing module is more delicate and natural. First, the generated image in the first row and third column does not have FPM added, the eyes in the image are covered with black eyeshadow, and the area other than the lips is also stained with orange lipstick. In the first row and fourth column of the generated image with FPM, the white part of the eye can still be seen clearly at the right eye, and the color of the lipstick is also very suitable for the shape of the lips. Secondly, in the second row and third column of the generated images without FPM, the distribution of black eye shadow is uneven, and the lipstick color is not enough to fit the lip shape. In the fourth column of the second row, the final generated images of FPM are added, the eye shadow is clean and even, and the lip color is more suitable to the lip shape. Figure 15 shows the face segmentation diagram. Pixels of different colors represent different facial components. First, in the comparison image in the first row, the extraction of the eye region of the left eye in the face segmentation without FPM is too large and not precise enough, and the final eye makeup effect is not as good as the makeup effect after adding FPM. Secondly, in the face segmentation image without FPM in the second line, the extraction of the lip region is not accurate, and the final makeup effect of the lip is not as good as that after FPM is added.

Discussion
There are three conclusions that can be reached after analyzing our experimental results. First, the main purpose of this work is to improve the makeup transfer of large-pose faces. However, there are few open-source large-pose facial makeup transfer datasets available at present. Therefore, LPMT datasets are collected and built. Subsequently, the comparative experiments with previous classical algorithms are conducted on both LPMT dataset and the normal-pose facial makeup transfer dataset, MT. This makes our results more convincing. Second, our improved algorithm has achieved better results in both normal-pose and large-pose makeup transfer. Makeup transfer pays more attention to makeup details, and the improvement of makeup details can make the overall makeup effect better. The images generated by our method are more delicate in makeup details and solve the makeup transfer failure problem in some images. Lastly, most makeup transfer algorithms do not take objective indicators to evaluate the generated images, SSIM and PSNR are used in this paper to evaluate the experimental results and achieve good results.