Face Swapping: Realistic Image Synthesis Based on Facial Landmarks Alignment

. We propose an image-based face swapping algorithm, which can be used to replace the face in the reference image with the same facial shape and features as the input face. First, a face alignment is made based on a group of detected facial landmarks, so that the aligned input face and the reference face are consistent in size and posture. Secondly, an image warping algorithm based on triangulation is presented to adjust the reference face and its background according to the aligned input faces. In order to achieve more accurate face swapping, a face parsing algorithm is introduced to realize the accurate detection of the face-ROIs, and then the face-ROI in the reference image is replaced with the input face-ROI. Finally, a Poisson image editing algorithm is adopted to realize the boundary processing and color correction between the replacement region and the original background, and then the final face swapping result is obtained. In the experiments, we compare our method with other face swapping algorithms and make a qualitative and quantitative analysis to evaluate the reality and the fidelity of the replaced face. The analysis results show that our method has some advantages in the overall performance of swapping effect.


Introduction
Face synthesis refers to the image processing technology of the automatic fusion of two or more different faces into one face, which is widely used in fields of video synthesis, privacy protection, picture enhancement, and entertainment applications. For example, when we want to share some of the interesting things on social networks, we can use the face synthesis technique which can be regarded as a fusion of facial features and details to change our appearances appropriately without privacy leaks. As another type of face fusion, face swapping combines some parts of one person's face with other parts of the other's face to form a new face image. For instance, in the application of virtual hairstyle visualization, the client's facial area can be fused with the hair areas of the model images to form new photos, so that customers can virtually browse their own figures with different hairstyles. This paper focuses on the face swapping problem of virtual browsing applications for hairstyle and dressing. Our main contributions of the proposed algorithm include the following: (1) construct a pipeline of face swapping which integrates some learning-based modules into the traditional replacementbased approach, (2) improve the sense of reality and reliability of the synthesis face based on the precise detection of the facial landmarks, and (3) the face occlusion problem can be solved by introducing an accurate face parsing algorithm.

Related Work
Existing face swapping algorithms can be roughly divided into three categories: replacement-based, model-based, and learning-based. The replacement-based method usually replaces the face region in the reference image with the input face region and then applies some image processing techniques to enhance the real sense of the synthesized image. Bitouk et al. [1] built a face image set with different expressions, facial shapes, and postures, and the algorithm could automatically select the most similar reference face from the image set for face replacement. Although this method can reduce the influence of the differences in facial expression, face shape, and posture, the application of 2 Mathematical Problems in Engineering the algorithm is limited because users cannot choose the favorite reference face image independently. Mahajan et al. [2] established a mask image of facial features, and the mask-covered image was then attached to the reference image to achieve the face swapping. But the selected mask image, which can only cover the facial features, is unable to retain other facial characteristics, such as wrinkles, textures, skin colors, and muscle deformations. Besides, this method requests users to carry with the interactive instructions when running, which greatly reduces the practicability of the algorithm.
In the model-based approach [3], a two-dimensional or three-dimensional parametric feature model is established to represent human face, and the parameters and features are well-adjusted to the input image. Then the face reconstruction is performed on the reference image based on the result of adjusting the model parameters. An early work presented by Blanz and Volker et al. [4] used a 3D model to estimate the face shape and posture, which improved the shortcoming of the unsatisfied performance of the synthesis due to the illumination and the perspective. However, the algorithm requires a 3D input model and a manual initialization to get a better result, which undoubtedly has a stricter requirement for data acquisition. Wang et al. [5] proposed an algorithm based on active apparent model (AAM). By using the well trained AAM, the face swapping is realized in two steps: model fitting and component composite. But this method needs to specify the face-ROI manually and a certain number of face images for model training. Lin et al. [6] presented a method of constructing a 3D model based on the frontal face image to deal with the different perspectives of reference image and input image. But the reconstructed model does not reflect the characteristics of the original face precisely and takes too much time to compute.
In most of the learning-based models, the reference image is converted into a synthesized face image by training a generative neural network that contains the information of the input image. Korshunova et al. [7] proposed a model based on convolution neural network, which can change the reference face into the input face while maintaining the posture, expression, and illumination of the former one. Although this method [7] has some advantages in the sense of reality, it needs a lot of training data and a large amount of computation, and the trained network only works for one single person. In addition, gan (generative adversarial networks) technology can also be used to obtain the generation of face synthetic images [8]. This method replaces the latent space representation of the face and then reconstructs the entire face image with a region-separated gan model. However, this model needs to build a large face image dataset with pixellevel parsing labels while slightly degrading the quality of the synthetic picture.
Above all, the replacement-based approach is simple and fast but sensitive to the variation in posture and perspective. The model-based method can effectively solve the perspective problem; however, it usually needs to collect three-dimensional face data, and robustness is not something to be satisfied. The learning-based approach can produce quite real and natural synthetic face image, while usually requiring a large number of training data and having more restrictions on the input and reference faces. Based on the comprehensive consideration of the characteristics of the above three methods, a face swapping algorithm supported by the facial landmark alignment is proposed under the replacement-based framework.
In addition, other widely used algorithms have been applied in our methods to achieve better results, such as facial landmark detection [9,10], facial region segmentation [11,12], and face warping [13]. And we will detail how these algorithms are applied in the method section.

Method
The algorithm is composed of three steps: face alignment, warping and replacement. The accuracy and robustness of the algorithm are enhanced by introducing some learning-based modules like facial landmark detection and face parsing.
. . Pipeline of Face Swapping. Although face swapping seems uncomplicated and practicable, an elaborately designed algorithm flow still has an impact on the realization of the final result. The pipeline of the proposed algorithm starts with two channels that finally fuse into one, as shown in Figure 1. First, the input image is aligned with the reference image based on a facial landmark detection algorithm. Second, the reference image is warped to fit the aligned face of the input image. With an advanced face parsing algorithm, in the next step, the face-ROIs are extracted from the aligned input image and the warped reference image, respectively. Finally, some common steps of face replacement and color correction are introduced to generate the final composite face image. To summarize, the proposed algorithm will be demonstrated into three parts: face alignment, face warping, and face replacement.
. . Face Alignment. As the first step of face swapping, alignment refers to aligning the input face image in and the reference face image ref in size and direction. For the purpose of detecting faces in pictures, we apply the relevant methods in paper [14] which proposes a novel multiple sparse representation framework for visual tracking to detect the faces in pictures. Apart from increasing the speed of the algorithm, the application of sparse coding and dictionary learning also enables these methods to learn more knowledge from relatively fewer sample data. Then we extract several stable key points from the images to mark the faces, referred to as facial landmark detection (FLD for short). In this paper, we employ a popular FLD algorithm [9,15] based on an ensemble of regression trees to detect facial landmarks Ω s = { | = 1, 2, . . . , 68}, as plotted in Figure 2(a). We use this method that relies on our implementation. Each landmark point is symmetric to another point with respect to the central axis of the face such as 22 and 23 , 49 , and 55 . The points located at the central axis are symmetric to themselves such as 28 and 31 . To evaluate the rotation between the input and the reference face, the central axis of the input face should be extracted previously. According to the basic definition,  the central axis can be evaluated based on an optimization procedure as where ( ) is the mapping of with respect to the line . Equation (1) indicates that the central axis should be the line which optimizes the symmetry of all the 68 detected facial landmarks. Consequently, the center of the face is defined as the middle point of the projections of all the 68 detected facial landmarks on the central axis . The average distance from all the detected landmarks to the center point is denoted by , which can be used as the metric of the size of the input face. Therefore, the average distance can be written as where |Ω | = 68. Similarly, the central axis and the size of the reference face are, respectively, denoted with and . Then face alignment can be implemented by rotating the input image in with the angle and scaling it with the factor , where is the rotation angle from to , and is the ratio between and , as shown in After the face alignment, the original facial landmarks transfer to the new locations , and the aligned input image is denoted by a in . Figure 2(b) shows the input face and the reference face with different sizes and poses. Figure 2(c) displays the results of facial landmark detection and face alignment. It can be seen from the result that the aligned input face is very similar in size and direction to the reference face.
. . Face Warping. To preserve the shape of the swapped face, we warp the reference image to fit the aligned input face before face replacement. The warping is implemented based on the alignment of facial landmarks. We pick 18 out of the 68 facial landmarks and denote them with Φ = {1, 2, . . . , 17, 34} (see Figure 3(a)), which are considered to have a significant impact on facial shape. The landmarks of the reference face are denoted with , ∈ Φ r . The new locations w of most of the landmarks (except 1 , 17 , and 34 ) after the image warping should perfectly aligned to the input face, so we have Figure 3(b) illustrates the original landmarks (red points) and their new locations (green points). To realize the image warping, the reference image ref is firstly decomposed into many triangle pieces based on the landmarks. The triangulation is required to minimize the change of the image background because we generally hope to preserve some parts in the background such as hair, body, and dress (that is why we do not move 1 , 17 , and 34 ). The final layout of triangulation is designed as shown in Figure 3(a). Then the image warping can be realized by applying the specific affine transformation to the corresponding triangle pieces.
Suppose that there is a triangle whose three vertices are , , and , and the corresponding locations after the warping are denoted with w , w and w , respectively. Then the affine transformation of the triangle can be described as ] .

(6)
∈ R 2×3 is a transform matrix whose closed-form solution iŝ− 1 . With the matrix , each pixel inside the triangle can be transformed from the original location to the new location w . Since the coordinates of the new locations are not integer in general, the bilinear interpolation is used to generate the final warped reference image w ref . Figure 3(c) shows that the warped reference face has the exactly same shape as the aligned input face.  [2,16] use a convex hull of the facial landmarks as the face-ROI which could cover some distracting areas like hair, hat, forehead, and neck. Therefore, in our model, a higher-precision face parsing algorithm based on deep learning [17] is introduced to extract a more accurate face-ROI. Different from the multiclass parsing network in paper [17], we use only two class labels, face and nonface, to train the parsing neural network based on the Helen dataset [18] which contains 2330 face images with pixel-level ground truth. Because our method requires face parsing rather than skin segmentation, so a contour detector is used. We initialize the encoder with pretrained VGG-16 net and the decoder with random values. During training, we fix the encoder parameters and only optimize the decoder parameters. The architecture of the network is shown in Figure 4. We set the learning rate to 0.0001 and train the network with 30 epochs with all the training images being processed each epoch. Then the face-ROIs can be generated by the retrained face parsing network.
It can be seen from Figure 5 that the retrained face parsing network can precisely extract face-ROIs in most of the cases. However, some parts of the boundary of the face-ROI w ref obtained by the algorithm are too close to the edge of hair. If we replace the warped reference face-ROI w ref with the aligned input face-ROI a in directly, the boundary between the replaced face-ROI and original background of the warped reference image will become very sharp. It will disorder the results of boundary smoothing and color correction in the next step. To solve this problem, a conditional image erosion algorithm described by (7)   where is a 5 × 5 square structure element and Ω edg is a set of pixels which are too close to the boundary between face and background. The exact area of Ω edg can be obtained by where ‖ w ref ( , )‖ is the gradient amplitude of the pixel ( , ) in the warped reference image, and t is a threshold of gradient amplitude. Equations (7) and (8) Figure 4: Architecture of the face parsing network.   Figure 6(b) that the color of the replaced face is slightly different from the neck, ears, forehead, and other residual parts of the reference face. In order to eliminate the boundary effect and make the composite face more realistic, we use generic interpolation machinery based on solving Poisson equations [19,20] to correct the color of the replaced face and smooth the boundary. The main idea of the color correction algorithm is minimizing the color difference between the original pixels and the replaced ones while preserving the gradient of pixels at the boundary. It can be seen from Figure 6(c) that this method can achieve seamless clone of human face colors, making the image more realistic and natural.

Experiment
In the experimental part, we will verify the superiority of our model by comparing it with three popular face swapping Mathematical Problems in Engineering  algorithms [2,7,16] and analyze the experimental results from qualitative and quantitative aspects to explain the effectiveness of our method.
In the qualitative experiment, we use the photos of several public figures as input image and reference image, respectively, as shown in Figure 7 [7]. The five reference faces have different postures, genders, skin tones, face shapes, and hairstyles, while the input faces include a male and a female face. The corresponding visual results obtained by the competing algorithms and our model are shown in Figure 8, where the five reference faces are replaced with two input faces, respectively. Figures 8(a) and 8(b), respectively, give the face swapping results for the male input face and female input face, where each column corresponds to a reference face and each row corresponds to a face swapping algorithm. On the whole, the face swapping results of the learning-based algorithm [7] (denoted as L1) is more natural and realistic and has a good adaptability to the different perspectives, hairstyles, and skin colors. However, as mentioned above, this method is only valid for the identities with a large number of training images, and each face identity corresponds to a generative neural network. In other words, this method cannot be applied to untrained new face images, which seriously restricts the application of this method. Our model promises the real sense of the face swapping result as well, and it is only slightly worse than the L-model in dealing with the perspective variance. In addition, this paper introduces the steps of the precise face parsing and the adaptive color correction, so it can effectively solve the problem of skin color difference and hair occlusion. While the other two replacement-based algorithms (denoted as R1 [16] and R2 [2]) have only adopted relatively simple algorithm flow, it is impossible to remove the boundary effects completely.
Many specific applications require that the facial features and face shape should remain as they should be in the face swapping, so a quantitative experiment is designed to verify the similarity between the face swapping result and the input face. Table 1 gives the similarity measures between the input 8 Mathematical Problems in Engineering face and the swapping results shown in Figure 8, which are obtained by a CNN-based model [21]. The higher scores in Table 1 represent the higher similarity. According to Table 1, our model is obviously better than the L-model and the R2model while slightly lower than the R1-model in some cases. This is because that the L-model produces the composite face with a slight modification of the facial features to ensure the reality of the swapping results. The R2-model does not involve the color correction step and thus has a lower score. The R1-model has the highest similarity because it retains the complete face region at the expense of partially reducing the reality of the boundaries between forehead and hair. In contrast, our model preserves almost all of the facial features of the original input face while guaranteeing the reality of the swapped face, which leads to a better balance between similarity and realism.
Mathematical Problems in Engineering 9 (a) (b) Figure 9: Failure examples of our method.    Table 2 shows the comparison of computation complexity and the execution time of our method and the other methods. All the algorithms are implemented in Python and tested on a single core i7-8700K. Table 3 gives the Mean Opinion Scores (MOS) of the results shown in Figure 8. The MOS is the arithmetic mean over all individual values on a predefined rating scale that a subject assigns to his opinion of the performance of a face swapping quality. It is expressed as a single rational number, typically in the range 1-5, where 1 is lowest perceived quality and 5 is the highest perceived quality. In this paper, the MOS are obtained by 100 students at Northeastern University in China according to the rating scale defined in Table 4. Figure 9 shows some failure examples of our algorithm and the middle images are the face swapping results. The main reason for failure shown in Figure 9(a) is that our algorithm is based on two-dimensional digital images in order to reduce time complexity, but the faces are actually three-dimensional. When the face posture difference between the input image and the reference image is too large, the result of the face swapping is unsatisfactory. Figure 9(b) shows the failure result when the face of the input image is occluded by glasses. This is mainly because the face parsing algorithm only extracts the facial features. Figure 10 shows some other results of our method and all the images used are available in the public domain.

Conclusion
In this paper, a new face swapping algorithm based on facial landmarks detection is proposed, which can achieve fast, stable, and robust face replacement without the threedimensional model. Our approach introduces the training results of existing learning models directly. The method we proposed does not require any training data, because it uses no learning model for new training. The experimental results show that the composite image obtained by our model has a great reality and strong adaptability to the difference of skin color and hair occlusion while retaining most of the facial features of the input image. Compared with other algorithms, our model has some advantages in aspects of visual realism, time complexity, and data requirement. However, there is still room for further improvement, which mainly shows that the swapping result is not perfect when the input face and the reference face have a significant difference in perspective and posture. How to predict and generate the face image from a given perspective to another one is essential to solving the problem and also the main direction of our future research.

Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.

Additional Points
Highlights. (i) Face alignment ensures that the input face is consistent with the size and angle of the reference face. (ii) Reference image is deformed so that the face shape of face swapping result is consistent with the input face. (iii) A face parsing algorithm was introduced to achieve better visual effects. (iv) Poisson image editing algorithm improves the realism of the results.

Conflicts of Interest
The authors declare that they have no conflicts of interest to this work.