Enriching Facial Anti-Spoofing Datasets via an Effective Face Swapping Framework

Yang, Jiachen; Lan, Guipeng; Xiao, Shuai; Li, Yang; Wen, Jiabao; Zhu, Yong

doi:10.3390/s22134697

Open AccessArticle

Enriching Facial Anti-Spoofing Datasets via an Effective Face Swapping Framework

School of Electrical and Information Engineering, Tianjin University, Tianjin 300072, China

^*

Author to whom correspondence should be addressed.

Sensors 2022, 22(13), 4697; https://doi.org/10.3390/s22134697

Submission received: 4 June 2022 / Revised: 20 June 2022 / Accepted: 20 June 2022 / Published: 22 June 2022

(This article belongs to the Section Internet of Things)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

In the era of rapid development of the Internet of things, deep learning, and communication technologies, social media has become an indispensable element. However, while enjoying the convenience brought by technological innovation, people are also facing the negative impact brought by them. Taking the users’ portraits of multimedia systems as examples, with the maturity of deep facial forgery technologies, personal portraits are facing malicious tampering and forgery, which pose a potential threat to personal privacy security and social impact. At present, the deep forgery detection methods are learning-based methods, which depend on the data to a certain extent. Enriching facial anti-spoofing datasets is an effective method to solve the above problem. Therefore, we propose an effective face swapping framework based on StyleGAN. We utilize the feature pyramid network to extract facial features and map them to the latent space of StyleGAN. In order to realize the transformation of identity, we explore the representation of identity information and propose an adaptive identity editing module. We design a simple and effective post-processing process to improve the authenticity of the images. Experiments show that our proposed method can effectively complete face swapping and provide high-quality data for deep forgery detection to ensure the security of multimedia systems.

Keywords:

multimedia security; facial anti-spoofing; biomedical big data; generative adversarial network; latent feature analysis

1. Introduction

With the development of Internet of things technologies [1,2], deep learning [3,4,5], and communication technologies [6,7], wearable devices and intelligent terminal devices have developed rapidly, which have brought unprecedented changes to many fields. Intelligent medical systems, intelligent industrial systems, and smart home systems are changing people’s lifestyles. At the same time, they also speed up data collection, processing, and transmission [8]. However, in the era of big data, it is inevitable that it will cause malicious tampering and forgery of data, which poses a potential threat to personal privacy and public opinion [9]. Especially when the deep learning technologies gradually mature, various forgery technologies have achieved realistic results. Taking personal portraits as an example, face forgery technologies cannot be detected by the human eye, as shown in Figure 1. Many countries and regions actively have carried out the improvement of relevant regulations and laws. Portrait data, especially in the digital medical system, is extremely sensitive personal data. Improving privacy and security has become one of the hot topics.

With the development of deep learning technologies [10,11,12], great progress has been made in image understanding [13,14], image generation [15,16] and image classification [17]. With the development of deep face forgery technologies [18,19], researchers also carry out deep forgery detection technologies [20,21] at the same time. However, most of the current deep forgery detection technologies are learning-based methods. These methods rely heavily on data. Diversification and sufficient data is one of the necessary means to improve the generalization and accuracy of deep forgery detection methods. Due to the sensitivity of human eyes to familiar structures, generating high-quality face images that meet the downstream tasks is very challenging work.

In order to improve portrait security in social media, we propose an effective face swapping framework to enrich facial anti-spoofing datasets. Firstly, we utilize the feature pyramid network to extract the latent codes of the images. Secondly, the identity weight generation module is used to increase the attention to the identity information in the latent codes. Thirdly, we propose an adaptive identity editing module, which is based on the AdaIN [22] mechanism and can realize identity transformation. Fourth, through the generator of StyleGAN [15], the edited latent codes can be decoded into high-quality face images. Finally, in order to improve the overall authenticity of the images, we propose a post-processing method. In this work, our main contributions are as follows:

We propose an effective face swapping framework, which can generate photorealistic results.
We use the identity weight generation module to increase the attention to the identity information.
We propose an adaptive identity editing module to realize the identity transformation.
We utilize post-processing to improve the authenticity, and experiments verify the efficiency of our face swapping framework.

The rest of this paper is arranged as follows: we introduce the related works of this paper in Section 2. In Section 3, we introduce the method in detail, including the feature pyramid network, mapping network, identity weight generation module, adaptive identity editing module, generator, post-processing process, and loss functions. In Section 4, we show the results of the proposed face swapping framework, and design ablation experiments and comparative experiments to verify the efficiency of our proposed method. Finally, in Section 5, we elaborate on our conclusions and the focus of future work.

2. Related Works

In this section, we introduce the related works of this paper from four aspects. In Section 2.1, we introduce the fundamental of the paper, that is, the generative adversarial network. In Section 2.2, we introduce the deep face forgery detection methods and the composition of existing face forgery datasets. In Section 2.3, we briefly introduce the latent code operation based on the latent space. In Section 2.4, we report the related works of face forgery methods.

2.1. Generative Adversarial Network

Goodflow et al. [3] proposed the generative adversarial networks in 2014. The generative adversarial network consists of a generator and a discriminator. The function of the generator is to generate sufficiently real images, and the function of the discriminator is to identify whether the input image is from the real image (i.e., the images in the dataset) or from the fake images (i.e., the images generated by the generator). The generator and the discriminator are against each other during the training process, and finally, the discriminator cannot distinguish the authenticity of the images. The special structure and training method of the generative adversarial network ensure the authenticity of the generation, and then it is widely used in image generation [15], image editing [18,23], and other fields [24,25].

2.2. Deep Face Forgery Detection Methods and Datasets

Since deepfake was developed in 2017, face forgery videos or images based on deep learning have emerged on the Internet. Relevant statistics show that more than 96% of face forgery videos and images are illegal, which poses a potential threat to personal privacy and public opinion. At the same time, deep forgery detection technologies are also developing. Tariq et al. [26] proposed a forgery images forensics platform, FakeFaceDetect. The platform uses dual stream faster R-CNN to capture the high-level information and low-level information of the images, so as to realize the detection of various forgery facial images. Yang et al. [27] found that there are subtle texture differences in image saliency between real images and forgery images for the first time. They used the guided filter with saliency mapping as the guidance mapping, showing the potential characteristics of forgery facial images. Yu et al. [28] found that different generative adversarial networks have specific fingerprints, and these specific fingerprints are left in the generated images. Yu et al. utilized this feature to distinguish between real images and forgery images. Yang et al. [21] found that multi-scale texture information played an important role in depth forgery detection in the process of exploring CNN to distinguish true and fake images. Based on this discovery, they proposed MTD-Net for deep forgery detection by using central difference convolution and atrous spatial pyramid pooling. Qian et al. [29] introduced the frequency domain into forgery detection. They found that the frequency domain provided a complementary viewpoint of forgery artifacts and can better describe compression errors, and then they proposed a novel frequency in the face forgery network.

At present, deep forgery detection methods are mostly learning-based methods, which rely heavily on data. Commonly used face forgery datasets include Faceforensics++, Celeb-DF [30], Deeperforensics [31] and DFDC [32]. Faceforensics++ includes fake face videos and real face videos. The fake videos were generated by four face forgery methods. The methods of forgery face videos include Deepfakes, Faceswap, Face2Face, and NeuralTexture. Faceforensics++ divides the quality of forgery videos from low to high into three levels: raw, c23, and c40. Celeb-DF contains 5639 high-quality deepfake videos and 590 real videos, corresponding to more than 2 million frames. Real videos were collected from YouTube, including 59 characters, which are diverse in gender, age, posture, and so on. Deepforensics collected 100 face images and 1000 public videos on YouTube. They swapped each image to the corresponding 10 videos, and 1000 forgery videos were obtained. DFDC is a large-scale face forgery dataset published by Facebook and Microsoft, which contains 100,000 forgery video clips. These forgery videos are obtained by deepfake, GAN-based methods, and non-learned methods.

2.3. Latent Code Editing

Since the proposal of convolutional neural networks, researchers have found that the shallow features extracted by convolutional networks have low-level semantic information, such as structure, color, and so on, while the deep features have high-level semantic information, such as identity information and so on. In addition, researchers found that editing images through latent codes could achieve better results. After Ian et al. [3] proposed that after generative adversarial network, the latent space editing based on GAN has achieved a great leap. Härkönen et al. [33] found the latent direction of attributes in the latent space based on principal component analysis. Shen et al. [34] first found that generative adversarial networks learned semantics in linear subspaces of the latent space. After the proposal of StyleGAN, due to the powerful generation ability of StyleGAN and the feature de-entanglement ability in the latent space of StyleGAN, a large amount of latent space editing work is set to be completed in the latent space of StyleGAN. Emily et al. [35] utilized the fully supervised method to realize latent code editing in StyleGAN for the first time to realize the change of attributes. Or et al. [36] used the prior knowledge in natural language processing to find the relationship between features in latent space and text information.

2.4. Face Forgery Technologies

According to the effect of face forgery generated by different methods, face forgery technologies can be divided into three categories: face reenactment [37,38], face swapping [17,18], and face local editing [39]. Facial reenactment refers to changing only the expression and posture attributes in the whole facial image while keeping other attributes unchanged. The specific effect is shown in the first line of Figure 1. Face swapping refers to changing only the identity information in the whole facial image while keeping other attributes unchanged. The effect is shown in the second line of Figure 1. Face local editing refers to changing only a partial region of the facial image while maintaining other regions unchanged, such as changing the hairstyle, changing the hair color, changing eye shape, etc.

The early face swapping works carry out face cutting and face fusion on the basis of similar posture. These kinds of methods have great limitations and cannot complete the face swapping in the case of large skin color differences and large posture differences. With the proposal of convolutional neural networks and generative adversarial networks, the above problems are gradually solved through the extraction and fusion of high-level features and low-level features. Güera et al. [40] used CNN to extract frame-level information and used RNN to judge whether a subject needed to be operated in each frame. Natsume et al. [41] proposed RSGAN, which independently processed the facial region and hair region in the latent space, and realized face swapping by replacing the latent spatial representation of the facial region. Nirkin et al. [42] introduced a continuous interpolation of the face viewpoints based on reenactment, Delaunay triangulation, and barycentric coordinates and realized face swapping with subject agnostic. Li et al. [19] proposed a dual-stream network, which provided attribute information and identity information respectively. Through the proposed fusion model, high-fidelity results are achieved. Zhu et al. [43] realized one-shot face swapping by using the face transfer module with the help of the powerful generation ability of StyleGAN.

3. Method

In this section, we introduce the method proposed in this paper in detail. The section is divided into five parts. First, we explore the position of identity information in latent codes, which will affect subsequent decisions on how to edit latent codes. In Section 3.2, we briefly introduce the encoder, mapping network, and generator. In Section 3.3, we focus on the identity editing module, which is the key to achieving accurate identity transformation. In Section 3.4, we introduce the post-processing process proposed in this paper, which will improve the authenticity of the generated images. In Section 3.5, we report the loss functions.

The overall structure of the network is shown in Figure 2. The network is composed of a feature pyramid network with ResNet, mapping network, identity weight generation module, adaptive identity editing module, generator, and post-processing process. For the convenience of representation, we unify the representation of symbols here. We use

I_{i d}

to represent the image that provides identity information, and

I_{a t t}

to represent the image that provides attributes other than identity information. The output of the StyleGAN generator is represented by

I_{r e s u l t}

. After the post-processing process, the final generation result is represented by

I_{f i n a l}

.

3.1. Exploration and Motivation

It is well known that the latent codes of an image contain all the features of the corresponding image. These features can be roughly divided into two categories: fine-grained features, such as identity information, expression information, and posture information, and coarse-grained features, such as color information and structure information. In order to design the subsequent adaptive identity editing module, we explore the location of identity information in latent codes. As shown in Figure 2, we use an encoder, mapping network, identity weight generation module, and generator consistent with the structure of the face swapping process.

In the latent space, the specific location of these features is uncertain, and the features are entangled with each other, which causes a great degree of confusion. In order to edit only the features that we want to edit and prevent the interference of other features as much as possible, we need to extract the area to be edited. At this stage, researchers have done too little research on the structural level of latent space, so they can not accurately determine the specific location of the corresponding features. Therefore, we propose an identity weight generation module, which can adaptively find the location of the area to be edited through a large number of data training and weight updating.

Different from the face swapping process, the function of the weight generation module here is to obtain a weight map between 0 and 1, which is used to visualize the position of identity information in latent codes. The loss functions in this process are consistent with the loss functions in the face swapping process, but the accuracy of the identity information transformation is not required (we introduce the loss function in detail in Section 3.5). In the identity transformation stage of the exploration and motivation process, the obtained mask is multiplied by the latent codes mapped into StyleGAN to obtain the region of interest, and simple feature fusion is realized by element addition. Finally, the fused latent codes are input into the generator to get the result of changing only the identity information.

When the network training converges, we visualize the output value of the identity weight generation module to obtain the heat map shown in Figure 3.

From Figure 3 we can see that the distribution of identity information in latent codes does not have a certain regularity, and the content of identity information in each latent vector is also different. We believe that the reason for this phenomenon is that the image is in the wild, which is different from the image generated by a neural network. Each feature is entangled with others, resulting in the global distribution of identity information in latent codes.

3.2. Encoder, Mapping Network, and Generator

In order to extract the latent features of the image more comprehensively, we use the feature pyramid structure [44] with ResNet [45] as the encoder. In order to meet the requirements of StyleGAN, according to the resolution of the input image, we choose (i.e., 256 × 256), we set the dimension of encoder output is 14 × 512. The latent space of the latent code output by the encoder and the

w +

latent space of StyleGAN are two different latent spaces. In order to map the latent code obtained by the encoder only to the latent space of StyleGAN, we set up the mapping network. The main structure of the mapping network is the linear layer, and each latent vector corresponds to a mapping sub-network, that is, the mapping network is composed of 14 mapping sub-networks. We use the StyleGAN generator as the generator of our framework. With the help of the powerful generation ability of StyleGAN and the face generation prior knowledge, we can get high-quality results.

3.3. Identity Editing Module

We divide the latent codes editing stage into two parts: identity weight generation and adaptive identity editing.

3.3.1. Identity Weight Generation

The weight generation module is composed of multiple

1 \times 1

convolution and the sigmoid activation function, whose function is to add an identity weight on the basis of the original latent code. For the latent code of

I_{i d}

, it pays more attention to the region representing identity information, and for the latent code of

I_{a t t}

, it pays more attention to the feature other than identity information. In the network, there are two identity weight generation modules. They have the same structure. They are responsible for extracting the regions of interest of identity information and attribute information respectively. The process is as follows:

m a s k_{i d} = IWG (L_{i d})

(1)

m a s k_{a t t} = IWG (L_{a t t})

(2)

R o I_{i d}^{i d} = m a s k_{i d} \times L_{i d}

(3)

R o I_{a t t}^{a t t} = m a s k_{a t t} \times L_{a t t}

(4)

where

IWG

represents the identity weight generation module,

L_{i d}

and

L_{a t t}

represent the latent codes of

I_{i d}

and

I_{a t t}

respectively.

R o I_{i d}^{i d}

represents the latent code after enhancing the identity information in the latent code of

I_{i d}

.

3.3.2. Adaptive Identity Editing

The adaptive identity editing module is the key to realizing identity transformation in this paper. The module uses the identity information extracted from the existing face recognition network to change the specific information in the latent codes. We use the identity information extracted from the face recognition network to fix the direction of the latent codes. The adaptive mask in this module further extracts the feature regions that need to be edited and reduces the overall effect decline caused by changes to other latent features.

The structure of the adaptive identity editing module is shown in Figure 4. The structure is based on the AdaIN [22] mechanism and the adaptive blending mechanism.

The main structure of the adaptive identity editing module is divided into three parallel channels, and the output result is the increment of identity transformation. The process is as follows:

Step 1: Input

R o I_{i d}^{i d}

into the intermediate channel, that is, the adaptive mask generation path. The adaptive mask generation module is composed of two

1 \times 1

convolutions and a sigmoid activation function. The process is as follows:

m a s k = AMG (R o I_{i d}^{i d})

(5)

where

AMG

is the adaptive mask generation module.

Step 2: Input

R o I_{i d}^{i d}

into the upper path, which is to edit the identity information in the latent code. The identity editing module is composed of four identity editing blocks based on AdaIN. Because there are no learnable parameters in the AdaIN mechanism, we design a residual structure for learning identity information editing. The process is as follows:

L_{e d i t}^{i d} = IEB (R o I_{i d}^{i d})

(6)

L_{o u t} = (\frac{L_{i n} - m e a n (L_{i n})}{s t d (L_{i n}) + e p s} \times s t d (v_{i d})) + m e a n (v_{i d})

(7)

where

IEB

is the identity editing blocks.

L_{i n}

and

L_{o u t}

are the input and output of the AdaiN layer respectively.

v_{i d}

is the output of the face recognition network.

m e a n

is the process of calculating the average value and

s t d

is the process of calculating the variance.

e p s

= 10 × 10

^{- 5}

to prevent the denominator from being 0.

Step 3: Use the adaptive mask obtained in Step 1 to eliminate irrelevant features, and finally only retain the increment of identity transformation. The process is as follows:

Δ i d = m a s k \otimes L_{e d i t}^{i d} \oplus (1 - m a s k) \otimes R o I_{i d}^{i d}

(8)

The reason why we set the adaptive mask in the adaptive identity editing module is to retain other information related to identity information in

L_{e d i t}^{i d}

, such as location information. At the same time, the information irrelevant to identity information, such as color information, is eliminated in

R o I_{i d}^{i d}

. Finally, the edited latent codes are shown below:

L_{e d i t}^{a l l} = R o I_{a t t}^{a t t} + Δ i d

(9)

where

L_{e d i t}^{a l l}

is the edited latent code, then

L_{e d i t}^{a l l}

is input into the StyleGAN’s generator, and we can get high-quality results.

3.4. Post-Processing

In order to achieve high-resolution face generation, we use StyleGAN’s generator as the face generator of the method proposed in this paper. However, there is a defect in using the prior knowledge of the face, that is, the generated images only focus on the face region, and often ignore the hair region and the background region. This greatly reduces the authenticity of the generated image. To solve the above problems, we use the existing face parsing to get the face semantic segmentation mask, use the mask to weight the generated image and the image that provides other attributes, and finally get the photo-realistic results. The specific steps are as follows:

Step 1: Firstly, we use the face parsing network to get

I_{a t t}

’s face parsing map. The binary mask is obtained by classifying the foreground and background of the face parsing map.

Step 2: Because the result of the direct combination has sharp edges, we use Gaussian blur to process the binary mask to obtain

m a s k_{g u a}

to solve this problem. In this way, a transition area will be formed at the edge of the foreground and background, which will eliminate the above sharp edges.

Step 3: We use

m a s k_{g u a}

to combine

I_{a t t}

and

I_{r e s u l t}

to get the final result

I_{f i n a l}

.

I_{f i n a l} = I_{a t t} \otimes (1 - m a s k_{g u a}) \oplus I_{r e s u l t} \otimes m a s k_{g u a}

(10)

3.5. Loss Functions

In this work, we set identity loss

L_{i d}

, self-reconstruction loss

L_{r e c o n}

, and attribute loss

L_{a t t}

as the loss functions of this paper. In the exploration stage and the face swapping stage, we use the same loss functions to train the network.

3.5.1. Identity Loss

In order to ensure that the identity information of

I_{r e s u l t}

is consistent with that of

I_{i d}

, we set identity loss. We use face recognition network as the extractor of identity information (in this paper, we use ArcFace [46] as the extractor). Identity loss is shown as follows:

L_{i d} = c o s (extractor (I_{i d}), extractor (I_{r e s u l t}))

(11)

where

c o s

represents cosine similarity loss, and

extractor

represents face recognition network.

3.5.2. Self-Reconstruction Loss

Because there is no ground truth in the face swapping task, there is no strong constraint in the training process. In order to enhance the robustness and stability of the network, we propose the self-reconstruction loss. We set a self-reconstruction every four times. When the network is self-reconstruction,

I_{i d}

=

I_{a t t}

. The self-reconstruction loss is shown as follows:

L_{r e c o n} = L_{1} (I_{i d}, net (I_{i d}, I_{i d}))

(12)

where

L_{1}

is the l1 loss. Through the above process, we can achieve strong constraints in the training network, which greatly improves the stability and robustness of network training.

3.5.3. Attribute Loss

In order to ensure that the attributes are consistent with

I_{a t t}

during face swapping, we set attribute loss. We use VGG-19 [47] loaded with pre-training parameters as an attribute feature extractor. Some of the feature maps are taken for constraints to ensure the consistency of attribute information.

L_{a t t} = \frac{1}{L} \sum_{i = 1}^{L} | | F_{i} (I_{r e s u l t}) - F_{i} (I_{a t t}) {| |}_{2}

(13)

where

L

is the number of the feature maps,

F_{i}

is the i-th feature maps extracted by VGG-19.

3.5.4. Objective Function

The objective function of the network is as follows:

L_{n e t} = λ_{i d} \times L_{i d} + λ_{r e c o n} \times L_{r e c o n} + λ_{a t t} \times L_{a t t}

(14)

where

λ_{i d}

= 3,

λ_{r e c o n}

= 5 and

λ_{a t t}

= 3, which are the hyper-parameters respectively.

4. Results and Experiments

We introduce this section in five parts: in Section 4.1, we introduce the dataset and experimental setting. In Section 4.2, we briefly introduce the evaluation metrics used for quantitative comparison. In Section 4.3,we display some generation effects of our proposed model. In Section 4.4, we prove the superiority of the adaptive identity editing module. In Section 4.5, we compare our framework qualitatively and quantitatively with other models.

4.1. Dataset and Experimental Setting

4.1.1. Dataset

In this work, the dataset we used is CelebA-HQ [48]. The dataset is a large-scale HD face dataset with a resolution of

1024 \times 1024

. The dataset is widely used in face recognition, face segmentation, and other face tasks. CelebA-HQ is further developed on the basis of CelebA [49]. CelebA-HQ has 30,000 centered face images through cutting and rotating.

4.1.2. Experimental Setting

The hardware conditions of the experiments are Intel (R) Xeon (R) CPU E5-2620 V4 and two NVIDIA GTX Titan XP GPUs. The whole experiment is implemented on the PyTorch platform. We use Adam [50] as the optimizer of the training process. The epoch of training is 250,000 and the batch size is 1. Model weights and results are saved through hyper-parameters.

4.2. Evaluation Metrics

In the quantitative experiment, we take identity similarity, expression similarity, and FID [51] value as quantitative metrics.

4.2.1. Identity Similarity

Similar to identity loss, the formula of identity similarity is as Equation (15). The larger the identity similarity, the closer the generated image is to the identity information of

I_{i d}

id similarity = c o s (extractor (r e s u l t), extractor (I_{i d}))

(15)

4.2.2. Expression Similarity

We use a 3D face attribute extraction network to extract expression information. The expression similarity formula is as Equation (16). The smaller the expression similarity, the smaller the expression similarity between the result and

I_{a t t}

.

\exp similarity = L_{1} (r e s u l t, I_{a t t})

(16)

4.2.3. FID

FID is a metric used to measure the quality and diversity of generated images. The smaller the FID value, the better the quality of the generated images.

4.3. The Generation Results of Our Model

As shown in Figure 5, we display partial results of our framework. Our proposed framework can realize the maintenance of expression and the transformation of identity and can achieve photo-level authenticity.

4.4. Superiority of the Adaptive Identity Editing Module

In order to verify the superiority of the adaptive identity editing module in the process of identity transformation, we set up an ablation experiment. Firstly, we set up the comparison model net-w/o AIE: the network has the same encoder, mapping network, identity weight generation module, and decoder as the face swapping network in this paper but does not include an adaptive identity mapping module. The results generated by the two networks are shown in Figure 6.

4.4.1. Qualitative Analysis

From the qualitative analysis of the figure, we can see that although the generation results of net-w/o AIE can reach photo-realistic authenticity. However, from the perspective of effect and details, there are still many defects. Firstly, in terms of effect, the images generated by net-w/o AIE are not as good as net in identity transformation. We can infer that the identity editing based on the AdaIN mechanism in the adaptive identity editing module is effective. Secondly, in terms of details, the hair color, skin color, and general tone of the background of net-w/o AIE are not consistent with

I_{a t t}

. Therefore, we can infer that the adaptive mask mechanism plays a key role. In summary, we can infer that the adaptive identity editing module has advantages in identity transformation and attribute maintenance.

4.4.2. Quantitative Analysis

In order to more clearly reflect the superiority of the adaptive identity editing module, we set up a quantitative analysis. We select 30 pairs of images as the objects of quantitative analysis. Calculate the identity similarity and expression similarity of each pair of images respectively and calculate FID between two groups of images. Then the calculated identity similarity and expression similarity are averaged. The results are shown in Table 1.

It can be seen from the Table 1 analysis that when the adaptive identity editing module is removed, the performance of the generated images in identity similarity and expression similarity decreases. Although net-w/o AIE is slightly better than net in FID performance, there is little similarity between the two network generation results.

4.5. Comparison with Other Models

We choose FaceSwap, FSGAN [42], SimSwap [52], and Faceshifter [19] as the comparison methods of this paper. The results generated by similarity models are shown in Figure 7.

4.5.1. Qualitative Analysis

It can be seen from Figure 7 that the comprehensive effect of our proposed method on identity transformation and attribute maintenance is the best. In terms of skin color maintenance, some results generated by Faceshifter are better than our proposed method, but they are inferior to our proposed method in identity transformation. We find that there are abnormal facial distortions or irregular color patches in the implementation results of FaceSwap. This is because FaceSwap is completed based on dlib and opencv and lacks the fitting between samples with large differences in posture, skin color, and so on. Although FSGAN has achieved good results in identity transformation, the overall quality is low. After amplification, it will be found that the overall effect is poor compared with other methods. To sum up, our proposed method solves some defects of the existing face swapping methods and significantly improves identity transformation and attribute maintenance.

4.5.2. Quantitative Analysis

In order to clarify the detailed differences between various face swapping methods, we set up a quantitative comparison. We select 30 groups of results generated by different models as the objects of quantitative comparison. Calculate the identity similarity and expression similarity of each pair of images respectively and calculate FID between two groups of images. Then the calculated identity similarity and expression similarity are averaged. The results are shown in Table 2.

From the Table 2 analysis, we can see that our proposed method is the best in the two metrics of identity similarity and FID. Although our effect is slightly inferior to FSGAN in expression similarity, the gap with Faceshifter is relatively small. On the whole, our proposed method has certain efficiency, which can solve some problems of existing methods and provide more realistic and high-definition face forgery images.

5. Conclusions and Expectations

In this work, we propose a face swapping framework for enriching facial anti-spoofing datasets. Through experimental verification, we prove the efficiency of our overall framework and the superiority of the adaptive identity editing module. The generated results achieve photo-level authenticity and can be used in social media to improve the accuracy of deep forgery detection and ensure portrait security. In future work, we focus on higher quality and more detailed face swapping results and the privacy and security of multimedia systems.

Author Contributions

Conceptualization, J.Y., G.L. and S.X.; methodology, J.Y., G.L. and S.X.; software, J.Y., G.L. and S.X.; validation, J.Y., G.L. and S.X.; formal analysis, G.L. and S.X.; investigation, G.L. and S.X.; resources, G.L. and S.X.; data curation, G.L. and S.X.; writing—original draft preparation, G.L.; writing—review and editing, G.L., S.X. and Y.L.; visualization, G.L. and S.X.; supervision, S.X., Y.L., J.W. and Y.Z.; funding acquisition, J.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by the National Natural Science Foundation of China (No. 61871283).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

We would like to thank all those who have contributed to the work of this paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

Torabi, S.; Bou-Harb, E.; Assi, C.; Karbab, E.B.; Boukhtouta, A.; Debbabi, M. Inferring and Investigating IoT-Generated Scanning Campaigns Targeting a Large Network Telescope. IEEE Trans. Dependable Secur. Comput. 2022, 19, 402–418. [Google Scholar] [CrossRef]
Yang, J.; Han, Y.; Wang, Y.; Jiang, B.; Lv, Z.; Song, H. Optimization of real-time traffic network assignment based on IoT data using DBN and clustering model in smart city. Future Gener. Comput. Syst. 2020, 108, 976–986. [Google Scholar] [CrossRef]
Goodfellow, I.; Pougetabadie, J.; Mirza, M.; Xu, B.; Wardefarley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. In Proceedings of the 28th Annual Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 2672–2680. [Google Scholar]
Lecun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436. [Google Scholar] [CrossRef] [PubMed]
Wen, J.; Yang, J.; Li, Y.; Gao, L. Harmful algal bloom warning based on machine learning in maritime site monitoring. Knowl.-Based Syst. 2022, 245, 108569. [Google Scholar] [CrossRef]
Feng, C.; Liu, B.; Yu, K.; Goudos, S.K.; Wan, S. Blockchain-Empowered Decentralized Horizontal Federated Learning for 5G-Enabled UAVs. IEEE Trans. Ind. Inform. 2022, 18, 3582–3592. [Google Scholar] [CrossRef]
Lee, J.; Paek, J.S.; Hong, S. Millimeter-Wave Frequency Reconfigurable Dual-Band CMOS Power Amplifier for 5G Communication Radios. IEEE Trans. Microw. Theory Tech. 2022, 70, 801–812. [Google Scholar] [CrossRef]
Wu, C.J.; Ku, C.F.; Ho, J.M.; Chen, M.S. A Novel Pipeline Approach for Efficient Big Data Broadcasting. IEEE Trans. Knowl. Data Eng. 2016, 28, 17–28. [Google Scholar] [CrossRef]
Karnouskos, S. Artificial Intelligence in Digital Media: The Era of Deepfakes. IEEE Trans. Technol. Soc. 2020, 1, 138–147. [Google Scholar] [CrossRef]
Li, Y.; Chao, X. Distance-Entropy: An effective indicator for selecting informative data. Front. Plant Sci. 2021, 12, 818895. [Google Scholar] [CrossRef]
Li, Y.; Chao, X.; Ercisli, S. Disturbed-Entropy: A simple data quality assessment approach. ICT Express 2022, in press. [CrossRef]
Yang, J.; Zhang, Z.; Gong, Y.; Ma, S.; Guo, X.; Yang, Y.; Xiao, S.; Wen, J.; Li, Y.; Gao, X.; et al. Do Deep Neural Networks Always Perform Better When Eating More Data? arXiv 2022, arXiv:2205.15187. [Google Scholar]
Li, Y.; Yang, J.; Wen, J. Entropy-Based redundancy analysis and information screening. Digit. Commun. Netw. 2021, in press. [CrossRef]
Li, Y.; Chao, X. Semi-supervised few-shot learning approach for plant diseases recognition. Plant Methods 2021, 17, 1–10. [Google Scholar] [CrossRef] [PubMed]
Karras, T.; Laine, S.; Aittala, M.; Hellsten, J.; Lehtinen, J.; Aila, T. Analyzing and Improving the Image Quality of StyleGAN. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 8107–8116. [Google Scholar] [CrossRef]
Schönfeld, E.; Schiele, B.; Khoreva, A. A U-Net Based Discriminator for Generative Adversarial Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 8204–8213. [Google Scholar] [CrossRef]
Li, Y.; Chao, X. Toward sustainability: Trade-off between data quality and quantity in crop pest recognition. Front. Plant Sci. 2021, 12, 811241. [Google Scholar] [CrossRef]
Nirkin, Y.; Keller, Y.; Hassner, T. FSGAN: Subject Agnostic Face Swapping and Reenactment. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 7183–7192. [Google Scholar] [CrossRef] [Green Version]
Li, L.; Bao, J.; Yang, H.; Chen, D.; Wen, F. Advancing High Fidelity Identity Swapping for Forgery Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 5073–5082. [Google Scholar] [CrossRef]
Yang, J.; Xiao, S.; Li, A.; Lu, W.; Gao, X.; Li, Y. MSTA-Net: Forgery Detection by Generating Manipulation Trace Based on Multi-scale Self-texture Attention. In IEEE Transactions on Circuits and Systems for Video Technology; IEEE: Piscataway, NJ, USA, 2021; p. 1. [Google Scholar] [CrossRef]
Yang, J.; Li, A.; Xiao, S.; Lu, W.; Gao, X. MTD-Net: Learning to Detect Deepfakes Images by Multi-Scale Texture Difference. IEEE Trans. Inf. Forensics Secur. 2021, 16, 4234–4245. [Google Scholar] [CrossRef]
Huang, X.; Belongie, S. Arbitrary Style Transfer in Real-Time with Adaptive Instance Normalization. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 1510–1519. [Google Scholar] [CrossRef] [Green Version]
Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2223–2232. [Google Scholar]
Mogren, O. C-RNN-GAN: Continuous recurrent neural networks with adversarial training. arXiv 2016, arXiv:1611.09904. [Google Scholar]
Antoniou, A.; Storkey, A.; Edwards, H. Data augmentation generative adversarial networks. arXiv 2017, arXiv:1711.04340. [Google Scholar]
Tariq, S.; Lee, S.; Kim, H.; Shin, Y.; Woo, S.S. Gan is a friend or foe? a framework to detect various fake face images. In Proceedings of the 34th ACM/SIGAPP Symposium on Applied Computing, Limassol, Cyprus, 8–12 April 2019; pp. 1296–1303. [Google Scholar]
Yang, J.; Xiao, S.; Li, A.; Lan, G.; Wang, H. Detecting fake images by identifying potential texture difference. Future Gener. Comput. Syst. 2021, 125, 127–135. [Google Scholar] [CrossRef]
Yu, N.; Davis, L.; Fritz, M. Attributing Fake Images to GANs: Learning and Analyzing GAN Fingerprints. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Korea, 27 October–2 November 2019; pp. 7555–7565. [Google Scholar] [CrossRef] [Green Version]
Qian, Y.; Yin, G.; Sheng, L.; Chen, Z.; Shao, J. Thinking in frequency: Face forgery detection by mining frequency-aware clues. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 86–103. [Google Scholar]
Li, Y.; Yang, X.; Sun, P.; Qi, H.; Lyu, S. Celeb-DF: A Large-Scale Challenging Dataset for DeepFake Forensics. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 14–19 June 2020; pp. 3204–3213. [Google Scholar] [CrossRef]
Jiang, L.; Li, R.; Wu, W.; Qian, C.; Loy, C.C. Deeperforensics-1.0: A large-scale dataset for real-world face forgery detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 2889–2898. [Google Scholar]
Dolhansky, B.; Bitton, J.; Pflaum, B.; Lu, J.; Howes, R.; Wang, M.; Ferrer, C.C. The deepfake detection challenge (dfdc) dataset. arXiv 2020, arXiv:2006.07397. [Google Scholar]
Härkönen, E.; Hertzmann, A.; Lehtinen, J.; Paris, S. Ganspace: Discovering interpretable gan controls. Adv. Neural Inf. Process. Syst. 2020, 33, 9841–9850. [Google Scholar]
Shen, Y.; Yang, C.; Tang, X.; Zhou, B. InterFaceGAN: Interpreting the Disentangled Face Representation Learned by GANs. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 2004–2018. [Google Scholar] [CrossRef] [PubMed]
Denton, E.; Hutchinson, B.; Mitchell, M.; Gebru, T. Detecting Bias with Generative Counterfactual Face Attribute Augmentation. 2019. Available online: https://www.arxiv-vanity.com/papers/1906.06439/ (accessed on 3 June 2022).
Patashnik, O.; Wu, Z.; Shechtman, E.; Cohen-Or, D.; Lischinski, D. StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, BC, Canada, 11–17 October 2021; pp. 2065–2074. [Google Scholar] [CrossRef]
Doukas, M.C.; Koujan, M.R.; Sharmanska, V.; Roussos, A.; Zafeiriou, S. Head2Head++: Deep Facial Attributes Re-Targeting. IEEE Trans. Biom. Behav. Identity Sci. 2021, 3, 31–43. [Google Scholar] [CrossRef]
Wiles, O.; Koepke, A.; Zisserman, A. X2face: A network for controlling face generation using images, audio, and pose codes. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 670–686. [Google Scholar]
Choi, Y.; Choi, M.; Kim, M.; Ha, J.W.; Kim, S.; Choo, J. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8789–8797. [Google Scholar]
Güera, D.; Delp, E.J. Deepfake Video Detection Using Recurrent Neural Networks. In Proceedings of the 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Auckland, New Zealand, 27–30 November 2018; pp. 1–6. [Google Scholar] [CrossRef]
Natsume, R.; Yatagawa, T.; Morishima, S. Rsgan: Face swapping and editing using face and hair representation in latent spaces. arXiv 2018, arXiv:1804.03447. [Google Scholar]
Nirkin, Y.; Hassner, T.; Keller, Y. FSGANv2: Better Subject Agnostic Face Swapping and Reenactment. In IEEE Transactions on Pattern Analysis and Machine Intelligence; IEEE: Piscataway, NJ, USA, 2022; p. 1. [Google Scholar] [CrossRef]
Zhu, Y.; Li, Q.; Wang, J.; Xu, C.; Sun, Z. One Shot Face Swapping on Megapixels. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 19–25 June 2021; pp. 4832–4842. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Deng, J.; Guo, J.; Xue, N.; Zafeiriou, S. ArcFace: Additive Angular Margin Loss for Deep Face Recognition. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 4685–4694. [Google Scholar] [CrossRef] [Green Version]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Karras, T.; Aila, T.; Laine, S.; Lehtinen, J. Progressive growing of gans for improved quality, stability, and variation. arXiv 2017, arXiv:1710.10196. [Google Scholar]
Liu, Z.; Luo, P.; Wang, X.; Tang, X. Large-scale celebfaces attributes (celeba) dataset. Retrieved August 2018, 15, 11. [Google Scholar]
Kingma, D.P.; Ba, J.L. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Kangjin, W.; Yong, Y.; Ying, L.; Hanmei, L.; Lin, M. FID: A Faster Image Distribution System for Docker Platform. In Proceedings of the 2017 IEEE 2nd International Workshops on Foundations and Applications of Self* Systems (FAS*W), Tucson, AZ, USA, 18–22 September 2017; pp. 191–198. [Google Scholar] [CrossRef]
Chen, R.; Chen, X.; Ni, B.; Ge, Y. SimSwap: An Efficient Framework for High Fidelity Face Swapping. In Proceedings of the 28th ACM International Conference on Multimedia, Seattle, WA, USA, 12–16 October 2020; pp. 2003–2011. [Google Scholar]

Figure 1. Some technologies realize the results of face forgery. The first line is face reenactment and the second line is face swapping. The result of face swapping is realized by the framework proposed in this paper.

Figure 2. The overall structure of the network, which is composed of feature pyramid network with ResNet, mapping network, identity weight generation module, adaptive identity editing module, generator, and post-processing process.

Figure 3. The visualization result of the output value of the identity weight generation module.

Figure 4. The structure of the adaptive identity editing module.

Figure 5. Partial results of our framework.

Figure 6. The results of the experiment to verify the superiority of the adaptive editing module.

Figure 7. Compared with other methods. The comparison methods consist of FaceSwap, FSGAN, SimSwap, and Faceshifter.

Table 1. Quantitative analysis of the superiority of the adaptive identity editing module. Bold represents the optimal value.

Method	Id Similarity ↑	Exp Similarity ↓	FID ↓
$net - w / oAIE$	0.57	0.25	58.7864
$net$	0.58	0.23	58.8624

Table 2. Quantitative analysis of the comparative experiment with other method. Bold represents the optimal value.

Method	Id Similarity ↑	Exp Similarity ↓	FID ↓
$FaceSwap$	0.37	3.32	216.78
$FSGAN$	0.45	1.64	67.54
$SimSwap$	0.54	1.31	69.84
$Faceshifter$	0.51	0.19	58.9625
$Ourmethod$	0.58	0.19	58.8624

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, J.; Lan, G.; Xiao, S.; Li, Y.; Wen, J.; Zhu, Y. Enriching Facial Anti-Spoofing Datasets via an Effective Face Swapping Framework. Sensors 2022, 22, 4697. https://doi.org/10.3390/s22134697

AMA Style

Yang J, Lan G, Xiao S, Li Y, Wen J, Zhu Y. Enriching Facial Anti-Spoofing Datasets via an Effective Face Swapping Framework. Sensors. 2022; 22(13):4697. https://doi.org/10.3390/s22134697

Chicago/Turabian Style

Yang, Jiachen, Guipeng Lan, Shuai Xiao, Yang Li, Jiabao Wen, and Yong Zhu. 2022. "Enriching Facial Anti-Spoofing Datasets via an Effective Face Swapping Framework" Sensors 22, no. 13: 4697. https://doi.org/10.3390/s22134697

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Enriching Facial Anti-Spoofing Datasets via an Effective Face Swapping Framework

Abstract

1. Introduction

2. Related Works

2.1. Generative Adversarial Network

2.2. Deep Face Forgery Detection Methods and Datasets

2.3. Latent Code Editing

2.4. Face Forgery Technologies

3. Method

3.1. Exploration and Motivation

3.2. Encoder, Mapping Network, and Generator

3.3. Identity Editing Module

3.3.1. Identity Weight Generation

3.3.2. Adaptive Identity Editing

3.4. Post-Processing

3.5. Loss Functions

3.5.1. Identity Loss

3.5.2. Self-Reconstruction Loss

3.5.3. Attribute Loss

3.5.4. Objective Function

4. Results and Experiments

4.1. Dataset and Experimental Setting

4.1.1. Dataset

4.1.2. Experimental Setting

4.2. Evaluation Metrics

4.2.1. Identity Similarity

4.2.2. Expression Similarity

4.2.3. FID

4.3. The Generation Results of Our Model

4.4. Superiority of the Adaptive Identity Editing Module

4.4.1. Qualitative Analysis

4.4.2. Quantitative Analysis

4.5. Comparison with Other Models

4.5.1. Qualitative Analysis

4.5.2. Quantitative Analysis

5. Conclusions and Expectations

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI