Facial UV Map Completion for Pose-invariant Face Recognition: A Novel Adversarial Approach based on Coupled Attention Residual UNets

Pose-invariant face recognition refers to the problem of identifying or verifying a person by analyzing face images captured from different poses. This problem is challenging due to the large variation of pose, illumination and facial expression. A promising approach to deal with pose variation is to fulfill incomplete UV maps extracted from in-the-wild faces, then attach the completed UV map to a fitted 3D mesh and finally generate different 2D faces of arbitrary poses. The synthesized faces increase the pose variation for training deep face recognition models and reduce the pose discrepancy during the testing phase. In this paper, we propose a novel generative model called Attention ResCUNet-GAN to improve the UV map completion. We enhance the original UV-GAN by using a couple of U-Nets. Particularly, the skip connections within each U-Net are boosted by attention gates. Meanwhile, the features from two U-Nets are fused with trainable scalar weights. The experiments on the popular benchmarks, including Multi-PIE, LFW, CPLWF and CFP datasets, show that the proposed method yields superior performance compared to other existing methods.


Introduction
Face recognition has gained much attention for decades [1][2][3]. Contrary to other popular biometrics, face recognition can be applied to uncooperative subjects in a non-instructive manner. While (near)-frontal face recognition has gradually matured, face recognition in the wild is still challenging due to different unconstrained factors. In fact, the performance of a face recognition system heavily depends on the pose of input faces. Recent studies show that the performance of face verification with the same view, such as frontal-frontal or profile-profile, is really good. However, the performance dramatically degrades when verifying faces in different views like frontal-profile [4].
Pose-invariant face recognition refers to the problem of identifying or verifying a person by analyzing face images captured from different poses. In recent years, numerous pose-invariant face recognition methods have been proposed. In [5][6][7][8][9][10][11][12][13], the authors train deep neural networks on large-scale datasets to ease the effect of pose variation, which leads to significant improvements in the performance of face recognition. In [14], Masi et al. propose a method to enrich the pose variation in the training dataset by rotating faces across 3D space. Beyond, in [15], Sagonas et al. propose a novel method to jointly learn both frontal view reconstruction and landmark localization by solving a constrained optimization problem. Kan et al. [16] introduce stacked progressive auto-encoders (SPAE), which can learn pose-robust features through a complicated deep neural network to transform profile faces to frontal ones. In [17], Hassner et al. introduce a straightforward approach to generate frontal faces from a simple 3D shape. Peng et al. [18] propose a new reconstruction loss for disentangled learning that encourages identity features of the same subject to be clustered together despite the pose variation.
Recently, generative adversarial networks (GANs) [19] have proved to be powerful to mimic data distribution. GANs have been successfully applied to many computer vision tasks such as image inpainting [18,20,21], style transfer [22,23], image synthesis [24,25], super-resolution [26] and so on. These successful applications have motivated researchers to apply GANs to pose-invariant feature disentanglement arXiv:2011.00912v1 [cs.CV] 2 Nov 2020 [4,27], face completion [28] and face frontalization [4,[29][30][31][32]. In [28], Wang et al. propose a recurrent generative adversarial network (RGAN), which consists of a CompletionNet and a DiscriminationNet, for completing face and recovering the missing region automatically. Dual et al. [32] propose a boosting GAN (BoostGAN) for face deocclusion and frontalization. BoostGAN can generate photorealistic frontal faces with identity preservation from occluded but profile ones. TP-GAN [29] uses a two-pathway GAN that simultaneously learns global structures and local information for photorealistic frontal view synthesis. Zhao et al. [33] propose a unified deep architecture containing a face frontalization module and a discriminative learning module, which can be jointly learned in an end-to-end fashion. Zhang et al. [34] propose a geometry guided GAN to generate facial images with arbitrary expressions and poses conditioned on a set of facial landmarks. They embed a classifier into the GAN to facilitate image synthesis and perform facial expression recognition. In [27], Tran et al. propose DR-GAN that can take one or multiple input images and produce one unified identity representation along with synthesized identity-preserved faces of various target poses. However, all methods mentioned above usually require a large amount of paired faces across different poses for training, which is overdemanding in realworld applications.
In [35], Deng et al. propose an adversarial UV map completion framework called UV-GAN to solve poseinvariant face recognition without the need of extensive pose coverage in the training dataset. The authors in [35] first fit a 3DMM [36] to 2D profile face and get an incomplete UV map, which is then fulfilled by a straightforward pix2pix [37,38]. The generator architecture in pix2pix follows the general shape of U-Net [39] to add skip connections between encoder and decoder subnetworks in order to enhance the transfer of low-level information between input and output. One weakness of the original UV-GAN is the plain architecture of the generator, which is shown to be worse than residual networks [40]. Another weakness is that one U-Net block seems to be not enough to mix well low-level information in the encoder with high-level semantic features in the decoder. In [41], Deng et al. use UV-GAN with similar architecture as in [35] to extract side information as well as subspaces, and combine UV-GAN with robust PCA for the face recognition task. He et al. [42] introduce a framework for heterogeneous face synthesis from near-infrared (NIR) to visible domain. The framework consists of two adversarial generators to estimate a UV map and a facial texture map from an input NIR face, and then generate a corresponding frontal visible face. Nevertheless, both generators in this framework are based on the general U-Net structure [23,39]. Some efforts [43,44] stack multiple U-Nets together, but skip connections are utilized only inside each single U-Net. Ibtehaz et al. [45] propose residual paths with additional convolutional layers in skip connections to reduce the semantic gap between encoder and decoder features. In [46], Oktay et al. introduce attention gates to implicitly learn to suppress irrelevant regions in an input image while highlighting salient features useful for a specific task. In [47], Tang et al. introduce coupled U-Nets architecture, where coupling connections are utilized to improve the information flow across U-Nets.
In this paper, we propose a new generative model architecture called Attention ResCUNet-GAN, where the generator is coupled U-Nets, and the backbone of each encoder is enhanced by residual network architecture. We use attention gates for skip connections within each U-Net to suppress irrelevant lowlevel information from encoders. We also use skip connections across two U-Nets to limit gradient vanishing and promote feature reuse. The experiments on the popular benchmarks demonstrate that our Attention ResCUNet-GAN yields considerably better results than the original UV-GAN model.
The rest of this paper is organized as follows. Details of our proposed method are presented in Section 2. Section 3 presents our experimental results on the Multi-PIE dataset. Finally, the conclusion is made in Section 4.

Our Proposed Method
Following [35], we use 3DDFA [48] to fitting 2D images to retrieve UV maps and 3D meshes. With a non-frontal face, the UV map generated by 3DDFA is always incomplete due to self-occlusion. Hence, we propose a new generative model architecture called Attention ResCUNet-GAN to improve the performance of the original UV-GAN [35] in filling up the missing contents of the UV map, which in turns helps to synthesize facial images of arbitrary poses. The overall pipeline process to synthesize more faces of various poses is depicted in Fig. 1 [49] introduce the 3D Morphable Model Model (3DMM) to recover the 3D face from a 2D image. Assuming that a 3D face scan with N vertexes can be represented as a 3N × 1 vector S = [x 1 , y 1 , z 1 , . . . , x N , y N , z N ] T ∈∈ R 3N , where [x i , y i , z i ] T are the object-centered Cartesian coordinates of the i-th vertex. Given a dataset of such 3D face scans, one would like to represent them as a smaller set of variables. The authors in [49] propose to use a twostage principle component analysis (PCA) to estimate the shape identity parameters along with expression parameters of the 3D faces. Suppose that, after the first stage, we keep first n s principal components and s 1 , s 2 , . . . , s ns are the corresponding orthonormal basis, then a 3D face S can be represented as follows: whereS ∈ R 3N are the mean shape vector across the dataset of 3D face scans and α = [α 1 , . . . , α ns ] are the shape parameters. In the second stage, a new PCA is trained on the offsets between expression scans and neutral scans. After this stage, the final shape a representation is follows: where e i , i = 1, . . . , n e are the orthonormal basis of first n e principal components, and β = [β 1 , . . . , β ne ] are the expression parameters. After the 3D face is constructed, a rigid transformation is applied on the shape from the barycentric coordinate to camera based world coordinate. Each 3D vertex v = [x, y, z] T is rotated and translated as fol- where R ∈ R 3×3 and t = [t x , t y , t z ] T are the 3D rotation and translation components, respectively. Finally, each 3D point can be projected into its 2D location in the image plane with scale orthographic projection: where f is the scale factor, Pr = 1 0 0 0 1 0 is the orthographic projection matrix and t 2d is the principal point that is set to the image center.
Suppose that the set of all the model parameters are denoted by p = [f, R, t 2d , α, β].

3DDFA method
Method 3DDFA associates Cascaded Regression and a Convolutional Neural Network (CNN). Cascaded CNN can be formulated as: where p k is the model parameters at the k-th iteration, which is updated by applying a CNN-based regressor N et k on the shape indexed feature F eat that depends on the input image I and the current parameters p k .
The purpose of the CNN regressors is to predict the parameter update ∆p to shift the initial parameter p 0 as close as possible to the ground truth p g . In term of objective function, [48] proposes to use the Optimized Weighted Parameter Distance Cost (OWPDC): where w * is the optimized parameter importance vector.

Proposed Network for UV Map Completion
The proposed Attention ResCUNet-GAN consists of a generator, two discriminators, and an identity preserving module. The global discriminator deals with the global structure of entire complete UV maps, while the local discriminator focuses on the local details of the face region.

Generator Network
An incomplete UV map is fed into Attention ResCUNet-GAN Generator, which acts as an auto-encoder to reconstruct missing regions. We use the following reconstruction loss as in [35]: where I P is the input incomplete UV map, G(I P ) is the output from the generator, and I F is the ground truth texture. The generator (Fig. 2) consists of coupled U-Nets [47]. A drawback of the UV-GAN's generator is the plain convolutional backbone, which is shown to be rapidly degraded as the network depth increases [40]. Therefore, here we leverage the residual architecture in [40] to build a deeper backbone that is capable of extracting better high-level features without suffering from the degradation problem. Particularly, in terms of the backbone network for encoders, we use ResNet-50 [40] consisting of multiple bottleneck residual blocks, each of which is a stack of three successive layers with 1x1, 3x3, 1x1 convolutions. Batch normalization is used right after each convolution and before activation layers. We use skip connections within each U-Net to transfer low-level information from the encoder to high-level contextual features in the decoder. Attention gates [46] are used to suppress irrelevant low-level information from encoders. Fig. 3 illustrates how a coarse feature map can guide another low-level feature map to ignore irrelevant information.
To combine features across two U-Nets, one can apply a direct depth-wise concatenation of the coarse feature maps D U 1 , D U 2 extracted from the decoders of both U-Nets and the attentive informationÊ U 2 extracted from an attention gate of the encoder of the second U-Net. In such a combination, the latest feature map D U 2 , which is thought to obtain more contextual information, would play the most crucial role regarding the contribution to the final output. However, such a direct concatenation always requires more memory. Thus, before concatenating with D U 2 , here we apply fast normalized fusion [50] to combine D U 1 andÊ U 2 as follows: where w 1 , w 2 are learnable scalar weights that can be trained via normal back propagation algorithm and = 0.0001 is a small value to avoid numerical instability. Parameters are ensured to be positive by applying Relu activation after them.
Global and Local Discriminators. Global discriminator enforces maintaining the surrounding context of the facial image. Meanwhile, the local discriminator focuses on the central face region to enforce better recovering local details such as eye, nose, mouth and so on. We keep the same architectures for the discriminators as described in [35] (Fig. 4). The following typical adversarial loss is used: where p d (x), p d (y), p d (z) denote the distributions of incomplete UV maps x, complete UV maps y and the Gaussian noise z, respectively.
Identity preserving module. The synthetic faces must not only be photorealistic but also preserve identity information, which plays a crucial role in generation-based face recognition. To this end, the following identity loss [35] is used: where F (.) denotes the embedding features extracted by the last layer before softmax in a pretrained CNN.
Here in terms of embedding feature extractor, we use FaceNet pretrained on VGGFace2 dataset, which contains 3.31M face images of 9,131 identities. This feature extractor is frozen during training. The identity preserving module in Eq. (10) enforces the embedding features of faces in the UV map ground truth I F and  the generated UV map G(I P ) to be close to each other. The dimension of the embedding features is 512. Final loss function. Overall, the total loss function is a weighted sum of the abovementioned losses: where λ 1 , λ 2 , λ 3 are the weights that control the importance factors of different losses. Moreover, a similar auxiliary loss is also applied to the intermediate output of the generator right after the end of the first U-Net decoder. The auxiliary loss strengthens the gradient flow to the layers of the first U-Net so that the parameters in the first U-Net can be trained more efficiently. Therefore, the final loss can be expressed as follows: where η is a parameter regulating the contribution of the auxiliary loss.

Datasets and settings
We train our Attention ResCUNet-GAN on the Multi-PIE dataset [51]. All subjects in this dataset were taken in 15 viewpoints, 19 illumination conditions, and many facial expressions. Totally, there are more than 750,000 images of 337 people. For every subject with each illumination condition and facial expression, we feed 15 facial images captured from 15 viewpoints to the 3DDFA model to retrieve separate incomplete UV maps. We then select the incomplete UV maps with yaw angles of 0°, −30°, 30°a nd merge them using Poisson blending [52] (Fig. 5) to create the corresponding ground-truth UV map. In that way, we can ideally create 15 pairs of images for training the generator. Each of these pairs consists of an incomplete and a ground-truth UV map. However, in some cases, when the quality of an input facial image is not good enough, the 3DDFA model can not successfully detect the face landmarks; thus, the corresponding 3D mesh and incomplete UV map can not be created. Therefore, such cases are ignored in the training phase. All generated UV maps are rescaled to 256 × 256 to fit the input size of our ResCUNet-GAN.
In addition to the proposed Attention ResCUNet-GAN, we also try a normal ResCUNet-GAN that has a similar architecture but without any attention gates and fast normalized fusion. In this ResCUNet-GAN, the concatenation is applied to all skip connections. Our networks are implemented in Pytorch. It takes three days for training each network on a server with two GPU RTX 2080Ti. We train each network for 100 epochs with a batch size of 16 and a learning rate of 10 −4 . We empirically set the importance factors as follows: η = 0.3, λ 1 = λ 2 = 0.5, λ 3 = 0.01.
In order to evaluate the effectiveness of the proposed method, we conduct experiments on pose-invariant face recognition on different benchmarks. Casia Web Face is a facial dataset that consists of 453,453 im- Figure 5 The creation of ground-truth complete UV maps. Three facial images with yaw angles of 0°, −30°, 30°are fed to the 3DDFA model to create three incomplete UV maps which are then merged by Poisson blending to generate the ground-truth complete UV map.
ages over 10,575 identities. LFW (Labeled Faces in the Wild) is a well-known dataset for face verification inthe-wild. LFW contains more than 13,000 images of 1,680 identities, and each identity has two or more images of various poses. CPLFW (Cross-Pose LFW) is an extended version of LFW, which is more difficult due to different illuminations, occlusions, and expressions. CFP dataset consists of 500 subjects, each of which has ten frontal and four profile images. There are two evaluation protocols regarding the CFP dataset: frontalfrontal (FF) and frontal-profile (FP) face verification. Each of them has ten folders with 350 same-person pairs and 350 different-person pairs.

Image Reconstruction
We use two metrics to evaluate the quality of output from the Attention ResCUNet-GAN. The first metric is the structural similarity (SSIM), which is designed for measuring the similarity between images. The second one is the peak signal-to-noise ratio (PSNR), which is commonly used to measure the quality of reconstruction. Table 1 shows that our method achieves better results than the original UV-GAN according to both metrics SSIM and PSNR.   to each other. However, for profile input faces, the results are quite different. UV-GAN produces the worst UV maps. Normal ResCUNet-GAN yields better results, and Attention ResCUNet-GAN gives the most realistic ones with a smooth texture. Note that the intermediate output obtained from the first U-Net of the Attention ResCUNet-GAN still yields better results than UV-GAN's. The results from some in-the-wild input images are shown in Fig. 8. One can see that Attention ResCUNet-GAN yields significantly better results than other ones, especially compared to the original UV-GAN.
In Fig. 9, Fig. 10 and Fig. 11, we show side-byside synthetic images generated from the UV map reconstructed by UV-GAN and the proposed Attention ResCUNet-GAN, respectively. One can see that our model yields qualitatively better results than the original UV-GAN, especially for profile and in-the-wild input images.
The facial images in the Multi-PIE dataset are not diverse enough to reflect the real data distribution. Thus, in-the-wild faces occluded by strange things or with too much makeup can lead to some failures of the model, as illustrated in Fig. 12.

Attention map visualization
The attention coefficients of the proposed Attention ResCUNet-GAN are visualized in Fig. 13. These attention coefficients are obtained in the attention gate of the AFC node that takes S9 as input (see Fig. 2). One can see that the attention maps try to ignore the visible face regions, focusing only on the missing regions of incomplete UV maps.

Pose Invariance Face Recognition
We compare our methods with UV-GAN on the Multi-PIE dataset in the face verification task. We take fa-cial images from different pose ranged from 0°to 75°a nd frontalize them using UV-GAN and our methods. We then use a face detector [53] to crop the central faces from the generated complete UV maps and push the cropped faces through ArcFace [54] to verify if the synthetic frontal face and the ground truth one belong to the same subject or not. The verification results are shown on Table 2. One can see that the verification accuracy falls down along with the increase of pose. Nevertheless, our proposed ResCUNet-GANs (and even ResUNet-GAN with one U-Net block) always produces better frontal faces in term of preserving identity. Attention ResCUNet-GAN outperforms other methods by orders of magnitude on all profile poses. Surprisingly, for frontal faces, Attention ResCUNet-GAN yields little degraded results than normal ResCUNet-GAN. The reason may be that the useful information, which is necessary for the recognition task, in frontal images is almost comprehensive. Hence, a complicated transformer with attention gates and fast normalized fusion might unintendedly diminish some useful information and leads to the degradation in the verification accuracy.
In the next experiment, we train a face recognition model on the CASIA dataset and evaluate its performance in the face verification task on other different datasets. Firstly, we train a face deep feature extractor with ResNet-101 backbone and arcface [54] loss on the CASIA dataset augmented by using Attention ResCUNet-GAN. For each identity in CASIA, we generate different profile faces from the frontal one, ranging from −80°to 80°with the step of 20°. For each identity, we synthesize approximately 300 frontal and profile images. We train the network with a batch size of 128 for 30 epochs. The learned model is then used for the verification task on the LFW, CPLFW and CFP  datasets. Note that for the CFP dataset, we consider two verification types: frontal-frontal means to verify two frontal faces, and frontal-profile means to verify a frontal face and a profile one. We use k-fold crossvalidation to evaluate the face verification task. Particularly, each dataset will be divided into 10 groups (k = 10). Each group is considered as a test set in turn, while the remaining groups are used to tune the best verification threshold. In total, we have ten runs for each face verification dataset. The mean accuracy and the standard deviation over ten runs are reported. Table 3 and Table 4 show that data augmentation using the proposed Attention ResCUNet-GAN improves the performance of the recognition model. Note that the LFW dataset does not pay much attention to crosspose face verification, and most faces in this dataset are nearly frontal. Therefore, a heavy facial pose augmentation using generative networks for training the recognition model is probably not really necessary. In fact, the verification performance on the LFW dataset over ten runs slightly fluctuates when we apply the proposed generative model for data augmentation. The standard deviation of accuracy increases from 0.032 to 0.391 (see Table 3). However, in overall, using Attention ResCUNet-GAN still helps to improve the average cross-validation accuracy. In contrast to the LFW dataset, the CPLFW dataset has lots of positive face pairs with different poses to enlarge intra-class variance. In this case, our model results in more stable improvements, where the standard deviation of accuracy is almost the same as if the data augmentation is not used.  The CFP dataset focuses on the pose variation in terms of extreme pose where many details of faces are occluded (see Fig. 14). One can see from the Table 4, our Attention ResCUNetGAN considerably improves the performance of the face recognition model, especially for the frontal-profile subtask.

Conclusions and future work
In this paper, we introduce a novel generative model called Attention ResCUNet-GAN to generate complete facial UV maps, which allows us to synthesize various faces of arbitrary poses and improve poseinvariant face recognition performance. We leverage the residual connections in ResNet, intra-block and extra-block feature fusion in coupled UNets to enhance the generator. The skip connections within each U-Net are amplified with attention gates, while the contextual feature maps from two U-Nets are fused with trainable scalar weights. We jointly train global and local adversarial losses with identity preserving loss. The experiments show that the proposed Attention ResCUNet-GAN outperforms the original UV-GAN by order of magnitude in terms of both reconstruction metrics and the performance on the pose-invariant face verification task.
In future work, we would like to exploit some recent efficient backbones such as EfficientNet [55] to improve the performance of the proposed approach. More complex short-cut connections [45,56] can also be utilized    Table 4 Verification accuracy (%) comparison on the CFP dataset
Availability of data and materials Not applicable Figure 12 Some failed cases when the input facial images are "abnormal" with respect to the training data. The top row shows the input images, the second row contains incomplete UV map and the third row displays the completed UV maps generated by our Attention ResCUNet-GAN. Figure 13 Attention map visualization. The first column contains UV maps generated by 3DDFA network, the second column contains generated UV maps overlaid by attention masks, and the last column illustrates attention coefficients only.

Competing interests
The authors declare that they have no competing interests.