Complete Face Recovering: An Approach towards Recognizing a Person by a Single Partial Face Image without the Target Photo in Gallery

Complete face recovering (CFR) is to recover the complete face image of a given partial face image of a target person whose photo may not be included in the gallery set. The CFR has several attractive potential applications but is challenging. As far as we know, the CFR problem has yet to be explored in the literature. This paper therefore proposes an identity-preserved CFR approach (IP-CFR) to addressing the CFR. First, a denoising auto-encoder based network is applied to acquire the discriminative feature. Then, we propose an identity-preserved loss function to keep the personal identity information. Furthermore, the acquired features are fed into a new variant of the generative adversarial network (GAN) to restore the complete face image. In addition, a two-pathway discriminator is leveraged to enhance the quality of the recovered image. Experimental results on the benchmark datasets show the promising result of the proposed approach.


Introduction
In real-world human face identification applications, query face images are generally collected in a less constrained or unconstrained environment.Let us consider a scenario below: In the crowded area such as supermarket or railway station, a target person can easily hide in the crowd, and his/her photos captured by surveillance cameras are probably partial.These partial faces can be: 1)occluded by disguises of sunglasses, hats and scarfs, or faces of other individuals; 2) captured in large-scale poses without user awareness; 3) positioned partially outside camera's view, to name a few.Some examples of partial face images as illustrated in Figure 1.Under the circumstances, we can only acquire the partial face images of a target person as the query samples.More worse, the photos of the target person cannot be found in the gallery set.Generally speaking, such partial face images will certainly increase much difficulty for practical face recognition.In particular, there is only one single partial face image is available.Under the circumstance, the learning problem of how to recover a complete face image for recognition becomes more challenging.We call such learning problem C omplete F ace Recovering (CFR).With this very limited input information, the fundamental challenges of CFP are not only to recover visually realistic and semantically plausible content for the missing part, but also to preserve the personal identity information.
In the literature, the most related task is image inpainting [5] which is, however, quite different from the CFR with two-fold: (1) Image inpainting methods usually fill in visually plausible content for the missing region of an image based on the overall scene.However, they fail to restore the complete photo from these partial query photos because they are unable to predict a large missing region in a face image without relevant information for reference.(2) Image inpainting focuses on generating visually realistic and semantically reasonable image without considering the preservation of the personal identity information during the face image restoration.It turns out that it is hard to identify the target person from the recovered face images.Recently, partial face recognition methods [39,13,17] have received considerable attention and achieved remarkable performance.However, these methods need the photos of a target person to be included in the gallery set.Otherwise, they cannot acquire the complete face image by retrieving one from the gallery set, and thus resulting in an incorrect solution.
In this paper, we therefore focus on studying the CFR problem.We use a denoising auto-encoder [36] based network with an identity-preserved loss function as the objective function to seek the manifolds of the query images.To recover the complete face image, we propose a deep architecture exploiting a new variant of generative adversarial network (GAN) based encoder-decoder [4] structure with the feature representation of the acquired manifolds of the query image.Further, we enhance the global structure and local consistency of the recovered results by a two-pathway discriminator, i.e., a global discriminator and a local discriminator.Experiments have shown its promising result.The main contributions of this paper are summarized below: -We attempt to address a new challenging and practical CFR problem; -We propose a novel identity-preserved network to represent the features of a face image in manifold space so that the essential features of the face image can be obtained; -A new GAN-based structure with a two-pathway discriminator is proposed for boosting the recovery result.
The rest of this paper is organized as following.Section 2 reviews the related work of GAN and image inpainting.Section 3 introduces the detail of the proposed IP-CFR model.Section 4 presents the evaluation and the experimental results and Section 5 gives the conclusion of our work.

Generative Adversarial Network (GAN)
GAN [15] is a kind of generative model that has led to significant progress in image synthesis [26,34,9,19], super resolution [25,7,10,38,41], image inpaint- ing [44,20,43,27,42], style transfer [30,48,21,2,22], to name a few.The min and max two-player game is an ingenious strategy to estimate the target distribution and then generate photorealistic images.A widely accepted cognitive approach is to treat GAN as a possibility based model (PBM) such as Deep Convolutional GAN (DCGAN) [33], Wasserstein GAN (WGAN) [1] and Improved Training of WGAN [16], etc. Namely, the discriminator calculates the conditional probability p(y|x) that the sample x belongs to the category y and the generator estimates the joint probability p(x, y).PBM has at least three unsolved problems, i.e., the training difficulty, hard control of the generated image diversity, and model collapse.Subsequently, researchers have tried to study GAN along another cognitive perspective, i.e., energy based model (EBM), which utilizes an energy function to replace probability calculation.Zhao et al. have first attempted to propose an energy-based GAN (EBGAN) [46] that gives a clearer physical explanation for GAN, Wasserstein distance, and gradient penalty.Following this work, a number of EBM variants, e.g.Margin Adaptation for GAN (MAGAN) [37], Loss-Sensitive GAN (LSGAN) [32] and Boundary Equilibrium GAN (BEGAN) [6], have been proposed.Compared with the PBMs, the merits of these EBMs are three-fold: stable convergence, easy for training, and robust against hyper-parameters.In this paper, we will therefore develop our CFR approach based on EBM.

Image Inpainting
Image inpainting is a technique which aims to automatically fill-in a missing region of an image.According to [20], image inpainting methods can be roughly classified into two categories: classical methods and learning-based methods.
Classical methods diffuse the external information along the contour normal to the missing portion [35,5,8,14], or fill the missing area by copying patches from the similar area within an image [40,18,18,24,3,11].However, these classical methods cannot well handle complex images such as non-stationary texture images because they simply manipulate low-level features and are unable to capture a high-level understanding of scenes.
More recently, deep learning-based methods, especially GAN-based methods, have achieved great success at the task of image inpainting.These schemes fill the missing pixels using the learned data distribution, and thus generating a new type of image inpainting methods called learning-based methods [31,43,28,47].This kind of methods utilizes global context information to generate local missing region, and make the restored image visually realistic.For example, Yu et al. [45] combined a coarse network with a contextual attention modelbased fine work to reconstruct the global image as well as to fine-tune detail texture, which can fill-in the missing regions of arbitrary shapes.Zheng et al. [47] integrated a reconstruction network into a generation network, where the former network obtained the pixel distribution of the missing area while the generation network generated a real area image according to the distribution.Nevertheless, the GAN-based methods ignore preserving the personal identity information during image restoration, which is therefore unfavorable for the subsequent face recognition.Besides, it has been shown that these methods will fail to recover the missing region when this region possesses independent semantic (e.g., eyes, chin and forehead).In such cases, the generator in existing GAN-based inpainting methods may get out of control and generate the strange texture that mismatches with the surrounding region, as illustrated in Figure 4 and 5.
In summary, the traditional methods simply use low-level global information to restore the local missing region.Hence, they are incapable of dealing with complex non-stationary texture images and cannot recover unknown regions in images.For existing learning-based methods, they generate visually realistic and context plausible images, but without considering the personal identity information.

The Proposed Approach
The aim of CFR is to restore a query photo of a person (containing partial face), i.e., x p into a desired complete face, i.e., xc , to make it close to the ground truth complete photo, i.e., x c .Subsequently, it is expected that the restored query photo xc will be used to recognize x c correctly.The architecture of the proposed approach, denoted as IP-CFR hereinafter, is illustrated in Figure 2.

Network Architecture
Identity-preserved Encoder: The proposed algorithm is inspired from the previous work of [36], where the partial face image x p is treated as some kind of noising input.Based on this, we design a stochastic operator, i.e., p (x c |x p ) = B g θ (y) (x c ), where y = f θ f (x p ), and treat it as an essential representation of x p reside in the manifold.Moreover, it has been shown that both complete and partial face images of one person can be mapped into the same low-dimensional manifold [29].Figure 3 gives the sketch of a map process.In the training stage, suppose the dataset contains M (M > 1) images for a person i (i = 1, 2, . . ., N ).We select one image x gi as control image for the person.In a formal formula, x gi and x j pi (j = 1, . . ., M ) should be mapped into the same manifold through the mapping function f θ f as follows: where θ f is the parameter of the encoder.Subsequently, we use the de-noising encoder to seek out the mapping function f θ f (•) and use a decoder g θg (•) to restore the complete face image x j ci as : where θ g is the parameter of the decoder.Then, the proposed model can be trained in the training set that contains M pairs of images x j pi and x j ci for person i, and the objective function is as follows: where θ G = {θ g , θ f }, and L is represented as: Note that L ip is the identity-preserved constraint that guarantees the partial query photo and the control image are mapped into the same manifold, so as to preserve the target person's identity.L pw is the pixel-wise reconstruction residual.There are more detailed descriptions about the loss functions in Section 3.2.λ ip and λ pw are the hyper-parameters.
Face Recovery Decoder: To recovery the face images from the manifold in the low-dimension space, this paper modifies the decoder by proposing a new recovery network architecture in virtue of EBM.Different from PBM, which estimate examples distribution, this paper introduces BEGAN to match the distributions of reconstruct errors.Besides, BEGAN exploits an encoder-decoder model as the classifier to match the reconstruct loss distribution based on Wasserstein distance.The discriminator tries to enlarge the distance between the reconstruct loss distributions of real samples and generated samples while the generator reduces this distance.This strategy makes the training process fast and stable.The objects of our discriminator and generator are formulized by: where, θ g and θ D are the parameter of the face recovery decoder (g) and discriminator (D), respectively.f p = f θ f (x p ) is the feature expression in the manifold space, as described in Eq.( 1).λ k is the learning rate for k and is set to 0.001 in our experiment.The initial value of k is k 0 = 0. Furthermore, g : R X → R W ×H×C , decodes the features into original image space.The feature f p is an X dimension vector and the recovered image has the size of W × H × C. The structure of g is shown in Figure 2 (a).
In addition, l rec,θ D denotes the reconstruction error which can be calculate by: where D θ D : R W ×H×C → R W ×H×C , is the encoder-decoder function parameterized by θ D .x ∈ R W ×H×C .The structure of D will be described in detail in the following section.η represents the norm which can be 1 or 2. The hyperparameter γ in Eq.( 5) is defined as: Here, γ is used to control the balance between g and D, meanwhile balancing the diversity and quality of samples generated by g, which takes the value from [0, 1].When γ is lower, g pays more attention to the reality of the decoded images, thus leading to the lower diversity of the generated images.
Dual Pipeline Discriminator: Besides the improved structure of recovery network, we also enhance the discriminator using a dual pipeline architecture [47] that contains a global and a local pathway.The global pathway network takes the whole image as the input to discern the overall consistency of the face image, while the local pathway observes on one quarter of the entire face image to determine the local consistency and detail information.The discriminator is also based on an encoder-decoder structure that reconstruct the input.Outputs of the two pathways then are synthesized together by a convolution layer that reconstruct the entire image.The skip-connections are introduced for relieving the network architecture and fusing multi-scale features.Finally, the distribution of reconstruct error can be acquired to distinguish the generated image from real image.An overview of the discriminator is shown in Figure 2 (b) and the layers of the discriminator architecture is described in Table 2.The layers of generator is also described and can be seen in Table 1.Here,the enc.and dec.are the abbreviations of encoder and decoder respectively.The conv. is short for convolutional layer and the deconv.is for deconvolution layer.k represents the kernel size, s denotes the stride and c is the number of channels.Every layer is followed by a Rectified Linear Unit (ReLU) activate layer except the synthesis layer.Every layer in the encoder is followed by a residual block before the ReLU layer, which is used for the skip connection.Compared with the generator, this structure of the discriminator is more simple so that does not introduce too much computation cost during the training stage.The global pathway takes the entire image which has been resized to 128 × 128 × 3 pixels as input.It consists of a down-sampling encoder and an up-

Training Loss
The loss function is essential for training our network.We formulate this loss function as a weighted sum of terms in Eq.( 4) and ( 5) to train the generator network and the following section will give a detailed description.
Identity-preserved Loss: The identity-preserve loss function L ip is formulated as: η can be 1 or 2. Since the query image (which contains partial face) and the control image are the same person, thus should be mapped into the same manifold in low-dimensional space.f θ f (x p ) and f θ f (x g ) correspond to the outputs of the hidden layer which can be seen as the feature expressions of the manifolds that are with the same identity.Therefore, they are desirable to be the same.This constraint helps to enforce the decoder robust to the variances and find the essential features of the query image.
Reconstruct Residual Loss: The pixel-wise reconstruct residual loss L pw ( Note that we use an abbreviated notation:L pw = L pw (x p , x c ) ) is expressed as: where W and H denote the with and height of the image respectively.G is the generator which is consisted of the identity-preserved encoder and the face recovery decoder.This reconstruct residual loss function measures the global similarity of the recovered image and the original complete image.Since the recovered face images should be similar to the input complete face images in terms of pixel intensities, we employ the L 2 norm to enforce this similarity.It means that the missing areas of the query image should be repaired after passing though the generator.
Adversarial Loss: Since there could be several possible mapping between partial and complete face images, the CFR inherently belongs to an underdetermined problem.Conducting the identity features and the pixel intensities similarities may not guarantee that the decoder can output realistic and reliable recovered face images.Thus besides the loss functions described so far, the generative part L g mentioned above in Eq.( 5) of the proposed GAN is also introduced to the loss function which serves as a supervision to motivate the network to generate natural and reliable images by trying to mislead the discriminator.
Overall Loss Function: The overall loss function is a weighted sum of the foregoing loss and is expressed as:

Algorithm
Based on Section 3.1 and 3.2, the training and testing of the proposed approach are below: Training: The standard forward-backward optimization paradigm is utilized to train the network.In the forward direction, the IP-CFR network takes x p as input and outputs xc .In the backward direction, the networks parameters are updated based on Eq.( 5) and ( 10) over the recovered image.Specifically, the Adam optimization algorithm [23] is utilized to update the parameters θ G and θ D .Testing: When the network get trained, the discriminator can be discarded.We use the generator to recovery the complete face image with a partial face image as input.Accordingly, the algorithm is given in Algorithm 1: Algorithm 1 identity-preserved complete face recovering Input: partial face image x j pi , corresponding complete face image x j ci , control image x gi , (j = 1, . . ., M, i = 1, 2, . . ., N ), learning rate α.Output: Recovered complete face image x j ci 1: i ← 0 2: while not converged do Compute g θ G i,k -gradient based on Eq.( 10): Compute g θ D i -gradient based on Eq.( 5):

Experiment
The extensive experiments are carry out to demonstrate the ability of the proposed model for recovering the missing contents on face images.

Datasets and Evaluation Criteria
The experiments use two benchmark face datasets: the CelebFaces Attributes (CelebA)1 dataset and the Labeled Faces in the Wild (LFW)2 dataset.
We use the recovered results as the qualitative index.As for the quantitative indicator, it is worth noting that our model aims at generating identity-preserved complete face images rather than the same pixels in the original images.Therefore, we use recognize accuracy to help evaluate the performance of our proposed model.

Implementation Details
The algorithm evaluation of this paper is divided into two aspects: 1) The effectiveness of IP-CFR.We randomly select 90% of the persons to train our model and the rest of the objects is utilized for testing.702 persons who contain more than 6 images are selected to calculate the recognize accuracy.Among these images, we stochastically select 6 images for each identity.For each person, we randomly select one image as control image (702 samples in total) then the rest 5 images (3510 samples in total) are used as query samples.
2) The transferability of IP-CFR.We use CelebA dataset for training and LFW for testing.To calculate the recognize accuracy, we choose 311 identities which have no less than 6 images for each.There are 5425 images in total.For each person, we randomly select one image as control image and the remaining 5114 images are set as query samples.
All the images are cropped out the face part and then re-scaled into the size of 128 × 128 × 3 pixels.The partial face image x p are randomly cropped from x c with the size of 64 × 64 × 3 to simulate the image occlusion.
The experiments are conducted by firstly extracting deep features with in-sightFace [12] and then use a cosine-distance metric to calculate the rank-1, rank-5 and rank-10 recognize accuracy.The hyper-parameters are set as: λ ip = 1, λ pw = 1 and λ g = 5 × 10 −3 .We compare our model with three different kinds of methods: 1. InsightFace is a easy and effective method for face recognition.It improves the speed and accuracy greatly.Moreover, this method has strong robustness under the large posture change (roll, yaw, pitch).
2. Dynamic Feature Matching (DFM) proposed by He et al. [17].DFM is a partial face recognition method that can work for arbitrary input size without pre-alignment.
3. Generative inpainting proposed by Yu et al. [45] which is also trained on CelebA dataset.It is an GAN-based inpainting method which can generate realistic inpainting results for the missing pixels inside the input image.
For DFM and generative inpainting, we use the same setting recommended in the papers.For these three methods, we use the cropped images which are the same with the input of our proposed model as query samples.

Qualitative Comparison
We compare our model with generative inpainting.Figure 4 and 5 present examples of CFR results on CelebA dataset and LFW dataset.We can observe Our method use an identity-preserved encoder to find the manifolds of the face images and then get the complete face images by a decoder.The global information is acquired by the encoder and the complete face image is recovered by the decoding process.

Quantitative Comparison
Table 4.4 and 4.4 show the recognize accuracy of the state-of-the-art methods and IP-CFR.The useful information for recognize of generative inpainting mainly comes from the input images.Therefore, the recognize accuracy increases a little by generative inpainting compared with directly using insightFace.The rank-10 recognize accuracy of DFM is acceptable but not good enough.We also implement the experiment which uses the original complete face images as query samples by insight face and get the Rank-10 results of 0.9063 on CelebA and 0.9969 on LFW.The recognize accuracy of our proposed method (0.8589 on CelebA and 0.9361 on LFW) is close to using the original complete face images as query samples.

Limitations and Discussion
Even though our approach can handle various partial face images, it has some limitations.The input with too sparse useful information cannot be repaired due to difficulty of finding the feature expression of the low-dimensional manifold.
The failed examples mainly contain the following situations: 1) contain less recognizable information; 2) are in low resolution.The sparse useful information is due to the randomly crop.Low resolution is caused by image cropping face part and resizing in the preprocessing step.The failed results give us an inspiration that the eye region contains more crucial information for deep network to recognize.Resolution is beyond our consideration.Mapping an image from low resolution to high resolution is a heterogeneous matching problem.We plan to investigate better identity-preserved network to seek more discriminative feature to address this issue in our future work.Figure 6 shows several examples of failed recovered results.The other limitation is that the recovered results lose some subtle image detail.From Figure 4 and 5, we can see that the recovered parts to some extent become blurred, which can cause diminishment in visual quality.Limited available information causes this kind of visual blur.More prior information, for example the location of the facial features, can be considered to enhance the clarity of the image in our future work.

Conclusion
In this work, we have proposed the IP-CFR network to deal with the new but challenging CFR problem.In our model, the generator is composed of an identity-preserved encoder which is constrained by an identity-preserved loss and followed by a complete image recovery decoder.Then, a dual pipeline discriminator (global and local) is applied to enhance the recovered results.The proposed framework is capable of recovering the complete face image as well as preserving the identity of the query partial face image.Experiments have shown the promising results of the proposed approach on the benchmark face datasets.

Fig. 1 .
Fig. 1.Examples of partial face images collected in an unconstrained environment.These query photos suffer varying degrees of damage, including: 1)occluded by disguises of sunglasses, hats and scarfs, or faces of other individuals; 2) captured in large-scale poses without user awareness; 3) positioned partially outside camera' view.

Fig. 3 .
Fig. 3.The sketch of a manifold map, where each point represents the manifold of a person.Both complete and partial face images of one person should be mapped into the same low-dimensional manifold.The identity-preserved network is to seek the map function.

3 1 3
sampling decoder.The local pathway uses similar structure with the global pathway except that the input is 64 × 64 × 3-pixel patch.As the recovered region is large, we divide the entire image I into four equally sized patches {p 1 , p 2 , p 3 , p 4 } then input them into the local pathway.Each patch usually contains the recovered region.Finally, we fuse the reconstructed images from each pathway by simply feeding them into a successive convolution layer to obtain the final output image.The skip-connections are introduced for relieving the network architecture and fusing multi-scale features.The structure of the discriminator is shown in Fig.2 (b).
. Cheung and M.K. Li

Fig. 6 .
Fig. 6.Examples of failed results: The images in the first row are the inputs; The second row shows the recovered results and the last row shows the original images.

Table 1 .
Layers of the generator

Table 2 .
layers of the discriminator

Table 3 .
Recognition accuracy comparison with state-of-the-art on CelebA dataset of 702 classes

Table 4 .
Recognition accuracy comparison with state-of-the-art on LFW dataset of 311 classes that generative inpainting has limited ability to predict external region of the input.It can generate only little more visual meaningful pixels.Because generative inpainting mainly utilizes the global image information to infer the inner missing pixels' distribution.Once missing region locates outside the input, the model has no global information as guidance, thus generates much meaningless texture.Therefore, the inpainting method cannot handle the CFR problem.