Semantic-Aware Face Deblurring With Pixel-Wise Projection Discriminator

Most recent face deblurring methods have leveraged the distribution modeling ability of generative adversarial networks (GANs) to impose a constraint that the deblurred image should follow the distribution of sharp ground-truth images. However, generating sharp face images with high fidelity and realistic properties from a blurry face image remains challenging under the GAN framework. To this end, we focus on modeling the joint distribution of sharp face images and segmentation label maps for face image deblurring in a GAN framework. We propose a semantic-aware pixel-wise projection (SAPP) discriminator that models pixel-label matching with semantic label map information and generates a pixel-wise probability map of realness for the input image as well as a per-image probability. Moreover, we introduce a prediction-weighted (PW) loss to focus on erroneous pixels in the output of the decoder, using per-pixel real/fake probability map to re-weight the contribution of each pixel in the decoder. Furthermore, we present a coarse-to-fine training technique for the generator, which encourages the generator to focus on global consistency in the early training stages and local details in the later stages. Extensive experimental results show that our method outperforms existing methods both quantitatively and qualitatively in terms of perceptual image quality.


I. INTRODUCTION
Single face image deblurring (SFID) aims to restore a sharp face image from a single blurred face image. It is one of the significant but challenging research areas in computer vision because face analysis plays an important role for many applications including face detection [1], [2], [3], [4], face recognition [5], [6], [7], [8], and age prediction [9], [10], [11], [12]. SFID is a highly ill-posed problem that can have many possible sharp images for a given blurred image; therefore, recent SFID methods have typically leveraged face-specific priors, including face landmarks [13], face sketches [14], face 3D shape [15], face segmentation label The associate editor coordinating the review of this manuscript and approving it for publication was Taous Meriem Laleg-Kirati . maps [16], [17], [18], and deep features [19]. Despite these efforts, these methods [13], [15], [17] often suffer from over-smoothed and perceptually unnatural results. Some SFID methods [14], [16], [18], [19] are effective at improving the perceptual qualities of deblurred images on the strength of generative adversarial networks (GANs) [20]. GANs have demonstrated an ability to generate realistic samples via a min-max game between a generator and a discriminator. The generator captures the training data distribution, and builds a mapping function from a prior noise distribution to generated data distribution. The discriminator guesses whether the input sample is from the training sharp image (real) or the generator (fake) [21] as shown in Fig. 1 (a). Several SFID methods [14], [16], [18], [19] leverage this distribution modeling VOLUME 11, 2023 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ ability to impose a constraint that the deblurred image should follow the distribution of sharp ground-truth (GT) images. However, generating sharp face images with high fidelity and realistic properties from a blurry face image remains challenging under the GAN framework. One possible reason is attributed to the discriminator. In the case that additional information, such as semantic labels, exists, the discriminator typically estimates the data distribution of only the sharp face images, and it does not learn the joint distribution of the sharp images and semantic labels. Even though face images are highly structured with semantic components, the decisions of discriminator can be based on relatively unimportant details [22] and not based on the semantically structured features of the face [23]. In addition, the discriminators in the above methods are limited to estimating the global (per-image) real/fake decision without considering local (per-pixel) decisions [24]. Hence, these approaches lack pixel-level details of the generated image and do not guarantee that their generators can synthesize locally plausible images [24], [25], [26].
For the joint distribution modeling of data and additional information such as labels, conditional GANs (cGANs) are widely used [21], [28], [29], [30], [31], [32], [33], [34], [35]. By incorporating data and additional labels, the discriminators identify real images in a principled way, thereby resulting in generators that produce realistic images [36], [37]. Recently, projection GANs [28], [33], [34], [35] have successfully decomposed joint distributions into image distribution (marginal) and label matching distribution (conditional). Specifically, as shown in Fig. 1 (b), the projection discriminator utilizes a class embedding matrix, an image embedding network (encoder), and a linear layer. Despite their promising distribution modeling ability, projection GANs cannot model the pixel-wise joint distribution of the image and semantic label map because they assume that each pixel in the input image shares the same label information.
Meanwhile, U-Net GAN [24] has been proposed to synthesize locally plausible images. As shown in Fig. 1 (c), U-Net GAN utilizes a U-Net [38] structure-based discriminator, which consists of an encoder and a decoder acting as a classifier and a segmenter, respectively. The U-Net discriminator simultaneously outputs the probabilities of whether the input samples are real or fake in both the entire image and each pixel. This global and local feedback encourages the generator to improve the quality of synthesized samples. However, there exists a limitation in that this structure is not designed to take an additional label information as an input. Another limitation is that feedback to the generator can be overwhelmed by the dominant correct pixels of the decoder in the discriminator, resulting in inefficient training. The qualities of face images may be dependent on small components, such as the eyes, nose, and lips. Therefore, it is important to focus on erroneous pixels for generating highquality details.
To address the limitations mentioned above, we propose a semantic-aware pixel-wise projection (SAPP) GAN with a SAPP discriminator for face image deblurring in a GAN framework. Our SAPP GAN exploits both the U-Net GAN [24] and projection GAN [28]. As shown in Fig. 1 (d), the SAPP discriminator models pixel-label matching with semantic label map information. Unlike projection GAN [28] that utilizes image-level label information as illustrated in Fig. 1 (b), the proposed discriminator models pixel-wise joint distribution of the images and pixel-wise label maps. Furthermore, unlike U-Net GAN [24] (Fig. 1(c)) that models data distribution of only the sharp images, our discriminator can capture joint the distribution of sharp images and corresponding segmentation label maps. By using a face segmentation map as condition information, the SAPP discriminator considers face components when it makes pixel-wise real/fake decisions during training. Empowered by semanticaware pixel-by-pixel feedback, the generator can restore more accurate and detailed face image with high perceptual quality. Moreover, we propose a prediction-weighted (PW) loss to focus on erroneous pixels in the output of the decoder. The PW loss utilizes a per-pixel real/fake probability map to re-weight the contribution of each pixel in the decoder. Thus, the decoder can discriminate between the generator distribution and the target distribution more precisely, thereby enabling the generator to obtain more powerful and accurate feedback. Furthermore, based on the global and local feedback from the SAPP discriminator, we introduce a coarse-tofine training technique for the generator, which encourages the generator to focus more on global consistency in the early training stages and local details in the later stages.
We validate the performance of our method on the MSPL dataset [18] and Real-Blur dataset [39] and compare its performance with those of the other SFID methods. Based on these extensive experiments, we show that our method outperforms existing methods quantitatively and qualitatively.
Our contributions can be summarized as follows: • We present a semantic-aware pixel-wise projection discriminator that models the joint distribution of sharp face images and segmentation label maps.
• We introduce a prediction-weighted loss that gives a high penalty for incorrectly predicted pixels.
• We propose a coarse-to-fine generator training technique that enables the generator to focus on global consistency in the early stages and local details in later stages.
• Our method achieves state-of-the-art performance for single face image deblurring both quantitatively and qualitatively in terms of perceptual image quality.
and face image deblurring. We also discuss projection-based conditional GANs and U-Net based GANs, which are highly relevant to the proposed method.

A. SINGLE IMAGE DEBLURRING 1) GENERAL DEBLURRING
General image deblurring restores a sharp image from a blurred image captured in a general (natural) scene. Image deblurring has been typically considered as an ill-posed problem with a large solution space [40]. To overcome this, various priors have been studied to regularize the solution space, such as Gaussian mixture [41], hyper-Laplacian [42], ℓ 1 /ℓ 2 -norms [43], ℓ 0 -norms [44], [45], variational Bayes approximations [46], [47], adaptive sparse priors [48], patch priors [49], and dark channel priors [50]. Although these methods perform well in certain cases, they are not flexible for real-world examples, owing to their restrictive assumptions due to regularization [51], [52]. DL-based approaches have recently made significant advances in image deblurring. Early DL-based studies [51], [53], [54] combined convolutional neural networks (CNNs) with traditional optimization-based deconvolution algorithms. Most of these methods used CNNs for blur kernel estimation and then employed optimization-based methods to obtain sharp images. Hence, such methods relied on accurate kernel estimation step [55]. In contrast, Nah et al. [55] proposed a deep neural network that directly restored a sharp image from a blurry image without estimating the blur kernel. In particular, they built a multi-scale CNN, which consisted of multiple sub-networks at each sub-scale and predicted sharp images in a coarse-to-fine manner. Instead of stacking multiple sub-networks, multi-recurrent approaches [52], [56] have proposed to implement coarse-to-fine procedures with a single recurrent neural network (RNN). More recently, multi-patch hierarchy methods [57], [58], [59] have been proposed to restore sharp images progressively from nonoverlapping patches. To effectively reduce the computational cost, Cho et al. [60] proposed a multi-input multi-output (MIMO) architecture that accepts multi-scale input images with a single encoder and outputs multiple scales of sharp images with a single decoder.

2) FACE DEBLURRING
While general deblurring models have been well generalized to capture the natural representation of images, they have not been specialized in specific domains, such as face and text images [16], [61]. In contrast, most face deblurring approaches primarily focus on facilitating face restoration by utilizing effective and powerful face prior information, e.g. , reference priors [62], [63], face landmarks [13], [64], face sketches [14], face 3D shapes [15], face semantic segmentation maps [16], [17], [18] and deep feature priors [19]. Reference-based approaches [62], [63], [65] use an additional sharp reference face as a guide for face deblurring. However, such methods require a time-consuming procedure to find an adequate reference image. Instead of searching for similar faces, recent face deblurring methods focused on estimating facial priors through deep neural networks (DNNs). Shen et al. [16] first proposed a DNN-based framework consisting of two sub-networks: a semantic face parsing network and a multi-scale deblurring network. The face parsing network first estimates the semantic segmentation maps from blurry face images. Then, the deblurring network performs restoration. Inspired by [16], Yasarla et al. [17] proposed an uncertainty-based multi-stream network (UMSN) that measures the uncertainty score to prevent the negative effects of inaccurate parsing maps. More recently, Lee et al. [18] proposed a multi-semantic progressive learning (MSPL) framework that progressively restores sharp faces component-by-component. Jung et al. [19] developed a deep feature prior-based method that extracts rich information of pre-trained face recognition network to utilize not only the shape prior of the face but also the texture prior.
However, most existing methods still yield blurry face images because they try to model the data distribution of only the sharp images. In contrast to existing methods, the proposed method focuses on modeling a joint distribution of sharp images and their segmentation label maps for semanticaware restoration.

2) U-NET BASED GANs
One of the other development directions of GANs is to generate locally coherent images [24], [27], [77]. In particular, U-Net GAN [24] implements U-Net [38] based discriminator architecture, which consists of an encoder and a decoder, acting as a classifier and a segmenter, respectively. U-Net based discriminator outputs the probabilities of the input sample being real over the entire image and per-pixel through the encoder-decoder architecture. The per-pixel decision provides spatially coherent feedback to the generator while the per-image decision gives global coherent feedback. Encouraged by the feedback, the generator attempts to improve the quality of synthesized samples. Owing to its powerful data representation, U-Net GAN has been adopted in various studies [25], [26], [82].
Although the proposed SAPPGAN is highly inspired by the projection GAN [28] and U-Net GAN [24], there are some key differences. Unlike the projection GAN, which considers each pixel in the input image to share the same label information, our method has the ability to model the pixel-wise joint distribution of the image and semantic label map. Moreover, the U-Net GAN [24] has a limited ability to model the joint distribution of input data and external data because it has been developed under unconditional settings. In contrast, we condition the U-Net GAN [24] on additional information (segmentation maps) to construct the conditional framework.

III. PROPOSED METHOD
The proposed SAPPGAN consists of a deblurring network G and a discriminator network D. The overall architecture is shown in Fig. 2. G takes a blurred face image I blur ∈ R H ×W ×3 as its input, where H and W the represent height and width of image, respectively. Then, G outputs a deblurred face image I deblur ∈ R H ×W ×3 as follows: (1) D takes a sharp face image x ∈ R H ×W ×3 and a segmentation map y s ∈ R H ×W ×1 as its input and models the joint distribution p(x, y s ). Here, x can be a deblurred image I deblur or a GT sharp image I GT . y s is used for the condition information of D. The proposed discriminator can be employed in several SFID methods that use the GAN framework. Thus, we adopt DFPGNet [19] as our generator. Similar to other GAN-based networks [20], we alternatively train our generator G and discriminator D.
In this section, we first introduce our semantic-aware pixelwise projection (SAPP) discriminator, which utilizes segmentation maps for pixel-wise real/fake decisions. We then describe our discriminator training technique with the proposed prediction-weighted (PW) loss, which re-weights the contribution of each pixel in the decoder using the probability map output from the SAPP discriminator. We subsequently introduce the generator training technique based on the coarse-to-fine strategy.

A. SEMANTIC-AWARE PIXEL-WISE PROJECTION DISCRIMINATOR
We propose a SAPP discriminator that considers face component information y s when it makes the pixel-wise real/fake decision. The encoder D enc and decoder D dec of our SAPP discriminator D is adopted from those of the U-Net discriminator [24]. As shown in Fig. 2, the body network of the encoder D body enc takes a face image x as its input and outputs a feature map Z ∈ R H 32 × W 32 ×1024 , on which the head network of the encoder D head enc is applied to generate a probability of realness as follows: where p enc denotes a global probability of x being real. There are 5 downsampling stages in D body enc , where each stage is a series that includes 3 × 3 convolution layer, ReLU activation, 3 × 3 convolution layer, and 2 × 2 average pooling layer. D head enc is a series of a global sum pooling layer and fully connected layer.
With the Z from D body enc , D dec outputs the per-pixel real/fake prediction. However, unlike the U-Net discriminator [24], the SAPP discriminator leverages face semantic maps y s to further decide whether the input image matches the semantic label map condition. Thus, D dec segments the input image as real or fake, conditioned on y s , which results in Q dec ∈ R H ×W ×1 .
As shown in Fig. 2, D dec consists of body network D body dec , label embedding matrix V and head layer D head dec as in [28]. D dec is connected with D enc through skip connections that concatenate corresponding feature maps from the stages of where L ∈ R H ×W ×d is the output feature map. V ∈ R N ×d contains a list of the d-dimensional row embedding vectors of the N class labels. Note that unlike [28], V embeds the segmentation label map y s pixel-wise to feature map F ∈ R H ×W ×d . Thus, y s is first one-hot encoded and then unrolled toŷ s ∈ R HW ×N . Then the embedded matrix F ∈ R HW ×d is obtained as: where ⊙ represents matrix multiplication. Finally,F is rearranged to feature map F of dimension H × W × d. By taking the inner product between L and F at the pixel level, we can obtain a per-pixel conditional probability map M as: where F i,j ∈ R 1×1×d and L i,j ∈ R 1×1×d represent the vector element at location (i, j) of F and L, respectively. M i,j ∈ R 1×1×1 represents the degree of matching between the pixel of x and semantic label of y s at location (i, j). Thus, M represents the conditional probabilities i.e. the image-label matching map. D head dec is a 1 × 1 convolution layer that takes L as its input and outputs per-pixel marginal probabilities i.e. an imagebased real/fake probability map. Therefore, the final decoder output Q dec is calculated by the summation of the image-label matching map and image-based real/fake probability map as follows: where ⊕ denotes element-wise summation. Giving a condition y s as a semantic face map enables D dec to further make an accurate per-pixel decision. Thus, G can generate more accurate and realistic face details when driven by semanticaware feedback.

B. DISCRIMINATOR TRAINING
We propose a prediction-weighted (PW) loss that utilizes the probability map from the decoder to re-weight the contribution of each pixel to the loss for the decoder [24]. For convenience in notation, we define Q r dec ∈ R H ×W ×1 and Q f dec ∈ R H ×W ×1 for decoder output Q dec when the input is real and fake, respectively: Additionally, we define probability map p r ∈ R H ×W ×1 for the real input and p f ∈ R H ×W ×1 for the fake input. Note VOLUME 11, 2023 that each pixel in both p r and p f represent the realness of the corresponding pixel, i.e. the probability of the real class when the input image is real and fake, respectively. Then, we can derive p r and p f from the original GAN loss [20] as follows: where A(t) = log(1 + exp(t)) refers to the SoftPlus function [28], [35]. By rearranging above equations, p r and p f are obtained as: Finally, our PW loss is defined as: Here, ⊗ refers to element-wise multiplication and [·] i,j represents pixel location (i, j). ξ (·) is a normalization function as i,j t i,j , and 1 ∈ R H ×W ×1 is a matrix filled with ones. As PW loss aims to emphasize the erroneous prediction of the decoder, 1 − p r and p f are used as per-pixel weighting factors for the real and fake inputs, respectively. For example, when the discriminator incorrectly determines that a real pixel is fake, the value of that pixel in p r becomes low, which highly affects the PW loss. Similarly, in the fake data, misjudged pixels have high p f values; thus they have a large impact on PW loss and vice versa. As 1−p r and p f have different values for each pixel, PW loss can highlight regions with wrong predictions, similar to [83].
Overall, our discriminator objective function L D consists of an encoder loss L D enc and decoder loss L D dec as: where encoder loss is defined as follows [20]: − log(1 − D enc (G(I blur ))), (12) and decoder loss L D dec is defined as Eq. (10).

C. GENERATOR TRAINING
The generator objective function includes reconstruction loss L pixel , prior feature loss L feat , and adversarial loss L adv . Reconstruction loss is defined as L 1 distance between the GT image I GT and the deblurred image I deblur in image domain as follows: Inspired by [19], we employ deep feature prior loss L feat to utilize the rich information of deep features extracted from the well-trained VGGFace [84] network θ. Let θ(·) l be the intermediate output features of the l th layer of θ. Then, L feat minimizes the L 2 distance between the deep features obtained using I GT and I deblur as: We select the relu1_2, relu2_2, and relu3_3 layers of VGGFace for θ(·) 1 , θ(·) 2 and θ(·) 3 , respectively following [19]. L adv is an adversarial loss from D that encourages G to generate more realistic details as: where L adv,enc and L adv,dec are defined as: Here, α is a balancing coefficient that makes the training process effective by enabling coarse-to-fine training. As the deblurring task is a challenging and extremely ill-posed problem, it is effective to decompose the deblurring task into smaller and easier sub-tasks. Thus, we divide the SFID task into two sub-tasks, which are to the learn global face image distribution and learn the structural features and detailed textures of the real face image. Thus, α is a scalar that decreases in proportion to the current epoch η c : where η t refers to the total number of epochs. In the early stages, L adv,enc has a higher effect on L adv than L adv,dec , thus, the generator focuses more on global consistency than local consistency by balancing L adv,enc and L adv,dec . As the training proceeds, i.e. for larger η c , the effect of L adv,dec increases, so that the generator focuses on local details. By doing this, the generator can output detailed deblurred face images. Finally, the final generator objective L G becomes where λ pixel , λ feat , and λ adv are hyperparameters that are empirically set as 1, 1, and 0.06, respectively.

A. EXPERIMENTAL DETAILS 1) DATASETS
The training and evaluation are conducted on the MSPL dataset [18], which has been used in recent SFID studies [18], [19]. The MSPL dataset consists of training set and a test set for face deblurring collected from various face images. The detailed description of the MSPL dataset is as follows.
• The MSPL training set consists of 24, 183 pairs of blurred face images and the corresponding sharp GT face images. The GT face images are collected from the CelebAMask-HQ dataset [86], which contains pairs of high-quality (1024 × 1024 resolution) face images and corresponding segmentation label maps. Each segmentation label map is precisely and manually annotated with 19 classes, including facial components and accessories, such as the eyes, eyebrows, nose, mouth, lips, ears, hair and skin. In practice, segmentation label maps for face image datasets can be obtained leveraging pre-trained face parsing networks [87], [88], [89].
In [18], 18000 motion blur kernels are synthesized from random 3D trajectories, where the size of blur kernel ranges from 13  • The MSPL testset is further divided into the MSPL-Center test set and MSPL-Random test set. The former primarily consists of images with a frontal face at the center position. The latter provides images of randomly rotated or/and cropped versions of the former. Each of the MSPL-Center and MSPL-Random test sets contain 240 sharp-blurry face pairs collected from the CelebA [90], CelebAMASK-HQ [86] and FFHQ [91].

2) IMPLEMENTATION DETAILS
We implement our model using the PyTorch [92] and train it using two NVIDIA Titan Xp GPUs. During training, we adopt the Adam optimizer [93] with β 1 = 0.9, β 2 = 0.999. The learning rates of the generator and discriminator are initialized to 1 × 10 −4 and decayed exponentially by 0.99 every epoch. For every training iteration, the pairs of GT images, blurry images and segmentation label maps are sampled with a batch size of 16. As in [17], [18], [19], random horizontal flips and random rotations are performed for data augmentation. The proposed network is trained for 300 epochs, which is sufficient for convergence.

3) EVALUATION METRICS
For the quantitative evaluation, we employ the perceptual image quality assessment metrics: the identity distance (d ARC ) [94] between the GT and restored face images using the pre-trained Arcface [95] embedding vector, feature distance (d VGG ) of the pre-trained VGGFace [84] to measure the similarity of the facial identity between the GT and deblurred images, and learned perceptual image patch similarity (LPIPS) [96] for perceptual quality. Note that smaller values of d ARC , d VGG , and LPIPS indicate higher consistency with the GT face image. Moreover, we report widely-used image quality assessment metrics, which are the peak signalto-noise ratio (PSNR) and structural similarity index map (SSIM) [97].

B. COMPARISONS ON MSPL DATASET
We compare the proposed SAPPGAN with state-of-the-art deblurring models, including general models [57], [60] and face models [16], [17], [18], [19], [61], [85]. For general deblurring models [57], [60] which are originally trained on natural scenes, we report additional results using the retrained models on the MSPL training set. As existing face deblurring models [16], [17], [18], [19] are trained on different training set or/and synthetic blur kernels, we also retrained them on the MSPL training set for a fair comparison. Throughout  this experimental section, those retrained models using the MSPL training set are marked with *. The official models of [18], [19] are not retrained because they are trained on the MSPL training set. All experiments are conducted with official codes provided by the authors. Note that we did not re-implement and retrain the model in [16] because the official training codes have not yet been released. Table 1 reports the quantitative evaluation results on the MSPL-Center and MSPL Random test sets. The proposed SAPPGAN outperforms the state-of-the-art methods in terms of perceptual metrics, such as LPIPS, d VGG , and d ARC . Importantly, our proposed SAPPGAN achieves significant improvements in perceptual metrics over recent GAN-based SFID methods [16], [18], [19] that were developed to restore perceptually satisfactory images. In contrast to GAN-based SFID methods whose objective function primarily focuses on making the global decision of sharp face images with the data distribution of only sharp face images, the proposed SAPPGAN estimates the joint probability of the sharp face images and semantic label map of the faces and is able to provide pixel-level and global feedback to the generator. With this powerful capability, the proposed SAPPGAN is able to restore images that are perceptually outstanding.
The perceptual improvement of SAPPGAN is also noticeable in visual comparisons on the MSPL-Center (see Fig. 3) and MSPL-Random test sets (see Fig. 4). The resulting images of [17], which are not based on GANs, appear overly smooth and lack sharp details. Moreover, GANbased models (i.e. Lee et al. [18], Jung et al. [19], and our SAPPGAN) outperform other methods in restoring realistic facial details. Among them, the proposed SAPPGAN significantly improves image quality with fine details and realistic textures. Specifically, SAPPGAN restores the main components (i.e. the eyes, nose, mouth and ears) of the face with high-fidelity textures (see 3 rd , 4 th , 8 th rows in Fig. 3 and 1 st , 3 rd , 7 th rows in Fig. 4 for eyes, 1 st , 6 th , 7 th rows in Fig. 3 and 1 st , 2 nd , 3 rd rows in Fig. 4 for nose, 3 rd , 6 th , 7 th rows in Fig. 3 and 3 Fig. 4 for beard). These semantic-aware deblurred results are attributed to the powerful and detailed feedback from the SAPP discriminator.

C. COMPARISONS ON REAL BLURRED IMAGES
Most existing SFID methods [16], [17], [18], [19], [61], including the proposed SAPPGAN, are trained with datasets that are degraded by synthetic blur kernels. However, SFID on real-world scenarios must consider more complex degradation factors, such as motion blur, sensor saturation, lens distortion, nonlinear transform functions, noise, and compression [39]. We conduct experiments on the real-world blurred images provided by [16], [39] to demonstrate the generalization capability of our proposed method on real-world SFID task. Since GT images do not exist for the real-world blurred images, qualitative results with competitive SFID methods [16], [17], [18], [19], [61] are shown in Fig. 5. The results of [16], [17], [61] are relatively smooth, whereas those of [18], [19] reconstruct sharper images. Compared with [18], [19], our method improves the restoration of the fine details and rich textures of the face, because it benefits from the proposed SAPPGAN.

D. EXECUTION TIME AND FACE VERIFICATION
Considering that SFID can be used in the preprocessing step of high-level face-related vision tasks (i.e. face recognition [5], [6], [7], [8]), SFID methods must enable the accurate recovery of the identity of the GT face. Therefore, we report the performance of face verification using deblurred images on the CelebA test set provided by [16]. For a fair comparison, we follow the evaluation setting in [19] and measure the estimated mean accuracy (Acc) [98]. In addition, we compare the inference time and the number of model parameters of the existing methods and proposed model. Following [16], [18], [19], the average inference time for 10 images is reported using a single NVIDIA Titan XP GPU. The spatial size of each image is 128 × 128 × 3.
The experimental results are shown in Table 2. When comparing the verification Acc on the GT images and blurred images, it can be observed that Acc is remarkably degraded from 93.47% to 77.05% by blur artifacts. The Acc of the deblurred images using our method is 90.64%, which is the most comparable to the Acc of the GT images. In addition, Table 2 shows that our method maintains the parameters and inference time of the original model from [19]. This demonstrates that our SAPP discriminator and training method can be easily applied to other GAN-based SFID models, and it improves the reconstruction quality of the deblurring network without the additional load of parameters and inference time.

E. ABLATION STUDY
In this section, we conduct an ablation study to verify the effect of each component in our approach. Table 3 shows the brief configurations of each experiment and its quantitative results. Specifically, the baseline model, termed as S0 in Table 3, is set to the generator architecture of [19] and is trained with the sum of L pixel (Eq. (13)) and L feat (Eq. (14)) without adversarial loss. S1 is a model trained using the adversarial loss of the original U-Net discriminator [24], [25] in addition to the loss function of the S0 model. Compared to S0, the performance of S1 is increased in terms of d ARC and LPIPS by |0.4909 − 0.2844|/0.4909 = 42.07% and |0.0950 − 0.0879|/0.0950 = 7.47%, respectively. This verifies the effectiveness of per-pixel adversarial loss in the image deblurring task. When the proposed PW loss is involved (S2), our discriminator is trained to rapidly focus on difficult and misclassified examples. This allows the discriminator to be trained more accurately. Thus, the discriminator can provide more accurate adversarial feedback to the generator during training. This boosts the performance of S2 in terms of d ARC and LPIPS by |0.2844 − 0.2645|/0.2844 = 8.12% and |0.0879 − 0.0825|/0.0879 = 6.14%, respectively compared to S1. The effectiveness of the incorporating the SAPP discriminator instead of the U-Net discriminator is shown in S3 in Table 3. The results of S3 show that d ARC is enhanced by |0.2844 − 0.2613|/0.2844 = 8.12% and the LPIPS is enhanced by |0.0879 − 0.0824|/0.0879 = 6.26%, compared to S1. These improvements demonstrate that our key concept, i.e. conditional image restoration by forcing the discriminator network to estimate pixel-wise semanticaware probability, is effective in face deblurring tasks. The results of S4 in Table 3 indicate that using the PW and SAPP discriminator together enables a better performance than using them separately. The d ARC and LPIPS values are improved by |0.2613 − 0.2561|/0.2613 = 1.99% and |0.0824 − 0.0816|/0.824 = 0.97%, respectively, compared to those of S3. The final model of our method (S5 in Table 3) outperforms S4 by |0.2561 − 0.2543|/0.2561 = 0.70% and |0.0816 − 0.0799|/0.0816 = 2.08% in terms of d ARC and LPIPS, respectively. These results demonstrate the effectiveness of the proposed coarse-to-fine training scheme (noted as C2F), which allows our generator to focus first on the global consistency of the restored image and then on the local consistency.

V. CONCLUSION
This paper presents a semantic-aware pixel-wise projection (SAPP) GAN, a novel GAN-based framework for single face image deblurring. The proposed SAPP discriminator VOLUME 11, 2023 is designed to incorporate a label matching (conditional) distribution into an image (marginal) distribution using a pixel-wise projection technique. This approach enables our discriminator to focus on the realness of the restored face by taking into account semantically important information. Furthermore, our SAPP discriminator provides global (perimage) and local (per-pixel) feedback to the generator by adopting a U-Net-like architecture. In addition, our discriminator can be trained more accurately with the proposed PW loss, which dynamically weights the incorrect predictions of the discriminator on a pixel-by-pixel basis. The generator is effectively trained through the proposed coarse-to-fine training technique to balance adversarial feedback between the global and local decisions of the discriminator. Overall, the proposed SAPPGAN improves on recent face image deblurring methods in terms of image perceptual quality. We believe that our SAPPGAN framework can be applied to various fields of face image restoration.