A Unified Framework From Face Image Restoration to Data Augmentation Using Generative Prior

In the domain of data enhancement, image restoration and data augmentation are two tasks gaining increasing attention. Current image restoration models focus on improving clarity using pre-trained generative models, and data augmentation methods try to generate new samples with the help of generative models. These two related topics have long been studied completely separately. We propose a downstream-friendly restoration framework based on pre-trained generative models with the capability of data augmentation for face images. We carefully design our framework to achieve high fidelity when inheriting the generation ability from the pre-trained generator. To achieve this goal, we use a modified U-Net to predict the biases of latent codes and feature maps to guide the generator. We further propose to adopt linear interpolation as an approach to enriching the datasets for downstream tasks, especially for class-imbalanced tasks. Effectiveness of our method is demonstrated through experiments on three datasets and one downstream task.


I. INTRODUCTION
With the growing popularity of deep learning in tasks such as face recognition and fatigue monitoring, there is an increasing demand for a large number of high-quality face images. However, face images obtained in real world scenarios often comes in low quality because of limited storage spaces, cost constraint of imaging system hardware and complex lighting environment. Optimal performance on downstream depends on the provision of large high-quality images datasets. Thus, face image restoration and data augmentation has long been spotlight topics in computer vision. Image restoration aims to upscale low-quality images to high-quality images with enhanced details that boost image quality. Data augmentation aims to increase the amount of data with slightly altered original data or new data generated from original data. These two domains improve datasets for downstream tasks from quality and quantity respectively. With the increasing effectiveness The associate editor coordinating the review of this manuscript and approving it for publication was Yi Zhang . of deep generative models such as Variational Auto-Encoder (VAE) [1], Generative Adversarial Network [2], etc., generative models are widely used in both image reconstruction and data augmentation. Nevertheless, the two fields have long developed independently.
Compared to the general image restoration framework, face restoration methods benefit from incorporating domain specific prior knowledge. Prior knowledge can usually be classified into three categories, facial structural information, attribute information and reference images, provided in the form of image annotations and text annotations -which are expensive to obtain. As a remedy, recent face restoration methods have started to adopt pre-trained generative models as generative priors. An additional advantage of these methods is that they are effective to different downsampling operators [4], [5]. In other words, they are suitable for blind image restoration and various scenarios.
Although these methods generate realistic images, the gap between generative prior and inputs is still a tough nut to crack so they often suffer from low fidelity. To mitigate this VOLUME 11, 2023 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ FIGURE 1. Panel (a) to (c) illustrates three methods to use a pre-trained StyleGAN [3]. The arrows indicates external guiding information for generation. Panel (d) is an illustration of linear interpolation for data augmentation. The blue crosses represent samples from ''big class'' (class with a large number of samples), the green circles represent samples from ''small class'' (class with few samples) and the green rings represent new samples we add to the dataset. We create new samples along the lines connecting two exist samples to fill the sparse input space.
problem, many researchers have developed different kinds of methods to correct the outputs of the pre-trained generator. One straightforward idea is to introduce a post-processor to adjust the outputs. GLEAN [6] proposes a decoder after the pre-trained model processing inputs. Another popular idea is to guide the generator in the process of generation. HyperStyle [7] chooses to alter the weights of the generator and GFP-GAN [8] chooses to manipulate the feature maps of the generator, as we depicted in Fig.1 (a) and (b). Despite the great improvement in fidelity, these methods still sacrifice correctness for high resolution. For example, GFP-GAN uses local discriminators for eyes and mouths to enrich details but may generate unreal eyes and mouths when the input face image has exaggerated expressions. The model hence fails in many real-world scenarios, propagating failure to downstream tasks as well.
More importantly, as methods of image restoration, these studies mentioned above do not have the ability to generate new samples for downstream tasks. Nevertheless, for deep learning models, the amount of training data is equally crucial. Unlike generative models, image restoration methods output a fixed result based on the input image and cannot generate new images even though the pre-trained generative models used by these methods possess such potential. For example, many researchers have found that linear interpolation of GAN's latent codes can be used to generate realistic images. Therefore, we propose to add new samples while restoring the image by means of linear interpolation. However, the previous face restoration methods designed complex frameworks in pursuit of the quality of the restored image, thus losing the ability of linear interpolation. The significant non-linearity in the network does not guarantee continuity for linear interpolated samples, which means the outcome usually appears unnatural and unsuitable for downstream tasks. As shown in Fig.1, (a) methods typically manipulate feature maps using addition and multiplication or geometric transformations, and (b) methods permanently change the nature of the generative model. The operations they conduct on weights or feature maps are irreversible and not suitable for images other than the input one so it is impossible to linearly interpolate the changes.
On the other hand, data augmentation with image processing methods cannot bring new information. The traditional approach to augmenting data using generative models is conditional generation using class labels, which requires training a conditional generator with existing training data [9], [10]. Another way of augmenting data with pre-trained generative models runs the risk of introducing data styles from the pre-train datasets. We believe that it is more reliable and convenient to generate data by linear interpolation between two images of the same class. As shown in Fig.1 (d), linear interpolation can fill the input space when there are insufficient samples. This method can increase the diversity of samples more than the traditional geometric transformation when the samples are sparse, and there is no risk of randomly introducing samples of different styles from original samples.
To fend off these drawbacks, we propose a downstreamfriendly face restoration model. Our model prioritize fidelity and correctness when using generative priors. We propose a new framework in the track of pre-trained GAN based image restoration methods. Unlike most restoration methods that pursue image clarity, our method take both fidelity and the ability of data augmentation into consideration. As shown in Fig.1 (c), we propose to adjust the generated content of the pre-trained model with only the biases. We adopt an existing latent inversion model to produce a latent code for coarse reconstruction. To make the fixed generator generate a different image from initial reconstruction, we propose a Difference Extractor module and a set of mappers to change both the initial latent codes and the feature maps of the generator (shown in Fig.2). Difference Extractor module takes in both the low-quality input and the initial reconstruction to encode the difference into a set of guidance information in various sizes. After that, we introduce mappers to map the feature maps of the extractor into the feature space and latent space of generator. Finally, the generator outputs the restoration with the help of latent biases and feature biases. When it comes to data augmentation, we interpolate two samples to generate new samples for downstream tasks. We demonstrate in experiments that our model has higher fidelity than the State-Of-The-Art method and can generate reasonable new samples which is beyond the capacity of the SOTA methods.
In summary, our key contributions are threefold: • We design a new method based on a pre-trained generative model capable of delivering image restoration and data augmentation jointly. The framework uses latent codes, latent biases and feature biases to create interfaces for data augmentation.
• We demonstrate that our method is comparable to the SOTA methods in terms of restoration performances through extensive experiments in both synthetic datasets and low-quality datasets in the wild.
• We show that our method is able to create high-quality synthetic samples by interpolating latent codes. That is to say, we improve not only the quality of the data but also its quantity. To our knowledge, we are the first to propose this method of two-fold data enhancement.

II. RELATED WORKS A. FACE RESTORATION
Since the rise of GANs, this new technology has been quickly applied to the field of face restoration. In this section we review some popular GAN-based methods and the prior knowledge they use. Methods such as SeRNet [11], CAGFace [12] utilize face parsing segmentation maps in addition to the original image to help the model learn the structural information of human faces. In addition to the generative model, these methods require a pre-trained face structure extraction network and a loss to calculate the differences between the annotated face segmentation map and the segmentation map of the generated image. At the same time, the annotated face segmentation map is fed into the generation network together with the original image. Super-FAN [13] uses the face heatmap as auxiliary information, and narrows the heatmap of the generated image and target image as an auxiliary task. Yin et al. [14] proposed a multi-task learning method to combine the two tasks of face structure extraction and face restoration. FSRNet [15] and some of its follow-up studies [16], [17] choose to perform these two tasks alternately. They first up-sample the low-quality image to obtain an intermediate result, then extract the face structure from this intermediate result. Finally the generative network output the final result with the combination of face structure and intermediate result as input. These methods fully apply prior knowledge of face structure to restoration models at a cost of expensive human annotation.

B. PRE-TRAINED GAN BASED IMAGE RESTORATION
The previous methods are based on upsampling from the low-quality image input (bottom-up approach). They find the results that best match the distribution of the high-quality image among the images that are semantically similar to the low-quality image. These methods are limited by the insufficient information provided by low-quality inputs. The excellent performance of BigGAN [18] and StyleGAN [3], [19], [20], [21] provides researchers with new ideas for image restoration tasks. Using these pre-trained generative models, these new methods find the results that are most semantically similar to the low-quality inputs among high-quality images (top-down approach). The high-quality images generated by this approach are more realistic in detail. PULSE [4] first proposed this idea. Compared with the traditional reconstruction loss to calculate the difference between the generated image and the annotated high-definition image, PULSE adds a random noise into StyleGAN, and calculates the reconstruction loss between the downsampled images generated by StyleGAN and the low-quality inputs. GLEAN [6] goes a step further and proposes an encoder-hidden variable librarydecoder architecture by treating the pre-trained generative adversarial network as a generative prior pool. The authors of GFP-GAN [8] realized that using only latent codes to control the generated results of the pre-trained model results in low fidelity so they propose a method in which the encoder simultaneously outputs the change amount of the feature map of the pre-trained model, and directly changes the feature map when the generator generates it. GPEN [5] uses an encoder to map the original image into latent codes in Z space, and inputs the hidden variable into the pre-training generator to output a high-definition image. Unlike most methods that do not change the pre-trained generator directly, GPEN fine-tunes the pre-trained generator during training. Panini-Net [22] first interpolates input image features to generator features to adapt to degradation dynamically. The generated results of such methods are realistic and have high quality, but can be severely distorted in the case of large stylistic differences with the pre-training data.

C. USE OF PRE-TRAINED GANs
As can be seen from the previous section, the main problem of pre-trained GAN-based image restoration methods is how to precisely control the generated content of pre-trained GAN, which has been fully discussed by GAN researchers. We can divide this task into two parts, restoring the latent codes of the GAN and controlling other things of the GAN. There are usually two ways to restore latent codes. Researches such as PULSE and Image2stylegan [23], [24] use the method of optimizing loss, and methods such as pSp [25], e4e [26], E2Style [27] train an encoder through a large number of samples.
What the second part of the task should control is an open question. HyperStyle and GPEN alter the weights of the pretrained generator. As described earlier, GPEN changes the parameters once and for all by fine-tuning whereas Hyper-Style [7] indirectly changes the parameters through a Hyperparameter network so that the Hyperparameter network will customize a set of weights for the pre-training generator according to each input image during inference. As a GAN inversion method, HyperStyle retains some of the excellent properties of pre-trained StyleGAN, such as the ability to FIGURE 2. Overview of our method. We propose to use a Difference Extractor (shown with blue boxes and arrows) and mappers to encode latent bias and feature biases to guide the pre-trained generator. Difference Extractor takes in both the input image and an initial reconstructed image and outputs the features which are then mapped to latent bias and feature biases by corresponding mappers (denoted with yellow and orange arrows). The latent bias mapper is implemented as a one-layer MLP and feature bias mappers are implemented as 1*1 convolution layers. Finally, the biases are added to the initial latent and feature maps of StyleGAN to change the output. The details of the convolution blocks of the Difference Extractor are shown in the blue box on the right side. The ''resample'' convolution layer conduct ''downsample'' in an encoder and ''upsample'' in a decoder by altering stride and padding size. edit images by using this approach. Another class of methods represented by StyleHEAT [28] controls the intermediate feature maps of the generator. The authors of StyleHEAT conducted detailed research on the intermediate feature maps of StyleGAN. They found that the geometric transformations made on the feature maps are preserved in the generated images, which proves that controlling the feature maps is a shortcut to controlling the generated images. Compared with altering weights, this approach saves much computation cost.

D. IMAGE DATA AUGMENTATION
In addition to increasing image quality, image restoration models are often used to improve the performances of downstream tasks by improving the qualities of datasets. Most image restoration models, although based on generative models, lack the ability to add samples to the datasets. In the field of computer vision, the commonly used image data augmentation methods are basic image operations, such as geometric transformation, flipping, color perturbation, cropping, adding noise, etc [29]. Object detection and other fields have specially designed augmentation methods such as Cutmix [30] for their own tasks. GAN-based data augmentation has proven to be very effective in some downstream fields. Some researchers [31], [32], [33] use GAN to generate new samples or features, and some researchers [9], [34], [35] use GAN to change the category of images. In the experiments section, we demonstrate that our method can generate new samples effectively for downstream datasets and significantly bolster performance of downstream tasks.

III. PROPOSED METHOD A. OVERVIEW
Image restoration task is an image-to-image translation task, which can be mathematically written as: Given a low-quality image I LQ , the problem is to find a model F with parameters θ which maps I LQ to a high-quality image I HQ . In generative prior based methods, there is a fixed generator G in the model. Specifically, we use the StyleGAN2 [19] trained on high-quality face dataset as the prior generator as it is one of best-performing generative models. In order to effectively control the image content generated by StyleGAN2 and preserve the applicability of linear interpolation, our use of StyleGAN follows two principles. First, we avoid directly changing StyleGAN itself. Second, we only modify latent codes and convolutional feature maps with linear operations. The process of image restoration consists of four steps and the framework of our method is depicted in Fig.2. First, we use a fixed pSp [25] encoder to generate the initial inverted latent codesŵ 0 from the input I LQ . Given the initial latent codes, we produce an initial reconstruction I HQ,0 with Style-GAN2. Second, we stack the two images together [I LQ , I HQ,0 ] and put them into the Difference Extractor. The Difference Extractor maps the inputs into a set of guiding information and latent biases B w,0 . Then, we use Feat Bias Mapper to convert guiding information to biases of feature maps. At the same time, we use latent bias to adjust the initial latent codes. Finally, we use StyleGAN2 again to generate an high-quality image with the help of the adjusted latent codes and guiding information. After that, we get the new output I HQ . The whole process can be briefly represented as follows, (2)

B. DIFFERENCE EXTRACTOR
Difference Extractor is designed to extract the difference between input I LQ and initial reconstruction I HQ,0 and output guiding information for the generator. To serve this purpose, Difference Extractor is composed of an encoder and a decoder. The encoder is responsible for the information extraction part and the decoder should learn to adapt the refined low-dimensional feature to generator style.

1) ENCODER
Inspired by HyperStyle [7], [36], we feed both the initial reconstruction and the input image to the encoder. This design have several benefits. First of all, the encoder is spared from memorizing the whole content of the input by learning the differences between two images. What's more, the input I LQ contains the content information and I HQ,0 contains prior information. By checking the two images together, the encoder is able to get prior information without interacting with the generator. Considering the ultimate goal of the Difference Extractor is to guide the generator, the distilled prior represented by I HQ,0 can make the encoder achieve this goal well.

2) DECODER
The decoder part of Difference Extractor aims to adapt the difference features to the style of generator feature maps. StyleGAN2 is a stack of convolution blocks, we refer the feature maps to the convolutional output of the first convolution layer of every convolution block. The shape of the feature maps varies among blocks, we can denote them as We design the decoder as a similar structure to StyleGAN2 but with less channel to save computational cost.

3) ARCHITECTURE
From the perspective of model structure, we adopt the widely used U-Net [37] as the Difference Extractor. However, the size of the input image is 256, and the maximum size of the feature map is 1024, so we attach two extra convolution blocks in the decoder. The U-Net convolution block we used is shown in the right part of Fig. 2, which consists of two branches. One of the branch has two 3*3 convolution layers, each convolution layer is followed by an activation layer. The other branch has a 1*1 convolution layer without activation. At last, two branches are merged together by adding the feature maps. Like the classic U-Net, our model reduces information loss with feature skips from the encoder to the decoder.

C. LATENT BIAS
It is shown by many researchers [38], [39] that the latent codes can dominate the content StyleGAN generation hence it is necessary to revise the initial latent codesŵ 0 obtained by pSp encoder. We propose to modify the latent codes in the form of adding biases on the basis of the initial latent codes. Our experiments show that this method can effectively modify the image content without destroying the latent space of the pretrained StyleGAN.
We infer the revision of latent codes from the concentrated information extracted by the Difference Extractor. We denote the features outputted by the encoder as F DE . It is a concise representation of the difference. We map F DE to the dimension of latent code with a one-layer MLP to get latent bias B w . B w is added to the initial latent codesŵ 0 to form a new latent coderŵ:ŵ = B w +ŵ 0 (3)

D. FEATURE BIAS
Previous researches [7], [28] have proved that given latent codes, changing the weights and intermediate feature maps of StyleGAN are both effective ways to correct the output images. We tested the effects of the two methods separately and found that changing the weights of the pre-trained generator can generate realistic images without strange distortions, but it has limited ability to reconstruct the image content.
On the other hand, changing the feature map can greatly change the image content while it also prone to generate some unreal ghosts. Considering the task nature of image restoration and the impact on downstream tasks, we decide to adopt the approach of modifying the feature map to achieve better reconstruction accuracy. The shortcomings can be compensated by suitable training methods and loss functions. As a result, we propose a series of Feat Bias Mappers to map the output of the Difference Extractor to the size of the generator feature map. We denote the three dimensions of the feature map as h (height), w (width), and c (number of channels). According to the design of the Difference Extractor, the guidance information it outputs and the feature map of StyleGAN have the same h and w and different c in each convolution block. We use c DE to represent the number of channels output by the Difference Extractor, and c G to represent the number of channels of the StyleGAN feature map. Both GFP-GAN and StyleHEAT apply Spatial Feature Transformation (SFT) to the feature map, which can be expressed as: If we use SFT, the modification of the feature map requires both addition and multiplication operations. Although it can correct the value of the feature map to a greater extent, it is beyond the scope of linear operations. In experiments, we found that using only the bias provides good enough reconstruction results, so our model omits the prediction of the multiplier. Additionally, we limit the number of biased channels to 1 instead of extending it to c G to reduce computation and limit model complexity. It should be emphasized that we are deliberately choosing a simplified design to trade part of the ability of the model to generate details for more flexibility in data augmentation. Therefore, the Feat Bias Mapper maps the guidance information matrix of shape h * w * c DE to the feature map bias of shape h * w * 1. We implement Feat Bias Mapper with a 1*1 convolution layer. We illustrate the changes in the feature maps in every stage of our method in Fig. 3.

E. DATA AUGMENTATION
When constructing a dataset, the quantity and quality of data are of comparable importance. Besides face image restoration, our method provides a convenient way to generate new samples for downstream tasks. Benefiting from the concise model structure, our model provides an interface for linear operations for image generation. As long as we control the latent biases and feature biases obtained during restoration, we can generate new samples through linear interpolations.
The presence of rich semantic information in the latent variable space of GANs is proved by many works [38], [39]. Researchers observed that the appearances and semantics of synthetic results of GANs change continuously when linearly interpolating from latent code w A to another latent code w B . Despite that many methods are based on a pre-trained GAN, they fail to inherit this nice property in their methods. We propose to generate new samples by linear interpolation on the intermediate results. Specifically, we save the initial latentŵ 0 , latent bias B w and feat bias B f during restoration.
w, B f = FE(I LQ ) where FE stands for the extraction part of the model and G stands for the generative model. For inputs I A and I B , we can get two sets of latent codes and biasesŵ A , B f ,A andŵ B , B f ,B . We generate new samples by feeding the linear interpolation latent codes and biases into the generative model:ŵ where α is a linear interpolation coefficient between 0 and 1.

F. LOSS FUNCTIONS
Like most pre-trained GAN-based image restoration methods, our loss function consists of three parts: 1) reconstruction loss, which is used to make the output close to the target y 2) id loss, which is used to let the model learn the identification of the input face 3) Adversarial loss, which is used to help the model output natural, high-quality pictures.

1) RECONSTRUCTION LOSS
The reconstruction loss consists of two parts, the L2 loss and the LPIPS loss proposed by Zhang et al. [40]. Both of these losses are widely used in the field of image restoration. The pixel-level L2 loss is the most direct measure of image similarity. To compensate for the insensitivity of the L2 loss to blurring, the LPIPS loss measures the reconstruction quality in a way that is closer to the human senses. These two losses can be represented as: whereŷ = F(I LQ , I HQ,t ; θ) is the output of the model, φ i (.) is the feature map of the i-th convolution block of a pre-trained AlexNet [41].

2) ID LOSS
Id loss is a commonly used loss in the field of face restoration. It is similar to the LPIPS loss as it uses a pre-trained face recognition neural network to imitate human annotation. We choose the pre-trained ArcFace [42] model to score the outputs. However, unlike the usage of LPIPS loss, we use the feature map of the last layer of ArcFace, and take the dot product of the feature maps of the two images as a similarity measure:

3) ADVERSARIAL LOSS
Adversarial loss is the loss used by the generative adversarial network proposed by Goodfellow et al. [2]. This loss can measure the distance between the probability distribution of the generated data and that of the real data, allowing the model to generate more natural pictures. At the same time, since our training data are clear, high-quality face images, this loss also pushes the model to generate images full of realistic details. We adopt a similar implementation to StyleGAN, applying R1 regularization [43] to the discriminator to make the training process stable: where R1(.) denotes R1 regularization.
To sum up, the loss we use for the model is a weighted summation of three losses: L = λ rec L rec + λ id L id + λ adv L advforG (13) λ rec , λ id , λ adv are the weights of the three losses, we set the hyper-parameters as follows in our experiment:

IV. EXPERIMENTS ON IMAGE RESTORATION A. DATASETS 1) TRAINING DATASET
We train our model on the FFHQ [44] dataset. It contains 70,000 high-quality face images with 1024*1024 resolution. We generate low-quality inputs by downsampling the origin images. To approximate the degradation in real-world scenarios, we adopt a degradation model used in research [5], [45] on blind super-resolution: where high-quality images are convolved by a Gaussian blur kernel with sigma σ and then downsampled by scale s. After that, white Gaussian noise with the noise level δ is added to the image. Finally, the degradation caused by JPEG compression is imitated by a compression operator with quality factor q. The blur, downsample and compression operations are chosen to realistically represent the issues faced by downstream scenarios. We set Gaussian blur kernel size as 41 and randomly sample σ , s, δ, q in the range of [0.1, 10], [4,32], [0, 10], [60, 100].

2) TEST DATASETS
We use three different datasets to verify the performances of face image restoration methods. These datasets have no overlap with our training dataset and they are the representations of different types of images.
• CelebA-Test dataset. We use the test partition of CelebA-HQ [46] as a test dataset. It has 2,824 images of 512*512 resolution. We downsample the images with the same degradation model used in the training process.
• LFW dataset. Labeled Faces in the Wild(LFW) [47] is a dataset collected from the web. The dataset has images of various conditions including poor lightning, extreme poses, and strong occlusions.
• Rfau dataset. It [48] is a dataset of Asian teenagers obtained from surveillance cameras in the school.
The faces of students are cut from panoramas so the images are of low resolutions.

B. IMPLEMENTATION
Our model takes images of size 256*256 as inputs and outputs images of size 1024*1024. We adopt the StyleGAN2 pre-trained on the FFHQ dataset with 1024*1024 output as generative prior. The Difference Extractor module consists of six downsample convolution blocks as an encoder and eight upsample convolution blocks as a decoder. We use one convolution layer as a Feat Bias Mapper for each upsample block and a one-layer MLP is adopted as a Latent Bias Mapper. We train the model with a batch size of 16 and a learning rate of 2 × 10 −3 for both the image restoration model and the discriminator. The training data is augmented by a random horizontal flip. We implement our model with the PyTorch framework and use four Nvidia RTX 3090 GPUs to train the model.

C. METRICS
We use five metrics to evaluate the image restoration results from different aspects.
• PSNR. Peak Signal-to-Noise Ratio (PSNR) is a common way to quantify reconstruction quality for images and videos. It is based on pixel-wise mean square error: (15) where MAX I denotes the maximum possible pixel values of images,ŷ is the predicted image of models and y stands for targe images. The images with bigger PSNR values are better in terms of the absolute reconstruction error.
• SSIM. Structural Similarity (SSIM) [49] is used to compute the similarities of two images from the perspective of perceived change in structural information. A bigger SSIM score indicates better reconstruction.
• FID. Frechet inception distance (FID) [50] is a metric evaluating the quality of generated images. It compares the distributions of generated images and real images. A lower FID score implies more realistic images.
• LPIPS. Learned Perceptual Image Patch Similarity (LPIPS) [40] is a metric for perceptual quality with the reference image. We implement it in the same way as in the loss function. Images with smaller LPIPS scores are more perceptually similar to target images.
• ID. We use a pre-trained Arcface model to score the similarities of the identities of images. Smaller ID values refer to better identity similarity.

D. COMPARISONS WITH STATE-OF-THE-ART METHODS
We compare 4 state-of-the-art face restoration models with our model. They all use pre-trained StyleGAN as generative prior and trained on the FFHQ dataset. GFP-GAN and GPEN are both for blind face restoration, and PULSE and GLEAN VOLUME 11, 2023 are general image restoration methods. We use the official codes and weights of these methods.

1) CelebA-TEST DATASET
We show the quantitative results of the five methods on the CelebA-Test dataset in Table.1. Our method achieves the best PSNR and SSIM scores among the SOTA methods, which means that our method achieves the best results in terms of reconstruction accuracy and structural similarity. This indicates that our model can restore the content of the original image most faithfully without excessive deviation. At the same time, our model also won second place in the ID score, which is only slightly worse than the first place achieved by GPEN. This proves that our model retains most of the facial features while maintaining the overall restoration of the picture. However, there are still some gaps between our method and the SOTAs of face image restoration methods in terms of LPIPS and FID. Both LPIPS and FID simulate human senses to measure the quality of pictures. These two indicators can reflect the authenticity of the generated pictures to a certain extent. Our model prioritizes low bias during restoration so the results lose some clarity, resulting in limited perceptual quality. Besides restoration quality, we demonstrate the number of parameters of models and inference speed in Table. 1 as well. Our model consumes the least computational cost and time among the SOTA methods. We demonstrate the restoration effect of some images in Fig.4, where the first three rows are samples from the CelebA-Test dataset. Since we constructed the CelebA-Test dataset using the same degradation model used to construct the training dataset, this synthetic dataset contains various degrees of degradation. We selected three images of different levels of degradation to compare the performance of the methods in various situations. The first row of Fig.4 shows a lightly degraded input. Except that PULSE does not retain the identity of the face well, and GLEAN generates unnatural lines, all other methods, including our model, generate very clear and high-fidelity pictures. The input of the second row has conspicuous noises. GPEN restores the noise to freckles, and GFP-GAN and our model remove the noise. The input of the third row is so severely degraded that the facial features can only be barely distinguished. At this level of degradation, there are unnatural lattice marks in the pictures generated by GFP-GAN. Our model reproduces smooth faces but the details are not clear. In contrast, the restoration of GPEN works best.

2) LFW DATASET
The LFW dataset contains some moderate-quality celebrity images. Since this dataset does not have clear original images, we show some qualitative comparisons in the fourth and fifth rows of Fig.4. We choose to show a picture of a head-up man and a picture of a woman with sunglasses. GFP-GAN, GPEN and our model all have good restorations of the first image. In the second image, the GFP-GAN image appears unnatural eyes and nose. This phenomenon remind us that the methods rely on pre-trained GANs are vulnerable to occlusions on faces. Nevertheless, the restored image of our model is denoised without introducing any unnaturalness.

3) RFAU DATASET
The pictures in the Rfau dataset contain various expressions of Asian teenagers. This is a dataset with a very different style from FFHQ. From this experiment, we can see the performance of each method when they are used in a different circumstance to that of the training dataset. The failure of PULSE on the LFW dataset and the Rfau dataset can reflect the challenges posed by different data styles to pretrained GAN-based methods. As an expression dataset, the results of image restoration must preserve expression details. In the first picture, although both GFP-GAN and GPEN have reconstructed the input to high-definition pictures, the student's slight pouting expressions have disappeared. Both of them have restored this expression to closed thick lips. The second sample is a picture of a student yawning. Most of the methods restore the expression of a wide mouth, but there are strange white spots in the mouth in the results of GFP-GAN. Our method preserves all the details of expressions in these cases.

4) DISCUSSION
Through the experiments on three different datasets and various degradation levels, we found that our model can restore low-quality images well. Especially in terms of fidelity, our model does not introduce additional things due to introducing excessive priors from the pre-trained GAN, which fully guarantees the consistency of image content. But this is also the reason that the reconstructed images of our model lack details and are not clear enough when the degradation level is extremely high. Overall, it allows us to confidently apply our method to datasets for downstream tasks since details such as wrinkles and freckles do not affect the performance for the vast majority of downstream tasks. In subsequent experiments, we will demonstrate that this trade-off is worthwhile.

E. ABLATION STUDIES
We perform ablation studies to verify the effectiveness of the important components of our framework on image restoration in this section. We propose to use latent bias and feature biases to calibrate the output of the generator and both of the components are indispensable for the reconstruction quality. We compare the performances of the models with or without these components on the CelebA-Test dataset and show the results in Table. 2.
The model using only initial latent to restore images degenerates to a GAN inversion model with pSp encoder and StyleGAN. It only achieves a 15.0589 PSNR score and 0.4181 SSIM score. If we use the proposed latent bias and the encoder part of the Difference Extractor, the fidelity is upgraded a little as the PSNR and SSIM scores are shown while it preserves high image quality. As we discussed in previous sections, the operations conducted on latent space have constrained influence on image contents because using latent biases can only revise the results of pSp encoder to a limited extent. On the other hand, using only feature biases VOLUME 11, 2023 FIGURE 5. Qualitative results of linear interpolation on Rfau and LFW Dataset. Rfau is a dataset for expression classification and LFW is a dataset for face recognition. In each row, we select two images from the same category and perform linear interpolation between the two images. We show the interpolation results when the interpolation coefficient α = 0.2, 0.5, 0.8. makes a greater improvement based on using initial latent, but there is still a big gap with the complete model. Combining both latent bias and feature biases gives the best performance with a 3.4 improvement in PSNR and a 0.15 improvement in SSIM.

V. DATA AUGMENTATION AND DOWNSTREAM TASK
In this section, we examine the data augmentation capabilities of our model and its effects on downstream tasks. We demonstrate it on the expression classification problem on the Rfau Dataset. In this process, we propose a new data augmentation method and a potential solution to the class imbalance problem in face-related computer vision applications.

A. DATA AUGMENTATION
Benefiting from the model structure, our model retains and extends the linear interpolation ability of the pre-trained StyleGAN. In Fig.5 we show the qualitative effect of linear interpolation on the Rfau Dataset (expression classification task) and LFW Dataset (face recognition task). From the third column to the fifth column we give three interpolations in the change from picture a to picture b. They are the composition of 80% image a and 20% image b, 50% image a and 50% image b and 20% image a and 80% image b. From the figure, we can see that the new images obtained by linear interpolation look very natural without too many unreasonable artifacts. More importantly, The change from image a to image b is smooth in both aspects of identity and expression, constructing reasonable new pictures. The two pictures in the first row in Fig.5 are taken from the ''happy'' category, and the two pictures in the second row are taken from the ''dissatisfied'' category. It can be seen that the pictures generated by linear interpolation preserve the emotions of happiness and unhappiness. When used in the task of face recognition, our method can be used to generate new images of the same person as shown in the last two rows of Fig.5. Although image A and B are photos of the same person with different makeup, age, or expression, the generated samples preserve the identity well. We show that these results are useful in the downstream task.

B. EFFECT ON DOWNSTREAM TASK
We test our data augmentation on the expression classification task on Rfau Dataset. This dataset suffers from a severe class imbalance problem, and we present the number of training samples for the 6 classes in the dataset in Fig.6 (a). The category with the most samples has 2389 samples, while the category with the fewest samples has only 40 samples. Such a large gap will bring great challenges to the classifier. We generate new samples by class. We first restore all the images in the dataset. For each class, we randomly pick two different images and the linear partition coefficient α from the range [0.2, 0.8] so that the new image does not overlap with the original ones. We use a ResNet18 [51] pretrained on ImageNet [52] as a classifier and use macro F1, micro F1, and F1 scores of each category as the classification metrics.
We show the effectiveness of image restoration and data augmentation on the downstream task in Fig. 6 (b). To examine the classification results, we use 10 random seeds for each method. As is observed by [53] and [54], the random effect caused by random seeds during the training process is non-negligible and is even magnified when the dataset is small. First, we restore every image in the original dataset with three models, GPEN, GFP-GAN, and our model. The average macro F1 scores of ResNet18 trained on restored datasets is better than that trained on the original dataset. Among the three restoration models, our model outperforms the other two thanks to the higher fidelity. Then, we augment the dataset with the proposed linear interpolation method. We generate new samples so that there are at least 1000 images in each class to mitigate the imbalance. The performance enhancement brought about by data augmentation is obvious. What's more, the random effect is significantly smaller when the quantity of data is increased.
We further discuss the impact of different augmentation strategies and show more detailed results by testing two data augmentation strategies: • Augment V1: In this strategy, we increase the number of samples in all classes, where the classes with more samples (happy, sleepy) are doubled, and the classes with fewer samples are tripled.
• Augment V2: Round up the number of samples for classes with less than 1,000 samples to 1,000. and three baselines in the experiment: • Original data: The original unbalanced data.
• Restored data: All the original images including train data, valid data and test data are restored by our restoration model.
• Restored+copy data: This is a dataset built upon Restored data. We supplement the ''small classes'' to 1,000 samples by duplicating the restored images.
We show the comparison results in Fig.6 (c). From the two overall indicators of macro F1 and micro F1, the restored data can slightly improve the performance based on the original data. Both versions of data have excellent performance in the two big categories (happy and sleepy), but the performance in other categories is very random. Training with Augment V1 data improves the overall macro F1 by 0.13 over the restored data. The overall performance of the Augment V2 data is the best, and the macro F1 is remarkably improved from the original 0.4009 to 0.6684. When we try to augment data without generative models, the performance of the ''restore+copy'' strategy increases slightly compared to restored data. It is worth noting that both strategies of data augmentation have achieved great results, and most of the improvement is provided by the classes with few samples. Face restoration only improves the F1 score for the small classes by 0.12 on the basis of the original data, while Augment V1 improves F1 to 0.51, and Augment V2 improves F1 to 0.61. Surprisingly, data augmentation for small classes can even improve the performance of small classes to a level close to that of large classes, which is a huge improvement for downstream tasks.

C. DISCUSSION
As can be seen from this example, data augmentation with linear interpolation is a feasible and effective approach. In the case of class imbalance or few-shot learning, increasing the number of samples can achieve better results than pure image restoration. Image restoration models with the capability to increase sample would find themselves extremely useful in downstream tasks.

VI. CONCLUSION
We propose a new downstream-friendly face image restoration method, which can provide downstream tasks with datasets of better quantity and quality simultaneously. It is based on a pre-trained StyleGAN, and we adjust the generated content of StyleGAN with both latent biases and feature map biases. To improve fidelity and make training easier, we propose a Difference Extractor module and a series of mappers. Our method achieves high fidelity in experiments, thus it is reliable in downstream tasks. Additionally, we propose a new approach to mitigating the problem of class imbalance by linear interpolation. Through experimental verification, we find that data augmentation by the image restoration model is useful for downstream tasks.
Our work also provides new insights for future research. In terms of face image restoration, we can explore methods that achieve both better image restoration effects and interpolation effects. In terms of data enhancement, we can explore the effect of image interpolation other than human faces, the features of linear interpolation, and more enhancement strategies for different categories, etc.