Perceptual Quality Improvement in Videoconferencing using Keyframes-based GAN

In the latest years, videoconferencing has taken a fundamental role in interpersonal relations, both for personal and business purposes. Lossy video compression algorithms are the enabling technology for videoconferencing, as they reduce the bandwidth required for real-time video streaming. However, lossy video compression decreases the perceived visual quality. Thus, many techniques for reducing compression artifacts and improving video visual quality have been proposed in recent years. In this work, we propose a novel GAN-based method for compression artifacts reduction in videoconferencing. Given that, in this context, the speaker is typically in front of the camera and remains the same for the entire duration of the transmission, we can maintain a set of reference keyframes of the person from the higher-quality I-frames that are transmitted within the video stream and exploit them to guide the visual quality improvement; a novel aspect of this approach is the update policy that maintains and updates a compact and effective set of reference keyframes. First, we extract multi-scale features from the compressed and reference frames. Then, our architecture combines these features in a progressive manner according to facial landmarks. This allows the restoration of the high-frequency details lost after the video compression. Experiments show that the proposed approach improves visual quality and generates photo-realistic results even with high compression rates. Code and pre-trained networks are publicly available at https://github.com/LorenzoAgnolucci/Keyframes-GAN.


Introduction
In recent years, videoconferencing has become a primary means of personal and business communication all over the world, also because of the emergence of the COVID-19 pandemic.
Lossy video compression algorithms such as H.264 and H.265 allow for a decrease in the bandwidth required for video transmission but introduce compression artifacts that reduce the perceived quality of the video stream.The degradation of the visual quality worsens the user experience, even making it unacceptable in certain cases.For these reasons, the development of methods for video quality enhancement constitutes a very active area of research.In recent years, Generative Adversarial Networks (GANs) have emerged as one of the most promising and powerful tools for several image and video processing tasks, thanks to their ability to generate photorealistic and perceptually satisfying results [10,11,41].
Applying deep learning-based enhancement methods to videos has several advantages.Firstly, these methods can be applied as post-processing steps to existing video compression and transmission systems without requiring to change any component and being independent of the specific video codec employed.Secondly, enhancing the visual quality of videos reduces compression artifacts and other types of degradation, thus improving the user experience.Finally, the improvement in the perceived quality makes it possible to transmit videos with higher compression rates, consequently reducing the needed bandwidth.For example, [12] uses semantic video coding and a GAN to obtain a quality comparable to the one obtained by standard H.264 with three times the bandwidth.[64] proposes a talking-head synthesis approach that reconstructs a video using one-tenth of the original bandwidth.
Contributions In this work, we propose a novel GANbased approach for improving visual quality in videoconferencing.In videoconferencing the background has so little relevance [71] that some commercial solutions provide features to blur or replace the background with a virtual one.For this reason, we focus on the enhancement of the framed person, and in particular on the head area, because it is the most expressive and important part of interpersonal communications.Our approach is based on the assumption Figure 1.Overview of the proposed system at runtime.High quality reference keyframes (video I-frames) are used in our GAN-based approach to improve the visual quality of the video conference stream.The algorithm used to update the keyframe reference set is a key element to improve the visual quality of the restored frames.
that the subject speaking in front of the camera stays the same for a relatively long consecutive time frame so that we can exploit for enhancement the previous high-quality reference keyframes of the Group of Pictures (GOP) coding (i.e. the so-called I-frames), used in video compression algorithms as the base for motion-based compression.In particular, we propose a novel policy to create and update a set of reference keyframes in order to keep this set small, and thus memory efficient, and also to make it effective for the improvement of the visual quality.Our model extracts multi-scale features of the compressed frame and a reference keyframe and then combines them according to the facial landmarks (see Fig. 1).The feature fusion is performed with Adaptive Spatial Feature Fusion (ASFF) [37] and Spatial Feature Transform (SFT) [65] blocks in a progressive manner that helps in restoring coarse-to-fine details.We designed a pipeline for video enhancement that involves preserving a limited number of keyframes extracted from the video stream and using the most useful ones as a reference for restoring the compressed frame.The experiments and the comparison with competing state-of-the-art approaches show that our proposed method is very effective in generating photo-realistic results even with high compression rates.

Related Work
Video Coding Some interesting initial works have addressed the quality improvement of videos and images using coding based on neural networks [53,54].These approaches are currently not deployable with satisfying visual results due to an unbearable computational cost.Moreover, fully learned compression requires the standardization and diffusion of a novel technology, which is a very high market barrier to practical use.
Video Quality Improvement Recently, many learningbased image enhancement techniques have been proposed [2,6,10,11,26,45,46,60,61,69,77,78].Such approaches learn deep convolutional architectures, often based on GANs, to restore low-quality images corrupted by compression artifacts into high-quality ones, and deal with generic video content.[17] presents a Multi-Frame Quality Enhancement approach for compressed videos.After observing that the quality of compressed videos fluctuates across frames, the authors developed a BiLSTMbased detector to locate Peak Quality Frames (PQFs), that is frames that have a higher quality than their neighbors, whose information can be exploited to reduce the distortion of low-quality frames.A non-PQF and its nearest two PQFs are the input of a multi-frame CNN, composed of a motion compensation and a quality enhancement subnet.[66] presents EDVR, a video restoration framework with enhanced deformable convolutions.A pyramid, cascading and deformable module uses deformable convolutions in a coarse-to-fine manner to align the features of the reference frame to that of its neighboring frames and then a temporal and spatial attention fusion module combines them.
Face Quality Improvement Face super-resolution has been addressed in [5], where the authors proposed GWAInet, a GAN-based approach that performs 8× face super-resolution using a HR reference image of the same person depicted in the LR image.A warper subnetwork aligns the contents of the reference image to the input image.Then, after extracting the features of the LR and HR images, a feature fusion chain combines them to exploit the reference image.A peculiarity of this method is that it does not require facial landmarks for the training.In [35] superresolution of extremely degraded faces is dealt with a GAN that produces a coarse SR image.Then, the result is refined by exploiting facial components extracted from mul-tiple high-quality warped images of the same person or a similar one.In [74] the problem of face quality improvement is formulated as a dual-blind restoration problem, lifting the requirements of both the degradation and structural prior for training.The authors present HiFaceGAN, a collaborative suppression and replenishment framework with a nested architecture for multi-stage face renovation with hierarchical semantic guidance.[75] proposes a GAN prior embedded network for blind face restoration, using a Ushaped DNN for face restoration as a decoder.PSFR-GAN, a GAN-based Progressive Semantic-aware Style Transformation framework presented in [3], uses a face parsing network to obtain a segmentation map given an LQ face image.The input image and the segmentation map are exploited to produce a multi-scale pyramid of the inputs modulating different scale features with a semantic-aware style transfer approach.A semantic aware style loss accounts for each semantic region individually.In [34] blind face restoration task is tackled with a Guided Face Restoration Network (GFRNet) that takes advantage of a high-quality reference image of the same identity.A warper subnetwork reduces the difference in pose and expression between the two images to better recover fine and identity-aware facial details with a reconstruction subnetwork.The Deep Face Dictionary Network (DFDNet) proposed in [36] attempts to overcome the main limitation of reference-based methods by observing that facial components are similar between different people.Multi-scale dictionaries of facial parts are built offline with K-means from high-quality images.The features in the dictionaries most similar to the facial components of the degraded input are leveraged for restoration by means of Dictionary Feature Transfer and Spatial Feature Transform blocks.In [37] blind face restoration is tackled by exploiting a high-quality image selected from multiple available images of the same person as a reference to restore a degraded one.The features of the guidance image are warped to the low-quality ones according to the facial landmarks to reduce the difference in pose and expression.Multiple Adaptive Spatial Feature Fusion blocks combine the degraded and guidance features by generating an attention mask with facial landmarks to guide the restoration of the facial components.In [12] a method that combines semantic video coding and GAN-based video quality restoration is proposed for video conference systems, using a perceptual loss that accounts separately for the background and the foreground face.[8] presents HeadGAN, a method for head reenactment that conditions head synthesis on 3D face representations from a driving video.Audio features are exploited to better synthesize mouth movements.When driving and reference identities coincide, HeadGAN can be used for face reconstruction.In [67] facial priors encapsulated in a pre-trained GAN (GFP-GAN) are incorporated for blind face restoration by means of channel-split Spatial Feature Transform layers.Unlike GAN inversion methods, GFP-GAN can restore faces with a single forward pass.[43] tackles blind face restoration with a GAN that uses multi-scale facial features.A feature prior loss aims to reduce the difference in the feature space between the input and restored images, thus preserving the overall image content and spatial structure information.[33] proposes a restoration with a memorized modulation framework for blind face restoration.Low-level spatial feature embedding, wavelet memory embedding, and disentangled highlevel noise embedding are combined with adaptive attention maps.[82] presents DAVD-Net, a DCNN architecture that exploits the audio-video correlations to remove compression artifacts in close-up talking head videos.The audio features are extracted with a BiLSTM and organized in a 2D form.The video and audio features are aggregated with a spatial attention module.To further improve the restoration the structural information of the encoder in the video compression standards is embedded into the network by adding a constraining projection module.In [42] face quality of compressed videos is enhanced with MRS-Net+, a multilevel architecture comprised of one base and two refined enhancement levels that restore small, medium, and largescale faces, respectively.A landmark-assisted pyramid alignment subnet is developed to align faces across consecutive frames.[81] and [18] exploit a multi-modality neural network to restore strongly compressed face videos.They both use video and audio signals, combined with codec information in [81] and with an emotion state in [18].[76] presents a multi-task face restoration network that relies on network architecture search to restore images affected by various degradations.Additionally, during training clean images of the same subject as the degraded image are exploited by means of an identity loss.[70] proposes a method based on fully-spatial attention to tackle blind face restoration.A multi-head cross-attention layer takes the features of a degraded face as queries while the key-value pairs are from high-quality facial priors.The key-value pairs are sampled from a reconstruction-oriented high-quality dictionary.
Even if our aim is to improve the perceptual quality of videos we did not follow the standard multi-frame restoration approach that is commonly used in video restoration tasks, such as in MFQE 2.0 [17], MRS-NET+ [42] or DAVD-Net [82], because it usually involves looking also at future frames and this is not possible in a real-time stream.Surely taking into account only past neighboring frames is a possibility, but we preferred to consider possibly very distant I-frames and not necessarily the closest one.This preference is possible in videoconferencing because the subject usually is the same for the entire transmission so old I-frames can still be very useful in restoring the current compressed frame.This is similar to exemplar-guided face image restoration techniques but given that our method is applied to videos we can exploit multiple I-frames from the same video stream as possible references, dynamically updating the set of keyframes with the policy we designed to obtain the best performance.Precisely the LFU-inspired update strategy for the dynamic set of keyframes is what mainly differentiates our work from exemplar-based face restoration methods that constitute the state-of-the-art.For instance, ASFFNet [37] relies on a given set of reference images representing the same person and it can not handle a dynamic set of references, nor a policy for updating it.Similarly, DFDNet [36] needs an offline-generated dictionary of features of different subjects, therefore it can not exploit high-quality I-frames of the same subject that arrive in real-time.

Proposed Approach
Since its introduction in [15], the Generative Adversarial Network (GAN) framework has emerged as a powerful tool for various image and video synthesis tasks, such as imageto-image translation [23], face reenactment [73] and pose transfer [63].Compared to other deep generative models, like Deep Boltzmann Machines [9] or Variational AutoEncoders [29], GANs proved to be able to generate more photorealistic results [14,41], and have been successfully used to improve the visual quality of images [11] and videos [61].Our method is based on such a framework.

Proposed Architecture
We propose a novel GAN architecture shown in Fig. 2 and inspired by [37] and [36].Similarly to [37], we adopt the ASFF block and Moving Least Squares for warping.Differently from [37], we warp directly the reference image and not its features and we extract and fuse features at multiple scales in a progressive manner to help the network in restoring coarse-to-fine details.We took inspiration from [36] in the use of multi-scale features and of the SFT block, but we leverage a high-quality image of the same person to better restore subject-specific details.Differently from both [37] and [36], we select our reference image from the bestperforming set of high-quality keyframes coming from the same video, which is built and updated with our proposed policy.
Our architecture is based on U-Net [55] and it is composed of an encoder, that processes the input so that it is smaller in terms of spatial dimensions but deeper in terms of the number of channels, and by a decoder, that inverts the process.Multi-scale reference features are combined with the features of the degraded image in a progressive manner.This approach can make the network learn coarse-to-fine details and is beneficial to the restoration process.Our model takes 3 inputs: • a degraded (i.e.highly compressed) image; • a high-quality reference image (i.e. a video I-frame); • a binary image that is white only in correspondence with the facial landmarks of the compressed image.The model produces a restored image from the compressed one.
We use a pre-trained VGG-19 [59] to extract multi-scale features from the degraded, reference and landmarks binary images.The reference (guidance) image is previously warped to the degraded one based on the facial landmarks To align the warped reference and degraded features we adopt AdaIN [20].This helps reduce the difference in style and illumination between the two images and thus improves the restoration.We denote by F d and F g the degraded and guidance features.The AdaIN can be written as where σ(•) and µ(•) represent the mean and the standard deviation.
After going through multiple dilated residual blocks, the degraded features are progressively upsampled by enlarging the spatial resolution and reducing the number of channels.At the same time, they are combined with the reference features by means of Adaptive Spatial Feature Fusion (Sec.3.2) and Spatial Feature Transform (SFT) [65] blocks.
The SFT block generates affine transformation parameters for spatial-wise feature modulation incorporating some prior condition.The scale α and the shift β parameters are learned from the features outputted by the corresponding ASFF block.The output of the SFT block is formulated as where ⊙ is the element-wise product and F r are the restored features, that is the features originated from the degraded ones and restored in the decoding part of the architecture.Figure 4 shows the structure of the SFT block.Following [12], we train the network to learn the residual image, so there is a skip connection between the degraded image and the restored output.This choice reduces the overall training time and improves its stability.

ASFF Block
The fusion of the features of the reference and degraded images is a fundamental part of exemplar-based approaches, as it allows to fully exploit the information supplied by the  guidance image.Adopting a concatenation-based approach, as in [5,34], does not take full advantage of the reference features.
Thus, in our multi-scale architecture, we rely on multiple Adaptive Spatial Feature Fusion (ASFF) blocks [37].While the reference image generally contains more high-quality details, the degraded image should have more weight in the reconstruction of the overall face components.For example, if the mouth of the reference image is closed while that of the compressed image is open, the reconstruction of the teeth should be mainly based on the restored features from the degraded image.For this reason, ASFF blocks generate an attention mask based on the degraded image facial landmarks to guide the fusion of the guidance and restored features.Figure 5 shows the structure of the ASFF block.

Warping Reference with Moving Least Squares
For most guided face restoration methods, the performance is diminished by the pose and expression difference between reference and degraded images because it introduces artifacts in the reconstruction result.Thus, we spatially aligned the reference and compressed images with an image deformation method based on Moving Least Squares (MLS) [57].
Let p and q be respectively the sets of facial landmarks of the reference and degraded image, with |p| = |q| = N .In our case, N = 68.We aim to find a deformation function f to apply to all the points of the reference image.Given a point v in the image, we solve for the best affine transfor- Because the weights w i are dependent on the point of evaluation v we obtain a different transformation l v (x) for each v.We define the deformation function f to be f is an affine transformation we can rewrite it in terms of a linear transformation matrix M where p * and q * are weighted centroids Based on this insight, the least squares problem of Eq. ( 3) can be rewritten as where pi = p i −p * and qi = q i −q * .The affine deformation that minimizes Eq. ( 5) is With this closed-form solution for M, we can write a simple expression for the deformation function f Applying this deformation function to each point of the reference image lets to warp it according to the facial landmarks of the degraded image.

Keyframes Selection and Set Maintenance
Although warping with MLS helps to reduce the distance between the compressed and reference images, if they are too different the results will still be sub-optimal.Thus it is natural to select the optimal reference keyframe as the one that has a similar pose and expression to the degraded image, instead of simply using the previous keyframe.We measure the similarity between a keyframe and the degraded frame with the Euclidean distance between the sets of facial landmarks.Considering videoconferencing, assuming that the talking subject stays the same, even very old keyframes can be useful.So, as the video progresses, one can save a limited set of keyframes, to reduce memory requirements, and then use the most similar one as a reference to restore the current compressed frame.This novel method is the key to improving the overall restoration quality of the video and limits the cases in which the compressed and reference frames are very different.We took inspiration from the Least-Frequently Used (LFU) cache replacement strategy: for each keyframe of the set, we keep count of how many times it was selected for reconstruction and when a new keyframe is received from the video stream the least used is evicted.However, in this way, the first keyframes of the video would be excessively rewarded.Indeed, since for the first seconds of the video they are the only ones available as a reference they can be used not because of similarity with the compressed frame but for lack of alternatives.To overcome this problem we apply an exponential decay to the number of uses, i.e. when a new keyframe arrives the counter of the number of uses of all the keyframes of the set is halved.

Training Losses
As in [37], to train our model we employed a weighted sum of reconstruction and photo-realistic losses.We denote by I D , I R and I GT the degraded, reconstructed and groundtruth (i.e.high-quality uncompressed) images, respectively.
The reconstruction loss constrains the reconstructed image to faithfully approximate the ground-truth one and is composed of two terms.First, we relied on the Mean Square Error (MSE), defined as where C, H and W denote the channel, height and width of the image.Second, we adopted the perceptual loss [7,25,32], defined on the VGG-19 feature space.The perceptual loss is formulated as where Ψ l represents the features from the l-th layer of a pretrained VGG-19 model and L = {relu 2 2, relu 3 4, relu 4 4, conv 5 4}.We also experimented using VGG-Face [50] for the perceptual loss, in particular by extracting the output taken from the third convolutional layer of the fifth block before the ReLU activation, but the results were worse than with VGG-19.
The photo-realistic loss also contains two terms.First, we used the style loss [13] that is defined on the Gram matrix of the feature map for each layer in L Second, we employed the hinge version of the adversarial loss [40,79].We adopted multi-scale discriminators [62], that is 4 discriminators that have the same network structure but operate at different image scales.The adversarial loss can be formulated as where ↓r denotes the downsampling operation with scale factor r ∈ R = {1, 2, 4, 8} and λ adv,r are the trade-off parameters for each scale discriminator.ℓ adv,D and ℓ adv,G are used to update respectively the discriminators and the generator.To stabilize the learning of the discriminators we adopted SNGAN [48], incorporating the spectral normalization after each convolutional layer of the discriminator.Spectral normalization is based on regularizing the spectral norm of each layer of the discriminator by simply dividing the weight matrix by its largest eigenvalue.The overall training loss is defined as where λ M SE , λ perc , λ style , and λ adv are the tradeoff parameters.

Datasets
Similarly to [12], we used the Deep Fake Detection (DFD) dataset [56], which is composed of 363 high-resolution and high-quality videos depicting different activities performed by 28 actors.Then, we selected 55 videos of actions in which the actor is talking while facing the camera as in a setup of a video conference (i.e."podium speech" and "talking against wall" scenes) for an overall size of ∼ 40 GB and a duration of ∼ 40 minutes.The first 22 identities were utilized for training and the last 6 for testing.
We also employed the High-Definition Talking Face (HDTF) dataset [83], which contains 362 videos collected from YouTube with a resolution of 720P or 1080P.We used the "WDA" subset since it is composed of the videos that have the highest quality among those in the whole dataset, for a total of 193 videos.Since the videos have a much larger duration than those of the DFD dataset, we used only the initial 30 seconds to reduce the computational cost; this does not hamper the evaluation since the visual content remains extremely similar.We relied on this dataset only for testing purposes, to compare the proposed approach with competing state-of-the-art methods, and to evaluate the generalization capabilities of the models trained on the DFD dataset.
Starting from the raw (Constant Rate Factor 0) version of the original sequences, each video was compressed with the H.264 codec and CRF 32 and 42 using FFmpeg [1].Then, only during training, the frames of each sequence were extracted by sampling one frame every five, both for the raw and compressed versions.In addition, for the compressed versions, the frames were extracted starting from a given offset measured in the number of frames to skip.This was because for the training the reference frames (i.e. the raw ones) need to precede the compressed ones.The offset used in the experiments was equal to 5.
Both for training and testing we relied on dlib [27] to detect the face rectangle and the 68 facial landmarks of each frame.Then, we leveraged an affine transformation to perform the crop and alignment of the detected faces based on the set of facial landmarks.Each reference image was warped to the corresponding degraded one with Moving Least Squares to reduce the difference in pose and expression.To this end, we extracted the facial landmarks of both images and then applied the MLS algorithm presented in Sec.3.3.Finally, we used the facial landmarks of the compressed frame to generate the landmarks binary images.After the preprocessing, we ended up with 9,007 images for the training set and 12,568 images for the test set, considering the DFD dataset.Instead, all the 175,832 frames of the HDTF dataset were used for testing.

Training Setup
To train both the generator and the discriminator we employed the ADAM optimizer [28] with batch size 4, learning rate 10 −4 and momentum parameters β 1 = 0.9 and β 2 = 0.99.We trained all the models for 15 epochs because after that the outputs did not change significantly.We adopted several data augmentation techniques, such as shifting, 90 • rotations and cutout [4].We performed a grid search to find the optimal trade-off parameters for the training losses.After that, they were set as follows: λ M SE = 300, λ perc = 10, λ style = 1, λ adv = 2, λ adv,1 = 4, λ adv,2 = 2, λ adv,4 = 1 and λ adv,8 = 1.The 4 layers used to compute the perceptual loss were given the same weight, equal to 1.During testing, we set the maximum cardinality of the set of keyframes to 10.

Evaluation Metrics
The performance is evaluated using six full-reference and two no-reference visual quality metrics.Regarding the fullreference metrics, we employed: 1) Peak Signal-to-Noise Ratio (PSNR), which is often used to evaluate reconstruction and compression artifacts reduction, despite its issues in estimating the perceived quality [21,22]; 2) Structural Similarity Index Measure (SSIM) [68], another commonly used metric, although it is known that it doesn't perform well on the output of generative models [30]; 3) Learned Perceptual Image Patch Similarity (LPIPS) [80], using, in particular, the version with AlexNet [31] backbone.Typically LPIPS measures are in contrast with SSIM, i.e. distortions that are low for LPIPS are high in SSIM and viceversa.LPIPS has been shown to have a very strong correlation with perceived visual quality; 4) CONTRastive Image QUality Evaluator-Full Reference (CONTRIQUE-FR) [44], using, in particular, the LIVE FR model downloaded from the official repository; 5) Video Multimethod Assessment Fusion (VMAF) [38], a full reference perceptual video quality assessment model that combines multiple elementary quality metrics; 5) Video Multimethod Assessment Fusion -No Enhancement Gain (VMAF-NEG) [39], which subtracts the effect of image enhancement from the VMAF score.Indeed, VMAF tends to overpredict the perceptual quality when image enhancement techniques, such as sharpening or histogram equalization, are performed [39].Both VMAF and VMAF-NEG include an elementary metric that accounts for the temporal difference between adjacent frames of the videos, thus evaluating the presence of motion jitter and flicker.Regarding the no-reference metrics, we relied on: 1) Blind/Referenceless Image Spatial QUality Evaluator (BRISQUE) [47], which evaluates the naturalness of an image; 2) CONTRastive Image QUality Evaluator (CONTRIQUE) [44], using, in particular, the LIVE model downloaded from the official repository.

Baselines
We compare the proposed approach with several state-ofthe-art methods: six methods for blind face restoration, Hi-FaceGAN [74], PSFR-GAN [3], GFP-GAN [67], GPEN [75], DFDNet [36] and ASFFNet [37], and one for face super-resolution, GWAINet [5].DFDNet, PSFR-GAN, GPEN and GFP-GAN do not use a reference image but utilize extra face prior, respectively some offline-generated dictionaries of facial components, a segmentation mask, and pre-trained GANs.Instead, GWAINet exploits a reference image that is warped to the compressed one by means of a warper network.HiFaceGAN does not require any additional information w.r.t. the compressed input image.The most similar to our work is ASFFNet, which leverages a reference image and a binary landmark image.As ASFFNet needs a given static set of reference images, we make all the keyframes in the video available to it as possible guidance.Therefore, ASFFNet actually has an advantage over our approach, as, in our case, we limit the maximum cardinality of the set of keyframes to 10.

Quantitative Results
The quantitative results for the DFD dataset are reported in Tab. 1.The proposed method achieves the best performance for the LPIPS metric, which is the most indicative full-reference perceptual metric, as well as in terms of CONTRIQUE, CONTRIQUE-FR, and VMAF-NEG.PSFR-GAN performs better with regard to the signal metrics PSNR and SSIM, while GWAINet achieves the best result for BRISQUE.However, manual inspection shows that the images produced by GWAINet include excessive highfrequency artifacts and thus we did not consider this approach in the other experiments.GFP-GAN obtains the best VMAF value, probably because of its tendency to saturate colors and increase contrast at the cost of loss of photorealism, as is visible from the qualitative results.This tendency is similar to the application of image enhancement methods, which are known to boost the VMAF score [39].In support of this theory, we can notice the large difference from the VMAF-NEG score, which in contrast is not affected by image enhancement techniques.Our method achieves both the second-best VMAF value and the best VMAG-NEG value, proving its ability to obtain great overall video quality while preserving photorealism.Moreover, the VMAF and VMAF-NEG scores show that our video results are temporal consistent and do not present too much motion jitter and flicker or mosquito noise.
In the second experiment, reported in Tab. 2, we compare the proposed method with the baselines on the HDTF dataset.It is important to note that our model has not been trained on this dataset so that we can evaluate its generalization capabilities.Again, the proposed approach outperforms the other methods in terms of LPIPS, CONTRIQUE, CONTRIQUE-FR, and VMAF-NEG.Manual examination of the results shows that this may be motivated by the fact that several competing approaches tend to add (or, on the opposite, hide) skin imperfections or boost excessively the color of lips and eyebrows.
Overall, our method is the one that performs best with the highest consistency, as none of the baselines achieves better performance on multiple metrics simultaneously.The results obtained for the HDTF dataset also prove that the proposed model is capable of generalization.In addition, we argue that the metrics for which our method performs best, namely LPIPS, CONTRIQUE, CONTRIQUE-FR, and VMAF-NEG are those that correlate best with the actual quality of the restored frames.In Sec. 5 we provide some examples that support this argument.

Qualitative Results
Qualitative results for the DFD dataset are shown in Fig. 6.Our approach outperforms all the baselines in generating photorealistic and detailed results.GWAINet, HiFaceGAN and PSFR-GAN produce unsatisfactory images that still Table 1.Quantitative comparison between the proposed approach and other state-of-the-art methods for CRF 42 on DFD dataset [56].Best and second best results are in bold and underlined, respectively.↑= higher values are better, ↓= lower values are better.present visible artifacts, see for example the mouth in the second row.GFP-GAN and GPEN generate detailed but artsy and not photorealistic results, as the eyes in the first and fifth rows.DFDNet and ASFFNet achieve a better tradeoff between details and photorealism but, as can be seen in the last row, still produce visible artifacts.Our model exploits the reference keyframe and reproduces the high-frequency details lost after such strong compression without loss of photorealism.It is interesting to note that often the reference image (i.e. the bottom-left image in the input column) is not too similar to the degraded image, but the proposed method is still able to exploit it.For example, in the last row, the reference image has open eyes while the compressed one has them closed, and despite this, our model correctly depicts the restored frame with closed eyes.
Figure 7 shows the qualitative results for the HDTF dataset.Again, our method produces the most detailed and photorealistic images.All the baselines generate blurry hair in both the first and second rows, as well as a not detailed beard in the third row.In the first row, PSFR-GAN, GFP-GAN, GPEN and ASFFNet mistake the shadow of the glasses for their border and thus produce unrealistic results.In the fourth row, GPEN and DFDNet hallucinate moles that are not present in the ground truth.In the fifth row, our method is the only one capable of depicting the eyes as closed without adding artifacts.In the last row, GFP-GAN and GPEN add traces of glasses, while ASFFNet exploits the reference incorrectly and portrays the eyes as open.In general, our method is the one that most consistently generates satisfactory results that are similar to the ground truth.

Subjective Experiments
In this experiment we conducted a subjective test based on the three-alternative forced choice (3-AFC) methodology, using the AVrate Voyager tool [16,52].The test included the inspection of 15 sets of videos, 8 from the DFD dataset and 7 from the HDTF one, so as to maintain the completion time of around 15-20 minutes and avoid excessive fatigue as recommended by ITU-R BT.500-13 [24].Each original video was compressed with CRF 42 and restored using our proposed method, GPEN [75] and GFP-GAN [67]; using 3-AFC allowed to reduce the number of required comparisons [51].Participants (18, i.e. almost double the minimum required [72]) were requested to choose the reconstruction that matched more closely the original high-quality video, without considering aesthetic preferences.The position of the results of all the methods was changed randomly for each evaluation.Figure 8 reports the percentages of the forced choices for the 15 sets.The much larger preference given to our proposed method can be attributed to the fact the proposed GAN introduces fewer high-frequency details and color shifts than the GPEN [75] and GFP-GAN [67]; these additions tend to be more visible in a video sequence than when evaluating separately the frames using the quality metrics.

Inference Time
We compared the Frames Per Second (FPS) processed by our model with the baselines.The experiments were performed on an NVIDIA RTX 2080 Ti GPU.As shown in Tab. 3, our method achieves a number of FPS similar to or better than the baselines but outperforms them in terms of quality.Given that our model runs at almost 45 FPS, it proves to be capable of real-time inference and therefore suitable for videoconferencing.

Ablation Studies
Architecture We performed ablation studies to evaluate the importance of each component of our architecture.In particular, we measure the effect of using: i) Multi-scale features; ii) ASFF blocks; iii) SFT blocks.We start from Method # parameters (M) FPS HiFaceGAN [74] 72.22 44 PSFR-GAN [3] 67.26 28 GFP-GAN [67] 86.44 49 GPEN [75] 71.00 39 DFDNet [36] 113.31 4 ASFFNet [37] 23.Additionally, we substitute the ASFF and SFT blocks one at a time with SPADE [49], a spatially-adaptive denormalization block.The proposed architecture outperforms both versions that make use of SPADE, proving that ASFF and SFT blocks are more effective in our architecture.
Keyframes Selection Policy Tables 5 and 6 compare the proposed LFU policy update method with a different approach that maximizes the diversity of the keyframes, called "Max distance".The "Max distance" policy consists of maximizing the Euclidean distance between the facial landmarks of the frames of the set, in order to have a wide range of poses and expressions.The idea is that in this way, every future frame of the video should always have a reference in    the set that is not too different.For each new keyframe, its distance to all the keyframes in the set is computed.Then, between all the possible combinations of frames, we choose the group of keyframes that maximizes the total distance, so the new keyframe is not necessarily added to the set.Table 5 reports the results obtained with CRF 32 and Table 6 those for CRF 42.The maximum number of keyframes in the group was set to 10 in both cases.The results show that the proposed LFU strategy outperforms the "Max distance" one for almost all the metrics.
Keyframes Set Cardinality Regarding the dimension of the set of keyframes, we expect that as the maximum cardinality increases, the results will improve.In fact, having more possible references available, it is less likely that a compressed frame has no similar reference.The results reported in Tab.7 confirm our assumption, but the increase in performance is not too significant.However, we set the maximum cardinality to 10 because the time needed to choose the best keyframe is still about 0.1 milliseconds so a higher number of keyframes to choose from does not impact the computational complexity significantly.
Feature Extractor In this experiment, we replace the VGG-19 backbone with different feature extractors.In particular, we exploit the small and large versions of Mo-bileNetV3 [19], a popular and light CNN designed for mobile platforms which, from our experiments, reduces the inference time of our model by about two times.Table 8 reports the quantitative results.As expected, the version with the VGG-19 outperforms the MobileNetV3 ones, but the number of parameters is an order of magnitude greater.However, looking at the qualitative results obtained with the MobileNetV3 as the feature extractor we noticed how they were still more than acceptable, proving how effective our approach is, and suggesting that these backbones could be used for deployment on mobile devices.
Discriminator We substitute the multi-scale discriminators with a standard single-scale discriminator.Consequently, we also replace the adversarial loss described in Eq. ( 10) with the following one: ℓ adv,D and ℓ adv,G were used to update respectively the discriminator and the generator.Tables 9 and 10 report the quantitative results for the DFD and HDTF datasets, respectively.Even if the version with the single-scale discriminator outperforms the multiscale one for some metrics, the qualitative results show clearly that the use of the multi-scale discriminators allows to obtain less blurry and more sharp and detailed outputs.This is proven also by the lower values of the LPIPS metric for both datasets.For instance, Fig. 9 shows how the multi-scale version has less blurred and more detailed hair and eyes than the single-scale one, as well as an overall color more faithful to the ground truth.Additionally, the multi-scale discriminators let to achieve higher VMAF and VMAF-NEG values, which correspond to a better temporal consistency.

Conclusion
In this paper, we have proposed a novel GAN-based method and a keyframe selection system that improves the visual quality of videoconference videos enhancing the appearance of faces.A key element of the system is the policy that updates a set of previous I-frames and exploits them to improve the visual quality improvement process.The proposed approach improves over competing state-of-theart methods in terms of perceptual metrics and is rated much better in terms of fidelity by human evaluators.Perceptual Quality Improvement in Videoconferencing using Keyframes-based GAN

Supplementary Material
Quantitative Results Analysis In Sec.4.5 we reported the quantitative results for the DFD and HDTF datasets.Our method obtains the best performance in terms of LPIPS, CONTRIQUE, CONTRIQUE-FR and VMAF-NEG.We argue that these metrics best correlate with the perceived visual quality.In Figs.In Fig. 10 we compare a frame restored by our method and by GWAINet and present the corresponding values of noreference metrics BRISQUE and CONTRIQUE.The proposed approach clearly generates a more satisfying image than GWAINet, which adds high-frequency artifacts.We argue that these artifacts deceive BRISQUE, which mistakes them for high-frequency details that are distinctive of high-quality images [58].In Fig. 11 we report the values of the full-reference metrics PSNR, SSIM, LPIPS and CONTRIQUE-FR obtained by our approach and HiFace-GAN for a restored frame.Again, the proposed method produces a more detailed and photorealistic image, while HiFaceGAN generates a frame with visible artifacts.However, HiFaceGAN obtains better values for PSNR and SSIM.PSNR and SSIM are signal-based metrics that do not correlate well with the perceived visual quality for the output of generative models [21,22,30].On the contrary, LPIPS and CONTRIQUE-FR are perceptual-based metrics and are good indicators of the actual perceived quality of an image.
Regarding VMAF, it is known that image enhancement techniques tend to boost its values [39].As Fig. 12 shows, some baselines, such as GFP-GAN, saturate colors and increase the contrast of the restored frames, making them more visually pleasing but less similar to the ground truth.This tendency is similar to the application of image enhancement methods.We argue that this is the reason why such baselines perform so well in terms of VMAF.Our argument is supported by the large difference between the values of the baselines for VMAF and VMAF-NEG, which is not affected by image enhancement techniques, in Tabs. 1 and 2. On the contrary, our method obtains high values for both metrics without a substantial difference between them, meaning that they are due to the actual quality and temporal consistency of the results, and not due to color enhancement.

Figure 2 .
Figure 2. Overview of the proposed architecture.Best viewed in color on PDF.

Figure 3 .
Figure 3. Diagram of the multi-scale feature extraction with VGG-19

Figure 6 .
Figure 6.Qualitative comparison between the proposed approach and the baselines for the DFD dataset and CRF 42.The bottom-left image in the input column represents the reference frame exploited by our approach.Best viewed in full screen.

Figure 7 .
Figure 7. Qualitative comparison between the proposed approach and the baselines for the HDTF dataset and CRF 42.The bottom-left image in the input column represents the reference frame exploited by our approach.Best viewed in full screen.

Figure 8 .
Figure 8. Subjective results using 3-AFC.Videos from 1 to 8 belong to the DFD dataset, the others to the HDTF dataset.

Figure 9 .
Figure 9. Qualitative results for different discriminators for max cardinality 10 and CRF 42 on the DFD dataset.
10 and 11 we show two examples supporting our argument.

Figure 10 .Figure 12 .
Figure 10.Comparison between our method and GWAINet [5].The reported values represent BRISQUE ↓ /CONTRIQUE ↓ , respectively, where ↓ means that lower values are better.Best results for each image are highlighted in bold.

Table 3 .
FPS comparison between the proposed approach and other state-of-the-art methods.Best and second best results are in bold and underlined, respectively.

Table 6 .
Ablation studies on the keyframes selection policy for the DFD dataset and CRF 42.Best results are in bold.↑= higher values are better, ↓= lower values are better.

Table 7 .
Ablation studies on the maximum cardinality of the set of references for the DFD dataset and CRF 42.Best results are in bold.↑= higher values are better, ↓= lower values are better.