A multiresolution mixture generative adversarial network for video super-resolution

Generative adversarial networks (GANs) have been used to obtain super-resolution (SR) videos that have improved visual perception quality and more coherent details. However, the latest methods perform poorly in areas with dense textures. To better recover the areas with dense textures in video frames and improve the visual perception quality and coherence in videos, this paper proposes a multiresolution mixture generative adversarial network for video super-resolution (MRMVSR). We propose a multiresolution mixture network (MRMNet) as the generative network that can simultaneously generate multiresolution feature maps. In MRMNet, the high-resolution (HR) feature maps can continuously extract information from low-resolution (LR) feature maps to supplement information. In addition, we propose a residual fluctuation loss function for video super-resolution. The residual fluctuation loss function is used to reduce the overall residual fluctuation on SR and HR video frames to avoid a scenario where local differences are too large. Experimental results on the public benchmark dataset show that our method outperforms the state-of-the-art methods for the majority of the test sets.


Introduction
Super-resolution (SR) imaging techniques are used to solve the classic problem of recovering high-resolution (HR) images from low-resolution (LR) images. These techniques are widely used in image processing. At present, there are many ways to obtain SR, but there is still room for further development to improve upon the techniques.
With the relatively recent development of artificial intelligence, the use of deep learning to achieve SR has attracted widespread attention [1][2][3][4][5][6][7][8]. Many deep learning-based image methods are superior to traditional methods, achieving breakthroughs in the peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) metrics [9]. Among them, image superresolution (ISR) based on generative adversarial networks (GANs) [10] have recorded improvements in visual perception quality. However, using adversarial training for video super-resolution (VSR) has not received the same attention. Unlike ISR, VSR has to consider the relation between consecutive video frames, considering both spatial and temporal information to generate results with temporal consistency and spatial consistency. The creators of VSR algorithm TecoGAN [11] proposed a spatio-temporal discriminator and a Ping-Pong loss function to achieve such consistency. Using GANs in VSR can generate coherent and clear video details, however, there is still a large difference between the SR video implemented by TecoGAN and the real video. The performance of TecoGAN still needs to be improved, especially in some areas with dense textures.
A multiresolution mixture generative adversarial network for video super-resolution (MRMVSR) is proposed in this paper. In order to make full use of the information of LR video frames and generate results with better visual quality, this paper proposes a multiresolution mixture network (MRMNet) for VSR. Contrary to the traditional network, the MRMNet has simultaneous multiple resolution feature maps during the training process, which can continuously extract information from the LR feature maps to supplement the HR feature maps. The LR frame is gradually enlarged to the target resolution after being put into the network. Moreover, we propose a residual fluctuation loss function, to avoid excessive local differences in the generated frames and to generate better results in areas with dense textures. Experimental results were gathered, using public datasets to test our proposed model against other state-of the-art methods.
The contributions of the proposed method are summarized as follows: 1) an MRMNet is proposed for VSR, which makes full use of the information from LR images, 2) a residual fluctuation loss function is proposed for VSR to improve the visual perception quality of the resulting image, and 3) the performance of the MRMVSR model is fully evaluated, is superior to the performance of state-of-the-art video super-resolution methods.
There are five sections in this paper. We have briefly introduced the research content in Section 1. Section 2 examines related studies which motivate the proposed method. Section 3 introduces the proposed method in detail. In Section 4, the results and corresponding discussion of the experiments are presented, with concluding statements given in Section 5.

Related work
In recent years, image and video acquisition, processing, and analysis have commanded greater focus from researchers [12][13][14][15][16]. A significant amount of work has taken place analyzing ISR, with Wang et al. [16] producing a review of this body of work. The SR convolutional neural network [15] proposed by Dong et al. uses deep learning for the first time in the field of ISR, accomplishing single image super-resolution, surpassing the traditional methods in terms of the PSNR and SSIM metrics. Researchers have looked to improve image quality by predominantly making improvements to the network, continuously accelerating the training process and hence improving the performance of ISR. The network structure of improved models for VSR can be divided into three distinct types: network front-end upsampling, network backend upsampling, and iterative upsampling [4,8,[17][18][19][20]. Although these methods have achieved excellent results according to PSNR and SSIM, the visual perception quality is still poor. Several models follow SRGANs [19] to combine the GAN and perceptual loss [21] to obtain ISR, which can generate improved results in visual quality [22][23][24][25].
VSR differs from ISR by requiring the generation of continuous multi-frame images, with a certain relationship between frames. Simply using ISR methods for VSR will lose related information between adjacent frames. VSR has been realized by complex calculation methods that consequently have very high computational complexity [26,27]. However, most existing methods that use a deep-learning technique to complete VSR divide the task into multiple subtasks. Each sub-task recovers an HR image from multiple LR video frames [1,3,7,28]. One such method, frame-recurrent VSR [29], takes multiple frames of LR images as input, learning motion compensation information through a stream evaluation network. This method uses the generated HR video frames to cyclically generate subsequent video frames that can reuse high-frequency details and improve temporal consistency. GANs for ISR have been successful in improving visual perception quality; however, few studies apply such a method for VSR. To address this problem, TempoGAN [30] uses a GAN for VSR, improving overall temporal consistency. Furthermore, the TecoGAN algorithm introduced a novel spatio-temporal discriminator that gets rid of the single function of the authenticity identification, and can guide the network to generate spatio-temporal consistent results. Experimental results have found that TecoGAN can generate clear and coherent details. Although existing methods produce improved video clarity, the performance of the existing methods in dense texture areas is still not satisfactory. Moreover, the visual perception quality needs to be further improved. In order to improve the performance in texture-dense regions and generate results with better visual perception quality, this paper proposes an MRMNet and a residual fluctuation loss function.

Methodology
In this section, we introduce our proposed method that consists of network architecture MRMNet, and the residual fluctuation loss function that will be used for training purposes.

MRMNet
To make full use of the information from LR frames such that the quality of VSR improves, we propose an MRMNet. This network architecture, a generator network in adversarial training, is illustrated in Fig 1. There are three types of resolution feature map (x1, x2, x4) in the whole network. The label x1 signifies that the resolution of the feature map is the same as the original resolution, whilst x2 and x4 denote that the resolution of the feature map is magnified two and four times the original resolution, respectively. The entire network gradually enlarges x1 resolution frames to x4 resolution in multiple stages.
Specifically, the MRMNet consists of three components: bottleneck module, exchange unit, and residual module. The bottleneck module is responsible for feature extraction from LR frames and expressing LR features efficiently. The exchange unit is the central component of MRMNet, enlarging frames and obtaining exchanged features. The exchanged features have higher resolution feature maps that were obtained from other similar and lower resolution feature maps. These multiresolution feature maps are then combined to form a new feature map. The residual module is responsible for feature extraction and representation learning after the exchange unit.
The details of the bottleneck module are shown in Fig 2, consisting mainly of convolution layers and activation layers. The details of the convolution layers are also shown in Fig 2, with the kernel size (k), the number of channels (n), and stride (s) all given. In the activation layers, we chose the parametric rectified linear unit (PReLU) [31] function as an activation function. The input and output of the bottleneck module are all LR feature maps. It should be noted that the feature map obtained by the input feature map through the single and triple convolution layer routes will be combined, then activated as the output of the bottleneck module.
The exchange unit of the MRMNet is shown in The resolutions of the input feature maps are x1, x2, and x4, while the resolutions of the output feature maps are x2 and x4. The feature map of x1 resolution is enlarged to x2 resolution by the deconvolution operation [32], and is enlarged to x4 resolution by applying the deconvolution operation twice. If the x1 resolution feature map was enlarged to x4 resolution through applying the deconvolution operation only once, the training results would produce checkerboard artifacts. The feature map of x2 resolution is enlarged to x4 resolution by applying the deconvolution operation once. In all deconvolution layers, the kernel size is 3x3, the stride is 2, and the number of channels of x1, x2, and x4 feature maps are 128, 64, and 32, respectively. We also can obtain the feature map with the same resolution through the convolution operation, where kernel size is 3x3, the stride is 1, and the number of channels of x1, x2, and x4 feature maps are 128, 64, and 32, respectively. If someone needs to obtain a feature map with a specified resolution as output, they should receive all feature maps less than or the same as this resolution. For example, if the x4 resolution feature map is the desired output, the feature maps from x1 and x2 resolution feature maps must be obtained first, through deconvolution. Second, if the inputs of the exchange unit contain a x4 resolution feature map, a new feature map with the same resolution should be obtained from it via convolution. Then, all feature maps (if there are more than one) are merged into a feature map through the concatenation of feature maps in the channel dimension. The number of channels is adjusted through the convolution layer that has a kernel size of 1x1, a stride of 1, and the number of channels of x1, x2, and x4 feature maps are 128, 64, and 32, respectively. The final output is obtained using an activation layer. Through this structure, the exchange unit can supplement the information gathered from the LR frame and transfer it to the HR frame, obtain enlarged frames, and exchange features. After the first exchange unit, the number of channels of x1, x2, and x4 feature maps are 128, 64, and 32, respectively.
The residual module of the MRMNet, as shown in Fig 4, is a classic residual network without the batch normalization (BN) layer. The MRMNet also has a convolutional layer and an activation layer at the beginning of network, and a convolutional layer after the bottleneck module and at the end of the network. At the beginning, the kernel size of the convolutional layer is 3x3, the number of channels is 64, and the stride is 1. After the bottleneck module, the kernel size is 3x3, the number of channels is 128, and the stride is 1. By the end, the size of the convolutional kernel is 9x9, the number of channels is 3, and the stride is 1.  In the MRMNet, the number of bottleneck module and residual module used in each stage can be adjusted. In this paper, the number of each component is fixed at two, in each stage.

Loss function
In deep learning-based methods, the loss function is necessary since it can guide the neural network to learn desired information. In order to reduce the overall fluctuation of residual values between the SR frame and HR frame, hence avoiding a result with excessive local differences, we propose a residual fluctuation loss function. We will introduce the details of the residual fluctuation loss and the final loss function as follows.
Residual fluctuation loss function. The residual fluctuation loss function l SR rf is proposed to reduce the overall fluctuation and avoid excessive local differences in the generated frame. Such a residual function mainly refers to the variance in probability theory and statistics. The function is calculated by determining the variance of the difference between the HR feature map and the SR feature map, where H and W denote the dimensions of the feature maps, Res denotes the residual between the reference HR frame and the generated SR frame, Res x,y denotes the pixel value of Res at point (x, y), and mean(Res) denotes that the average value of pixels is calculated based on the obtained residual, Res. The residual, Res, can be calculated using, Here, F VGG denotes the feature map that is obtained by the VGG19 [33] network, I LR is the LR frame version of its HR frame counterpart I HR , and G denotes the generator network.  MRMVSR's generator. The L G;FÀ TecoGAN is calculated using, In (3), g t and y t denote the generated frame and reference frame for t th frame, respectively, while IN g and IN y denote the generated frames and reference frames for three consecutive frames, respectively. Furthermore, F D and F VGG denote the feature maps of the discriminator network D and VGG19 network, respectively. λ a and λ l are the coefficients of the loss function. L PP and L warp are the Ping-Pong loss and warp loss that are consistent with TecoGAN [11].
Using these two variables, the final loss function is formulated as follow, The proposed model will be trained based on this final loss function.

Datasets
The training data used to test MRMVSR had the same source as the TecoGAN, which were obtained from the HR video dataset Vimeo [34]. Specifically, this training set has 290 video clips that were extracted from 28 high-definition videos. Each video clip consists of 120 frames, and hence 34,800 images were included in the training set. We used 250 (a total of 30,000 images) and 40 (a total of 4800 images) video clips for training and validation processes, respectively. Image augmentation was used in the training process, including rotations and flipping. Four scene sets in the Vid4 dataset [26] were used as the test data in both the MRMVSR model and the comparative models, namely Calendar, City, Foliage, and Walk. These four scene data sets all contain 50 consecutive video frames, which are commonly used in the field of VSR.

Training details
In the training process, following the TecoGAN model, the MRMVSR model amplifies the video frame resolution four times to obtain the SR video frames. The LR video frame was obtained by down-sampling the HR video frame, implemented via bicubic interpolation. The LR video frame was normalized to [0, 1], whilst the HR video frame was normalized to [-1, 1]. To prevent the generation of local noise, the feature map was cropped to [-5, 5] after the exchange unit.
The training process had two stages, with each stage comprising 500,000 steps. The training time of the proposed MRMVSR is about 100 hours. The inference time of the proposed MRMVSR is about 0.8 second on average for an image. Each batch contained four different videos. Ten consecutive frames were selected for each video, so each batch consisted of 40 frames. The LR images were cropped to the same size of 24×24. The first stage was a pre-training stage without discriminator, where the loss function was ∑kg t −y t k 2 + L warp . The second stage was a formal training stage, which used the GAN with a discriminator. The learning rate in both two stage was set as 5e-5. Adam was chosen as the optimization algorithm, where β 1 and β 2 were 0.9 and 0.99, respectively. The final loss function used was L G;FÀ MRMVSR , where λ a = 0.01, λ l = 0.02.

Experimental results
To demonstrate the performance of the MRMVSR method, we made several experimental comparisons analyzing from three main aspects: the network, loss function, and overall performance of the model. The details are described separately as follows.

Network performance
To verify the performance of the MRMNet, the generator of TecoGAN-G was selected as the comparative network. The loss functions L G;FÀ TecoGAN of TecoGAN and L G;FÀ MRMVSR of MRMVSR were used to train the generator. Learned perpetual image patch similarity (LPIPS) [35] and temporal learned perpetual image patch similarity(tLP) [11] were selected as the evaluation metrics on the Vid4 test set, for measuring the visual perception quality and temporal coherence respectively. A smaller value represents an improved performance. The tLP is calculated using, where g t and y t denote the current generated frame and reference frame, g t−1 and y t−1 denote the previous generated frame and reference frame. The tLP employs LPIPS to measure the visual similarity of two consecutive frames in comparison to the reference, which are used for quantifying realistic temporal coherence and video continuity. We choose tLP and LPIPS as the evaluation metrics because PSNR and SSIM cannot evaluate the visual perception quality very well in the super-resolution field. Several GAN-based SR methods (e.g. SRGAN and ESRGAN) has better visual perception quality that is close to the real image, but the quantitative results are not satisfactory according to PSNR and SSIM. We took L G;FÀ TecoGAN as the generator loss function to evaluate the network performance at first. Figs 5 and 6 show the evaluation results according to the LPIPS metric and tLP metric, respectively.
From Fig 5, we can observe that TecoGAN-G and MRMNet have their own advantages on the four different scene datasets when using the same loss function according to the LPIPS metric. The evaluation results indicate that both networks can generate SR video with better visual quality. From Fig 6, we can observe that the performance of MRMNet is better than TecoGAN-G according to the tLP metric, significantly outperforming the other method in three out of the four datasets. This indicates that the SR video derived from MRMNet has an improved visual perception quality whilst also having better continuity. From the above conclusions, the performance of MRMNet can be considered to be better than TecoGAN-G.
In addition, we define L G;FÀ MRMVSR as the loss function of the generator and conduct contrast experiments to evaluate the network performance. As shown in Fig 7, according to the LPIPS metric, the MRMNet with loss function L G;FÀ MRMVSR is better than TecoGAN-G with the same loss function in different test sets. This means that MRMNet can produce videos with a higher visual quality than the other method. Fig 8 shows that the performance of MRMNet is also better than that of TecoGAN-G network according to the tLP metric. It can be found that the SR video generated by MRMNet has better visual perception quality and continuity than the video generated by TecoGAN-G.
According to the above experiments, it is shown that our MRMNet is superior to Teco-GAN-G in both visual perception quality and continuity.

Performance results and analysis of loss function
In order to show the effectiveness of the proposed residual fluctuation loss function, we use MRMNet as the generator network with different loss functions, L G;FÀ TecoGAN and L G;FÀ MRMVSR . Vid4 was used again as the evaluation dataset, while LPIPS and tLP were again selected as evaluation metrics. Figs 9 and 10 show the experimental evaluation results, using the LPIPS metric and tLP metric, respectively to compare the effectiveness of the loss function.
We can observe that the result generated by the proposed residual fluctuation loss function L G;FÀ MRMVSR has a smaller LPIPS value compared to different loss functions from Fig 9. This shows that adding a residual fluctuation loss function can effectively improve the visual perception quality of SR video. From the results of Fig 10, the model with our proposed loss function L G;FÀ MRMVSR can also generate a smaller tLP value in a majority of datasets. This indicates that the generated video from the proposed method has better continuity feature. In summary, adding a residual fluctuation loss function can effectively improve the performance of VSR, with both the visual perception quality and video continuity improved.
The loss curve of the proposed residual fluctuation loss function during training is shown in Fig 11. With the training processes, the loss value decreases gradually. It also shows that the training hyperparameters were set properly.

Results and analysis of overall performance of the model
In Table 1, the evaluation results of MRMVSR compared to Bicubic, dynamic upsampling filter (DUF) [36], FRVSR [29], and TecoGAN [11] methods are given, according to the LPIPS and tLP metrics. As shown in Table 1, the proposed method MRMVSR achieved the best performance in most scenarios according to the tLP and LPIPS metric, scoring the lowest evaluation results for three of the four data sets. This implies that the proposed MRMVSR method can produce

PLOS ONE
videos with better visual quality and temporal coherence, as well as better continuity between video frames compared to the other tested methods. However, the tLP and LPIPS of MRMVSR are worse than TecoGAN, and the tLP is inferior to FRVSR and DUF on Walk data set. One possible reason is that it has much local movement in the Walk data set. We will try to fix this problem in the future work.

The limitations and future work
Although the MRMVSR has achieved good performance in some aspects, there are still some limitations. The loss function of generator is a little complex. Therefore, the loss function terms may conflict with each other. In some scenarios, the loss function may lead to poor performance. One potential solution is to simplify the loss function. Another solution is to set appropriate weights for different loss function terms. The second problem is that the training time of the model is too long. To solve this problem, we will try to simplify the discriminator to speed up the training in the future work.

Conclusion
In this paper, we propose an MRMVSR method with a new generative network and a residual fluctuation loss function for VSR tasks, with the proposed generative network called MRMNet. In this network, the HR feature map can continuously extract LR feature map information to supplement the images. The LR feature map is gradually enlarged to obtain the target resolution, hence the utilization rate of the LR feature map information is improved. The proposed residual fluctuation loss function is able to restrict large variations in the quality of the generated images, to avoid the large local differences. By comparing the MRMVSR method with other state-of-the-art models using the LPIPS and tLP metrics over four test data sets, the experimental results showed that the proposed method offers a significant improvement. The proposed method has the ability to generate videos with better visual perception quality and temporal coherence, significantly improving performance in areas with dense textures.