Better Visual Image Super-Resolution with Laplacian Pyramid of Generative Adversarial Networks

: Although there has been a great breakthrough in the accuracy and speed of super-resolution (SR) reconstruction of a single image by using a convolutional neural network, an important problem remains unresolved: how to restore finer texture details during image super-resolution reconstruction? This paper proposes an Enhanced Laplacian Pyramid Generative Adversarial Network (ELSRGAN), based on the Laplacian pyramid to capture the high-frequency details of the image. By combining Laplacian pyramids and generative adversarial networks, progressive reconstruction of super-resolution images can be made, making model applications more flexible. In order to solve the problem of gradient disappearance, we introduce the Residual-in-Residual Dense Block (RRDB) as the basic network unit. Network capacity benefits more from dense connections, is able to capture more visual features with better reconstruction effects, and removes BN layers to increase calculation speed and reduce calculation complexity. In addition, a loss of content driven by perceived similarity is used instead of content loss driven by spatial similarity, thereby enhancing the visual effect of the super-resolution image, making it more consistent with human visual perception. Extensive qualitative and quantitative evaluation of the baseline datasets shows that the proposed algorithm has higher mean-sort-score (MSS) than any state-of-the-art method and has better visual perception.


Introduction
Image super-resolution refers to reconstructing one or more low-resolution (LR) images into high-resolution (HR) images [Harris (1964); Goodman and Gustafson (1968)]. Super-resolution is an important research problem in the field of computer vision and image processing, but there are still many shortcomings in high-resolution superresolution reconstruction of low-resolution images. First, since the existing methods are optimized by algorithms that reduce the mean square error (MSE), the picture is too smooth [Wang, Simoncelli and Bovik (2003); Wang, Bovik, Sheikh et al. (2004); Gupta, regression method to learn the mapping relationship between LR and HR images, and learned the regression amount through the structural characteristics of the DCT domain to obtain high-resolution images. Chang et al. [Chang, Yeung and Xiong (2004)] proposed an image SR method based on local linear embedding. By finding the K nearest neighbors in the training library for the input low-resolution image blocks, the highresolution reconstruction of the nearest neighbor feature blocks is finally performed using the linear combination. Resolution image block. Singh et al. [Singh and Ahuja (2014)] decomposed the mapping patch into directional frequency subbands and independently matched each subband pyramid to obtain better results. In order to expand the size of the training set, Huang et al. [Huang, Sigh and Ahuja (2015)] made small changes and shape changes to the image. Kim et al. [Kim and Kwon (2010)] used kernel ridge regression to estimate the high frequency information of high-resolution images. In order to reduce the practical complexity of the method, a sparse kernel ridge regression model was proposed based on matching tracking and gradient descent methods. Compared to, it is more general. Yue et al. [Yue, Sun, Yang et al. (2013)] proposed to retrieve similar HR images from the network and propose perceptual matching criteria for alignment structures. However, this method is limited by the training set size, and the number of mappings between LR and HR images may not be sufficient to cover texture changes in the image, and the counterparts requiring HR are known. Dong et al. [Dong and Loy (2015)] first used deep learning [Hinton and Salakhutdinov (2006)] to solve the super-resolution problem. Wang et al. [Wang, He, Sun et al. (2019)] used convolutional neural networks to remove image noise. Shocher et al. [Schocher, Cohen and Irani (2018)] proposed an unsupervised SR method based on CNN by using the repetitiveness of image internal information. In order to solve the problem that LR degradation is not downsampled from HR bicubic, Zhang et al. [Zhang, Zuo and Zhang (2018)] proposes a general framework with dimensionality stretching strategy. A discriminator network attached to the feature domain is used by Park et al. [Perk, Son, Cho et al. (2018)] to generate high frequency features. Kim et al. [Kim, Lee and Lee (2016)] proposed a network with multiple recursive layers, proposing a high performance architecture that allows long distance pixels Dependencies also maintain a small number of parameter models. These super-resolution methods all learn the mapping function through the MSE loss. Although it can effectively obtain high PSNR and SSIM, it will cause the output image to be too smooth. In order to prevent smooth output, a method for extracting the high-level feature-aware loss of the trainer is proposed by Bruna et al. [Bruna, Sprechmann and Lecun (2015)] and Johnson et al. [Johnson, Alahi and Li (2016)]. Huang et al. [Huang, He, Sun et al. (2019)] proposed a method for generating wavelet domain, which will be low, the resolution of the face is enlarged to a higher multiple. To create more realistic texture details, Sajjad et al. [Sajjadi, Scholkopf and Hirsch (2017)] proposed a texture-based matching loss function. Lai et al. [Lai, Huang, Ahuja et al. (2018)] proposed a new loss function Charbonnier to correct the difference between the SR image and the HR image. Ledig et al. [Ledig, Theis and Huszar (2016)] utilized the generated confrontation network to perform super-resolution reconstruction by optimizing the damage and content loss. We refer to this form of network architecture and modify its basic network elements. And Ledig the difference is that SRGAN focuses on 4×, our method has higher magnification, such as 8× to 16×. And Ledig focuses on learning the mapping function between low frequency image and high frequency image, and our network architecture is focused on capturing the exact high frequency details. Burt et al. [Burt and Adelson (1983)] proposed the concept of Laplacian pyramid. The Laplacian pyramid interprets the image with multiple resolutions and restores the image to the greatest extent by residual prediction. Denton et al. [Denton, Chintala and Fergus (2015)] proposed a LAPGAN (Laplacian Pyramid GAN) is used to produce real images. Recently, Lai et al. [Lai, Huang, Ahuja et al. (2018)] proposed LapSRN for image super resolution. Our method has the following differences compared to LapSRN: 1) LapSRN used a deep convolution network to combine with the Laplacian pyramid. We use the GAN network to combine with the Laplacian pyramid. By combining the GAN network, we can enhance super-resolution image quality, have better results, and recover more high-frequency details.

Laplacian pyramid
2) LapSRN implemented feature sharing to ensure higher inference speed, but does not focus on recovering high-frequency texture details of the image. This article focuses on recovering the high-frequency contour information of the image to generate a picture more consistent with human perception. We used LR images to train step by step to capture the missing high-frequency features at different magnifications.
3) LapSRN utilized the charbonnier loss function to better capture the difference between the SR image and the HR image. While we focus on restoring images that match the human perception and capture higher high frequency details, we used the VGG loss function.

Generative adversarial network
In 2014, Goodfellow et al. [Goodfellow, Pouget-Abadie and Mirza (2014)] proposed Generative Adversarial Network that utilized the generator network G to generate as realistic a sample as possible, using the discriminator network D to determine whether the image is a generated image or a real HR image, both iterative confrontation complete training. Ledig et al. [Ledig, Theis and Huszar (2016)] proposed to solve the problem that the discriminator in the network min-max is easy to discriminate by optimizing the loss function. A method for super-resolution using wavelets was proposed by Huang et al. [Huang, Li, He et al. (2018)].

Laplacian pyramid generative adversarial network for super-resolution 3.1 Network architecture
Our network works as shown in Fig. 1. By inputting the LR image to the GAN network, a residual image is generated. The upsampled image of the LR and the residual image are added pixel by pixel to recover the desired SR image. For 4× and 8× images, we need to continue to call the ELSRGAN network for super-resolution on the resulting SR results until the magnification requirement is met. In the process of generator network training, we used the combination of Laplacian pyramid and GAN. This paper is different from the ordinary GAN network. The ordinary GAN network directly inputs the LR image to generate the corresponding HR image. But we in order to capture more high frequency details, the modified LR image is generated to generate a different residual image (DR) in the Laplacian pyramid. Finally, the residual image is sampled on the original image and added pixel by pixel. ELSRGAN training schematic is shown in Fig. 2.   Figure 2: ELSRGAN training diagram. The difference between the upsampled HR image and the LR image is captured for training. The generator network generates a predicted residual image, determine whether the image is generated or a true HR image is performed by the discriminator network

Generative adversarial network
The generator network D and the discriminator network G form the generative adversarial network model. The generator network is responsible for generating as realistic a picture as possible, the discriminator network discriminating whether the image is a real HR image or an image generated by the generator network. Both the generator network and the discriminator network perform iterative training on the resistance, and finally complete the training, so that the generator generates a highly similar result to the image. However, in the GAN network, the min-max problem is easy to occur. That is, the discriminator can easily identify the result and cause the training model to crash. To solve this problem, we used the function of Eq. (1).
For the generator network architecture, SRGAN implemented the residual block structure as shown in Fig. 3. We introduced RRDB as the new generator network architecture, as shown in Fig. 4. We removed the BN layer in SRGAN to ensure the stability of the training. And reduce the complexity of the calculation, and combine the multi-layer residual network with dense links. It has been confirmed that removing the BN layer in super-resolution can improve the operation speed, reduce the computational complexity and GPU memory consumption.
And we propose a more complex and deep basic network unit structure, and the network capacity is obtained from the dense connection [Luo, Qin, Xiang et al. (2019)] which have higher benefit.
To solve the max-pooling problem, LeakyReLU is used as the activation function of the discriminator network. Our discriminator can effectively solve the min-max problem by Eq. (1). Fig. 5 is our discriminator network model. Whenever the number of features doubles, we use a convolutional layer to reduce the resolution. We implemented the Sigmoid activation function to get the final classification probability.

Figure 5:
Discriminator network. It consists of eight 3×3 kernel convolution layers. The number above the convolution layer is the step distance. The number of feature maps is indicated by the number below

Perceptual loss function
The perceptual function used in this paper refers to the perceptual function proposed by Ledig [Ledig, Theis and Huszar (2016) Common content loss relies on MSE-optimized loss function, which can result in higher PSNR, but results in an overly smooth texture that is not satisfactory in visual perception. We utilized a VGG loss function that is closer to the perceived similarity, which is defined as the Euclidean distance between the feature representation and the reference image: The adversarial loss in this paper is to encourage our generator network to spoof the discriminator network to achieve a variety of natural images. The generator loss is defined as follows:

Training data and test data
The training data in this paper is trained by the common reference data DIV2K, and the HR image domain LR upsampled image is subtracted pixel by pixel to obtain the DR image for training. The test data was tested using the widely used benchmark datasets Set5 [Bevilacqua, Roumy, Guillemot et al. (2012)], Set14 [Zeyde, Elad and Protter (2012)], BSD100 [Yang, Wright, Huang et al. (2008)] and URBAN100 [Yang, Wright, Huang et al. (2010)], where the first three methods consist of natural scenes and the URBAN contains challenging Scene image.

ELSRGAN model experimental analysis
We conducted an experimental analysis of the ELSRGAN model structure and compared the gaps between the two residual blocks of RRDB and RB. It can be found that the used of RRDB can capture more image details.

Figure 6:
Comparison of results of different network unit. We will use the ELSRGAN model of RRDB as ELSRGAN-RRDB and the ELSRGAN-RB using RB ELSRGAN model. It can be found that using RRDB can obtain higher reconstruction quality, the figure is the result under 4× super resolution We compared the number of RRDBs utilized and finally decided to use 24 RRDB units for our generator network, which guarantees that the network will not be too deep without sacrificing the reconstruction results.

Perceptual mean-sort-score
We propose a mean-sort-score (MSS) to reflect the visual effect of the image. Ledig et al. proposed a MOS score using 26 evaluators to evaluate images from 1-5, and we found that images with two sensory differences still scored the same during the assessment. And the evaluator is not good for nearby scoring (such as the choice between 2 and 3). In order to distinguish the perceptions of different methods, we propose the concept of MSS. The images of all methods and the HR images are simultaneously provided to the judges, and the scorer is sorted for the senses of all the images, and the lowest score is 1 point, and the highest score is N points (N is the number of images of a single score). For the convenience of comparison, the score is processed in Eq. (5): where S is the total score of the picture and n is the number of test sets. Using 50 evaluators to sort a total of 219 images, we found that the HR scores were closer to full marks than the MOS method (in 4× and 8×, all graders ranked HR first in the senses). The MSS score of the bicubic method is the last one, and we can find that our scoring method can better evaluate the senses. The comparison results are shown in Section 4.4.

Comparison with state-of-the-arts
We compare ELSRGAN with the most advanced methods: Bicubic, ScSR, Kim, SRCNN, SelfExSR and LapSRN. Among them SRGAN is only used for comparison of 4× images. As shown in Figs. 8-10, we enumerate the detailed features of the picture after super-resolution at different magnifications. It can be seen from the comparison that our proposed ELSRGAN contains more high-frequency details and accurately reconstructs the texture lines.  : Visual comparisons at 4× magnification. It can be seen that at 4× magnification, ELSRGAN can reconstruct more high-frequency texture features and the angle of the statue is more obvious Figure 10: Visual comparison at 8× magnification. At 8× magnification, our method complements the detail of the image in an illusion detail to make it look more realistic We detail the comparison of PSNR, structural similarity index (SSIM) and MSS. It can be found that although the proposed ELSRGAN method has a small PSNR and SSIM after super-resolution, the MSS is higher than other methods and is more in line with human sensory visual effects.

Discussion
By comparing MSS, we can find that our proposed ELSRGAN method is more in line with human visual sensory effects, and can restore more texture features in image highfrequency detail comparison. Although we have improved the scoring criteria for image sensory effects, we still use the criteria for human scoring. In the future work, we will further study the sensory scoring criteria and propose corresponding formulas to evaluate visual sensory effects and high-frequency texture features. At the same time, although the paper can restore the fine structure, the more elaborate methods of illusion produced by self-similarity may not be suitable for medical applications and surveillance.

Conclusion
We use generative adversarial networks combined with Laplacian pyramids to achieve single-resolution super-resolution reconstruction. We used RRDB as the basic network unit to capture more content details while using the perceptual loss function to obtain finer texture features. Extensive evaluation of the baseline data set shows that using the MSS score can confirm that our proposed ELSRGAN method is more visually authentic.
Funding Statement: This work was supported in part by the National Science Foundation of China under Grant 61572526.

Conflicts of Interest:
The authors declare that they have no conflicts of interest to report regarding the present study.