Image Super-Resolution Based on Generative Adversarial Networks: A Brief Review

: Single image super resolution (SISR) is an important research content in the field of computer vision and image processing. With the rapid development of deep neural networks, different image super-resolution models have emerged. Compared to some traditional SISR methods, deep learning-based methods can complete the super-resolution tasks through a single image. In addition, compared with the SISR methods using traditional convolutional neural networks, SISR based on generative adversarial networks (GAN) has achieved the most advanced visual performance. In this review, we first explore the challenges faced by SISR and introduce some common datasets and evaluation metrics. Then, we review the improved network structures and loss functions of GAN-based perceptual SISR. Subsequently, the advantages and disadvantages of different networks are analyzed by multiple comparative experiments. Finally, we summarize the paper and look forward to the future development trends of GAN-based perceptual SISR.


Introduction
Image super-resolution reconstruction technology is a research hotspot in the field of computer vision, and plays an important role in remote sensing images [Haut, Fernandez-Beltran, Paoletti et al. (2018)], medical images [Nie, Trullo, Lian et al. (2017); Guo, Cui, Yang et al. (2019); Mahapatra, Bozorgtabar and Garnavi (2019)], video surveillance and so on. Image super-resolution refers to the recovery of a corresponding high-resolution image from a low-resolution single image or a sequence of low-resolution images. Traditional single image super resolution requires multiple low-resolution images to restore high-frequency details of high-resolution images. However, it is sometimes difficult to obtain multiple images in some real-world scenarios. The super resolution of a single image is an inverse problem. Since a single lowresolution image loses a lot of high-frequency information, there are many possibilities for reconstruction results. In recent years, the deep learning method has achieved remarkable results in image processing [He, Zhang, Ren et al. (2016); Huang, Liu, Maaten et al. (2017)] due to its powerful learning ability. Therefore, most of the researches attempt to use the deep learning method to perform a single image superresolution task and has achieved good results. Moreover, a network of adversarial learning has been proposed in recent years: GAN. The GAN-based method can achieve lifelike results on image synthesis [Karras, Laine and Aila (2019)]. Most single image super resolution methods based on GAN models get lower scores in peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) evaluation experiments than traditional methods, but have higher mean-opinion-score (MOS). In this paper, our main goal is to provide a survey of perceptual SISR based on GAN models in recent years. We focus on the challenges, structural approaches, and results of the SISR based on the GAN models. These represent the problems to be solved and possible development methods for using GAN method on image super-resolution. The following chapters are organized as follows: In Section 2, we try to explore the challenges of single image super resolution. In Section 3, the datasets and evaluation metrics for a single image super-resolution experiment are presented. In Section 4, we review the recent study of single image perceptual super-resolution using GAN methods. In Section 5, the results of a single image super-resolution reconstruction based on GAN are summarized and analyzed. In Section 6, we summarize the full text and look to the future.

Standard GAN model and SSIR challenges
The single image super-resolution based on GAN trains neural networks with the idea of zero-sum game. The generator tries to learn the distribution of the real data, and the discriminator tries to judge the authenticity of the data from the generator. In the end, both are continually optimized to achieve Nash equilibrium.
where x represents a real dataset, pdata (x) represents a real sample distribution, z represents a random noise, pz (z) represents a prior distribution (generally assumed to be a Gaussian distribution), and E (⸳) represents an operation to calculate expected value.

GAN model challenge
First, the GAN-based model uses an adversarial idea to train neural networks. This method can complete the single-image super-resolution task by continuously optimizing the generator and discriminator. Second, due to the adversarial idea, this approach avoids the need to design complex loss functions to optimize the model. However, due to this adversarial idea, it has caused problems such as unstable training, gradient dispersion, and model collapse. For this open-ended problem, the optimization direction of most GAN models is in terms of architecture and loss function [Pan, Yu, Yi et al. (2019)]. Therefore, a naive idea is to apply different GAN models to SISR. However, the convergence time, memory requirements and image quality of different GAN models have advantages and disadvantages.

Evaluation metric challenge
In general, in the image super-resolution experiments, most image super-resolution models use the mean square error (MSE) to optimize the distance between reconstructed high-resolution images and ground truth (GT) images. This optimization method itself is beneficial to traditional super-resolution (SR) evaluation metrics: peak signal-to-noise ratio (PSNR), structural similarity (SSIM). However, using MSE to optimize models will make the reconstructed high-resolution images tend to smooth. Moreover, on the mean opinion scores (MOS), the images reconstructed by the traditional SISR method are lower than the SISR based on the GAN model. However, MOS cannot quantitatively analyze the perceptual quality of an image. Therefore, it is a better choice to apply the evaluation metric of perception combined with distortion to evaluate the perception quality of the image [Blau, Mechrez, Timofte et al. (2018)].

Real-world dataset challenge
Most of the existing research low-resolution images are sampled from high-resolution images using bicubic interpolation. In an unknown down-sampling experiment, the single image super-resolution evaluation metric score will decrease [Timofte, Agustsson, Van Gool et al. (2017)]. In addition, in the real world, low resolution images tend to have more complex motion. Usually, building a large-scale training set of real-world superresolution images may solve this problem to a certain extent. Furthermore, there are also methods that use GAN architecture to learn image degradation and then adapt to real datasets [Bulat and Yang (2018)].

Up-sampling multiple challenge
In super-resolution experiments, the ×4 up-sampling factor is currently the mainstream study. A single image super-resolution for a higher up-sampling factor is a very challenging problem, because an image with a higher up-sampling factor needs more high-frequency information. Therefore, it is crucial how to recover the detail information of high-resolution images in a single image higher resolution reconstruction experiment. For the problem of higher resolution image reconstruction, it can be considered to integrate the reconstruction information of different scales by using a multi-scale reconstruction method [Wang, Perazzi, McWilliams et al. (2018)].

Public datasets
The following sections introduce the common training and evaluation datasets for singleimage super-resolution based on the GAN model.

Training dataset
DIV2K dataset. The dataset consists of high-quality images collected on the Internet, and it contains 1,000 RGB images with a resolution of 2 K. The DIV2 K dataset contains a variety of categories of content, including people, natural environments, flora and fauna, and more. The dataset is divided into a training set, a validation set, and a test set, and the numbers are 800, 100, and 100, respectively. The training set, validation set, and test set all have three down-sampling factors, which are ×2, ×3, and ×4, respectively.

Evaluation datasets
Set5 dataset. The Set5 dataset contains 5 images, and the size of the images is generally small. This dataset contains high-low resolution images, and its down-sampling factors include ×2, ×3, ×4. Set14 dataset. The Set14 dataset contains 14 images. The high-low resolution images in this dataset contain three down-sampling factors, which are ×2, ×3, and ×4, respectively. BSD100 dataset. The BSD100 dataset contains 100 kinds of images from real-life scenes. There are three different high-low resolution down-sampling factors in this dataset, which are ×2, ×3 and ×4, respectively. Urban100 dataset. The Urban100 dataset contains 100 high-resolution images from urban scenes. The high-low resolution images in this dataset contain two down-sampling factors, which are ×2 and ×4, respectively. PIRM dataset. The PIRM dataset contains a validation set and a test set, each containing 100 pairs of high-low resolution images. Furthermore, the down-sampling factor for high-low resolution images is only ×4. This dataset is currently used primarily to measure the perception-distortion of images.

Evaluation metrics
Existing image super-resolution evaluation metrics can be divided into two categories: based on reconstruction accuracy (PSNR, SSIM) and based on visual perception (PI, Root Mean Square Error). Peak signal to noise ratio (PSNR). PSNR represents the ratio of the maximum power of the signal to the power of the signal noise. Given a real image x and a reconstructed image y whose pixels are both N, then MSE and PSNR can be defined as: (2) PSNR = 10 10 � 2 MSE � Here, L is generally 255, which represents the largest pixel value of an 8-bit image. Structural similarity (SSIM). SSIM is used to measure the similarity of two images, and its value is 0 to 1. A larger SSIM value indicates a smaller degree of image distortion, which means that the image quality is better. Given a real image x and a reconstructed image y, SSIM can be defined as: Here, μx and μy, σ 2 x and σ 2 y represent the corresponding mean and variance of the image x and the image y, respectively. The μxy is expressed as the covariance of x and y. For the constants c1 and c2, they are constrained by c1=(k1L) 2 and c2=(k2L) 2 , respectively. Where k1 and k2 take a smaller constant (e.g., 0.01) and L represents the largest possible pixel value. Perceptual index (PI) and root mean square error (RMSE). The perceptual-distortion plane consists of the PI of the vertical axis and the RMSE of the horizontal axis. By setting the RMSE value, it can be divided into three regions (regions 1/2/3 were defined by RMSE≤11.5/12.5/16 respectively) as shown in Fig. 2. In Fig. 2, the model cannot be implemented in the area below the curve, which shows that perception and distortion can only achieve better performance by trade-off. PI combines the non-reference image quality measurements of Ma et al. [Ma, Yang, Yang et al. (2017)] and NIQE [Mittal, Soundararajan and Bovik (2012)], which can be expressed as: The RMSE is calculated as the square root of the Mean Squared Error (MSE) of all pixels in all images and can be expressed as:

SISR reconstruction based on GAN model
The traditional SISR method training neural networks use the L1/L2 cost function, which makes the reconstruction results have higher PSNR and SSIM. However, the reconstructed high-resolution images lack rich details. At present, generative adversarial network (GAN) and its variants have achieved remarkable results in the field of images. The GAN-based method uses a discriminator to judge the authenticity of reconstructed high-resolution images, which makes the generated high-resolution images closer to the real images as a whole. Besides, the reconstructed images have more details and are more consistent with human visual perception. The GAN-based image super-resolution network structure is improved on the SGAN. The basic generator network and discriminator network are shown in Figs. 3 and 4, respectively. In Fig. 3, the gray box represents a feature extraction module (FEM), which includes multiple feature extraction blocks and a convolution layer (Conv) and an optional batch normalization layer (BN). In most GAN-based image super-resolution methods, their main improvement directions are discriminator networks, generator networks, and loss functions. Moreover, there are other improvement directions. For example, Shama et al. [Shama, Mechrez, Shoshan et al. (2019)] adds an adversarial feedback loop (AFL) to the SGAN. In other words, they add a feedback module based on the discriminator and generator, which enables the model to use the discriminator information during the testing stage. Next, we will introduce improvements in discriminator networks, generator networks, and loss functions. Besides, in order to maintain a good balance between RMSE and PI, Luo et al. [Luo, Chen, Xie et al. (2018)] used two complementary GAN networks, named Bi-GANs-ST. One memory residual GAN (MR-GAN) is used to reduce the RMSE score, and another weight-aware GAN (WP-GAN) is used to reduce the PI score. Zhong et al. [Zhong and Zhou (2020)] used a latent spatial regularization (LSR) generator network in the proposed LSRGAN. They added a companion encoder to apply regularization conditions to the GAN to generate a more ideal image manifold.

Discriminator network
Currently, part of the work is improved on the discriminator network ; Park, Son, Cho et al. (2018); Wang, Perazzi, McWilliams et al. (2018)]. Lee et al. ] studied the discriminator of SRGAN [Ledig, Theis, Huszár et al. (2017)] and found that strided convolution layers and maxpooling layers of the VGG network would cause detailed information losses and visual artifacts. Therefore, they proposed a resolution-preserving SRGAN (RPSRGAN). First, in order to keep the details in the discriminator network, they set the stride size of the convolutional layers to 1. Then, the maxpooling layers are removed after training the VGG network. To reduce the small amount of high-frequency noise generated in the SISR based on the GAN model, Park et al. [Park, Son, Cho et al. (2018)] used an additional discriminator in the feature domain. First, a traditional image discriminator is used to process pixel-level images. Besides, a feature discriminator attempts to distinguish the authenticity of the reconstructed image based on the extracted feature map. Similarly, Zhu et al. [Zhu, Chen, Peng et al. (2020)] also used multiple feature discriminators in the proposed GAN-IMC. They used three discriminators to distinguish the three aspects of the input image: image, morphology, and color. From the experimental results of these improved discriminators, the perceptual quality of reconstructed images can be improved to a certain extent.

Generator network
At present, most researches on improving the GAN generator network mainly enhance the FEM of the single generator network in Fig. 3. Unlike improving a single generator network, a multi-scale generator network extends a single generator network to multiple scales for reconstruction. Next, we will introduce the improvement methods of these two parts.

FEM improvements
Residual block (RB) with/without batch normalization. Yu et al. [Yu and Porikli (2016)] earlier used a generative adversarial network for super-resolution reconstruction of images. They used multiple convolutional layers to build a generator network, and output face images with a ×8 up-sampling factor. It is well known that deeper networks can improve the performance of the model, but as the number of network layers increases, it will lead to gradient dispersion or gradient explosion. This problem can be solved using normalized initialization and intermediate normalization layers. However, when the network reaches a certain depth, it will cause model degradation. With the introduction of the deep residual network, this problem is solved. Therefore, early improved generator networks mostly use the residual network. Ledig et al. [Ledig, Theis, Huszár et al. (2017)] proposed SRGAN earlier, and the generator network is shown in Fig. 3. They used 16 residual blocks with BN layers in the feature extraction block (FEB), as shown in Fig. 5(a). Similarly, Lee et al. ] also use the same generator network. Recently, some experimental results show that removing BN layers can improve the performance of generator networks and reduce the computational complexity of generator networks. Therefore, subsequent work removed all BN layers from the FEM. The EUSR-PCL proposed by Cheon et al. [Cheon, Kim, Choi et al. (2018)] used FEB as shown in Fig. 5(b), and removed PReLU in the dashed box in Fig. 3. Furthermore, Choi et al. [Choi, Kim, Cheon et al. (2019)] used a similar architecture in the proposed 4PP-EUSR. Residual block with scaling (RB-Scaling). Model performance can be improved by increasing the number of feature maps. However, too many feature maps will cause the network training process to be unstable. Therefore, some models add a residual scaling factor (e.g., 0.1) behind the last convolution layer of the residual block to stabilize the network training process. PESR Vu et al. [Vu, Luu, Yoo et al. (2018)] and EPSR Vasu et al. [Vasu, Thekke and Rajagopalan (2018)] both used scaled residual blocks, as shown in Fig. 5 (c). Residual dense block (RDB) with scaling. Recent research results show that densely connected neural network layers can alleviate the gradient dispersion problem. In addition, the densely connected networks can better reuse features. However, densely connected networks cannot be designed as deeply as residual networks. Therefore, part of the work combined the residual network and the densely connected network into a residual dense block (RDB). Wang et al. [Wang, Yu, Wu et al. (2018)] used residual dense blocks with scaling factors to improve FEB in the proposed ESRGAN, as shown in

Multi-scale generator network
The multi-scale generator network is shown in Fig. 6. The network can complete image super-resolution tasks in three different scales. In Fig. 6, FEB and FEM can refer to Figs. 5(b) and 3, respectively. Improved up-sampling. The sub-pixel convolution layer was first proposed by Shi et al. [Shi, Caballero, Huszár et al. (2016)]. They fed a single LR of size H×W×C into convolutional layers to generate r 2 C feature maps, all of which are H×W in size. Where, H, W, and C are the height, width, and number of channels of the LR, respectively. Moreover, r represents an up-sampling factor. The sub-pixel convolution operator rearranges generated feature maps into a single Super-resolution image of size rH×rW×C, as shown in Fig. 7(a). Unlike the method of generating r 2 C feature maps directly through convolutional layers, Cheon et al. [Cheon, Kim, Choi et al. (2018); Choi, Kim, Cheon et al. (2019)] both use enhanced up-sampling modules [Kim and Lee (2018)]. Specifically, they replaced Conv in the original up-sampling with 4 FEMs, as shown in Fig. 7(b). It should be noted that each FEM generates the same number of feature maps as the input.

Loss function
GAN-based perception SISR can generate natural images mainly due to the adversarial network. In order to further improve the perceptual quality of images, most studies apply different loss functions while improving the GAN network. Ledig et al. [Ledig, Theis, Huszár et al. (2017)] used a perceptual loss function in SRGAN, which includes a content loss and an adversarial loss. The content loss includes a pixel-based MSE loss (L2 loss) and a feature space-based VGG loss. Generally, the network uses MSE to ensure that the reconstructed image is similar to the ground truth image. These losses are defined as: = ‖ − ‖ 2 2 (7) Here, ISR represents a reconstructed super-resolution image, and IHR represents a ground truth image. Besides, the network also uses VGG loss to improve image perceptual quality. The loss function can be expressed as: = ‖ ( ) − ( )‖ 2 2 (8) where, ϕ indicates the VGG19 network feature layers. In addition to the content loss introduced above, there are the adversarial loss LG_adv of the GAN generator and the adversarial loss LD_adv of the discriminator. Their loss function is defined as: Here, G and D represent a generator and a discriminator, respectively. Since SRGAN was proposed, most studies have attempted to improve its content loss and adversarial loss. According to the expressions in the existing literature, we reclassify the content loss into content loss (e.g., L2 loss) and perception loss (e.g., LVGG loss). Next, we will introduce some improved loss functions.

Improved content loss L1 loss.
In addition to using L2 loss, some work also uses L1 loss to evaluate the 1-norm distance between the reconstructed super-resolution image and the ground truth image [Cheon, Kim, Choi et al. (2018); Wang, Yu, Wu et al. (2018); Chen, Liu, Liu et al. (2019)]. The L1 loss can be calculated as: where, ILR means a low-resolution image, and IHR means a ground truth image. Total variance loss. To alleviate the problem of high frequency noise amplification in GAN, Vu et al. [Vu, Luu, Yoo et al. (2018)] used a total variance loss function. The loss function is expressed as:  (2019)] proposed RankSRGAN. They train a Ranker that can learn perceptual metric behavior, and then use a Rank-content loss to optimize perceptual quality. The loss function is shown as: = sigmoid( ( ( ))) (13) Here, R(G(ILR)) represents the ranking score of ILR.

Improved perceptual loss VGG19 loss combined with Resnet50 loss.
Most GAN-based image super-resolution uses a pre-trained VGG19 network to calculate the perceptual loss. They calculate the MSE of ISR and IHR in the feature space, as shown in Eq. (8). The difference is that Chen et al. [Chen, Liu, Liu et al. (2019)] calculated the 1-norm distance of ISR and IHR in the feature space. Moreover, they combined VGG19 and Resnet50 to jointly develop a new perceptual loss. The perceptual loss can be defined as: 14) Differential content loss. Cheon et al. [Cheon, Kim, Choi et al. (2018)] used a differential content loss to evaluate the distance between the reconstructed superresolution image and the ground truth image at a deeper level. The loss function can be calculated as: where, dx and dy represent horizontal and vertical differential operators, respectively. Discrete cosine transform loss. For traditional content loss, improving the distortion-based performance of the image will reduce the perceptual quality. However, perception loss can better alleviate this situation. Therefore, the application of new perceptual loss is a key to improve distortion and perception. In this problem, Cheon et al. [Cheon, Kim, Choi et al. (2018)] used a discrete cosine transform loss function to compare the differences between two images in the frequency domain. The DCT loss function can be expressed as: Here, DCT(I) represents the DCT coefficient of the image I.

Improved adversarial loss
Relativistic GAN loss. Because SRGAN uses an adversarial loss function of binary cross entropy, this will cause model training to be unstable. Therefore, some studies have tried to stabilize the training process of GAN models with other adversarial losses [Vu, Luu, Yoo et al. (2018); Wang, Yu, Wu et al. (2018)]. In SRGAN, the discriminator can be expressed as D(x)=sigmoid(C(x)), where C(x) represents the output of a no-transformed discriminator. Vu et al. [Vu, Luu, Yoo et al. (2018)] and Wang et al. [Wang, Yu, Wu et al. (2018)] both introduced Relativistic GAN [Jolicoeur-Martineau (2018)] for improved networks. Among them, Eqs. (17) and (18) are relativistic GAN (RGAN) loss, and Eqs. (19) and (20) are relativistic average GAN loss. These loss functions are defined as: where

, DRa(IHR,G(ILR))=sigmoid(C(IHR)-ΕILR [C(G(ILR))])).
Focal RGAN loss. In a single-image super-resolution experiment, images with rich textures and patches are more difficult to reconstruct. To solve this problem, Vu et al. [Vu, Luu, Yoo et al. (2018)] used a focal loss to emphasize difficult training samples. They emphasize difficult samples and reduce the weight of simple samples, which increases the texture reconstruction in the images. The loss function is expressed as: Here, γ is a focusing parameter, p=σ (C(G(ILR ))-C(IHR )), C is obtained before the last sigmoid function σ of the discriminator. Other GAN losses. In addition to the above improved adversarial losses, there are some studies that also use GAN losses in other variants. Purohit et al. [Purohit, Mandal, Rajagopalan et al. (2018)] used conditional GAN loss [Mirza and Osindero (2014)

Results of SISR reconstruction in recent year
In this section, we analyze the advantages and disadvantages of different GAN-based SISR methods through multiple comparative experiments. It should be noted that the experimental up-sampling factor for all the analyses below is ×4.

Comparison of training time and model parameters
First, we compare the training time and parameters of different models, as shown in Tab. 2. The lower the PI calculated on the PIRM-self validation set, the better the perceptual quality. "Parameters" of Tab. 2 indicates the size of the generator network. "*" of Tab. 2 indicates that the experimental result is from RPSRGAN. As can be seen from Tab. 2, EPSR and ESRGAN can achieve better perceptual quality. However, their model parameters are 28 times and 33 times larger than that of SRGAN, respectively. Moreover, EPSR takes about 3 times more training time than SRGAN. Furthermore, it can be seen from Tab. 2 that the currently widely used training dataset is the DIV2K dataset. Since the dataset has fewer classes, the Flickr2K dataset [Lim, Son, Kim et al. (2017)] is also used in the ESRGAN experiment. They found that using additional training data with rich textures was more conducive to restoring the texture of the image.

Comparison of PI scores for PSNR models
At present, most GAN-based perceptual SISR studies have used PSNR-oriented network structures as generators of GAN models. Tab. 3 briefly summarizes the PSNR and PI values obtained by some PSNR models on different datasets. Where, the results of the three models of SRCNN [Dong, Loy, He et al. (2014)], VDSR [Kim, Lee, Lee et al. (2016)] and EDSR [Lim, Son, Kim et al. (2017)] are from Vu et al. [Vu, Luu, Yoo et al. (2018)]. The DBPN [Haris, Shakhnarovich and Ukita (2018)] and RCAN [Zhang, Li, Li et al. (2018)] results are from Cheon et al. [Cheon, Kim, Choi et al. (2018)] and Lee et al. [Lee, Chuang and Wang (2019)], respectively. It can be seen from Tab. 3 that the PSNR-oriented model can achieve a lower PI value to a certain extent while achieving a higher PSNR value. This shows that the PSNR and PI are not completely opposite, and better PSNR models can be used in the future to perceptual SISR experiments. Besides, it is necessary to see that the PI scores obtained by the PSNR-oriented models is far less than the PI values calculated by the high resolution datasets. This also shows that most of the current PSNR-oriented models cannot adapt well to perceptual SISR experiments. However, on the other hand, using a better PSNR model for GAN-based perceptual SISR architecture may further improve the images perceptual quality.

Comparison of perceptual SISR models
Tab. 4 compares the PSNR and PI of different models on Set5, Set14, BSD100 (B100), and Urban100 (U100). It can be seen from Tab. 4 that models with lower PI scores tend to have lower PSNR scores. Combining Tabs. 3 and 4 shows that the PI values obtained by most models on different datasets are already lower than the PI scores calculated by the high-resolution datasets themselves. This shows the point that GAN-based perceptual SISRs may generate excessively realistic images. Moreover, it can be seen from Fig. 8 that the image reconstructed by the GAN method is sharper and has a lower PI than the image reconstructed by the Bicubic method. However, the PSNR obtained by the GAN method is lower in terms of image accuracy (PSNR). This shows that it is significant to apply PI to evaluate the perceptual quality of the generated images. Furthermore, SRResNet in Tab. 4 is a generator network in SRGAN, which is a PSNRoriented model. It can also be seen from Tab. 4 that the super-resolution images generated by the GAN architecture is inferior to the PSNR model generated images in the PSNR values. However, GAN-based perceptual SISR can balance PSNR and PI through perceptual trade-off schemes. This method is more advantageous for tasks that require image quality and reconstruction accuracy in the future. From the above analysis, for the perceptual SR methods, pursuing only the perceptual quality or distortion of the image is more one-sided. When a good reconstructed image requires good perceptual quality, it also needs good reconstruction accuracy. Therefore, a solution that can balance the accuracy of image reconstruction and the perceptual quality of the image may have more potential.

Comparison of different network structures and loss functions
In this experiment, we compare the network structure and loss function of different methods. Tab. 5 shows the RMSE and PI on PIRM test set. It can be seen from Tab. 5 that the generator loss of most models currently uses a combination of pixel loss, perceptual loss, and adversarial loss, and they are also trying other losses. At present, the method adopted by ESRGAN achieved better perceptual quality in the third region. ESRGAN combined the residual dense network with the relativistic GAN loss and used some training trick. The difference is that EPSR uses a slightly worse adversarial loss function and generator network. In the end, EPSR achieved performance close to ESRGAN by setting non-negative scale factors for the generator loss. With this method, EPSR can achieve better perceptual quality in multiple regions of the perceptualdistortion plane. In a word, different architectures combined with different loss functions may achieve the most advanced results in a certain region of perceptual-distortion. However, considering the trade-off between perceptual-distortion is more likely to prevail over the entire perceptual-distortion plane.

Comparison of real dataset results
Currently, most GAN-based perceptual SISRs are studied on bicubic degradation datasets. They use known degradation datasets in both training and testing. In order to briefly analyze the performance of the perceptual SISR based on the GAN model on the real dataset, we list some results in Tab (2019)]. Among them, DA-SRNet is a perceptual SISR method based on GAN network, and the other is a method oriented to PSNR model. In addition, RCAN-Bic and RCAN-Bic1 represent models trained on a bicubic degradation (known degradation) dataset, respectively. RCAN-Real and DA-SRNet represent models trained on a real dataset (unknown degradation), respectively. It can be seen from Tab. 6 that the model trained on the known degradation dataset has a lower performance on the real dataset. However, RCAN-Real trained on a real dataset can achieve better performance than RCAN-Bic trained on a known degradation dataset. Furthermore, it should be noted that the GAN-based perceptual SISR can achieve a lower PI score.

Conclusion and future direction
In this paper, a single image super-resolution reconstruction based on generative adversarial network is reviewed in recent years. We introduce the challenges that faced by SSIR and also introduced common evaluation datasets. In addition, we review the direction of perceptual SSIR optimization based on the GAN model and summarize the results of partial reconstruction in recent years. Although the perceptual SSIR based on the GAN model has achieved considerable success, the super-resolution of the single image using the generative adversarial network has just started, so there is still a certain gap between the reconstructed image and the real result. Secondly, due to the problems of GAN itself, there is not much breakthrough in the latest research. However, we believe that these problems can be further solved with the optimization of neural networks. Based on the review of this paper, we propose the following possible future directions for research: (1) SSIR reconstruction of video. Most studies currently use DIV2K [Agustsson and Timofte (2017)] image datasets for training. At present, there are too few studies on image super-resolution using image sequences (video files). In the future, it may be considered to use video files to assist in super-resolution reconstruction of a single image.
(2) Improving efficiency of models. Most studies pursue the expression ability of the model, but ignore the time-consuming of the model. The current state-of-the-art model (EPSR) has a very high runtime. Therefore, how to reduce the calculation amount of the model and improve the speed of the algorithm is a direction worth considering.
(3) Establishment of new assessment measures. Most of the current research is focused on improving the PSNR and SSIM of reconstructed images, but these indexes do not correspond to the visual perception of the human eye. Although perception and distortion are used to assess the perceptual quality of an image, a low perception score may be an overly realistic result. Therefore, a more just image index is needed to measure the quality of the reconstructed image in the future.
(4) SSIR for unknown degradation. Most current methods degrade images using known algorithms to obtain low-resolution images (e.g., bicubic interpolation). However, these methods cannot adapt well to unknown degraded images in the real world. Therefore, reconstruction of low resolution images of unknown degraded can be considered in the future [Bulat, Yang, Tzimiropoulos et al. (2018); Gong, Sun, Shi et al. (2020)].
(5) Design of GAN models. On the one hand, the performance improvement of perceptual super-resolution comes from loss functions. On the other hand, the performance improvement of neural networks comes from design of GAN models. In perceptual SR, it is a natural choice to use the latest image feature extraction networks to improve the GAN model. Therefore, in the future, other advanced neural network structures or new network structures can be considered to extract image features.