Generative Adversarial Network-Based Edge-Preserving Superresolution Reconstruction of Infrared Images

The convolutional neural network has achieved good results in the superresolution reconstruction of single-frame images. However, due to the shortcomings of infrared images such as lack of details, poor contrast, and blurred edges, superresolution reconstruction of infrared images that preserves the edge structure and better visual quality is still challenging. Aiming at the problems of low resolution and unclear edges of infrared images, this work proposes a two-stage generative adversarial network model to reconstruct realistic superresolution images from four times downsampled infrared images. In the first stage of the generative adversarial network, it focuses on recovering the overall contour information of the image to obtain clear image edges; the second stage of the generative adversarial network focuses on recovering the detailed feature information of the image and has a stronger ability to express details. The infrared image superresolution reconstruction method proposed in this work has highly realistic visual effects and good objective quality evaluation results.


Introduction
Infrared imaging technology used for passive noncontact detection and identification has the advantages of good concealment, strong transmission ability, no interference from electromagnetic waves, and good low-light and night-vision capabilities [1]. In addition to the main military applications, this technology can also be widely used in civilian fields, such as industry, agriculture, medicine, and public security reconnaissance. However, infrared images have many disadvantages, such as low resolution, low contrast, and blurred edges. Although it is possible to improve the hardware performance of the infrared imaging system by improving the manufacturing process of the infrared detector, it requires tremendous human and financial resources and is difficult to achieve in the short term. Therefore, digital signal processing is an economical and effective way to improve the quality of infrared images [2][3][4]. Superresolution reconstruction refers to the reconstruction of a high-resolution image or a sequence on single-frame or multiframe low-resolution images and includes three main types, i.e., interpolation-based methods, reconstruction-based methods, and instance-based learning methods. Instance-based learning methods are flexible in algorithm structure, can provide more details under high magnification, and thus have become a research hotspot of superresolution reconstruction in recent years. Chao et al. [5] proposed the use of a convolutional neural network (CNN) to achieve the superresolution reconstruction of visible light images and to learn the mapping relationship between lowresolution image and high-resolution image by training with a large dataset. Other researchers [6] used the perceptual loss to replace the minimum mean square error and the learned upsampling to replace bicubic interpolation, achieving better results. Subsequently, network structures with more layers, such as the deeply recursive convolutional network (DRCN) [7] and the efficient subpixel convolutional neural network (ESPCN) [8], were proposed to achieve better outcomes. In the field of machine learning, generative models have always been a difficult problem. The proposal of a generative adversarial network (GAN) [9] meets the demands of generative models for research and application in many fields. GAN only uses back propagation and thus avoids the complex Markov chain. At the same time, GAN uses unsupervised learning methods, enabling the production of more clear and true samples. In recent research, GAN has been widely used [10][11][12]. Among them, the SaliencyGAN model proposed by Wang et al. [10] is a semisupervised salient object detection method for the Internet of Things. Yang et al. [12] proposed a new model based on conditional generative adversarial network (DAGAN) to reconstruct Compressed Sensing Magnetic Resonance Imaging (CS-MRI). Literature [11] proposed a novel fast CS-MRI deep learning architecture based on a conditional generative confrontation network. SRGAN [13] is a GAN for image superresolution (SR) and can recover a photorealistic natural image from four instances of downsampling. However, the amplified details of the image generated through this method usually show unsightly artifacts. To further improve the visual quality, researchers [14] have proposed a residual-in-residual dense block (RRDB) network unit and made improvements in the perception domain loss. Researchers [15] have also proposed a novel generative adversarial frame to improve the edge structure and texture information in the compressed image. ESRGAN+ [16] designed a network architecture with novel basic blocks to replace the basic structure used by the original ESRGAN. In the field of infrared imaging, SR reconstruction has been mostly achieved using sparse coding methods [17][18][19][20][21].  In this study, we propose a new GAN framework to improve the perceived quality of infrared images. Then, we design a multiconstraint loss function by combining image fidelity loss, adversarial loss, feature fidelity loss, and edge fidelity loss. By continuously updating and iterating the minimization of loss functions, we obtain the reconstructed image with high resolution and sharp edges using the proposed method (Figure 1).
The main contributions of this paper are as follows: (i) We propose a GAN-based SR reconstruction framework for edge preservation of infrared images that can enhance the GAN to better restore the edge structure of the infrared image while maintaining the detailed information (ii) To preserve the characteristics and the edge information in the image, we propose a multiple constraints loss function applicable for SR reconstruction (iii) We validate the proposed method using images from publicly available datasets and compare the performance of the proposed method with that from other mainstream methods. The results confirm that compared with other methods, the proposed method obtains more realistic reconstructed infrared SR images with sharper edges

International Journal of Digital Multimedia Broadcasting
We describe the proposed network framework and loss function in Section 2 and their quantitative evaluations using the public datasets in Section 3, followed by recommended future study on the proposed method and the conclusion.

Method
To generate a high-resolution image with photographgrade realistic details and inspired by the literature [22,23], we propose a simple and effective two-layer GAN, in  Figure 6: Comparison of the reconstructed results of images from the FLIR_ ADAS_1_3 validation set. The images in the first row are original images, those in the second row are the reconstruction results using the SRCNN method, those in the third row are the reconstruction results using the ESPCN method, those in the fourth row are the reconstruction results using the SRGAN method, those in the fifth row are the reconstruction results using the ESRGAN method, those in the sixth row are the reconstruction results using the ESRGAN+ method, and those in the last row are the reconstruction results using our proposed method. 4 International Journal of Digital Multimedia Broadcasting which the image generation process is divided into two stages ( Figure 2).

Stage 1 GAN.
In Stage 1, we use the 128 × 128 lowresolution image I LR0 as input and export it to the Stage 1 generator G 0 to generate a false 256 × 256 image (I SR0 ), which, together with a real 256 × 256 image (I HR0 ), is imported to the Stage 1 discriminator D 0 to identify the image. The core of the network structure of the generator G 0 in the proposed method is shown in Figure 3. We adopt  Figure 7: Comparison of the reconstructed results of images from the Itir_v1_0 validation set. The images in the first row are original images, those in the second row are the reconstruction results using the SRCNN method, those in the third row are the reconstruction results using the ESPCN method, those in the fourth row are the reconstruction results using the SRGAN method, those in the fifth row are the reconstruction results using the ESRGAN method, those in the sixth row are the reconstruction results using the ESRGAN+ method, and those in the last row are the reconstruction results using our proposed method. 5 International Journal of Digital Multimedia Broadcasting the network design of Ledig et al. [10] and introduce the skip connection, which has been proven effective in training deep neural networks. We adopt the residual block proposed in the literature [24] to construct a neural network with six residual blocks used as a stack to extract features from the image. Recent work on single-image superresolution [25][26][27][28] pointed out that with the deepening of the network and the training and testing phases under the GAN framework, hallucination artifacts often appear. In order to solve this problem, we are following the method in [29] and adding a batch normalization layer (BN) to the network. Each residual block contains two convolutional layers with a kernel size of 3 × 3 and 64 feature maps, two BN layers, and one parametric rectified linear unit (PReLU) layer [30]. The specific settings of each layer of the generative model are as follows: Here, C(PR,64) denotes a set of convolutional layers with 64 feature maps and activation function PReLU; C(64)BN(PR)C(64)BN(PR)SC represents a residual block; BN(PR) is a batch normalization layer with activation function PReLU, and SC denotes a skip connection. There are in total 6 residual blocks. C(t,3) represents a convolutional layer with 3 feature maps and activation function tanh.
To distinguish the generated SR samples from real highresolution (HR) samples, we train the discriminator network D 0 , whose overall framework is shown in Figure 4 and which is adopted from the architectural framework summarized by Radford et al. [31], in which LeakyReLU [32] activation is used to avoid maximum pooling of the entire network. The discriminator network includes ten convolutional blocks. All blocks except for the first block contain a convolutional layer, a BN layer, and a LeakyReLU layer. The number of kernels within the filter increases continually, from 64 in the Visual Geometry Group (VGG) network [33] to 1024; then, each time the number of kernels increases, segmented convolution is used to lower the image resolution. After that, a special residual block, containing two convolution layers and a LeakyReLU layer, is connected. The output of the last convolution unit is sent to a dense layer with an S-type activation function to get a true and false result. The structure and parameters of each layer of the discriminator D 0 network are as follows: Here, LR denotes the activation function LeakyReLU; C ðLR, 64Þ denotes a set of convolutional layers with 64 feature maps and activation function LeakyReLU; Cð64ÞBNðLRÞ denotes a set of convolutional layers with 64 feature maps followed by batch-normalization with activation function LeakyReLU, and D is the dense layer for outputting. The feature maps were increased from 64 to 1024.
Stage 1. Reconstruction Loss Function. The loss function contains three parts: adversarial loss, image fidelity loss, and edge fidelity loss (Eq. (3)). These parts each capture different perceptual characteristics of the reconstructed image to obtain a more visually satisfactory reconstructed image.
where the weight fλ i g is a trade-off parameter that is used to balance multiple loss components. The first part is the adversarial loss between the generator G 0 and the discriminator D 0 of GAN. This part encourages the generator to trick the discriminator network to produce a more realistic HR image as follows: where D θ D ðG θ G ðI LR0 ÞÞ is the estimated probability of the reconstructed image G θ G ðI LR0 Þ to be a true HR image. To obtain a better gradient, we use the minimization −log D θ D ðG θ G ðI LR0 ÞÞ to replace the minimizationlog ½1 − D θ D ðG θ G ðI LR0 ÞÞ.  The second term of Eq. (3), L mse , ensures the fidelity of the restored image using the pixel-level mean square error (MSE) loss as follows: where W, H, and C are the height, the width, and the number of channels, respectively, of the image. The third term of Eq. (3), L edge , the edge fidelity loss, is purported to reproduce sharp edge information as follows: where W and H are the width and the height, respectively, of the image. The labeled edge map I E is extracted by a specific edge filter on the real 256 × 256 image I HR 0 , while I∧ E is extracted by a specific edge filter on the 256 × 256 image I SR 0 generated by the generator G 0 . In our experiments, we chose the Canny edge detection operator. By minimizing the edge fidelity loss, the network continuously guides edge recovery.

Stage 2 GAN.
In the second stage, we use the generated 256 × 256 low-resolution image I SR0 as input and export it to the Stage 2 generator G 1 to generate a 512 × 512 image I SR1 , which, together with a real 512 × 512 image I HR1 , is exported to the Stage 2 discriminator D 1 to identify if it is false or real. We adopt the network design described in the literature [15], in which a generator model G 1 (Figure 5) where C(R,64) denotes a set of convolutional layers with 64 feature maps and activation function ReLU; C(64)BN(R)C(64)BN(R)SC represents a residual block. The network structure of the discriminator network D 1 in the second stage adopts a network structure similar to that of the discriminator D 0 . Each layer structure and network parameters of D 1 are as follows: where the weight fλ i ′ g is a trade-off parameter that is used to balance multiple loss components. The first term, L adv1 , is the adversarial loss between the generator G 1 and the discriminator D 1 of GAN. The second term, L mse1 , is the image fidelity loss. The third term, L feature , is the feature fidelity loss, which is defined based on the feature space distance defined in the literature [34], to facilitate the preservation of feature representation in the reconstructed image similar to that of the original image as follows: where W, H, and C are the height, the width, and the number of channels, respectively, of the image and ϕð⋅Þ represents the feature space function, which is a pretrained VGG-19 [33] network that maps images to feature space. The fourth pooling layer is used to calculate the feature-activated L2 distance to be used as the feature fidelity loss function.
Stage 2. Reconstruction Loss Function. The loss function contains three parts: adversarial loss, image fidelity loss, and feature fidelity loss (Eq. (9)).  First, we perform the 4x factor downsampling on all experimental data to obtain low-resolution (LR) images by lowering the resolution of the HR images. We set the batch size to 4 and used the Adam [35] with a momentum term of β = 0:9 as the optimization procedure. To enable the loss function in the same order of magnitude to better balance loss components, we set λ 1 ,λ 2 ′ , and λ 3 ′ of Eq. (9) to 10 -3 , 1, and 10 -6 , respectively. When training the Stage 1 GAN, we set the learning rate to 10 -4 , which is lowered to 10 -5 when training the Stage 2 GAN.

Experimental Evaluation.
To verify the effectiveness of the proposed method, we conduct the validation on two public data sets: the FLIR_ ADAS_1_3 (1366 images) validation set and the Itir_v1_0 dataset (11262 images). We compared the proposed method with the most advanced methods, including superresolution using deep convolutional networks (SRCNN) 9 International Journal of Digital Multimedia Broadcasting [5], ESPCN [8], SRGAN [10], ESRGAN [15], and ESRGAN+ [16] methods. Three images are selected from the verification set of FLIR_ADAS_1_3 and the data set of Itir_v1_0, and the subjective results of several methods are shown in Figures 6  and 7. From the reconstruction results, it is not difficult to see that the reconstruction results of our proposed method produce finer texture and edge details.
To facilitate a fair quantitative comparison, we used the correlation coefficient (CC) [36], the peak-signal-to-noise ratio (PSNR) [37], the Structural Similarity Index Measure (SSIM) [38], the visual information fidelity (VIF) [39], the Universal Image Quality Index (UIQI), and the time consumption, six objective indicators, to evaluate the quality of the reconstructed images and the SR methods. The quantitative results of the comparison of different reconstruction methods are shown in Table 1, which shows that our method is superior to the SRCNN [5], ESPCN [8], SRGAN [13], ESR-GAN [15], and ESRGAN+ [16] methods in the indicators of CC, PSNR, SSIM, and UIQI on the FLIR data set. The VIF index is slightly lower than the ESRGAN+ method, and the time consumption is greater than the ESRGAN method.
The quantitative results of comparison of different reconstruction methods on the Itir_v1_0 dataset are shown in Table 2, indicating that the proposed method is superior to the SRCNN [5], ESPCN [8], SRGAN [13], ESRGAN [15], and ESRGAN+ [16] methods on the Itir_v1_0 dataset. Only the time consumption is slightly greater than the ESRGAN method.
In order to more intuitively illustrate the effectiveness of the superresolution method proposed in this work for improving the edge features of infrared images, we show in Figure 8 the comparison of the edge detection results of the superresolution reconstruction results of various methods. It can be seen from the figure that the image reconstructed by our method has more and clearer edge information, which is meaningful for the application of infrared images.

Use Advanced Vision Tasks to Compare Superresolution
Results. Basic vision tasks including image superresolution reconstruction are all for advanced vision tasks. Infrared images are widely used in target detection and target matching tasks, but the shortcomings of low resolution and unclear edges of infrared images affect the accuracy of the above tasks. Therefore, whether the result of infrared image superresolution reconstruction can improve the accuracy of the above tasks has become an evaluation index of the result of superresolution reconstruction. In order to further verify our method, we match the superresolution images generated by several methods with real high-resolution images. Scale Invariant Feature Transform (SIFT) is a representation of the statistical results of Gaussian image gradients in the field of feature points and is a commonly used image local feature extraction algorithm. In the matching result, the number of matching points can be used as a criterion for matching quality, and the corresponding matching points can also determine the similarity of the local features of the two images. Figure 9 shows the result of matching the superresolution reconstructed image with the original high-resolution image through the SIFT algorithm. It can be seen from the quantity that the reconstructed image produced by our proposed method obtains more correct matching pairs than other methods.
In this experiment, we use the classic YOLO [40] method for image target detection ( Figure 10). It can be seen that the superresolution reconstructed image generated by our proposed method has better detection results and can detect more targets.

Discussion and Future Work
Through experiments, we demonstrate that compared with other methods, the proposed method has a better perceptual performance. However, during the experiment, we also found that for some images, the reconstruction results are not satisfactory, as shown in Figure 11. By analyzing these images, we found that they have common characteristics, i.e., when the imaging device and the imaging object are moving at high relative speeds, the captured image may contain motion blur. For such images, ordinary SR reconstruction methods cannot achieve effective edge recovery. Therefore, in future studies, we will address these problems.

Conclusion
In this study, we propose a two-stage GAN framework that is able to reconstruct SR image by recovering edge structure information and retaining feature information. In the first stage, the image fidelity loss, the adversarial loss, and the edge fidelity loss are combined to preserve the edges of the image. In the second stage, the image adversarial loss, the image fidelity loss, and the feature fidelity loss are combined to mine image visual features. By iteratively updating the generative network and the discriminator network, the SR reconstruction of infrared image with preserved edges is achieved. Experimental verification results show that the proposed method outperforms several other image reconstruction methods in reconstructing SR infrared images. Figure 11: Images with motion blur cannot obtain clear edge information using the ordinary SR reconstruction methods. 10 International Journal of Digital Multimedia Broadcasting

Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.