The Fusion of Unmatched Infrared and Visible Images Based on Generative Adversarial Networks

the


Introduction
Image fusion involves the use of mathematical methods to comprehensively process important information acquired by multiple sensors to produce a composite image that is easier to understand, thereby greatly improving the utilization rate of the image information and the reliability and automation degree of systems for target detection and recognition. Image fusion technology currently plays an important role in military, remote sensing, medicine, computer vision, target recognition, intelligence acquisition, and other applications. e fusion of visible and infrared images is one of the most useful cases of applying this technology. Infrared imaging sensors capture the heat radiation emitted by objects. On the one hand, infrared images are less affected by dark or severe weather conditions but typically lack sufficient background and contour edge details. On the other hand, visible images obtained by spectral reflection offer high resolution, excellent image quality, and rich background details but cannot detect objects under hidden or low light and night conditions. e advantages of visible and infrared images can be combined by constructing fused images that retain richer feature information, making them suitable for subsequent processing tasks.
Image fusion is divided into three levels (from lowest to highest level): pixel-, feature-, and decision-level image fusion. Currently, the most highly studied and frequently applied image fusion is performed at the pixel level, and the majority of proposed image fusion algorithms work at this level. According to different image fusion processing domains, image fusion can be roughly divided into two categories: the spatial domain and the transform domain. Image fusion based on the spatial domain is directly conducted on the pixel gray space of an image. Common image fusion methods based on the spatial domain include linear weighted image fusion, false color image fusion, image fusion based on modulation, image fusion based on statistics, and image fusion based on neural networks [1,2]. Image fusion based on the transformation domain involves transforming multisource images, combining coefficients from the transformation to obtain transformation coefficients of the fused images, and conducting inverse transformation to obtain the fused images. Common fusion algorithms based on the transform domain include those based on the discrete cosine transform (DCT), the fast Fourier transform (FFT), the multiscale transform [3][4][5], image subspace technology [6,7], the saliency method [8,9], the sparse representation method [10,11], and others [12][13][14][15].
Currently, image fusion based on the transform domain is a widely researched method. Most image fusion algorithms are based on multiscale decomposition and typically use the same transformation or representation for different source images. Since the thermal radiation in infrared images and the texture information in visible images are different, multiscale decomposition methods are not suitable for the fusion of infrared and visible images. To overcome this problem, the developers of FusionGAN [16] proposed an infrared and visible image fusion method based on the novel perspective of generative adversarial networks (GANs) [17]. In FusionGAN, the visible image and the generated fused image are both allowed to enter the discriminator. To "deceive" the discriminator, the fused image retains more visible information and drops its infrared thermal radiation information. It is often difficult to obtain perfectly matched infrared and visible images. To solve the above problems, this paper proposes a new fusion method that generates matching infrared images for visible images and produces fusion images that retain visible light texture details and infrared thermal radiation information, as shown in Figure 1. e main contributions of this paper are as follows: A new GAN framework is proposed to retain more information from visible and infrared images in fused images. For visible images without matching infrared images, approximate infrared images are generated to facilitate the next image fusion. To verify the feasibility and effectiveness of the proposed method, experiments are conducted on publicly available visible and infrared image datasets, and a comparison of the proposed method with other methods is carried out using several popular evaluation methods. e remainder of the paper is arranged as follows: Section 2 briefly describes related studies on image fusion and GANs. Section 3 introduces the proposed method. In Section 4, the fusion performance of the proposed method is experimentally evaluated. Section 5 presents the conclusion of the paper.

Related Studies
In this section, several methods for the fusion of visible and infrared images are briefly introduced along with GANs.

Infrared and Visible Image Fusion Using a Deep Learning
Framework. A method to generate fusion images of infrared and visible images using a deep learning framework was proposed by Li et al. [18]. e authors decompose the source image into base and detail content. e base content is fused by weight averaging. A deep learning network is used to extract multilayer features, and then, the L1-norm and a weighted average strategy are used to generate candidates for the fused detail content. e final fused detail content is obtained using a max selection strategy.

Infrared and Visual Image Fusion through Infrared Feature Extraction and Visual Information Preservation.
Zhang et al. [19] proposed an image fusion method that uses infrared feature extraction and visual information preservation.
is method uses quadtree decomposition and Bézier interpolation to reconstruct the infrared background and then subtracts the reconstructed background from the infrared image to obtain infrared bright features. e processed infrared features are then added to the visible images to achieve the final fusion image. [17] consists of a generator G and a discriminator D that perform a minimax game together. e generator attempts to generate realistic images to trick the discriminator, and the discriminator must distinguish all the real images and the images generated by the generator until G and D reach the Nash equilibrium:

Conversion between Matched Images.
e Pix2Pix [20] model, based on a conditional GAN (CGAN) [21], realizes the translation task of a variety of matched images. In Pix2Pix, generator G does not require random noise and only accepts one input image X as condition C with translated image Y as the output. Meanwhile, discriminator D accepts an X sample and a Y sample, where Y contains the real sample and the sample generated by the generator and D determines whether X and Y are the actual matched images. [16], which uses a GAN to fuse the thermal radiation information of infrared images with the high resolution and clear texture details of visible images. FusionGAN's generator produces a fused image with infrared intensity and an additional visible gradient, and the discriminator distinguishes the fused image from the real visible image so that the fused image retains both infrared and visible image information.

Method
is section introduces the method proposed in this paper. First, the structural framework of the model is described, and then the model is described in greater detail.

Structural Framework of the Model.
In this paper, the GAN two-player game is used to fuse visible and infrared images. e structural framework of the training process is shown in Figure 2. Visible image I V is input as condition C into generator G 1 to generate a fake infrared image I RG . Next, the visible image and the fake infrared image are input into generator G 2 together in a concatenated channel, creating fused image I F as the output. Discriminator D 1 distinguishes between real visible image I V and fused image I F so that the fused image is closer to the visible image and has more visible texture details. Simultaneously, discriminator D 2 distinguishes real infrared image I R , generated infrared image I RG , and fused image I F . rough continuous updating, the generated infrared image becomes closer to the real infrared image, and the fused image contains more thermal radiation information. In Figure 3, a visible image is input to obtain a fused image with both visible and infrared radiation information.

Network
Structure. In our model, generator G 1 is the three-part convolutional neural network (CNN), as shown in Figure 4. It contains a downsampling component for convolution, an upsampling component for deconvolution, and a tanh active component. e downsampling component for convolution contains 7 convolution blocks. Except for the first block, each block contains one convolution layer and one LeakReLU active layer. e upsampling component for convolution also contains 7 convolution blocks. e convolution layer adopts a 4 × 4 filter, with a step size of 2 and "same" padding.
Generator G 2 is the simple five-layer CNN, as shown in Figure 5. e first two layers use a 5 × 5 filter, layers 3 and 4 use a 3 × 3 filter, and the last layer uses a 1 × 1 filter. Each convolution layer has a step size of 1 without padding.
Discriminators D 1 and D 2 adopt the network structure, as shown in Figure 6. e discriminator contains a fourlayered CNN and a linear, fully connected layer. e first four convolution layers use a 3 × 3 filter with a step size of 2, no padding, and the LeakReLU active layer. e last fully connected layer is used for classification.

Loss Function.
e loss function of the proposed method consists of four elements, i.e., loss function L G 1 of generator G 1 , loss function L G 2 of generator G 2 , loss function L D 1 of discriminator D 1 , and loss function L D 2 of discriminator D 2 . e loss function L G 1 of generator G 1 is given by equation (2), where the first term represents adversarial loss between the generator and discriminator D 2 and the second term represents the loss of the structural similarity between the input visible image and the output infrared image: e loss function L G 2 of generator G 2 contains the adversarial losses between generator G 2 and discriminators D 1 and D 2 and the content loss of the fused image relative to the visible and infrared images: e loss function L D 1 of discriminator D 1 is defined as follows: e first term represents the classification results of the visible images, and the second term represents those of the fused images. e loss function L D 2 of discriminator D 2 is given by equation (5), which includes an additional term to represent the classification results of the generated infrared images:  e training parameters were set to an image batch size of 32 and a learning rate of 10 −4 , and the generator was trained once for every 2 discriminator training runs. e chosen optimizer was Adam. It took 16.5 hours to train the model. In the first part of this section, several common image fusion evaluation indexes are introduced. In the second part, two datasets are used to validate the effectiveness of the proposed method compared with three popular image fusion methods.

Common Image Fusion Evaluation Indexes.
e evaluation of fused images is performed by combining multiple indexes together. Objective quantitative evaluation methods are mainly divided into two categories: nonreference and reference image evaluation methods. Nonreference image evaluation methods include standard deviation (SD) [22] and information entropy (EN) [23]. Reference image evaluation methods include the correlation coefficient (CC) [24], peak signal-to-noise ratio (PSNR) [25], structural similarity index measure (SSIM) [25], visual information fidelity (VIF) [26], root mean square error (RMSE), and universal image quality index (UIQI) [27] methods. ese indexes are defined as follows.
SD reflects the dispersion of the relative mean gray value and is mathematically defined as follows: where μ is the average value of the fused images (M × N). A greater SD value indicates a higher contrast in the fused image and a typically better visual effect. EN is a statistical feature form that reflects the average amount of information in an image. EN is mathematically defined as follows: where L represents the image's gray levels and p l represents the proportion of pixels with gray value i in the total pixels. A larger EN means that a greater amount of information exists in the fused images. e CC measures the degree of linear correlation between a fused image and infrared and visible images and is mathematically defined as follows: where Cov(X, Y) represents the covariance of X and Y and Var (X) and Var (Y) represent the variance of X and Y, respectively. e larger the CC is, the higher the degree of correlation between the fused images and visible and infrared images is, and the higher the similarity is. e PSNR assumes that the difference between a fused and reference image is noise and is mathematically defined as follows: where MAX represents the maximum value of the image color and MSE is the mean squared error. e larger the PSNR is, the more similar the two images are. e common benchmark is 30 dB, and fused images with PSNR < 30 dB are clearly deteriorated. e SSIM considers image distortion by comparing changes in image structure information, thereby obtaining an objective quality evaluation. e mathematical definition of SSIM is as follows: where x and y are the reference image and fused image, respectively, u x , u y , σ 2 x , σ 2 y , and σ xy represent the mean value and variance and covariance of images x and y, respectively, and c 1 , c 2 , and c 3 are small normal numbers to avoid having a

Conv2D
Conv2D tanh LeakyReLU BatchNormalization VIF is a reference image evaluation method based on natural scene statistics and the concept of image information extracted from the human visual system. e mathematical definition of VIF is as follows: where I(C; E | z) is the reference image information content and I(C; F | z) is the mutual information of the reference and fused images. e RMSE is defined as follows:       [28], DeepFuse [29], FusionGAN [16], and the proposed method, respectively. Intuitively, all four methods fuse the texture information of the visible image and the thermal radiation information of the infrared image together to some extent. However, the fusion results of our method are more closely aligned with human visual perception, better preserve visible information, and retain more infrared information, making the image look richer and clearer with higher contrast. In addition, the target area is also more prominent than those of the other three methods.  1  3  5  7  9  11  13  15  17  19  21  23  25  27  29  31  33  35  37  39  41  43  45  47  49  51  53  55 PSNR_vi (i)   Quantitative comparison: the qualitative illustrations in Figure 7 cannot objectively determine the quality of the results. erefore, the fusion methods were further compared using quantitative methods. Eight indexes were used on 56 pairs of images from the TNO dataset, of which six indexes require a reference (the fused image refers to the corresponding visible and infrared images). e results are shown in Figure 8. e proposed method achieves the best performance for the majority of image pairs, and, for some individual image pairs, the comprehensive fusion index is much higher than that of the other methods. In addition, compared with the other three methods, the proposed method has the best average of the evaluation indexes. Because the proposed method uses two discriminants when referring to visible images, it is comparable to the FusionGAN method. However, when referring to infrared images, the proposed method considerably outperforms the FusionGAN method. is shows that the proposed method retains more infrared thermal radiation information while retaining sufficient visible texture information. us, our training framework is both effective and essential.

Validation by the VEDAI Dataset.
e Vehicle Detection in Aerial Imagery (VEDAI) dataset [30] contains publicly available aerial orthogonal normalization images from Utah's State Geographic Information Database (SGID) by the Automated Geographic Reference Center (AGRC). ese aerial orthogonal normalization images contain a wide variety of vehicles, backgrounds, and obfuscated objects. Each image has three visible channels and one near-infrared channel. In this section, the DenseFuse, DeepFuse, FusionGAN, and proposed method are further tested using the VEDAI dataset. Figure 9 shows the proposed method's generation of infrared images from visible images. Figure 9(a) shows the visible images, Figure 9(b) shows the actual infrared images corresponding to the visible images from the dataset, and Figure 9(c) shows the infrared images generated directly from the visible images using the proposed method. Figure 9 illustrates that the infrared images generated by the proposed method accurately reflect actual thermal radiation information while maintaining consistency with the real infrared images.
A total of 40 images from the VEDAI dataset were selected for a quantitative comparison. Figure 10 shows a quantitative analysis of the fusion results of the four methods using 8 evaluation indexes. e proposed method achieves the best SSIM, CC, PSNR, UIQI, and RMSE results on the majority of images. Compared with the other three methods, the averages of the other indexes of the proposed method are the highest. Our experiments show that the proposed method generalizes well to other datasets.

Conclusion
In this paper, we propose a new fusion method that generates a matched infrared image from a visible image and generates a fused image that retains more visible texture details and infrared heat radiation information than other methods. Experimental evaluations on two public datasets show that the proposed method generates infrared images with thermal radiation information relatively consistent with real infrared images and generates fused images with clearly prominent texture information and rich thermal radiation information. A quantitative analysis of eight evaluation indexes for fused images shows that the proposed method produces better visual effects while retaining more information than other methods.

Data Availability
All data included in this study are available upon request by contact with the corresponding author.

Conflicts of Interest
e authors declare that they have no conflicts of interest.  Mathematical Problems in Engineering 11