Entropy-Based Image Fusion with Joint Sparse Representation and Rolling Guidance Filter

Image fusion is a very practical technology that can be applied in many fields, such as medicine, remote sensing and surveillance. An image fusion method using multi-scale decomposition and joint sparse representation is introduced in this paper. First, joint sparse representation is applied to decompose two source images into a common image and two innovation images. Second, two initial weight maps are generated by filtering the two source images separately. Final weight maps are obtained by joint bilateral filtering according to the initial weight maps. Then, the multi-scale decomposition of the innovation images is performed through the rolling guide filter. Finally, the final weight maps are used to generate the fused innovation image. The fused innovation image and the common image are combined to generate the ultimate fused image. The experimental results show that our method’s average metrics are: mutual information (MI)—5.3377, feature mutual information (FMI)—0.5600, normalized weighted edge preservation value (QAB/F)—0.6978 and nonlinear correlation information entropy (NCIE)—0.8226. Our method can achieve better performance compared to the state-of-the-art methods in visual perception and objective quantification.


Introduction
Image fusion combines multiple source images of the same scene together to make the fusion image more suitable for human visual perception or computer processing [1]. The source images are obtained from different sensors or imaging conditions. Each source image contains redundant and complementary information about the scene. The purpose of image fusion is to integrate the redundant and complementary information to make the fused image contain more relevant information specific to an application or task. Image fusion technology has good application prospects. It has been applied in many fields. In the medical field, we can reconstruct fused images by fusing multiple medical images from different sensors. The fused images are able to provide complementary information for medical analysis, enabling doctors to diagnose more quickly and accurately [2]. Moreover, surveillance is a typical application of image fusion [3]. Using an infrared-visible image fusion, a surveillance system can work effectively all day.
Image fusion methods based on transform domain can be divided into two types: multi-scale decomposition and sparse representation [4]. Multi-scale decomposition (MSD) is a very classical approach to performing image fusion. The basic process of MSD-based methods contains three steps. First, source images are converted into a specific transform domain, which typically decomposes the source images into multi-scale representations. Then, different fusion strategies are adopted for each scale level to acquire the multi-scale representation of the fused image in the transform domain. Finally, the corresponding inverse transformation is utilized to reconstruct the fused image. Traditional MSD methods can be divided into two categories. One includes the pyramid-based methods such as Laplacian pyramid (LP) [5] and gradient pyramid (GP) [6]. The other contains the wavelet transform-based methods, such as discrete wavelet transform (DWT) [7] and dual-tree complex wavelet transform (DTCWT) [8]. In addition, there are some new MSD methods, such as non-subsampled contourlet transform (NSCT) [9], shift-invariant shearlet transform (SIST) [10], non-subsampled shearlet transform (NSST) [11] and complex shearlet transform (CST) [12]. In recent years, the edgepreserving filter based-MSD has been a hot research direction. Edge-preserving smoothing filters, such as the guided filter (GF) [13], joint bilateral filter (JBF) [14] and rolling guidance filter (RGF) [15], can avoid ringing artifacts, since they do not blur strong edges in the decomposition process. Li et al. put forward a method of image fusion using GF [16]. Chen et al. proposed an infrared-visible image fusion method combining RGF and multi-directional decomposition [17]. Jian et al. combined RGF and JBF together for image fusion [18]. Although these methods achieve quite good performance on many types of images, they still have the following disadvantages: (1) the redundant information between source images leads to low information entropy of fused images; (2) contrast of fused images is likely to decrease.
Sparse representation (SR) is another classic image processing method. SR conforms to the physiological characteristics of human vision [19]. Besides, it has robustness to additive noise [20]. SR has been successfully applied in many image processing fields, such as image classification [21], image denoising [22], image communication [23] and image super-resolution [24]. Yang and Li [25] were the first ones to utilize SR to deal with image fusion problem. The general flow of SR-based methods contains three steps. First, an appropriate over-complete sparse dictionary is constructed through a mathematical model or example learning. Second, the sparse representations of the source images on the given sparse dictionary are obtained. The sparse representations are usually sets of sparse coefficient vectors. Finally, the corresponding vectors are fused to get the coefficient vector set of the fused image. The fused image is reconstructed with the same dictionary. Recently, a new SR-based method called joint sparse representation (JSR) was applied to image fusion successfully. The idea of JSR is to use sparse representation to divide two source images into a common part and two innovation parts. The common part contains the redundant information shared by the two source images, while the two innovation parts represent the complementary information of each source image. After that, different fusion rules are used to fuse the two kinds of part separately. JSR preserves the advantages of the SR method while eliminating the correlation between source images so that the fused image is less affected by the redundant information. Yu et al. proposed a JSR-based approach to carry out image denoising and fusion simultaneously [26]. Ma et al. combined JSR and optimum theory to address the multi-focus image fusion problem [27]. However, including JSR, all SR-based methods generally have the following disadvantages: (1) an over-complete dictionary may result in visual artifacts in the reconstructed image; (2) simple fusion strategy for sparse coefficient vectors leads to spatial inconsistency.
In order to make full use of the advantages and overcome the shortcomings of the above methods, we propose a new image fusion method combining JSR and MSD. Specifically, first, JSR is used to decompose two source images into a common image and two innovation images. Then, the innovation images are sent to RGF-based MSD fusion framework to obtain the fused innovation image. Finally, the fused innovation image and the common image are combined to obtain the fused image.
The main contributions of our work are as follows: 1.
To improve the low information entropy caused by the redundant information between source images, only innovation images are performed edge-preserving MSD through RGF.

2.
To suppress the artifacts that may be brought by JSR, weight maps are used to balance the contribution of innovation images.

3.
To make fused images have high contrast, innovation images are used to guide the optimization of the weight maps.

4.
To ensure the spatial consistency of fused images, the fusion of innovation images is performed according to optimized weight maps.

Joint Sparse Representation
JSR is a novel method with which to perform image fusion. As with SR, JSR has two key issues: (1) sparse dictionary construction; (2) joint sparse coding to obtain coefficients. For JSR, constructing an over-complete sparse dictionary is exactly the same as in SR. In this study, we used K-SVD [28] to train the dictionary.
The biggest difference between JSR and SR is the method of sparse coding. In image processing, the objects for sparse coding are overlapping small image patches of source images. These patches are extracted by the sliding-window technique. The small patches of all the source images at the same position constitute a patch set. SR encodes each patch separately, while JSR simultaneously encodes all patches in the same set. JSR supposes that every patch consists of a common component and an innovation component. The common component is shared by all patches in the same set, while each patch has its own innovation component. There are two strategies for selecting the dictionaries for JSR. One strategy is to use one fixed dictionary for all the components for sparse encoding and reconstruction [26]. This strategy has low training cost, is easy to operate, and is suitable for the sparse representation of multiple types of images. The other strategy is to use a fixed dictionary for common components and an adaptive dictionary for innovation components for sparse coding and reconstruction [27]. This strategy may yield better results, but comes with additional computational costs. Our method uses the first strategy, which is to use a fixed dictionary for all components. Since there are always two source images for fusion, suppose each patch set has W patches. Given a flatten patch v i ∈ R n (1 ≤ i ≤ 2) extracted from a source image and an over-complete dictionary D ∈ R n×m (n < m), where n represents the length of the vector and m represents the number of atoms in the dictionary, respectively. The goal of JSR is to estimate a common sparse vector x C ∈ R m and innovation sparse vectors with only a few nonzero entries, such that where Dx C and Dx I i represent the common component and the innovation component of v i , respectively.
When performing sparse coding, JSR needs to first concatenate the source image patches and the dictionary separately. The encoded sparse coefficient vectors are also concatenated. Let where V denotes the concatenated source patch,D denotes the concatenated dictionary and X denotes the concatenated sparse coefficient vector. The joint version of Equation (1) can be defined as follows: The goal of JSR is to find an approximate optimal solution of X. This problem can be formulated as arg min where ε is a sparse reconstruction error.
Our proposed method uses JSR as an image decomposition method to reduce correlations between source images and increase the information entropy of a fused image. First, the two source images are divided into small overlapping image patches through the sliding window technique. The dimensions of the patches depends on the specific dictionary. Assuming the dictionary D ∈ R n×m , then the size of each patch should be √ n × √ n. Every patch is rearranged to a column vector. Second, two patches located at the same position compose a patch set. JSR is performed independently on each patch set. For every patch set, a common sparse vector belonging to the entire set and two innovative sparse vectors belonging to each patch in the set can be obtained. Then, a common patch and two innovation patches can be reconstructed by the corresponding sparse vector for each set. Finally, all the common patches and the corresponding innovation patches are averaged in the same order they were selected during the sliding window step. A common image and two innovation images are generated. The correlation between the two innovation images is lower than that between the two source images. We define the process of obtaining a common image and innovation images from source images as JSR decomposition.

Rolling Guidance Filter
The purpose of multi-scale decomposition is to obtain images of different blur levels to make full use of the information contained in the source images. Recently, edge-preserving filter-based MSD methods have become the mainstream of research. These methods are able to preserve high-contrast edges and obvious structures while blurring the image. The state-of-the-art edge-preserving MSD method was proposed by by Jian et al. [18]. It uses a rolling guidance filter (RGF).
A RGF can be seen as an extension of joint bilateral filter (JBF). JBF is first proposed for image denoising by Petschnigg et al. [14]. JBF accepts an input image and a guidance image as input. The content of the output image is similar to the input image, while the structures and edges of it are similar to the guidance image. First, the Gaussian kernel g δ is given as: where p and q index pixel coordinates in the image, and δ is the standard deviation. With Equation (4), given a guidance image G and an input image I in , the definition of JBF is as follows: where I out is the output image, is the normalization factor, N (p) is the set of neighboring pixels of p, δ s is the spatial standard deviation and δ r is the range standard deviation. The function g δ s sets the weight in the spatial domain based on the distance between the pixels, while the function g δ r sets the weight on the range based on intensity differences. δ s and δ r control the spatial and range weights, respectively. For the sake of simplicity, we abbreviate Equation (5) as follows: where JBF (·) denotes the JBF process. RGF was proposed by Zhang et al. [15]. The biggest feature of RGF is the ability to remove small structures while preserving the main content of the image. RGF is composed of two main steps: small structure removal and edge recovery. As shown in Figure 1, RGF is an iterative process. Suppose J t is the result in the t-th iteration and M is the number of total iterations, the iterative process of RGF is as follows: where σ s and σ r denote the standard deviations of RGF in order to distinguish them from δ s and δ r of JBF. According to Equation (7), the iteration process of RGF is given as Algorithm 1.

Algorithm 1 : The iteration process of RGF.
Input: Input image I in ; spatial standard deviation σ s ; range standard deviation σ r ; iteration number M. 1: Set J 0 as a constant image, i.e., ∀p, J 0 (p) = C, where C is a constant value. 2: for t = 1 : 1 : M do 3: Since J 0 is set as a constant image, the first iteration is equivalent to blurring I in with a Gaussian filter with the standard deviation of σ s to remove small structures. The remaining iterations are equivalent to continuous filtering of I in by the joint bilateral filter. Specifically, in each iteration, JBF takes I in as input and J t−1 as guidance, and σ s and σ r as standard deviations. Edges are recovered gradually during this process. After all the iterations are completed, J M is the output I out .
For simplicity, we denote the RGF filtering operation as follows: where RGF (·) denotes the RGF function.

JSR Decomposition
We first use JSR decomposition to decompose two source images S A and S B into a common image C and two innovation images I A and I B . An example of JSR decomposition is given in Figure 3.

Weight Map Construction
The weight maps are references for fusing detail layers. In order to make the information accurate and complete, weight maps are generated from the source images. First, S A and S B are processed with Kirsch operator to obtain the saliency maps R A and R B . Next, the initial weight maps P A and P B are obtained through a pixel-by-pixel comparison of the saliency maps R A and R B defined as follows: where q denotes pixel coordinates. The order in which the two source images are considered does not affect the initial weight maps. At each pixel position q, the initial weight of the one with the larger saliency value is set to 1 and the other is set to 0. If R A (q) = R B (q), then either of the two initial weights can be set to 1 and the other is set to 0, which does not affect the result. However, it is almost impossible for two saliency values to be exactly equal. Finally, JBF is used to filter P A and P B , and the final weight maps are obtained. This step is as follows: By adjusting δ i d and δ i r , weight maps for detail layers are optimized. The innovation images I A and I B are used as guidance images to enhance the difference between weight maps. This results in higher contrast in the fused image. Besides, this step makes the weight values same for the pixels with similar brightness and next to each other so that the problems caused by spatial consistency can be avoided [16].

Multi-Scale Decomposition
RGF is the key to performing this stage. When M and σ r are set to constants, changes of σ s can achieve different blur levels of rolling guidance filtering. As σ s increases, the output of RGF becomes more blurred, which means it contains more low-frequency components. Therefore, the output of using the largest σ s for RGF should be regarded as the base layer. Outputs using various smaller σ s are sequentially differentiated and the difference values are regarded as the detail layers. Now we take I A as an example to give the concrete multi-scale decomposition method. First, I A is normalized to range [0, 1] and blurred into K − 1 levels through RGF; this process is described by Equation (11).
Finally, base layers B A and K − 1 detail layers H i A (i = 1, 2, · · · , K − 1) can be obtained by Equation (12) The same operation is performed to I B to get B B and H i B (i = 1, 2, · · · , K − 1).

Fused Image Reconstruction
There are four steps to reconstruct the final fused image: (1) reconstruct the fused base layer F B ; (2) reconstruct the fused detail layers F i H ; (3) reconstruct the fused innovation image F I by combining F B and F i H ; (4) reconstruct the final fused image F by combining F I and the common image C. First, the fused base layer F B is reconstructed by entropy-based average. The base layer contains the low-frequency information of the image, which is equivalent to the average value of the image. Fusing base layers by the traditional simple averaging method may cause the fused image have low contrast and information entropy. Since the global variance of a base layer represents its overall contrast and amount of information, we regard global variance of a base layer as its entropy. Then, an entropy-based average method is used to fuse the base layers. Specifically, the entropies E A and E B of the two base layers B A and B B are calculated by: where var (·) denotes the global variance function. Second, the fused base layer F B is reconstructed by weighted average using E A and E B as weights: Second, the fused detail layers F i H are reconstructed. The detail layers are simply multiplied by the corresponding weight maps and summed up to achieve the fusion. This process is described as follows: Then, with the fused detail layers F i H and base layer F B , the fused innovation image F I can be obtained as follows: Finally, the fused innovation image F I and the common image C are merged together to obtain the ultimate fused image F:

Workflow of Our Proposed Method
The workflow of our proposed method can be summarized as follows. Consider two source images S A and S B , dictionary D, the decomposition level K and the parameters of JBF and RGF. First, JSR is used to decompose the two source images into one common image C and two innovation images I A and I B . Second, the Kirsch operator is used to extract saliency maps P A and P B from source images. Regarding the innovation images as guidance, JBF is applied to the saliency maps to obtain weight maps W i A and W i B (i = 1, 2, · · · , K − 1). Then, K-level multi-scale decomposition of the innovation images is performed by RGF to obtain the detail layers H i A and H i B (i = 1, 2, · · · , K − 1) and the base layers B A and B B . Finally, the detail layers are fused according to the corresponding weight maps. The base layers are fused by an entropy-based fusion rule. By summing the fused detail layers F i H (i = 1, 2, · · · , K − 1) and the fused base layer F B , the fused innovation image F I is obtained. The last step is to add F I to the common image C, and the fused image is obtained.
When doing addition and multiplication, the problem of dynamic range of images needs to be addressed. In our workflow, multiplication and addition occur during the fusion of the detail and base layers. For the fusion of detail layers, two detail layers on the same decomposition level are respectively multiplied with their corresponding weight maps, and the products are added to obtain the fused detail layer. The two corresponding weight maps are complementary, that is, the sum of two weights at the same pixel position is very close to 1. Therefore, the value range of the fused detail layers hardly changes. For the fusion of base layers, since the two weights are also complementary, the value range of the fused base layer does not change either. By adding all the fused detail layers and the fused base layer together, the fused image can be reconstructed. This final addition may cause a small number of pixels to be out of the reasonable range of [0, 255]. For these pixels, the values less than 0 are set to 0 and the values greater than 255 are set to 255. Finally, the value of each pixel is rounded into an integer.
The pseudo code of our proposed method is shown as Algorithm 2. if R A (q) ≤ R B (q) then 5: P A (q) = 0, P B (q) = 1.

Experimental Settings and Objective Evaluations
We tested all the methods using four categories of images, with four sets of source images in each category. Specifically, there are infrared-visible images shown in Figure 4, medical images shown in Figure 5, multi-focus images shown in Figure 6 and remote sensing images shown in Figure 7. The images we used for our experiments are downloadable from the following website: https://sites.google.com/view/durgaprasadbavirisetti/datasets.
The default parameters in our method are set according to [18]. Specifically, for the parameters of JBF, we set δ i s = {1, 3  The objective evaluation has a certain reference value for judging the quality of image fusion. In our experiments, four metrics were used to measure the information entropy and visual quality of the results of different methods:

1.
Mutual information (MI) [29] based on Shannon entropy and relative entropy. It measures the correlation between the source image and the fused image to indicate how much information is retained.

2.
Feature mutual information (FMI) [30] indicates the entropy of features in fused image.
It measures the amount of information in image features carried from the source images to the fused image. Besides, it is a non-reference image fusion metric.

3.
The normalized weighted edge preservation value (Q AB/F ) [31] measures the visual information quality of the fusion, and more edge information can lead to higher values for this metric.

4.
Nonlinear correlation information entropy (NCIE) [32] is based on nonlinear joint entropy. It measures the general correlation between the source images and the fused image.
Higher values of the above four metrics demonstrate a better fusion effect. The codes of the metrics are provided by Qu et al. [33].

Size of JSR Dictionary
This experiment tested the effect of different JSR dictionary sizes on the fusion performance. Suppose the dictionary D ∈ R n×m (n < m); then n affects the size of image patches while m affects the completeness of the dictionary. A larger n indicates a larger size of image patches, which leads to an enlarged window to perform joint sparse representation, while a larger m increases the completeness of the dictionary. It is important to find a suitable dictionary size. n and m values which are too small cause large reconstruction errors and loss of details, while overly large n and m easily cause artifacts and take more time.
In this experiment, the decomposition level K was set to 5 while other parameters were set to default values. All the dictionaries used in this experiment were trained with 100,000 natural image patches for 180 iterations by the K-SVD algorithm, as suggested in [34]. All 16 sets of source images were tested. The average values of the objective metrics and the average time cost were compared. First, m was fixed to 512 and n was set to {36, 64, 100}, respectively. Then, n was fixed to 64 and m was set to {128, 256, 512}, respectively. The results are shown in Tables 1 and 2. It can be seen that the size of 64 × 512 achieved the highest MI, Q AB/F and NCIE in both experiments, and its FMI values were both the second highest. At the same time, its time cost was not too high compared to others. Therefore, it is reasonable to choose the size of the JSR dictionary as 64 × 512.

Number of Decomposition Level
This experiment tested the effect of different decomposition levels K on the fusion performance. For the innovation image of each source image, RGF decomposition generates K − 1 detail layers and one base layer. The base layer can be considered as the last detail layer with the highest degree of blur. At the same time, K − 1 weight maps are generated by JBF to guide the fusion of detail layers. To ensure that at least one detail layer exists, the minimum value of K should be 2.
In this experiment, K was set to {2, 3, 4, 5}, respectively. The other parameters were set to default. The 64 × 512 dictionary was used for JSR as mentioned before. All 16 sets of source images were tested. The average values of the objective metrics were compared. The results are shown in Figure 8.
As K increases, MI and NCIE also increase. The highest values of MI and NCIE were at K = 5, but both the growth rates from K = 4 to K = 5 were low. Both the highest growth rates were from K = 3 to K = 4. FMI reached the highest value at K = 2. After reaching the highest value, the value dropped. NCIE reached the highest value at K = 3. Except from K = 2 to K = 3, the growth rate of NCIE was approximately equal to zero. Since two metrics reached the maximum value at K = 5, and the growth rates of all four metrics tended to be zero at K = 5, it is reasonable to choose K = 5 as the default decomposition level. We also compared the calculation times required for different decomposition levels. The comparison results are shown in Table 3. As K increases, the time cost increases at a very slow rate. Compared with the time cost of JSR decomposition, the time cost of MSD using RGF is very small. The 5-level decomposition is only about 6% slower than the 2-level decomposition. It is worthwhile to trade these time costs for performance improvements. Overall, it is reasonable to choose K = 5 as the default decomposition level.

Validity of Our Combination Strategy
In this experiment, our proposed method was compared to original JSR-based [26] and RGF-based [18] image fusion methods to validate the effectiveness of our combination strategy. Our method adopted the aforementioned default parameters, and the JSR-based and RGF-based methods adopted the default parameters from their papers. All four categories of source images were tested. The average values of the objective metrics of each category were compared separately. Some examples of the fused images are shown in Figure 9. The average values of the objective metrics are shown in Table 4.
According to Figure 9, RGF and our method perform similarly in terms of subjective effects. But the subjective effect of JSR is not as good as the other two. For fused images of JSR, the details in some images are smoothed, and some images have local artifacts. The objective metrics in Table 4 show that for images other than medical ones, our method is the best. This means that the fused images of our method have higher information entropy and visual quality. In the results of medical images, our method achieves best MI and Q AB/F , while JSR achieves best FMI and NCIE. However, according to the medical image fusion example in Figure 9, our method generates clearer details than the JSR method.
Our method combines JSR and RGF for better image fusion performance, while JBF can be regarded as a special form of RGF. First, JSR decomposition is used to obtain two innovation images and a common image. The two innovation images contain the complementary information of the two source images, which have more information entropy and less redundancy. The complementary information is what really needs to be fused. Since the common image contains the redundant information shared by the two source images, it should be directly included in the fused image without modification. Second, the combination of JBF and RGF for image fusion was proven to be very effective. RGF is used for multi-scale edge-preserving decomposition, which makes the most use of details and textures. JBF is used to obtain the corresponding weight maps, which take spatial consistency well into account and reduces local artifacts [18]. To further improve the quality of fusion, the innovation images generated by JSR were used as the input images of RGF and the guidance images of JBF. The MSD using RGF can extract complementary details of the innovation images at different levels. Meanwhile, using the innovation images as the guidance images of JBF can make the weight maps balance the contribution of the innovation images well. Then, different strategies are used to fuse the detail layers and the base layer to make the fused innovation image have high contrast and more details. Finally, adding the common image directly to the fused innovation image ensures that common information can be preserved.  Overall, our method is superior to the single JSR-based and RGF-based methods. This experiment proves the validity of our proposed combination strategy.

Comparison with Other Methods
In order to prove the superiority of the proposed method, we compare it with 13 other image fusion methods here. The comparative methods include the adaptive sparse representation (ASR) [35], the convolutional sparse representation (CSR) [36], curvelet transform (CVT) [37], dual-tree complex wavelet transform (DTCWT) [8], the gradient transfer fusion (GTF) [38], the hybrid multi-scale decomposition (H-MSD) [39], the convolutional neural network (CNN) [40], Laplacian pyramid (LP) [5], the general framework based on multi-scale transform and sparse representation (MSSR) [34], the multi-resolution singular value decomposition (MSVD) [41], nonsubsampled contourlet transform (NSCT) [9], the visual saliency map and weighted least square optimization (WLS) [42] and the fast filtering image fusion (FFIF) [43]. All these methods were given default parameters from in their related papers. For our proposed method, the dictionary size was 64 × 512, the decomposition level K was set to 5 and the other parameters were set to defaults.
In         According to Figure 10, our method produces clear structures of the windows and the barrel. GTF produces the sharpest silhouettes and structures. However, in the GTF result, the light under the windows in the visible image is not fused, which reduces the local contrast and makes the letters unclear. CNN performs similarly to our method. The results of other methods either have unclear structures or have low contrast.
According to Table 5, our method is the best on Q AB/F and the second best on the other three metrics. FFIF is the best on MI, FMI and NCIE, while it is the second best on Q AB/F . However, the visual quality of FFIF is not so good. Some information in the visible image is not fused at all. Some details from the visible image are missing in the fused image. Although the metrics show that the results of FFIF have high information entropy, this is in exchange for the decline in visual quality. In contrast, our method achieves a balance between visual quality and information entropy.

Analysis of Medical Results
According to Figure 11, our method produces both high contrast and clear structures. With the exception of CNN, MSSR, FFIF and our method, all other methods make the texture from the middle part of (b) unclear due to low contrast. However, the structures produced by CNN, MSSR and FFIF are not as clear as our method.
According to Table 6, our method is the best on Q AB/F and the second best on MI and NCIE. FFIF is the best on MI, FMI and NCIE, while it is the second best on Q AB/F . ASR is the second best on FMI. However, the edges and structure of the FFIF result are unclear. The structures of the two source images are mixed together, and it looks confusing. The result of ASR has low contrast and the middle part of the fused image is not clear. As pointed out by Q AB/F , our method has the best visual quality. At the same time, our method also has high information entropy.

Analysis of Multi-Focus Results
According to Figure 12, our method preserves the clear letters and edges well. Letters in ASR, H-MSD, MSVD and WLS results are not very clear. The edges of the right clock in MSSR and NSCT are not well preserved. In the CSR result, there is an obvious artifact above the clock edge. The fusion results of GTF and FFIF are very poor, the clock on the right is very fuzzy, and the information in the source image (a) is hardly fused. CVT, DTCWT, CNN, LP perform similarly to our method.
According to Table 7, our method is the best on Q AB/F and the second best on the other three metrics. FFIF is the best on MI, FMI and NCIE. CNN is the second best on Q AB/F . However, as mentioned earlier, the fusion quality of FFIF for some multi-focus images is very poor. In this case, high values of the metrics are not so convincing. Our method can achieve high values of metrics while ensuring visual quality.

Analysis of Remote Sensing Results
According to Figure 13, our method preserves the textures on the ground and the structures of the buildings well. ASR and MSVD fail to preserve the straight line structures in the right box. Except our method, all other methods do not preserve the textures on the ground well.
According to Table 8, our method is the best on Q AB/F and the second best on the other three metrics. FFIF is the best on MI, FMI and NCIE, while it is the second best on Q AB/F . However, the results of FFIF suffer from spatial inconsistencies, and there are many discontinuous black patches on the ground. In contrast, our method has the best visual quality while having high information entropy.

Summary of the Analysis
According to Figures 10-13 and the Q AB/F metric, our method is the best in terms of visual quality compared with the other 13 methods. The MI, FMI and NCIE metrics of our method are slightly lower than those of FFIF. Although the metrics show that the fused images of FFIF have higher information entropy, FFIF does not make full use of the information in each source image. This makes the visual quality of the FFIF method not so good. In contrast, our method achieves a good balance between visual quality and information entropy.
Overall, this experiment demonstrates that our method is comparable to or better than state-of-the-art methods both in visual and objective evaluations.

Conclusions
In this paper, a JSR and RGF based image fusion method is proposed. Our method uses JSR for image decomposition, reduces the correlation and highlights the complementary information between source images. This improves the low information entropy caused by the redundant information between source images. Multi-scale decomposition using RGF can remove small structures while preserving obvious edges in the innovation images. It can extract complementary details of the innovation images at different levels. Using weight maps to balance the contribution of the innovation images can suppress the artifacts that may be brought by JSR. The innovation images are used to guide the optimization of the weight maps so that the fused image can have high contrast. The fusion of innovation images is performed according to optimized weight maps to ensure the spatial consistency of the fused innovation image. Finally, adding the common image directly to the fused innovation image without processing ensures that the common information contained in the two source images can be well retained. Experimental results demonstrate that our main contributions have been achieved, and our method can achieve better performance than state-of-the-art methods.