MEEAFusion: Multi-Scale Edge Enhancement and Joint Attention Mechanism Based Infrared and Visible Image Fusion

Infrared and visible image fusion can integrate rich edge details and salient infrared targets, resulting in high-quality images suitable for advanced tasks. However, most available algorithms struggle to fully extract detailed features and overlook the interaction of complementary features across different modal images during the feature fusion process. To address this gap, this study presents a novel fusion method based on multi-scale edge enhancement and a joint attention mechanism (MEEAFusion). Initially, convolution kernels of varying scales were utilized to obtain shallow features with multiple receptive fields unique to the source image. Subsequently, a multi-scale gradient residual block (MGRB) was developed to capture the high-level semantic information and low-level edge texture information of the image, enhancing the representation of fine-grained features. Then, the complementary feature between infrared and visible images was defined, and a cross-transfer attention fusion block (CAFB) was devised with joint spatial attention and channel attention to refine the critical supplemental information. This allowed the network to obtain fused features that were rich in both common and complementary information, thus realizing feature interaction and pre-fusion. Lastly, the features were reconstructed to obtain the fused image. Extensive experiments on three benchmark datasets demonstrated that the MEEAFusion proposed in this research has considerable strengths in terms of rich texture details, significant infrared targets, and distinct edge contours, and it achieves superior fusion performance.


Introduction
Image fusion, as a subset of information fusion, involves analyzing and fusing image data from an identical scene acquired by multiple sensors to create more informative fused images.In general, infrared sensors like infrared cameras are suitable for harsh environments and occlusion circumstances such as darkness, rain, and fog.However, infrared images bear the shortcomings of low contrast and resolution, as well as lacking detailed texture.Additionally, they are susceptible to interference from background noise [1].Visible light sensors like optical cameras record images more closely aligned with human vision, with higher resolution and more detailed texture.Notably, the imaging quality is inferior under low light and occlusion conditions [2].Therefore, complementing the advantages between these sensors and fusing visible and infrared images enables the resulting images to include both texture details and infrared target information.Embedding infrared and visible image fusion (IVIF) algorithms on carriers equipped with infrared and visible light sensors can assist in realizing tasks such as personnel search and rescue, autonomous driving, remote sensing monitoring, and defense reconnaissance.Moreover, integrating IVIF with advanced vision tasks such as target detection, tracking, and semantic segmentation can facilitate the performance improvement of these tasks.Therefore, IVIF has become a research focus in recent years.
The process of IVIF mainly comprises two steps: information extraction and information fusion.The challenge lies in how to achieve the maximum extraction of features and information from the image and integrate them naturally, to acquire high-quality images with complementary information.The available fusion algorithms are divided into two types: traditional methods and deep learning (DL)-based methods.Traditional approaches primarily include multi-scale transform [3,4], sparse representation [5], and subspace [6].These methods rely on pyramidal and wavelet transform techniques, overcomplete dictionaries, and principal and secondary component analysis to implement image decomposition and fusion.Traditional fusion methods provide strong interpretability but hinge on manually designing feature extraction and fusion rules.The computation is complex and time-consuming, and the fusion results frequently show texture blurring, poor contrast, and artifacts.Benefiting from the breakthrough of computational power resources, DL technology has rapidly evolved and been introduced into the field of image fusion.This facilitates the automatic training of model parameters, thereby minimizing the influence of human factors.As a result, the ability of the network to extract complementary image information is optimized, realizing simple yet efficient fusion.Since Liu et al. [7] introduced CNN into the multi-focus image fusion task, numerous DL-based fusion methods have been developed, including autoencoder (AE)-, CNN-, GAN-, and task-driven-based algorithms.
Some current IVIF algorithms suffer from inadequate extraction of source image detail features or loss of global features, resulting in distorted fusion images, unclear edge outlines, and artifacts (as demonstrated in Figure 1a,b).Some approaches prioritize retaining the textural details from visible images and infrared salient objects in images and employ manually created masks to achieve clear infrared targets.However, such approaches are time-consuming and laborious and suffer from inadequate labeled data.The limited featurefitting capacity of the fusion network causes the loss of infrared or visible information, leading to darker or brighter images (as illustrated in Figure 1c).Regarding the feature fusion strategy, some AE networks necessitate the manual design of rules when fusing image features, which brings in subjective factors and makes it challenging to determine the optimal fusion strategy, leaving the fusion quality poor (as illustrated in Figure 1d,e).Most IVIF methods overlook the significance of intermediate layer features and solely focus on fusing the final depth features.They ignore the complementarity of information between the source images, ultimately leading to feature degradation.Even if partial end-to-end fusion algorithms accomplish feature interaction in the middle layer, they just perform simple convolution operations upon the features from the dual branches.Insufficient supplementary information mining leads to fusion results with unclear texture details and insignificant infrared targets (as displayed in Figure 1f,g).
In response to the aforementioned shortcomings, this work develops a multi-scale edge enhancement and joint attention mechanism-based IVIF method (MEEAFusion).MEEAFusion utilizes convolution kernels of varying scales to fully extract the shallow feature information from the original image.Subsequently, it devises a multi-scale gradient residual block (MGRB) and a cross-transfer attention fusion block (CAFB) to enhance the edge contours of deep feature representations, while realizing feature interactions and pre-fusion.Ultimately, the shallow and last deep features are merged to reconstruct the image without manually designing any fusion rules.
Contributions to this paper are summarized below.
(1) An end-to-end IVIF model is provided, primarily composed of a shallow multiscale feature extraction module, a multi-scale gradient residual block (MGRB), and a cross-transfer attention fusion block (CAFB).Dense connections are also incorporated to alleviate the information loss caused by the feature flow.
(2) MGRB is innovatively constructed by integrating several scales of gradient convolution, which could fully extract the textural details and semantic information from the image and obtain the depth features with enhanced edge contour.The MGRB module enables further promotion of the detailed description capability of the network.
(3) CAFB is created by incorporating the spatial and channel attention mechanism to achieve deep feature interaction between two paths, obtaining pre-fused features enriched with complementary information and common features from the two source images.The CAFB module abandons the traditional mindset of extracting first and fusing later and realizes the parallel processing of feature extraction, interaction, and fusion.
(4) Experiments on three generalized IVIF datasets demonstrate that the fusion results produced by the proposed method have abundant texture details, prominent infrared targets, and sharp edge contours, and they outperform those of mainstream fusion approaches in terms of subjective and objective evaluation metrics.It offers a novel, structurally simple yet high-performance solution for complementary multimodal image fusion.a-g) show the fusion results of FusionGAN [8], IPLF [9], STDFusionNet [10], DenseFuse [11], RFN-Nest [12], PMGI [13], and FLFuse-Net [14], respectively.The red and green boxes outline the salient targets and detail regions.

DL-Based IVIF Methods
The AE-based method is the most classical approach among the DL-based IVIF, and the essential idea is consistent with that of traditional approaches.The encoder first obtains the feature from the original image via a feature extraction network, followed by feature information fusion according to pre-established fusion strategies.Eventually, the decoder yields a fused image by reconstructing the fused features [15,16].DeepFuse [17] performs the same convolution operation on a set of overexposed and underexposed images to extract the feature maps, which are then summed and fused to rebuild the image.This strategy only focuses on the last layer of features and ignores the usage of information from the intermediate layer of features.The DenseFuse [11] network introduces densely connected blocks to extract multi-layer depth information without losing information from the intermediate layer, considerably boosting the fused image quality.NestFuse [18] utilizes Nest connections to decode the fused features at several scales, which could increase the multi-scale representation of the image features.RFN-Nest [12] replaces the fusion strategy of NestFuse with residual networks to fuse the features at different scales, resulting in the fused image containing more detailed information.An atrous spatial pyramid network with different expansion rates is utilized by EDAfuse [19] to extract depth features with different scales, and the fused image can contain more details and salient object characters.FPN, as an encoding network, can fully extract and fuse features from multiple scales and levels in two images, resulting in methods like FPNFuse [20] and PG-Fusion [21].
CNN-based fusion methods eliminate the need to construct fusion rules to accomplish end-to-end fusion.Liu et al. [7] first completed the multi-focus image fusion task with the help of CNN.PMGI [13] views image fusion as an issue of preserving intensity and gradient information.To this end, the method employs two identical paths to extract the gradient information and intensity information and introduces dense connections to reuse the intermediate layer features to prevent information loss.Additionally, a cross-path exchange module is developed to pre-fuse the features and strengthen the information interaction.FLFuse-Net [14] realizes fast fusion with the fully convolutional network and designs a significant information edge compensation branch between the infrared image and the decoder, which is used to retrieve the edge information of the significant infrared target to sharpen the fused image contour.U2fusion [22] combines information measurement and adaptive preservation strategies, making it suitable for multiple image fusion tasks.DRSNFuse [23] adopts deep residual blocks to extract features and further separates the base and detail parts from the feature map.Finally, the shallow features extracted by the base, detail, and residual blocks are integrated and reconstructed to yield the fused image.
IVIF typically lacks standard labels available, and GAN is particularly suitable for such unsupervised fusion tasks.The advantage lies in its reliance on the confrontation between the discriminator and the generator to optimize the generative capacity for creating the desired fused image.FusionGAN [8] innovatively introduces GAN to the IVIF field.Without base labeling, the discriminator is trained with the visible image as the true value, and the generator is prompted to produce an image containing more visible image details.Nevertheless, an individual discriminator might lead to algorithm collapse, meaning that the fused results are biased toward infrared or visible images.Consequently, DDcGAN [24] proposed a dual discriminative conditional GAN that leverages two discriminators to ensure that the fused image retains both the thermal object information and detailed texture from the infrared image and visible image, thereby improving the robustness of the fusion network and maintains the information balanced among the two images.The generator of DUGAN [25] integrates information from image content and gradient, while the discriminator uses a U-shaped architecture to drive the fused image to incorporate richer detailed features and global information.
The union of image fusion and downstream tasks as a whole can guide and promote each other to achieve well-fused results.SeAFusion [26] combines IVIF with high-level semantic segmentation tasks and feeds the fusion network output into the segmentation network to evaluate the fusion quality with the segmentation outcomes.Furthermore, a joint training strategy was developed, and semantic loss and content loss were proposed to simultaneously guide and optimize fusion network and segmentation network training.Zhang et al. [27] proposed a real-time fusion approach employing an adaptive weighting strategy to trade off the speed and quality of IVIF on an embedded platform, which joins image fusion with downstream target detection to attain faster fusion speeds and higher detection accuracies.IRFS [28] developed a joint paradigm of image fusion and advanced vision tasks, prioritizing image fusion as the primary objective while incorporating multimodal salient target detection as a subsidiary task.This method aims to facilitate saliency-guided image fusion, and a cross-loop training strategy is designed to aid in training network parameters.RSDFusion [29] incorporates IVIF and semantic segmentation to achieve the real-time fusion effect, successfully preserving the detailed textures and remarkable objects in the source image.
Moreover, recently, researchers have introduced the latest transformer technique [30][31][32], mamba model [33], and diffusion model [34,35] to the multimodal image fusion task and achieved favorable fusion results as well.A growing number of novel algorithms will emerge and develop in the IVIF field.

Attention Mechanism
The attention mechanism imitates the human visual focusing mechanism by assigning different weights to various targets, reflecting the level of relevance of the information.It has been broadly applied in computer vision and has gradually been introduced to the IVIF problem, yielding a series of advanced fusion algorithms.
Xu et al. [36] presented an IVIF algorithm based on the CBAM [37] module with different scale kernels to extract multi-scale feature maps from both spatial and channel dimensions.RDCa-Net [38] employs channel attention to focus on salient features across different feature layers and adopts a self-attention mechanism to concern contextual information.This approach adaptively obtains the weight parameters while calculating the loss function weights, allowing the fused result to retain more detailed features.Zhan et al. [39] embedded a global attention mechanism into the fusion algorithm based on a semantic segmentation task to capture the contextual dependencies over a long distance, thus adjusting the channel weights.AttentionFGAN [40] introduces the attention mechanism into GAN-based methods to allow the network to focus more on the texture details of visible images and prominent targets of infrared images.Cross-modal attention [41] is suitable for the IVIF task due to the ability to notice the information differences between different modalities.CrossFuse [42] devised a special cross-attention module to enhance the mutual information across multimodal images, attaining superior fusion performance.
Although the existing algorithms can focus more on crucial region information in the source image, they neglect the complementary features between different modalities, resulting in insufficient information exchange.Therefore, this paper designs the CAFB module to capture complementary information across multiple modalities adequately.

Method
This section presents a novel IVIF method, MEEAFusion, and provides a detailed description of the design and composition of each module and loss function.

Network Framework
As illustrated in Figure 2, the innovative MEEAFusion algorithm in this study first focuses on the feature heterogeneity between source images.In the shallow feature extraction stage, the convolutional layers of infrared and visible image paths are trained independently to generate feature maps containing the unique characteristics of each source image.To avoid the fixed receptive fields and potential loss of feature information caused by single convolution, the method utilizes convolution kernels of different scales to implement the feature extraction process.This ensures that the generated feature map covers a wide range of receptive fields, thereby enhancing the diversity of image information.Then, in the deep feature extraction stage, a multi-scale gradient residual block (MGRB) is innovatively constructed, which consists of different scales of Sobel gradient operators and residual connections.The MGRB module is aimed at optimizing image semantic information while strengthening feature edges and details.To eliminate the information loss caused by manual intervention or fusing features only before reconstructing the image, MEEAFusion designs the cross-transfer attention fusion block (CAFB).This module enables the facilitation of feature interaction between source images, meanwhile carrying out complementary feature cross-transfer between the two branches.Thus, each branch has access to the complementary information of the other branch in advance, i.e., pre-fusion.During feature extraction, the MGRB is densely connected to thoroughly convey different layers of feature information without adding any parameter burden to prevent information loss of deep features when the network goes deep.Under continuous feature extraction, interaction, and transmission, the feature information of visible and infrared images is sufficiently integrated and unified.A completely symmetric network structure with shared parameters is adopted for the deep feature extraction to train the convolutional kernel adapted to the feature extraction of both visible and infrared images and to further lower the network parameters.
Finally, the shallow and last deep feature results from the two paths are concatenated as inputs to the image reconstruction module to maximize the use of feature data and alleviate feature loss.Four consecutive 3 × 3 convolutions and one 1 × 1 convolution are utilized to reconstruct the features to obtain the ultimate fused image.Apart from the last convolution, whose activation function is Tanh, all the other convolution operations adopt Leaky Rectified Linear Unit (LRelu) as their activation functions.
In addition, pooling, upsampling, and downsampling operations are not employed throughout the fusion network, as they are prone to information loss and may also introduce noise.MEEAFusion adopts padding to always ensure that the image size remains constant during feature extraction and reconstruction.A detailed description of the design and components of each module is given below.

Shallow Multi-Scale Feature Extraction Module
Most fusion networks use a series of 3 × 3 convolutional kernels for feature extraction on the input image, resulting in a fixed receptive field.By contrast, using convolutional kernels of varying scales can obtain feature maps with different receptive fields, which have rich local and contextual information and can provide high-quality feature information for subsequent deep feature extraction.
For the given infrared image I ir and visible image I vis , this paper uses three scales of convolution kernel 1 × 1, 3 × 3, and 5 × 5 to extract image feature information.The resulting features are concatenated; then, the channel number is reduced by 1 × 1 convolution, and the source image shallow features are obtained after activated by the LRelu function.The output features are calculated as follows: where Conv 1 , Conv 3 , and Conv 5 denote 1 × 1, 3 × 3, and 5 × 5 convolutions, C(•) denotes channel concatenation, and σ(•) stands for the activation function LReLU.F ir and F vis denote the shallow features of the infrared image and visible image.

Multi-Scale Gradient Residual Block (MGRB)
The MGRB module is designed by combining different scales of gradient operators containing one main stream and two residual gradient streams.The detailed structure is presented in Figure 3. Conventional convolutional operations are performed on the mainstream to extract feature semantic information, and efficient Sobel gradient convolution is employed on the branch to improve the fine grain of the features.Two branches utilize convolution kernels with k = 3 and k = 5 for the extraction of edge information.When given the input feature F ir,vis , the outputs of the gradient convolution with different scales are where G 3×3horizontal , G 3×3vertical , G 5×5horizontal , and G 5×5vertical represent the 3 × 3 and 5 × 5 gradient convolution components in the horizontal and vertical directions, respectively, and the specific convolution kernel parameters are as follows: The 1 × 1 convolution of the branch allows for adjusting the channel dimension of the feature map to ensure consistency with that of the mainstream.Finally, the elementwise addition is performed to map the gradient features of the residual stream onto the mainstream features, thereby realizing edge compensation of the depth feature.Thus, for the MGRB module, in the case of a given input feature F ir,vis , the output F ′ ir,vis is calculated as follows: where Conv 1,3 indicates that the Conv 1 + LReLU and Conv 3 + LReLU operations are performed sequentially.Figure 4 specifically demonstrates the two gradient convolution results.The 3 × 3 gradient convolution roughly captures the image edge information, while the 5 × 5 gradient convolution has a wider range of receptive fields and could extract clearer texture details, such as the contours of the human and the tree branches.In particular, the texture inside the tree trunk is notably outlined as well, which can hardly be distinguished by the eye.Thus, using gradient operators of different scales can acquire richer texture information and strengthen the fused image edge details.Replacing the regular stacked convolution with the MGRB module can further enhance the fine-grain feature and the ability of the network to describe the details.

Cross-Transfer Attention Fusion Block (CAFB)
CAFB works in the procedure of deep feature extraction, with a refined workflow as follows: firstly, it obtains the complementary features between the two source image features.Then, spatial attention (SA) and channel attention (CA) are utilized jointly to assign weights to both complementary information, effectively suppressing the irrelevant features in spatial and channel dimensions.Finally, the weighted complementary feature maps are incorporated into the input features to realize the feature information pre-fusion.Figure 5 exhibits the particular architecture of CAFB.Specifically, the differential idea in [43] is adopted; i.e., the two source image features can be reformulated as Therefore, the two modal features can be regarded as an integration of common features and complementary features.Consequently, for the image features F ′ ir and F ′ vis output from the MGRB module, the complementary features embedded within the visible features Q c vis , as well as those inherent in the infrared features Q c ir , can be obtained, which are defined as The CAFB module is designed to fully exploit the complementary information across the dual branch features to realize feature interaction and pre-fusion.For a more precise definition, this study selects only the positive part of the complementary features as the supplementary information, which is expressed as follows: where Q vis,ij and Q ir,ij denote the values of Q c vis and Q c ir at point (i, j).The weighted refined complementary features are acquired using CA and SA, respectively.For Q c vis , average pooling (AVGP) and max pooling (MAXP) operations are first performed on the spatial locations, and then the resulting features are concatenated along the channel dimension.The spatial weight map is generated after dimensionality reduction using 5 × 5 convolution operation and activation via the Sigmoid function.Lastly, the refined complementary feature Q 1 vis with spatial location weights is derived by multiplying it with the input feature Q c vis .The computational process is represented as For the CA stream, AVGP and MAXP are performed in the channel dimension followed by feature summation.The other operations are nearly identical to those of the SA stream, in addition to replacing the convolutional layer with two fully connected layers.A complementary feature, Q 2 vis , with channel weights is ultimately yielded.The calculation process is as follows: where FC 1 and FC 2 represent the first and second fully connected layers.Adding Q 1 vis and Q 2 vis yields a weighted complementary feature, Qc vis , which is summed with the input infrared feature F ′ ir to obtain a pre-fused feature F ir that contains both common features and complementary information of two source images.
Pre-fused features F vis on the visible image path can be derived through similar steps.
F ir and F vis are transferred as the feature output of the CAFB module to the deep feature extraction branch of source images to realize the information interaction and prefusion of infrared and visible features.

Loss Function
To yield fused images characterized by abundant texture details, prominent infrared targets, and clear edge contours, and to realize an end-to-end training approach, this research integrates the content loss L Content , structural similarity loss L SSI M , and perceptual loss L Per between fused images I f and original images I ir , I vis .Consequently, a comprehensive total loss function L total is devised, formulated as follows: where α, β, and γ represent the weighting parameters for each loss component.

Content Loss
Content loss enables the generated image to gain rich texture information and approximate pixel distribution from the source image, which includes two parts, pixel loss L pixel and gradient loss L grad , and is defined as follows: where λ is the weighting coefficient of pixel loss.
L pixel keeps the created image consistent with source images at the pixel level, and L grad forces the fused result to preserve greater high-frequency information for sharper texture details.The maximum pixel and gradient values among the visible and infrared images are employed to participate in the calculation to achieve the optimal luminance distribution and sharper texture details.The two types of losses are given by where W and H represent the image width and height, ∥•∥ 1 denotes the l1-norm, and ∇ represents the gradient operator.

Structural Similarity Loss
To diminish the edge imbalance and distortion of fused images, structural similarity (SSI M) loss L SSI M between the source and fused image is calculated from three aspects: brightness, contrast, and structure.For any two images x and f , SSI M is computed as follows: where µ x , µ f , σ x , and σ f indicate the mean and standard deviation of image pixels, σ x f denotes covariance between two image pixels, and C 1 and C 2 are constants.Then, L SSI M is given by A smaller L SSI M indicates higher structural similarity and better fusion performance.

Perceptual Loss
Perceptual loss is initially used for image style transformation and image superresolution to ensure the output image resembles the source image, thus obtaining a more favorable visual effect.Similarly, applying the perceptual loss to the image fusion task could enhance the feature similarity between fused and source images.A simple and efficient VGG-16 model is chosen as the feature extraction network for calculating L Per to prevent information loss due to the over-extraction of features.Given that the VGG-16 network receives 3-channel images as input, the fusion image is copied three times, while the target image is composed of source images I ir and I vis and the adjusted image I adj .I adj is computed as follows: The feature maps convolved and activated in the tenth and thirteenth layers of the VGG-16 network are selected to compute the perceptual loss.
where n indicates the sequence number of VGG-16 convolutional layers, F n t and F n f denotes feature maps of the target image and fused image after the nth convolution, C n , W n , and H n are the channel number, width, and height of feature maps, respectively, and ∥•∥ 2 denotes the l2-norm.

Experimental Configuration 4.1.1. Datasets
The Multi-Spectral Road Scenarios (MSRS) [43] dataset, which comprises 1444 matched visible and infrared image pairs, is adopted to train MEEAFusion, and this study divides the image pairs into the training set and test set in a 3:1 ratio.Three datasets including MSRS, TNO [48], and RoadScene [22] are used to evaluate the fusion effect of MEEAFusion in the testing phase.Notably, when testing on the latter two datasets, original images are directly fed into the test network without any re-training and parameter tuning to investigate the generalization performance of the algorithm.

Implementation Detail
During the 50 epochs of MEEAFusion training, eight groups of images are randomly selected and stochastically cropped to the size of 128 × 128 as a batch input at each iteration.The Adam optimizer is adopted to update network parameters, with the learning rate initially set to 1 × 10 −3 , which is kept constant until linearly decaying after 25 epochs.Loss function weights for each part are set to α = 0.5, β = 10, γ = 1, and λ = 8.MEEAFusion is executed on a single GPU (RTX 3090), and the code is realized with the help of PyTorch.

Evaluation Indicator
This paper applies both qualitative and quantitative methods to evaluate the fusion effect.During the qualitative evaluation, the information richness, the clarity of edge contour, and the overall visual effect are mainly considered, which are normally intermingled with some subjective consciousness.It is hard for the eyes to distinguish the strengths and weaknesses of the fused images with minimal differences as well.Hence, a series of quantitative evaluation metrics have been proposed to assess the performance of fusion algorithms more fairly and accurately.This research chooses eight objective evaluation indicators: spatial frequency (SF) [49], average gradient (AG) [50], edge-information-based indicators (Q abf ) [51], edge feature mutual information (FMI edge ) [52], multiscale structural similarity (MS_SSIM) [53], the sum of correlation differences (SCD) [54], natural image quality evaluator (NIQE) [55], and perception-based image quality evaluator (PIQUE) [56].Among them, smaller NIQE and PIQUE indicate superior fusion outcomes, while larger other metrics showed better results.
Since these objective indicators mainly measure a single aspect of the image and cannot thoroughly determine the fusion effect, this paper establishes an average ranking as the overall evaluation indicator, which is calculated by averaging the rankings of all the indicators and serves to comprehensively assess the fusion quality.

Test Results on MSRS 4.2.1. Qualitative Comparison
Several image pairs are selected in the MSRS dataset to visually compare the fused results yielded by different approaches, as presented in Figures 6-8.The scenarios cover both daytime and nighttime conditions, and each scene image contains human significant targets and detailed regions.For clear comparisons, the salient and detailed regions are highlighted with red and green boxes, respectively, and the boxed regions are amplified in the corners of the images as needed.Figure 6 illustrates that fused images of various algorithms present obvious differences in the daytime scene.GTF, FusionGAN, and IRFS have significant human targets (red boxes) but blurred details (green boxes).Among them, the GTF result is close to infrared images in terms of contrast with a severe loss of detail texture, the FusionGAN images have poorly defined edge contours, and the IRFS result is dark overall.The detail texture in MDLatLRR, DATFuse, and U2Fusion is relatively clear, but the infrared target is relatively weak.CMRFusion's result exhibits a strong resemblance to the visible image, leading to a significant loss of infrared information.U2Fusion also lacks infrared features, appearing as image distortion.RFN-Nest, PMGI, FLFuse-Net, and DDFM have weak salient targets, and the image details appear less clear; the FLFuse-Net image especially encounters difficulty in distinguishing the leaves in the background.MSLFusion introduces spectral contamination and artifacts in the fused images.Only DenseFuse and the proposed method MEEAFusion could preserve infrared targets and background details better, and the fused image yielded by MEEAFusion has more salient targets and clearer edge contours thanks to the designed shallow multi-scale module, MGRB module, and CAFB module.In the night scene, it is inaccurate to define IVIF narrowly as the preservation of background texture from visible images and salient targets from infrared images.This is because the imaging quality of the optical device is relatively inferior, while the infrared image contains the primary prominent target information, as well as the important background details.Therefore, the significant targets and the detailed texture in both images need to be considered simultaneously in this instance.As can be seen from Figure 7, GTF, RFN-Nest, and FusionGAN lose both salient targets (red box) and background texture information (green box), resulting in a devastating fusion effect.The images produced by MDLatLRR, FLFuse-Net, U2Fusion, and DDFM are similar to the weakened infrared images, exhibiting inconspicuous infrared targets and dark background information.CMRFusion's result has a faint infrared target, while the IRFS image presents a distinct target but quite blurred edge contours.PMGI injects noise into the fused image, leading to poor image contrast.Figure 8 also exhibits similar fusion results.Only DenseFuse, DATFuse, MSLFusion, and the proposed method yield better fusion results with well-preserved infrared targets.However, MEEAFusion has the sharpest detailed texture, which can be attributed to the CAFB modules that fully preserve the complementary features in the visible and infrared images and consequently mitigate the interference of the illumination factors.

Quantitative Comparison
From the MSRS dataset, 36 groups of images covering multiple scenes such as daytime and nighttime are randomly selected in this study.The calculated results of objective evaluation metrics are displayed in Figure 9 and Table 1.Notably, MEEAFusion manages to achieve six optimal results, among which the top SF and AG point out some improvement in the clarity and texture detail of MEEAFusion fused images.The maximum Q abf illustrates that the presented method could retain more edge details from source images.The optimal MS_SSIM highlights the remarkable performance of the proposed approach in terms of contrast and brightness, while the best NIQE and SCD verify that MEEAFusion generates the highest-quality image that is closest to the original images.In addition, the second-best FMI edge and the third-best PIQUE indicate higher mutual information and less image distortion.All of these excellent indicator values lead to the average ranking of MEEAFusion being first, indicating that the fusion results of MEEAFusion are superior in various aspects and have the best comprehensive quality.
Overall, the proposed method could effectively preserve the prominent targets and background textures in the original images, generating images with clear details, distinct edge contours, and optimal visual quality.This is mainly attributed to three factors: the shallow feature extraction module, the MGRB module, and the CAFB module.Certainly, it also benefits from the constraints of the loss functions of each part.These factors collectively facilitate the generation of high-quality fusion results.

Generalization Experiment
In the above comparative experiments, both training and testing of MEEAFusion are conducted on the same dataset.For validating the generalization performance of the proposed method, two additional datasets, TNO and Roadscene, are selected to further compare the fusion effects between multiple algorithms.A total of 20 and 30 image pairs containing different scenes are randomly picked from each dataset for subjective and objective evaluations, respectively.

Test Results on TNO
Qualitative Comparison.Figures 10 and 11 display two image pairs selected from TNO for visual presentation.As observed in Figure 10, all methods except GTF, DenseFuse, FusionGAN, PMGI, and the proposed method MEEAFusion, exhibit significant loss of infrared salient human targets.However, the fusion images of GTF and FusionGAN show imbalanced background information and unclear target edge contours.DenseFuse loses details, and the overall pattern becomes hazy.Only PMGI and MEEAFusion results demonstrate superior visualization with moderate global brightness, remarkable human targets, and sharp edge contours.Figure 11 further confirms the preceding results.MDLatLRR, PMG, U2Fusion, MSLFusion, and DDFM all produce darkened images with indistinct branch outlines.MSLFusion even exhibits light pollution.The results of DenseFuse, CM-RFusion, DATFuse, IRFS, and MEEAFusion are more congruent with visual perception.Nevertheless, DenseFuse and MEEAFusion generate fusion images with brighter human targets and sharper human and tree branch outlines.Quantitative comparison.Figure 12 and Table 2 demonstrate the objective evaluation results of MEEAFusion and the other 13 advanced algorithms on the TNO dataset.MEEA-Fusion achieves optimal results in four metrics, SF, AG, MS_SSIM, and NIQE, indicating that the method can obtain clearer images, richer texture information, higher structural consistency, and better visual presentation.The second-best result on Q abf demonstrates that MEEAFusion retains more edge information.Although the results of FMI edge , SCD, and PIQUE perform slightly worse, they remain ranked at the forefront.Comprehensively, MEEAFusion ranks first, implying the approach suggested in this research has satisfactory generalization performance on an untrained TNO dataset.13 depicts a visible image that appears to be overexposed due to strong sunlight.Confronted with this challenging scenario, GTF, RFN-Nest, FusionGAN, CMRFusion, and IRFS algorithms produce fusion results that suffer from a severe loss of infrared detail information and complete blurring of the license plate number.The fused images generated by MDLatLRR, PMGI, FLFuse-Net, U2Fusion, and DDFM are biased toward darkness, exhibiting low brightness in the detail regions.DATFuse fails to preserve the detailed texture of the overexposed region.Only MSLFusion and MEEAFusion manage to better retain the texture information of the overexposed region and infrared details at the same time.In Figure 14, the performance of the five methods such as GTF is far from satisfactory.The fused images continue to show significant target edge blurring, image distortion, and spectral contamination.Five approaches such as MDLatLRR over-preserve infrared detail information, resulting in low image contrast.DenseFuse, MSLFusion, and MEEAFusion retain texture details clearly and have better visual effects.Quantitative Comparison. Figure 15 and Table 3 present the objective evaluation results of all methods on the RoadScene datasets.MEEAFusion exhibits a distinct advantage in SF and AG metrics, with values 18.61% and 16.72% higher than the second-placed method, respectively.This indicates that the produced images of MEEAFusion are clearer and richer in texture.The second-best Q abf demonstrates that MEEAFusion is capable of retaining more edge information.The values of MS_SSIM, NIQE, and PIQUE rank at the forefront, implying that the fused images have a higher overall visual quality.Collectively, MEEAFusion ranks second, with slightly inferior fusion results than those on the TNO dataset.The reason is that the scene images in the TNO dataset are simpler and have higher contrast, while the RoadScene dataset images are more complex and include a variety of road and traffic scenes.Furthermore, the most probable explanation for MSLFusion ranking first is that MSLFusion is trained on both the TNO and RoadScene datasets, whereas the approach provided in this research is trained on the MSRS dataset and then evaluated straight on the RoadScene dataset without any parameter fine-tuning.The aforementioned experimental results of qualitative and quantitative comparisons indicate that MEEAFusion still generates fusion images with significant targets and clear textures on untrained datasets, verifying that the proposed approach has outstanding generalization performance.

Efficiency Evaluation
Efficiency also plays an essential role in fusion performance evaluation.For a more fair and accurate evaluation of fusion efficiency, this paper fuses only one pair of source images every time by default.The fusion time is considered the duration from the input of the original images to the fusion system to the output of the fused image.This metric excludes the process of reading and saving the image, thereby comparing purely the run speed of the fusion network.The fusion efficiency is defined as the average fusion time of all image fusion pairs.Furthermore, all DL-based methods are run on a single RTX 3090 GPU, and the MATLAB code is executed on a single 2.60 GHz Intel i5-13500H CPU.Table 4 displays the parameter counts and inference time obtained for the different fusion approaches.The traditional methods take a longer time, whereas the fusion efficiency of the DL-based methods is dramatically improved due to the development of GPUs.FLFuse-Net exhibits the minimum parameter and fastest speed in the inference process.Noticeably DDFM employs 100 iterations of sampling in the process of generating the final fused image and, consequently, runs at the slowest speed, which hinders it from working in fields where real-time performance is required.The MEEAFusion algorithm in this study has relatively few parameters, which allows for its deployment on devices with more limited memory.It is observed that discarding dense connections could reduce the parameter numbers by 21.1% and lower the memory occupancy while boosting the running efficiency of the algorithm.The CAFB module slows down the inference speed due to the presence of fully connected layers.Nevertheless, MEEAFusion also possesses fusion efficiency comparable to that of other algorithms, maintaining excellent real-time performance on the trained MSRS dataset.What is more valuable is that the experimental results, both qualitative analysis and quantitative evaluation, demonstrate that MEEAFusion could yield fused images with better quality.

Ablation Experiment
For the methods proposed in this paper, the shallow feature extraction modules, MGRB module, and CAFB module are the most critical components of the fusion network.Ablation experiments are performed to confirm the effectiveness of each module, and the depth of the network is guaranteed to remain constant during the ablation process.The experimental outcomes are displayed in Figure 16.
Among them, the shallow multi-scale feature extraction module is replaced with a single 3 × 3 convolution, while the MGRB module removes the residual gradient stream and retains only the convolution operation on the mainline for deep feature extraction.Two approaches without feature interaction are adopted as alternatives to the CAFB module.One is F Conv , which employs normal convolution operations, and the other is F CASA , which still uses SA and CA but concentrates solely on the respective source image features.It can be found that without the shallow feature extraction module, the fused image produced by MEEAFusion has inconspicuous edge contours that can hardly be distinguished from the background.Additionally, the partial information loss leads to a decrease in brightness and the blurring of significant targets.The reason is that the shallow feature extraction module can provide informative multi-scale features for the deep feature extraction network and the image reconstruction module.Without MGRB, the fusion results show hazy edge contours of the salient target, unclear detail texture, and more image artifacts, which verifies that MGRB using multiscale gradient convolution could acquire deep features with enhanced edge information.The brightness, edge details, and contrast of the fused image degrade without CAFB, due to the loss of complementary feature information from different paths.F CASA compels the IR and visible paths to pay more attention to their respective features through the attention mechanism, resulting in a slight improvement in fusion performance compared to F Conv .However, this also causes the redundancy of common features and spectral pollution.Only MEEAFusion preserves both clear texture details and salient infrared targets, thereby achieving a balance between infrared and visible information.A quantitative comparison of the average rankings results of the ablation experiments is given in Table 5.It can be seen that the proposed algorithm achieves the best ranking on all three datasets.The worst results of the fusion algorithm on the MSRS and RoadScene datasets occurred when losing the MGRB module, which illustrates that the MGRB module dramatically improved the quality of the fused images in complex scenes.The numerical results on the TNO dataset remained unchanged, which is probably due to the simplicity of the image scenes.Without the CAFB module, the ranking metrics all undergo a drop; in particular, the F Conv fusion results are noticeably worse than those of F CASA , which also further verifies the significance of the attention mechanism in image fusion.

Expansion to Object Detection Tasks
To assess the improvement of image fusion processing on downstream advanced task results, the pre-trained YOLOv5s [57] algorithm is utilized to perform target detection on the fused images.Eighty pairs of images are randomly picked from the MSRS dataset as a test set, containing a variety of urban scene images, which are tagged as persons and cars.As illustrated in Figure 17, in the daytime scene, the detection result of the infrared image misses the vehicle.The human target in the visible light image is in the shadows and difficult to distinguish and also appears to be missed.The GTF image is close to the infrared image, but the details are more blurred, resulting in missed detection of both human and vehicle targets.The detection results of RFN-Nest, CMRFusion, and DATFuse lose either the person or car.Moreover, MDLatLRR, DenseFuse, PMGI, and IRFS show false detections.FusionGAN, FLFuse-Net, U2Fusion, MSLFusion, DDFM, and MEEAFusion correctly discover all persons and cars.In the night scenario, the infrared image results successfully detect all targets, and the visible image results miss the leftmost human target in the dark, as shown in Figure 18.Apart from GTF and CMRFusion, all other fusion algorithms could detect all objects correctly, verifying the facilitating function of image fusion processing for target detection.
The quantitative detection results of YOLOv5s on the MSRS dataset are shown in Table 6.Indicators of precision rate (P), recall rate (R), average precision (mAP), and average ranking are adopted to evaluate the detection results.As can be seen, the infrared image detection results are slightly better than the visible image results, but both source images have unsatisfactory detection precision.After image fusion, all fusion algorithm results, except GTF, achieve improved accuracy, indicating that the fused image could integrate the salient targets and detailed textures between the two source images and provide semantically rich input data for target detection.Among them, U2Fusion, MEEAFusion, and PMGI obtained the best detection results.The mAP@0.5 values of the detection results of fusion image produced by MEEAFusion are improved by 13.5% and 19.3% relative to that of the infrared and visible images, respectively.This indicates that the fusion process enhances the complementary semantic information, thereby improving target detection accuracy.In particular, the suboptimal mAP@0.5:0.95indicates that the detection results of MEEAFusion fusion images have higher confidence.

Conclusions
This study introduces MEEAFusion, an innovative IVIF method leveraging multiscale edge enhancement and a joint attention mechanism.Two primary modules, MGRB and CAFB, have been devised.Firstly, the shallow multi-scale module is exploited to extract the unique feature information of the source images.Then, the MGRB module enhances the edge texture of the deep feature map by collecting the gradient information at different scales.Finally, after defining the complementary features between the visible and infrared images, the CAFB module facilitates the information interaction between the two source image features while diminishing the feature redundancy.The subjective and objective evaluation results demonstrate that MEEAFusion has promising generalization performance and could yield fused images with more prominent infrared targets and clearer image textures compared to state-of-the-art algorithms with comparable fusion efficiencies.Moreover, the fusion results of the proposed method can facilitate the detection performance.However, it can be observed that although MEEAFusion can fulfill the fusion task well under low light, the fusion performance declines under strong exposure conditions.Further exploration of achieving balance under diverse lighting conditions is essential in the future.Additionally considering integrating MEEAFusion with advanced vision tasks to fully maximize the superiority of image fusion processing is another future investigation direction.

Figure 6 .
Figure 6.Visual display of fusion results for scene 00537D.

Figure 7 .
Figure 7. Visual display of fusion results for scene 00878N.

Figure 8 .
Figure 8. Visual display of fusion results for scene 01024N.

Figure 9 .
Figure 9. Data distribution of fusion results for 36 pairs of MSRS images over the eight objective evaluation criteria.Each point (x, y) in this Figure means (100 × x)% of fused images whose metric values do not exceed y.

Figure 10 .
Figure 10.Visual display of fusion results for bench scene.The salient regions are highlighted with red boxes.

Figure 11 .
Figure 11.Visual display of fusion results for Kaptein_1123 scene.The salient and detailed regions are highlighted with red and green boxes.

Figure 12 .
Figure 12.Data distribution of fusion results for 20 pairs of TNO images over the eight objective evaluation criteria.Each point (x, y) in this Figure means (100 × x)% of fused images whose metric values do not exceed y.

Figure 13 .
Figure 13.Visual display of fusion results for scene FLIR_00006.The detailed regions are highlighted with red boxes.

Figure 14 .
Figure 14.Visual display of fusion results for scene FLIR_06570.The salient and detailed regions are highlighted with red and green boxes.

Figure 15 .
Figure 15.Data distribution of fusion results for 30 pairs of RoadScene images over the eight objective evaluation criteria.Each point (x, y) in this Figure means (100 × x)% of fused images whose metric values do not exceed y.

Figure 16 .
Figure 16.Visual display of fusion results for Kaptein_1123 scene.The salient and detailed regions are highlighted with red and green boxes.

Figures
17 and 18  show the prediction results of the fused images.

Figure 17 .
Figure 17.Visual display of YOLOv5s prediction results for fused images of scene 00479D.

Figure 18 .
Figure 18.Visual display of YOLOv5s prediction results for fused images of scene 01348N.

Table 1 .
Mean metrics values for 36 sets of MSRS fusion images.Bolded: optimal.Red and underlined: second best.Blue and italicized: third best.

Table 2 .
Mean metrics values for 20 sets of TNO fusion images.Bolded: optimal.Red and underlined: second best.Blue and italicized: third best.

Table 3 .
Mean metrics values for 30 sets of RoadScene fusion images.Bolded: optimal.Red and underlined: second best.Blue and italicized: third best.

Table 4 .
Parameter numbers and inference time (s) for different fusion methods.Bolded: optimal.

Table 5 .
Quantitative comparison of average ranking results of ablation experiments.Bolded: optimal.

Table 6 .
Quantitative detection results of YOLOv5s on MSRS fused images.Bolded: optimal.Red and underlined: second best.Blue and italicized: third best.